Bubbles
0 points · 16 days ago · 0 comments

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) The values in there -- 0.0004 for the learning rate, and 0.1 for the weight decay -- were just copied from the tiny training run that we d...

No comments yet. Log in to discuss on the Fediverse