Writing an LLM from scratch, part 32f -- Interventions: weight decay

▲ 0

0 points · 3 days ago · 0 comments

I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) In my last post I looked into the learning rate, the lr parameter in that code, and found a value for that, plus some extra code to schedu...

No comments yet. Log in to discuss on the Fediverse