I'm still working on improving the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". In my training code, I have this code to create the optimiser: optimizer = torch.optim.AdamW( model.parameters(), lr=0.0004, weight_decay=0.1 ) In my last post I looked into the learning rate, the lr parameter in that code, and found a value for that, plus some extra code to schedu...
No comments yet. Log in to discuss on the Fediverse