This is the last of the interventions I'm trying out to see if I can improve the test loss for a from-scratch GPT-2 small base model, trained on code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". Back when I did my first training run for a base model, on my local RTX 3090, I used two optimisations: Setting the 32-bit floating point matrix multiplication precision to "high" rather than to "highest", which means that it uses lower-precision (but still techn...
No comments yet. Log in to reply on the Fediverse. Comments will appear here.