Last November, when I finished the main body of "Build a Large Language Model (from Scratch)", I set myself a number of follow-on goals. One was "training the full GPT-2 base model myself". I've reached the end of that journey, with a model that is almost -- if not quite -- as good as GPT-2 small, trained in 44 hours on my own machine, so I thought it would be worth summarising how it went. In December, I trained my first model, taking two days, but was disappointed to see that it was worse in terms of loss, and in terms of how well it could be fine-tuned to follow instructions, than the original GPT-2 model. I expected that a chunk of that difference was likely to be due to the original model having been trained for longer, but also noticed that there were a number of changes -- interventions -- that I could make to the model and the training run, and I thought they might help. In January, I got a DDP training system together that would allow me to iterate on those interventions…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.