I've been working on a GPT-2-small-style LLM based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)", and have tried a bunch of different things to see if I could get it to approach the quality of the original OpenAI GPT-2-small, measured in terms of loss on a held-back test dataset. After working through them, in my last post, I managed to train one that was almost (if not quite) there. Now, back before I started digging into these interventions, I was doing three evals for each model I built; a smoke test (to see if it could give a coherent completion to "Every effort moves you"), a test for that test set loss, and an instruction-following test that fine-tuned the model on the Alpaca dataset, got it to generate results for a test set of instructions, and then used an LLM as a judge to score them. The idea behind this was that the loss on the test set was an interesting technical measure of the quality of a model, but it didn't really tell us much about how…
No comments yet. Log in to reply on the Fediverse. Comments will appear here.