1 hour ago · 0 comments

Since early February, I've been trying various interventions on a 163M-parameter GPT-2-style model that I trained from scratch on my local RTX 3090, using code based on Sebastian Raschka's book "Build a Large Language Model (from Scratch)". My original model got a loss of 3.944 on my test set, while the original GPT-2 weights got 3.500 on the same dataset. I wanted to see if I could close that gap, and had a list of potential changes to the training setup, and to the model itself. Which of...

No comments yet. Log in to reply on the Fediverse. Comments will appear here.