3 hours ago · 0 comments

I remember finding it interesting when, back in 2015, Andrej Karpathy posted about RNNs and gave an example of how their output improves over the course of a training run. What might that look like for a (relatively) modern transformers-based LLM? I recently trained a GPT-2-small-style LLM, with 163 million parameters, on about 3.2 billion tokens (that's about 12.8 GiB of text) from the Hugging Face FineWeb dataset, and over the course of that training run, I saved the current model periodic...

No comments yet. Log in to reply on the Fediverse. Comments will appear here.