The largest open pretraining dataset for European Portuguese

0 ▲

15 hours ago · Tech · 0 comments

Educational score over time vs. document count for Bagaco v2 A couple of months ago I released Bagaço - a pretraining dataset for European Portuguese. The idea was simple: take the FineWeb 2 dataset, limit it to web pages that look like they came from Portugal, and classify them into categories (Sports, Culture, etc.) and an educational score. Pulling the European Portuguese from the wider corpus was a bit of a frustrating experience. It's a bit like finding a needle in the haystack. I avoided the problem, and just included anything with a .pt domain in the URL. But that didn't feel like it was enough. Which led me to the next phase: European Portuguese variety identification. Or – in other words – spotting European Portuguese in the wild. After learning some bitter lessons, I built two FastText-based classifiers that achieved SOTA performance, but with 10x the throughput. So that I could run the classifier at scale. With those two pieces in place, maybe you guessed where this was…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.