2 hours ago · Writing · hide · 0 comments

I did a test run of adding text pages from comics to the comics magazine search engine a couple weeks ago, but I wasn’t quite sure whether it was a good idea. Or whether my pipeline was good enough. And after getting some feedback, it seems like the answers are “yes” and “not quite”. So I’ve fixed some problems: The detection of duplicate pages was really bad, but it’s now better. There are still duplicate pages in there, but getting it totally right just takes a lot of processing. The current algo uses Tesseract to do a “quick” OCR, and then it compares the texts to find the distances, and then excludes the one that are too similar. The problem here is that Tesseract, while being a good OCR as far as traditional OCRs go, it does get a goodly percentage of the words wrong, so if the scans aren’t pristine, you’ll get “different words” even if it’s the same page that’s been scanned twice. For the search engine on kwakk.info, I’m using the Surya OCR engine, which is much, much more…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.