Even more deduplication on kwakk.info

0 ▲

2 hours ago · Gaming · hide · 0 comments

As I’ve nattered on about before, I started including text pages from comics in the comics ‘zine search engine, and this has some unique problems. I’m just dumping hundreds of thousands of comics into the grinder, picking out the text pages, and then OCR-ing them. But many comics have been scanned several times, and many include editorial pages that are more or less identical across several titles. So I’m now running the pages through a text-based deduplicator… but it used a “quick” (FSVO) OCR, Tesseract, which behaves horribly on non-standard pages like the above, and there are now five copies of that page in the search engine. Which is just so annoying when trying to actually find stuff — you have to wade through duplicates. Which means that I had to do something more, and that’s now in production: After doing the real OCR, with Surya, which is great even with 30 degree text and badly scanned pages on colourful backgrounds, I’m running an extra step and deduplicating again based on…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.