Jens Willmer — Bubbles

6 days ago · 10 min read1998 words · Tech · 0 comments

If your RAG pipeline ingests dirty data, the answers will be wrong. The embedding model and prompt chain cannot fix what was broken before indexing. I built this pipeline for a maritime email corpus: thousands of .eml files with PDF attachments, Office documents, images, and ZIP archives, turned into a searchable knowledge base. The examples here are maritime, but the patterns apply to any industry where you ingest unstructured documents into a RAG system. Corporate email, support tickets,…