Jens Willmer

https://jwillmer.de

1 posts

Tech

Subscribe via RSS

  1. Data Cleaning for a RAG Ingest Pipeline

    If your RAG pipeline ingests dirty data, the answers will be wrong. The embedding model and prompt chain cannot fix what was broken before indexing. I built this pipeline for a maritime email corpus: thousands of .eml files with PDF attachments, Office documents, images, and ZIP archives, turned into a searchable knowledge base. The examples here are maritime, but the patterns apply to any industry where you ingest unstructured documents into a RAG system. Corporate email, support tickets,…

    0