2 hours ago · Tech · 0 comments

I did a lot of data engineering work in my career. When you work a lot with data, you often get quick requests to extract some cold data and process it. Since the data is cold, it usually resides on S3. For example, one of the typical requests in the past was to count unique values from an old MongoDB backup or CSV dump with 100GB of compressed data. I quickly learned that using throwaway Python scripts would not work well. Often, the data is too big to fit into memory to sort and deduplicate. Maintaining a Spark cluster was not worth it for this kind of work. So, what I would do is something like this: LC_ALL=C aws s3 cp s3://bucket/data.json.gz - \ | gzip -dc \[......]

No comments yet. Log in to reply on the Fediverse. Comments will appear here.