Data cleaning & handling 101: a few commands to handle and clean csv or tsv files

0 ▲

2 hours ago · Tech · 0 comments

Duplicates Print duplicates rows: $ sort file.csv | uniq -d For the following examples I assume email addresses are the fifth column of a csv file. Print the column with email addresses only: $ cat file.csv | cut -d ',' -f 5 Sort the output of cat to group duplicates: $ cat file.csv | cut -d ',' -f 5 | sort Print duplicated email addresses only: $ cat file.csv | cut -d ',' -f 5 | sort | uniq -d Print the count of duplicated email addresses: $ cat file.csv | cut -d ',' -f 5 | sort | uniq -d | wc -l Create a new file without duplicates: $ awk '!seen[$0]++' file.csv > no_dup.csv See https://yctct.com/data-remove-duplicated-rows for explanation. Edit file names in Vim $ sudo apt install moreutils $ vidir Concatenate files $ cat foo.txt bar.txt > foobar.txt Append foo.txt to bar.txt $ cat foo.txt >> bar.txt Remove columns from a tsv or a csv file To remove columns 1, 3 and 5 from a tsv file: $ cut --complement -f 1,3,5 file.txt > new_file.txt and from a csv file: $ cut --complement -d ','…

No comments yet. Log in to reply on the Fediverse. Comments will appear here.