Get difference between two csv files based on column using bash - csv

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.

You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }

Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)

If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

Related

extract rows from one csv file based on column information in the other csv

I have 2 csv files. I want the information from column 1 of from file 1 to be used to extract rows from the other csv file.
My file1.csv has names arranged in columns :
ENSG00000102977
ENSG00000100823
ENSG00000149311
ENSG00000175054
ENSG00000085224
ENSG00000178999
ENSG00000197299
ENSG00000139618
ENSG00000105173
ENSG00000175305
ENSG00000166226
ENSG00000163468
ENSG00000115484
ENSG00000150753
ENSG00000146731
and the 2nd csv file has the names along with the corresponding values arranged in rows.
ENSG00000102977 1083.82334384253 1824.50639384557 1962.86064714976 1367.60568624972
I wrote an awk script
`awk 'FNR == NR{f1[$1];next} $1 in f2 {print $1,$2,$3,$4,$5 $NF}' FS="," file1.csv FS="," file2.csv`
but it returns without any output or error.
Kindly guide me as to where I am wrong. Its a bit puzzling since there is no error.
Try grep's option -f - read pattern from a file.
grep -f fileWithPatterns.csv fileToExtractFrom.csv

Compare two CSV files of different length

I have two CSV files with different no of records. I have to compare column 1 of file1 with column 1 of file2 if it matches then print only those lines from file2 where column 2 does not match.
Need a help to do this using Unix command.
It's difficult to post a solution without example data but you could try the following awk example none the less:
awk -F, 'NR==FNR { map[$2]=1;next } { if (map[$2]!=1) { print } }' file1.csv file2.csv
Set the field separator to comma with -F. Process the first file first (NR==FNR), create an array called map with the second comma separated field as the index. Then when processing the second file, if there is no entry in map for the second field, print the line.

Deduplicating by first field in awk

I'm looking for a modified version of the top answer to this question:
extracting unique values between 2 sets/files
awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
How do i accomplish the same thing by deduplicating on field one, instead of the entire line?
File format is the following:
blah#domain.com,Elon,Tusk
I want to output only the lines from file 2 which have emails unique to file 1.
The ideal solution would allow for multiple files, rather than only 2, which all duplicated against the files before it, so you could do:
awk .... file1 file2 file3 file4 file5 file6
and somehow output 6 new files containing rows with only unique first fields to all other files before it
However, if that's too complex, just working on 2 files is fine as well
Based on the input you provided and the requests you made, we can make the following awk script:
awk 'BEGIN{FS=","}
(FNR==1){close(f); f=FILENAME ".new"}
!(a[$1]++) { print > f }' file1 file2 file3 file4 file5
This will create 5 files called file[12345].new. Each of these files will contain the lines with a unique first column. Note that it is evident that file1.new and file1 are identical (with the exception if there are duplicates in file1)

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

Can awk deal with CSV file that contains comma inside a quoted field?

I am using awk to perform counting the sum of one column in the csv file. The data format is something like:
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
I was using this awk script to count the sum:
awk -F, '{sum+=$3} END {print sum}'
Some of the value in name field contains comma and this break my awk script.
My question is: can awk solve this problem? If yes, and how can I do that?
Thank you.
One way using GNU awk and FPAT
awk 'BEGIN { FPAT = "([^, ]+)|(\"[^\"]+\")" } { sum+=$3 } END { print sum }' file.txt
Result:
192
I am using
`FPAT="([^,]+)|(\"[^\"]+\")" `
to define the fields with gawk. I found that when the field is null this doesn't recognize correct number of fields. Because "+" requires at least 1 character in the field.
I changed it to:
`FPAT="([^,]*)|(\"[^\"]*\")"`
and replace "+" with "*". It works correctly.
I also find that GNU Awk User Guide also has this problem.
https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
You're probably better off doing it in perl with Text::CSV, since that's a fast and robust solution.
You can help awk work with data fields that contain commas (or newlines) by using a small script I wrote called csvquote. It replaces the offending commas inside quoted fields with nonprinting characters. If you need to, you can later restore those commas - but in this case, you don't need to.
Here is the command:
csvquote inputfile.csv | awk -F, '{sum+=$3} END {print sum}'
see https://github.com/dbro/csvquote for the code
For as simple an input file as that you can just write a small function to convert all of the real FSs outside of the quotes to some other value (I chose RS since the record separator cannot be part of the record) and then use that as the FS, e.g.:
$ cat decsv.awk
BEGIN{ fs=FS; FS=RS }
{
decsv()
for (i=1;i<=NF;i++) {
printf "Record %d, Field %d is <%s>\n" ,NR,i,$i
}
print ""
}
function decsv( curr,head,tail)
{
tail = $0
while ( match(tail,/"[^"]+"/) ) {
head = substr(tail, 1, RSTART-1);
gsub(fs,RS,head)
curr = curr head substr(tail, RSTART, RLENGTH)
tail = substr(tail, RSTART + RLENGTH)
}
gsub(fs,RS,tail)
$0 = curr tail
}
$ cat file
id, name, value
1, foo, 17
2, bar, 76
3, "I am the, question", 99
$ awk -F", " -f decsv.awk file
Record 1, Field 1 is <id>
Record 1, Field 2 is <name>
Record 1, Field 3 is <value>
Record 2, Field 1 is <1>
Record 2, Field 2 is <foo>
Record 2, Field 3 is <17>
Record 3, Field 1 is <2>
Record 3, Field 2 is <bar>
Record 3, Field 3 is <76>
Record 4, Field 1 is <3>
Record 4, Field 2 is <"I am the, question">
Record 4, Field 3 is <99>
It only becomes complicated when you have to deal with embedded newlines and embedded escaped quotes within the quotes and even then it's not too hard and it's all been done before...
See What's the most robust way to efficiently parse CSV using awk? for more information.
You can always tackle the problem from the source. Put quotes around the name field, just like the field of "I am the, question". This is much easier than spending your time coding workarounds for that.
Update(as Dennis requested). A simple example
$ s='id, "name1,name2", value 1, foo, 17 2, bar, 76 3, "I am the, question", 99'
$ echo $s|awk -F'"' '{ for(i=1;i<=NF;i+=2) print $i}'
id,
, value 1, foo, 17 2, bar, 76 3,
, 99
$ echo $s|awk -F'"' '{ for(i=2;i<=NF;i+=2) print $i}'
name1,name2
I am the, question
As you can see, by setting the delimiter to double quote, the fields that belong to the "quotes" are always on even number. Since OP doesn't have the luxury of modifying the source data, this method will not be appropriate to him.
This article did help me solve this same data field issue. Most CSV will put a quote around fields with spaces or commas within them. This messes up the field count for awk unless you filter them out.
If you need the data within those fields that contain the garbage, this is not for you. ghostdog74 provided the answer, which empties that field but maintains the total field count in the end, which is key to keeping the data output consistent. I did not like how this solution introduced new lines. This is the version of this solution I used. The fist three fields never had this problem in the data. The fourth field containing customer name often did, but I needed that data. The remaining fields that exhibit the problem I could throw away without issue because it was not needed in my report output. So I first sed out the 4th field's garbage very specifically and remove the first two instances of quotes. Then I apply what ghostdog74gave to empty the remaining fields that have commas within them - this also removes the quotes, but I use printfto maintain the data in a single record. I start off with 85 fields and end up with 85 fields in all cases from my 8000+ lines of messy data. A perfect score!
grep -i $1 $dbfile | sed 's/\, Inc.//;s/, LLC.//;s/, LLC//;s/, Ltd.//;s/\"//;s/\"//' | awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}' > $tmpfile
The solution that empties the fields with commas within them but also maintains the record, of course is:
awk -F'"' '{ for(i=1;i<=NF;i+=2) printf ($i);printf ("\n")}
Megs of thanks to ghostdog74 for the great solution!
NetsGuy256/
FPAT is the elegant solution because it can handle the dreaded commas within quotes problem, but to sum a column of numbers in the last column regardless of the number of preceding separators, $NF works well:
awk -F"," '{sum+=$NF} END {print sum}'
To access the second to last column, you would use this:
awk -F"," '{sum+=$(NF-1)} END {print sum}'
If you know for sure that the 'value' column is always the last column:
awk -F, '{sum+=$NF} END {print sum}'
NF represents the number of fields, so $NF is the last column
Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.
perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new({allow_whitespace => 1})} if($csv->parse($_)){#f=$csv->fields();$sum+=$f[2]} END{print $sum}' file
allow_whitespace is needed since the input data has whitespace surrounding the comma separators. Very old versions of Text::CSV_XS may not support this option.
I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk
you could try piping the file through a perl regex to convert the quoted , into something else like a |.
cat test.csv | perl -p -e "s/(\".+?)(,)(.+?\")/\1\|\3/g" | awk -F, '{...
The above regex assumes there is always a comma within the double quotes. so more work would be needed to make the comma optional
you write a function in awk like below:
$ awk 'func isnum(x){return(x==x+0)}BEGIN{print isnum("hello"),isnum("-42")}'
0 1
you can incorporate in your script this function and check whether the third field is numeric or not.if not numeric then go for the 4th field and if the 4th field inturn is not numberic go for 5th ...till you reach a numeric value.probably a loop will help here, and add it to the sum.