Deduplicating by first field in awk - csv

I'm looking for a modified version of the top answer to this question:
extracting unique values between 2 sets/files
awk 'FNR==NR {a[$0]++; next} !($0 in a)' file1 file2
How do i accomplish the same thing by deduplicating on field one, instead of the entire line?
File format is the following:
blah#domain.com,Elon,Tusk
I want to output only the lines from file 2 which have emails unique to file 1.
The ideal solution would allow for multiple files, rather than only 2, which all duplicated against the files before it, so you could do:
awk .... file1 file2 file3 file4 file5 file6
and somehow output 6 new files containing rows with only unique first fields to all other files before it
However, if that's too complex, just working on 2 files is fine as well

Based on the input you provided and the requests you made, we can make the following awk script:
awk 'BEGIN{FS=","}
(FNR==1){close(f); f=FILENAME ".new"}
!(a[$1]++) { print > f }' file1 file2 file3 file4 file5
This will create 5 files called file[12345].new. Each of these files will contain the lines with a unique first column. Note that it is evident that file1.new and file1 are identical (with the exception if there are duplicates in file1)

Related

extract rows from one csv file based on column information in the other csv

I have 2 csv files. I want the information from column 1 of from file 1 to be used to extract rows from the other csv file.
My file1.csv has names arranged in columns :
ENSG00000102977
ENSG00000100823
ENSG00000149311
ENSG00000175054
ENSG00000085224
ENSG00000178999
ENSG00000197299
ENSG00000139618
ENSG00000105173
ENSG00000175305
ENSG00000166226
ENSG00000163468
ENSG00000115484
ENSG00000150753
ENSG00000146731
and the 2nd csv file has the names along with the corresponding values arranged in rows.
ENSG00000102977 1083.82334384253 1824.50639384557 1962.86064714976 1367.60568624972
I wrote an awk script
`awk 'FNR == NR{f1[$1];next} $1 in f2 {print $1,$2,$3,$4,$5 $NF}' FS="," file1.csv FS="," file2.csv`
but it returns without any output or error.
Kindly guide me as to where I am wrong. Its a bit puzzling since there is no error.
Try grep's option -f - read pattern from a file.
grep -f fileWithPatterns.csv fileToExtractFrom.csv

Compare two CSV files of different length

I have two CSV files with different no of records. I have to compare column 1 of file1 with column 1 of file2 if it matches then print only those lines from file2 where column 2 does not match.
Need a help to do this using Unix command.
It's difficult to post a solution without example data but you could try the following awk example none the less:
awk -F, 'NR==FNR { map[$2]=1;next } { if (map[$2]!=1) { print } }' file1.csv file2.csv
Set the field separator to comma with -F. Process the first file first (NR==FNR), create an array called map with the second comma separated field as the index. Then when processing the second file, if there is no entry in map for the second field, print the line.

output lines from file A where first columns match file B

Two csv files formatted identically like this:
blah#domain.com,Elon,Tusk
I want to output the lines from the first file which match the same email address in the second file
Instead of awk, I use join for this type of task because it's simpler/easier for me to remember e.g. join -t',' -o 1.1,1.2,1.3 <(sort -t',' -k1,1 first.csv) <(sort -t',' -k1,1 second.csv), although I believe that awk is the best tool for this type of task, e.g. awk -F, 'FNR==NR {a[$1]; next}; $1 in a' second.csv first.csv

Get difference between two csv files based on column using bash

I have two csv files a.csv and b.csv, both of them come with no headers and each value in a row is seperated by \t.
1 apple
2 banana
3 orange
4 pear
apple 0.89
banana 0.57
cherry 0.34
I want to subtract these two files and get difference between the second column in a.csv and the first column in b.csv, something like a.csv[1] - b.csv[0] that would give me another file c.csv looks like
orange
pear
Instead of using python and other programming languages, I want to use bash command to complete this task and found out that awk would be helpful but not so sure how to write the correct command. Here is another similar question but the second answer uses awk '{print $2,$6-$13}' to get the difference between values instead of occurence.
Thanks and appreciate for any help.
You can easily do this with the Steve's answer from the link you are referring to with a bit of tweak. Not sure the other answer with paste will get you solving this problem.
Create a hash-map from the second file b.csv and compare it again with the 2nd column in a.csv
awk -v FS="\t" 'BEGIN { OFS = FS } FNR == NR { unique[$1]; next } !($2 in unique) { print $2 }' b.csv a.csv
To redirect the output to a new file, append > c.csv at the end of the previous command.
Set the field separators (input and output) to \t as you were reading a tab-delimited file.
The FNR == NR { action; } { action } f1 f2 is a general construct you find in many awk commands that works if you had to do action on more than one file. The block right after the FNR == NR gets executed on the first file argument provided and the next block within {..} runs on the second file argument.
The part unique[$1]; next creates a hash-map unique with key as the value in the first column on the file b.csv. The part within {..} runs for all the columns in the file.
After this file is completely processed, on the next file a.csv, we do !($2 in unique) which means, mark those lines whose $2 in the second file is not part of the key in the unique hash-map generated from the first file.
On these lines print only the second column names { print $2 }
Assuming your real data is sorted on the columns you care about like your sample data is:
$ comm -23 <(cut -f2 a.tsv) <(cut -f1 b.tsv)
orange
pear
This uses comm to print out the entries in the first file that aren't in the second one, after using cut to get just the columns you care about.
If not already sorted:
comm -23 <(cut -f2 a.tsv | sort) <(cut -f1 b.tsv | sort)
If you want to use Miller (https://github.com/johnkerl/miller), a clean and easy tool, the command could be
mlr --nidx --fs "\t" join --ul --np -j join -l 2 -r 1 -f 01.txt then cut -f 2 02.txt
It gives you
orange
pear
It's a join in which it does not emit paired records and emits unpaired records from the left file.

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.
Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file
awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}