Find matches between multiple CSV files

Find matches between multiple CSV files - csv

I have an indeterminated number of csv files in one folder. All the CSV have only one column with diferent number of rows, like this:
File1.csv
rs1000
rs2000
rs4000
rs5000
...
I want to compare all the CSV files in that folder and output a CSV file with only the rows that are in common in all the files.
I have this command:
awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' *.csv > out_p.csv
but it shows rows that are not in all the files.

I didn't test, but it should work.
awk '{line[$0]++}END{for(x in line)if(line[x]==ARGC-1)print x} *.csv
The one-liner reads all lines into a hashtable (awk array)
Increments the value(occurrence) each time
Finally print out the lines with occurrence same as the count of *.csv files.
Note, assume that each csv file has no duplicated lines

Try something like this:
awk '{ array[$1]++ } END { for(i in array) { if(array[i] > 1) print i } }' *.txt
Each csv file has only one column so you don't need to specify a field separator, a line is printed if it exists more than once in all the csv files.

Related

extract rows from one csv file based on column information in the other csv

I have 2 csv files. I want the information from column 1 of from file 1 to be used to extract rows from the other csv file.
My file1.csv has names arranged in columns :
ENSG00000102977
ENSG00000100823
ENSG00000149311
ENSG00000175054
ENSG00000085224
ENSG00000178999
ENSG00000197299
ENSG00000139618
ENSG00000105173
ENSG00000175305
ENSG00000166226
ENSG00000163468
ENSG00000115484
ENSG00000150753
ENSG00000146731
and the 2nd csv file has the names along with the corresponding values arranged in rows.
ENSG00000102977 1083.82334384253 1824.50639384557 1962.86064714976 1367.60568624972
I wrote an awk script
`awk 'FNR == NR{f1[$1];next} $1 in f2 {print $1,$2,$3,$4,$5 $NF}' FS="," file1.csv FS="," file2.csv`
but it returns without any output or error.
Kindly guide me as to where I am wrong. Its a bit puzzling since there is no error.

Try grep's option -f - read pattern from a file.
grep -f fileWithPatterns.csv fileToExtractFrom.csv

Compare two CSV files of different length

I have two CSV files with different no of records. I have to compare column 1 of file1 with column 1 of file2 if it matches then print only those lines from file2 where column 2 does not match.
Need a help to do this using Unix command.

It's difficult to post a solution without example data but you could try the following awk example none the less:
awk -F, 'NR==FNR { map[$2]=1;next } { if (map[$2]!=1) { print } }' file1.csv file2.csv
Set the field separator to comma with -F. Process the first file first (NR==FNR), create an array called map with the second comma separated field as the index. Then when processing the second file, if there is no entry in map for the second field, print the line.

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.

Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file

awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

awk, header: igore when reading, include when writing

i know how to ignore the column header when reading a file. i can do something like this:
awk 'FNR > 1 { #process here }' table > out_table
But if i do this, everything else other than the COLUMN HEADER is written into the output file. But i want the output file to have COLUMN HEADER also.
Of course i can do something like this, after i execute the first statement:
awk 'BEGIN {print "Column Headers\t"} {print}' Out_table > out_table_with_header
But this becomes a 2 step process. So is there a way to do this in a SINGE STEP itself?
In short, is there a way for me to ignore Column Header while reading file, perform operation on the data, then include Column Header when writing it to output file, in a single step (or a block of steps that takes very less response time?)

Not sure if I got your correctly, you can simply:
awk 'NR==1{print}; NR>1 { # process }' file
which can be simplified to:
awk 'NR==1; NR>1 { # process }' file
That works for a single input file.
If you want to process more than one file, all having the same column headers at line 1 use this:
awk 'FNR==1 && !h {print; h=1}; FNR>1 { # process }' file1 file2 ...
I'm using the variable h to check whether the headers have been printed already or not.

How do I get the length of file names longer than 250, then output the character count to a csv file?

I've got this:
cmd /c dir /s /b |? {$_.length -gt 250}
However, I would like to export the character count and file path into two separate columns in a csv file. Adding
| export-csv ./250files.csv
does the trick for exporting the count to a column, but I also want the path to each file on the second column.

Export-CSV is expecting some objects to convert to CSV. Since you are just dealing with strings you need to make the objects yourself in order to get the desired output. Calculated properties will do just that for you
cmd /c dir /s /b | Select #{Name="Path";Expression={$_}}, #{Name="Length";Expression={$_.Length}}
Should be able to pipe that into the next step easy.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Find matches between multiple CSV files - csv

Try something like this: awk '{ array[$1]++ } END { for(i in array) { if(array[i] > 1) print i } }' *.txt Each csv file has only one column so you don't need to specify a field separator, a line is printed if it exists more than once in all the csv files.

Related

extract rows from one csv file based on column information in the other csv

Compare two CSV files of different length

How to split text file into multiple files and extract filename from line prefix?

awk, header: igore when reading, include when writing

How do I get the length of file names longer than 250, then output the character count to a csv file?

Categories

Resources