Compare two CSV files of different length

Compare two CSV files of different length - csv

I have two CSV files with different no of records. I have to compare column 1 of file1 with column 1 of file2 if it matches then print only those lines from file2 where column 2 does not match.
Need a help to do this using Unix command.

It's difficult to post a solution without example data but you could try the following awk example none the less:
awk -F, 'NR==FNR { map[$2]=1;next } { if (map[$2]!=1) { print } }' file1.csv file2.csv
Set the field separator to comma with -F. Process the first file first (NR==FNR), create an array called map with the second comma separated field as the index. Then when processing the second file, if there is no entry in map for the second field, print the line.

Related

extract rows from one csv file based on column information in the other csv

I have 2 csv files. I want the information from column 1 of from file 1 to be used to extract rows from the other csv file.
My file1.csv has names arranged in columns :
ENSG00000102977
ENSG00000100823
ENSG00000149311
ENSG00000175054
ENSG00000085224
ENSG00000178999
ENSG00000197299
ENSG00000139618
ENSG00000105173
ENSG00000175305
ENSG00000166226
ENSG00000163468
ENSG00000115484
ENSG00000150753
ENSG00000146731
and the 2nd csv file has the names along with the corresponding values arranged in rows.
ENSG00000102977 1083.82334384253 1824.50639384557 1962.86064714976 1367.60568624972
I wrote an awk script
`awk 'FNR == NR{f1[$1];next} $1 in f2 {print $1,$2,$3,$4,$5 $NF}' FS="," file1.csv FS="," file2.csv`
but it returns without any output or error.
Kindly guide me as to where I am wrong. Its a bit puzzling since there is no error.

Try grep's option -f - read pattern from a file.
grep -f fileWithPatterns.csv fileToExtractFrom.csv

Trying to get all lines from csv file where column 8 equal a value using awk, but it prints all lines and matching lines twice

I have a CSV file and I want to get a subset of the lines where column 8 is equal to a specific value. I'm doing this:
awk 'FS=","; $8 == 0' infile.csv
The output of this command is that it prints every line in the file at least once, but the lines where the 8th column is 0, it will print twice. Why is it doing this? How do I get it to just print the matching lines once?

You have two blocks FS="," and $8 == 0. The implied action when the condition is satisfied is {print}. The first assignment returned value is non false, therefore it prints the record. Whereas the second condition is only true when it's satisfied, that's why you see that record printed twice.
If you don't want the assignment to be used as a condition wrap with curly braces.
$ awk '{FS=","} $8==0{print}'
However, setting FS again and again for each record is unnecessary
$ awk 'BEGIN{FS=","} $8==0'
will do the same. However the easiest will be using the -F
$ awk -F, '$8==0'

Find matches between multiple CSV files

I have an indeterminated number of csv files in one folder. All the CSV have only one column with diferent number of rows, like this:
File1.csv
rs1000
rs2000
rs4000
rs5000
...
I want to compare all the CSV files in that folder and output a CSV file with only the rows that are in common in all the files.
I have this command:
awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' *.csv > out_p.csv
but it shows rows that are not in all the files.

I didn't test, but it should work.
awk '{line[$0]++}END{for(x in line)if(line[x]==ARGC-1)print x} *.csv
The one-liner reads all lines into a hashtable (awk array)
Increments the value(occurrence) each time
Finally print out the lines with occurrence same as the count of *.csv files.
Note, assume that each csv file has no duplicated lines

Try something like this:
awk '{ array[$1]++ } END { for(i in array) { if(array[i] > 1) print i } }' *.txt
Each csv file has only one column so you don't need to specify a field separator, a line is printed if it exists more than once in all the csv files.

merge csv unix based on column 1

Hi I have two csv files having same columns like,
x.csv
column1,column2
A,2
B,1
y.csv
column1,column2
A,1
C,2
I want the output like:
z.csv
column1,column2
A,2
B,1
C,2
i.e. for the matching data in first column, I want to keep the x.csv record and for a new field in y.csv (like A,2) i just want to append it (like C,2).
Thanks

$ awk -F, 'NR==FNR{a[$1]; print; next} ! ($1 in a)' x.csv y.csv
column1,column2
A,2
B,1
C,2
How it works
-F,
This tells awk to use a comma as the field separator
NR==FNR{a[$1]; print; next}
While reading the first file (NR==FNR), this tells awk to (a) to add $1 as a key to the associative array a, (b) print the line, and (c) skip the remaining commands and jump to the next line in a file.
! ($1 in a)
If we get here, that means we are working on the second file. In that case, we print the line if the first field is not a key of array a (in other words, if the first field did not appear in the first file).

Find duplicates lines based on some delimited fileds on line

I have a file with lines having some fields delimited by "|".
I have to extract the lines that are identical based on some of the fileds
(i.e. find lines which contain the same values for fields 1,2,3 12,and 13)
Other fields contents have no importance for searching but the whole extracted lines have to be complete.
Can anyone tell me how I can do that in KSH scripting
(By exemple a script with some arguments (order dependent) that define the fileds separator and the fields which have to be compared to find duplicates lines in input file )
Thanks in advance and kind regards
Oli

This prints duplicate lines based on matching fields. It uses an associative array which could grow large depending on the nature of your input file. The output is not sorted so most duplicates are not grouped together (except the first two of a set).
awk -F'|' '{ idx=$1$2$3$12$13; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}' inputfile.txt
You could probably build up your index list in a shell variable in a wrapper script something like this:
#!/bin/ksh
for arg
do
case arg in # validate input (could be better)
+([0-9]) ) # integers only
idx="$idx'$'$arg"
;;
* )
echo "Invalid field specifier"
exit
;;
esac
done
awk -F'|' '{ idx='$idx'; if (array ...
You can sort the output by piping it through a command such as this:
awk ... | sort --field-separator='|' --key=1,1 --key=2,2 --key=3,3 --key=12,12 --key=13,13

This prints lines which are duplicated - just one line each:
awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Compare two CSV files of different length - csv

I have two CSV files with different no of records. I have to compare column 1 of file1 with column 1 of file2 if it matches then print only those lines from file2 where column 2 does not match. Need a help to do this using Unix command.

Related

extract rows from one csv file based on column information in the other csv

Trying to get all lines from csv file where column 8 equal a value using awk, but it prints all lines and matching lines twice

Find matches between multiple CSV files

merge csv unix based on column 1

Find duplicates lines based on some delimited fileds on line

Categories

Resources