merge csv unix based on column 1 - html

Hi I have two csv files having same columns like,
x.csv
column1,column2
A,2
B,1
y.csv
column1,column2
A,1
C,2
I want the output like:
z.csv
column1,column2
A,2
B,1
C,2
i.e. for the matching data in first column, I want to keep the x.csv record and for a new field in y.csv (like A,2) i just want to append it (like C,2).
Thanks

$ awk -F, 'NR==FNR{a[$1]; print; next} ! ($1 in a)' x.csv y.csv
column1,column2
A,2
B,1
C,2
How it works
-F,
This tells awk to use a comma as the field separator
NR==FNR{a[$1]; print; next}
While reading the first file (NR==FNR), this tells awk to (a) to add $1 as a key to the associative array a, (b) print the line, and (c) skip the remaining commands and jump to the next line in a file.
! ($1 in a)
If we get here, that means we are working on the second file. In that case, we print the line if the first field is not a key of array a (in other words, if the first field did not appear in the first file).

Related

Update a CSV file to drop the first number and insert a decimal place in a particular column

I need help to perform the following
My CSV file looks like this
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
What I need to do is generate a new csv file to be as follows
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
if I was to do this as a test
echo '036921' | awk -v range=1 '{print substr($0,range+1)}' | sed 's/.$/.&/'
I get
3692.1
Can anyone help me so I can incorporate that, (or anything similar) to change my CSV file?
Try
sed 's/,0*([0-9]*)([0-9]),/,\1.\2,/' myfile.csv
Using awk and with the conditions specified in the comment, you can use:
$ awk -F, '{ printf "%s,%06.1f,%s\n", $1, $2 / 10, $3 }' data
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
$
With the printf format string providing the commas, there's no need to set OFS (because OFS is not used by printf).
Assuming that values with leading zeros appears solely in 2nd column I would use GNU AWK for this task following way, let file.txt content be
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
then
awk 'BEGIN{FS=",0?";OFS=","}{$2=gensub(/([0-9])$/, ".\\1", 1, $2);print}' file.txt
output
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
Explanation: I set field separator (FS) to be , optionally followed by 0, so leading zero will be discarded as part of separator. In 2nd I replace last digit by . followed by that digit. Finally I print such changed line, using , as separators.
(tested in gawk 4.2.1)
I wish to have 4 numbers (including zeros) and the last value (5th value) separated from the 4 values by a decimal point.
If I understand, you need not all digits of that field but only the last five digits.
Using awk you can get the last five with the substr function and then print the field with the last digit separeted from de previous 4 by a decimal point, using the sub() function:
awk -F',' -v OFS=',' '{$2= substr($2, length($2) - 4, length($2) ); sub(/[[:digit:]]{1}$/, ".&",$2);print}' file
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated

Trying to get all lines from csv file where column 8 equal a value using awk, but it prints all lines and matching lines twice

I have a CSV file and I want to get a subset of the lines where column 8 is equal to a specific value. I'm doing this:
awk 'FS=","; $8 == 0' infile.csv
The output of this command is that it prints every line in the file at least once, but the lines where the 8th column is 0, it will print twice. Why is it doing this? How do I get it to just print the matching lines once?
You have two blocks FS="," and $8 == 0. The implied action when the condition is satisfied is {print}. The first assignment returned value is non false, therefore it prints the record. Whereas the second condition is only true when it's satisfied, that's why you see that record printed twice.
If you don't want the assignment to be used as a condition wrap with curly braces.
$ awk '{FS=","} $8==0{print}'
However, setting FS again and again for each record is unnecessary
$ awk 'BEGIN{FS=","} $8==0'
will do the same. However the easiest will be using the -F
$ awk -F, '$8==0'

Compare two CSV files of different length

I have two CSV files with different no of records. I have to compare column 1 of file1 with column 1 of file2 if it matches then print only those lines from file2 where column 2 does not match.
Need a help to do this using Unix command.
It's difficult to post a solution without example data but you could try the following awk example none the less:
awk -F, 'NR==FNR { map[$2]=1;next } { if (map[$2]!=1) { print } }' file1.csv file2.csv
Set the field separator to comma with -F. Process the first file first (NR==FNR), create an array called map with the second comma separated field as the index. Then when processing the second file, if there is no entry in map for the second field, print the line.

Print first, penultimate and last fields in CSV file

I have big comma separated file with 20000 row and five column, I want to extract particular column, but there are more values so more comma, except header, so how to cut such column.
example file:
name,v1,v2,v3,v4,v5
as,"10,12,15",21,"12,11,10,12",5,7
bs,"11,15,16",24,"19,15,18,23",9,3
This is my desired output:
name,v4,v5
as,5,7
bs,9,3
I tried following cut command but doesn't work
cut -d, -f1,5,6
In general, for these scenarios is best to use a proper csv parser. You can find those in Python, for example.
However, since your data seems to have fields with commas just in the very beginning, you can decide to print the first field and then the penultimate and last one:
$ awk 'BEGIN{FS=OFS=","} {print $1, $(NF-1), $NF}' file
name,v4,v5
as,5,7
bs,9,3
In TXR Lisp:
$ txr extract.tl < data
name,v4,v5
as,5,7
bs,9,3
Code in extract.tl:
(mapdo
(lambda (line)
(let ((f (tok-str line #/"[^"]*"|[^,]+/)))
(put-line `#[f 0],#[f 4],#[f 5]`)))
(get-lines))
As a condensed one liner:
$ txr -t '(mapcar* (do let ((f (tok-str #1 #/"[^"]*"|[^,]+/)))
`#[f 0],#[f 4],#[f 5]`) (get-lines))' < data

Find duplicates lines based on some delimited fileds on line

I have a file with lines having some fields delimited by "|".
I have to extract the lines that are identical based on some of the fileds
(i.e. find lines which contain the same values for fields 1,2,3 12,and 13)
Other fields contents have no importance for searching but the whole extracted lines have to be complete.
Can anyone tell me how I can do that in KSH scripting
(By exemple a script with some arguments (order dependent) that define the fileds separator and the fields which have to be compared to find duplicates lines in input file )
Thanks in advance and kind regards
Oli
This prints duplicate lines based on matching fields. It uses an associative array which could grow large depending on the nature of your input file. The output is not sorted so most duplicates are not grouped together (except the first two of a set).
awk -F'|' '{ idx=$1$2$3$12$13; if (array[idx] == 1) {print} else if (array[idx]) {print array[idx]; print; array[idx]=1} else {array[idx]=$0}}' inputfile.txt
You could probably build up your index list in a shell variable in a wrapper script something like this:
#!/bin/ksh
for arg
do
case arg in # validate input (could be better)
+([0-9]) ) # integers only
idx="$idx'$'$arg"
;;
* )
echo "Invalid field specifier"
exit
;;
esac
done
awk -F'|' '{ idx='$idx'; if (array ...
You can sort the output by piping it through a command such as this:
awk ... | sort --field-separator='|' --key=1,1 --key=2,2 --key=3,3 --key=12,12 --key=13,13
This prints lines which are duplicated - just one line each:
awk -F'|' '!arr[$1$2$3$12$13]++' inputfile > outputfile