Using AWK to merge two files based on multiple conditions - csv

I know this question has been asked several times before. Here is one example:
Using AWK to merge two files based on multiple columns
My goal is to print out columns 2, 4, 5 and 7 of file_b and columns 17 and 18 of file_a if the following match occurs:
Columns 2, 6 and 7 of file_a.csv matches with Columns 2, 4 and 5 of file_b.csv respectively.
But no matter how much I try, I can't get it to work for my case. Here are my two files:
file_a.csv
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222
file_b.csv
col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9
Output should look like this:
col2, col4, col5, col7, col17, col18
a, b, c, 4.5, 145, 88
e, f, g, 6.3, 101, 96
I tried this:
awk -F, -v RS='\r\n' 'NR==FNR{key[$2 FS $6 FS $7]=$17 FS $18;next} {if($2 FS $4 FS $5 in key); print $2 FS $4 FS $5 FS $7 FS key[$2 FS $6 FS $7]}' file_a.csv file_b.csv > out.csv
Currently the output I am getting is:
col2, col4, col5, col7,
a, b, c, 4.5,
e, f, g, 6.3,
In other words, col17 and col18 from file_a is not showing up.
Yesterday I asked a related question where I was having issues with line breaks. That got answered and solved but now I think this problem is related to checking the if condition.
Update:
I am sharing links to truncated copies of the actual data. The only difference between these files and the actual ones are that the real ones have millions of rows. These ones only have 10 each.
file_a.csv
file_b.csv

Please try this (GNU sed):
awk 'BEGIN{RS="\r\n";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}'
This is the time BEGIN block kicks in. Also OFS kicks in.
When we are printing out many fields which separated by same thing, we can set OFS, and simply put comma between the things we want to print.
There's no need to check key in arr when you've assigned value for a key in the array,
by default, when arr[somekey] isn't assigned before, it's empty/"", and it evaluates to false in awk (0 in scalar context), and a non-empty string is evaluates to true (There's no literally true and false in awk).
(You used wrong array name, the $2,$6,$7 is the key in the array arr here. It's confusing to use key as array name.)
You can test some simple concept like this:
awk 'BEGIN{print arr["newkey"]}'
You don't need a input file to execute BEGIN block.
Also, you can use quotes sometimes, to avoid confusion and underlying problem.
Update:
Your files actually ends in \n, if you can't be sure what the line ending is, use this:
awk 'BEGIN{RS="\r\n|\n|\r";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
or this (This one will ignore empty lines):
awk 'BEGIN{RS="[\r\n]+";FS=OFS=",";SUBSEP=FS}NR==FNR{arr[$2,$6,$7]=$17 FS $18;next} {if(arr[$2,$4,$5]) print $2,$4,$5,$7,arr[$2,$4,$5]}' file_a.csv file_b.csv
Also, it's better to convert first to avoid such situations, by:
sed -i 's/\r//' files
Or you can use dos2unix command:
dos2unix file
It's a handy commandline tool do above thing only.
You can install it if you don't have it in your system yet.
Once converted, you don't need to assign RS in normal situations.

$ awk 'BEGIN {RS="\r\n"; FS=OFS=","}
NR==FNR {a[$2,$6,$7]=$17 OFS $18; next}
($2,$4,$5) in a {print $2,$4,$5,$7,a[$2,$4,$5]}' file1 file2 > output
Your main issue is, in the array lookup the index you should use is the second file key, not the first file key. Also the semicolon after the if condition is wrong. The rest is cosmetics only.
Not sure you want the output \r\n terminated, if so set ORS=RS as well, otherwise it's newline only.

Since you have mentioned that the file is huge, you can give a try to Perl, if that is an option.
The files are assumed to have "\r".
$ cat file_a.csv
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
x, y, z, 243, 222
$ cat file_b.csv
col2, col4, col5, col7
a, b, c, 4.5
e, f, g, 6.3
x, k, l, 12.9
$ perl -F, -lane 'BEGIN { %kv=map{chomp;chop;#a=split(",");"$a[0],$a[1],$a[2]"=>"$a[3]"} qx(cat file_b.csv) } if($.>1){ $x="$F[0],$F[1],$F[2]";chomp($F[-1]);print "$x,$kv{$x}",join(",",#F[-2,-1]) if $kv{$x} } ' file_a.csv
a, b, c, 4.5 145, 88
e, f, g, 6.3 101, 96
$

Related

Cassandra: Precision of double values exported with COPY TO in a collection (map)

How do I set the precision of a double value in a collection in the same way as a double column when exporting to a CSV file with COPY TO?
DOUBLEPRECISION works for the column (d below), but not for the key in the map.
CREATE TABLE test.digits (
name text,
d double,
digits map<double, int>,
PRIMARY KEY ((name))
);
INSERT INTO test.digits (name, d, digits) VALUES ('Fred', 0.1234567890123456789,
{
1.1234567890123456789 : 1,
2.1234567890123456789 : 2,
3.1234567890123456789 : 3
}
);
SELECT * from test.digits;
name | d | digits
------+----------+--------------------------------------
Fred | 0.123457 | {1.12346: 1, 2.12346: 2, 3.12346: 3}
COPY test.digits (name, d, digits) TO 'digits.csv' WITH header=true AND DELIMITER='|' AND NULL='' AND DOUBLEPRECISION=15;
cat digits.csv
name|d|digits
Fred|0.1234567890123457|{1.12346: 1, 2.12346: 2, 3.12346: 3}
If it's not possible to set this precision in a collection, is this a bug or a feature?
Try adding: AND FLOATPRECISION=15
COPY my_keyspace.digits (name, d, digits) TO 'digits.csv'
WITH header=true AND DELIMITER='|' AND NULL='' AND DOUBLEPRECISION=15
AND FLOATPRECISION = 16;
➜ cat digits.csv
name|d|digits
Fred|0.1234567890123457|{1.1234567890123457: 1, 2.1234567890123457: 2, 3.1234567890123457: 3}
The cqlsh utility displays 5 digits of precision by default.
To enable more digits for the select, add or create a ~/.cassandra/cqlshrc file with:
[ui]
float_precision = 10
double_precision = 15
However, this will not resolve the COPY TO precision issue which seems like a bug in collections export as implied by #fg78nc.

Merging two csv files, can't get rid of newline

I am merging two csv files. For simplicity, I am showing relevant columns only. There are more than four columns in both files.
file_a.csv
col2, col6, col7, col17
a, b, c, 145
e, f, g, 101
x, y, z, 243
file_b.csv
col2, col6, col7, col17
a, b, c, 88
e, f, g, 96
x, k, l, 222
Output should look like this:
col2, col6, col7, col17, col18
a, b, c, 145, 88
e, f, g, 101, 96
So col17 of file_b is added to file_a as col18 when the contents of col2, col6 and col7 match.
I tried this:
awk -F, 'NR == FNR {a[$2,$6,$7] = $17;next;} {if (! (b = a[$2,$6,$7])) b = "N/A";print $0,FS,b;}' file_a.csv file_b.csv > out.csv
The output looks like this:
col2, col6, col7, col17,
, col18
a, b, c, 145
, 88
e, f, g, 101
, 96
So the column 17 from file_b I am trying to add does get added but shows up on a new line.
I think this is because there are carriage returns after each line of file_a and file_b. In Notepad++, I can see CRLF. But I can't get rid of them. Also, I would rather not go through two steps: getting rid of carriage returns first and then merging. Instead, if I can bypass the carriage returns during the merge, it will be much faster.
Also, I will appreciate it if you could tell me how to get rid of the spaces before and after the comma separating the merged column. Note that I put spaces between the columns and commas for the other columns for better readability. That is not how it is in the actual files. But there are indeed spaces between col17 and "," and col18 in the merged file and I don't know why.
If you insist on marking this as a duplicate, kindly explain in a comment below how the answers to the previous question(s) address my issue. I tried figuring it out from those previous similar questions and I failed.
Try this please (GNU awk):
awk -F, -v RS="[\r\n]+" 'NR == FNR {a[$2,$6,$7] = $17;next;} {b=a[$2,$6,$7]; print $0 FS (b? b : "N/A")}' file_a.csv file_b.csv
The thing you have problem at:
1. Carriage returns, by RS="[\r\n]+", it will treat multiple newlines, including \r and \n as line separators. Note this will also ignore empty lines, if you don't want to, change to RS="\r\n".
2. The spaces, that's because awk's default OFS is a space. And when you print, you used ,, this will add spaces between them. Just use space or sometime just write them together will do, they will be concatenated.
Could you please try following.
awk -v RS="[\r\n]+" '
BEGIN{
SUBSEP=OFS=", "
}
FNR==NR{
if(FNR==1){
header=$0
}
a[$1,$2,$3]=$4
next
}
FNR==1 && FNR!=NR{
split(header,array,", ")
sub(/[a-zA-Z]+/,"",array[4])
print header,"col"array[4]+1
next
}
a[$1,$2,$3]{
print $0,a[$1,$2,$3]
}' b.csv a.csv
What above code does:
1- Seems like you may have carriage returns in your Input_file(s) so I have made \r\n as record separator(in case you want to remove carriage returns then try tr -d '\r < a.csv > temp && mv temp a.csv and do for other fiels too).
2- This will create header also as per your file's last column too.
with Miller (http://johnkerl.org/miller/doc)
mlr --csv join -j col2,col6,col7 --lp l --rp r -f file_a.csv \
then unsparsify --fill-with "" \
then rename lcol17,col17,rcol17,col18 file_b.csv
you have
col2,col6,col7,col17,col18
a,b,c,145,88
e,f,g,101,96
I have used as input
# file_a.csv
col2,col6,col7,col17
a,b,c,145
e,f,g,101
x,y,z,243
# file_b.csv
col2,col6,col7,col17
a,b,c,88
e,f,g,96
x,k,l,222
Since you wanted to get of spaces between the delimiter ,, you can try this Perl solution, that removes the spaces while splitting.
The answer assumes you have \r in the files. I have used -vT option for cat to show that the carriage return exists
$ cat -vT file_a.csv
col2, col6, col7, col17^M
a, b, c, 145^M
e, f, g, 101^M
x, y, z, 243^M
$ cat -vT file_b.csv
col2, col6, col7, col17^M
a, b, c, 88^M
e, f, g, 96^M
x, k, l, 222^M
$
$ perl -lne 'BEGIN { %kv=map{chomp;chop;#a=split(/\s*,\s*/);"$a[0],$a[1],$a[2]"=>"$a[3]"} qx(cat file_b.csv) } chop;#b=split(/\s*,\s*/);$x="$b[0],$b[1],$b[2]"; print "$x,$b[-1],",$kv{$x} if $kv{$x} ' file_a.csv
col2,col6,col7,col17,col17
a,b,c,145,88
e,f,g,101,96
$

awk delete field if other column match

i have a CSV file that looks like this:
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,5,c,01/03
2,5,6,c,01/03
The last 2 rows have been appended to the file but it has an extra column (third column). I want to delete the third column from the last 2 rows (i.e. where column 4 == "c" and column 5 == "01/03")
Output i want is to remove the third column from last 2 rows such that it has only 4 columns:
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,c,01/03
2,5,c,01/03
if can be done in vim, would be good too
Here's a slightly different approach that avoids having to type the list of columns to be included:
awk -F, 'BEGIN {OFS=FS} NF==5 {for(i=3;i<=NF;i++){$i=$(i+1)}; NF--} 1'
The solution with an explicit listing of columns can also be written more compactly as follows:
awk -F, 'BEGIN {OFS=FS} NF == 5 {print $1, $2, $4, $5; next} 1'
This should do it
awk -F, 'BEGIN {OFS=","} {if (NF == 5) {print $1, $2, $4, $5} else {print}}' filename
$ awk 'BEGIN{FS=OFS=","} {print $1,$2,$(NF-1),$NF}' file
col1,col2,col3,col4
1,2,a,01/01
2,3,b,01/02
3,4,c,01/03
2,5,c,01/03

Convert MySql column with multiple values to proper table

I have a dataset in the form of a CSV file than is sent to me on a regular basis. I want to import this data into my MySql database and turn it into a proper set of tables. The problem I am having is that one of the fields the is used to store multiple values. For example the field is storing email addresses. It may one email address or it may have two, or three, or four, etc. The field contents would look something like this. "user1#domain.com,user2#domain.com,user3#domain.com".
I need to be able to take the undetermined number of values from each field and then add them into a separate table so that they look like this.
user1#domain.com
user2#domain.com
user3#domain.com
I am not sure how I can do this. Thank you for the help.
Probably the simplest way is a brute force approach of inserting the first email, then the second, and so on:
insert into newtable(email)
select substring_index(substring_index(emails, ',', 1), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 1;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 2), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 2;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 3), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 3;
And so on.
That is, extract the nth element from the list and insert that into the table. The where clause counts the number of commas in the list, which is a proxy for the length of the list.
You need to repeat this up to the maximum number of emails in the list.
Instead of importing the csv file directly and then trying to fix the problems in it, I found the best way to attack this was to first pass the csv to AWK.
AWK outputs three separate csv file that follow the normal forms. I then import those tables and all is well.
2 info="`ncftpget -V -c -u myuser -p mypassword ftp://fake.com/data_map.csv`"
3
4 echo "$info" | \
5 awk -F, -v OFS="," 'NR > 1 {
6 split($6, keyvalue, ";")
7 for (var in keyvalue) {
8 gsub(/.*:/, "", keyvalue[var])
9 print $1, keyvalue[var]
10 }}' > ~/sqlrw/table1.csv
11
12 echo "$info" | \
13 awk -F, -v OFS="," 'NR > 1 {
14 split($6, keyvalue, ";")
15 for (var in keyvalue) {
16 gsub(/:/, ",", keyvalue[var])
17 print keyvalue[var]
18 }}' > ~/sqlrw/table2.csv
19
20 sort -u ~/sqlrw/table2.csv -o ~/sqlrw/table2.csv
21
22 echo "$info" | \
23 awk -F, -v OFS="," 'NR > 1 {
24 print $1, $2, $3, $4, $5, $7, $8
25 }' > ~/sqlrw/table3.csv
Maybe using a simple php script would/shoud do the trick
<?php
$file = file_get_contents("my_file.csv");
$tmp = explode(";", $file); // iirc lines in csv are terminated by a ;
for ($i=0; $i<count($tmp); $i++)
{
$field = $tmp[$i];
$q = "INSERT INTO my_table (emails) VALUES (`$field`)";
// or use $i as an id if don't have an autoincrement
$q = "INSERT INTO my_table (id, emails) VALUES ($i, `$field`)";
// execute query ....
}
?>
Hope this helps even if it's not pure SQL .....

Awk: How to cut similar part of 2 fields and then get the difference of remaining part?

Let say I have 2 fields displaying epoch time in microseconds:
1318044415123456,1318044415990056
What I wanted to do is:
Cut the common part from both fields: "1318044415"
Get the difference of the remaining parts: 990056 - 123456 = 866600
Why am I doing this? Because awk uses floating point IEEE 754 but not 64 bit integers and I need to get difference of epoch time of 2 events in microseconds.
Thanks for any help!
EDIT:
Finally I found the largest number Awk could handle on Snow Leopard 10.6.8: 9007199254740992.
Try this: echo '9007199254740992' | awk -F ',' '{print $1 + 0}'
The version of Awk was 20070501 (produced by awk --version)
Here is an awk script that meets your requirements:
BEGIN {
FS = ","
}
{
s1 = $1
s2 = $2
while (length(s1) > 1 && substr(s1, 1, 1) == substr(s2, 1, 1))
{
s1 = substr(s1, 2)
s2 = substr(s2, 2)
}
n1 = s1 + 0
n2 = s2 + 0
print n2 - n1
}