I have one file with 11 columns with first column as primary id - P1
second csv with three columns with first column as same primary id - P1, though not at same level in both files,
I am merging both files using below command:
awk 'NR==FNR {h[$2] = $3; next} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,h[$2]}' first.csv second.csv > final.csv
however, getting only three columns in new csv
You should see if join wouldn't be an easier solution. Type man join for that:
join - join lines of two files on a common field
If first.csv has 11 columns and second.csv has three, then you have your files are in the wrong order. Try like this:
awk 'NR==FNR {h[$2] = $3; next} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,h[$2]}' second.csv first.csv > final.csv
You are also not using the first column as keys in this example, but the second one.
Related
I often use the cqlsh command COPY...FROM CSV... but I have new needs.
I'd like to add an extra colum in my cassandra table that would be created from two other columns.
Example (cvs file)
1;2
2;4
3;6
would become a table with these values:
my table: 12;1;2
24;2;4
36;3;6
I ve used other options but they're much slower than COPY...FROM CSV
Do you know if I can do that using COPY...FROM CSV?
You can't do this with only copy command.
If you are using Linux then
First dumb the csv to file with copy command let's say csv_test.csv
1;2
2;4
3;6
Then use the below command to combine first two column into one.
cat csv_test.csv | awk -F ";" '{print $1$2 ";" $0}' > csv_test_combine.csv
Output file csv_test_combine.csv :
12;1;2
24;2;4
36;3;6
I have file named key and another csv file named val.csv. As you can imagine, the file named key looks something like this:
123
012
456
The file named val.csv has multiple columns and corresponding values. It looks like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
9,0,0,452,K,p
1,2,2,000,L,x
I would like get the subset of lines from val.csv whose value in the KEY column matches the values in the KEY file. Using the above example, I would like to get an output like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
Obviously these are just toy examples. The real KEY file I am using has nearly 500,000 'keys' and the val.csv file has close to 5 million lines in them. Thanks.
$ awk -F, 'FNR==NR{k[$1]=1;next;} FNR==1 || k[$4]' key val.csv
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
How it works
FNR==NR { k[$1]=1;next; }
This saves the values of all keys read from the first file, key.
The condition is FNR==NR. FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, if FNR==NR, we are still reading the first file.
When reading the first file, key, this saves the value of key in associative array k. This then skips the rest of the commands and starts over on the next line.
FNR==1 || k[$4]
If we get here, we are working on the second file.
This condition is true either for the first line of the file, FNR==1, or for lines whose fourth field is in array k. If the condition is true, awk performs the default action which is to print the line.
I have a CSV file that contains a few thousand lines. It looks something like this:
abc,123,hello,world
abc,124,goodbye,turtles
def,100,apples,pears
....
I want each unique entry in column one to be repeated exactly three times. For example: If exactly three lines have "abc" in the first column that is fine and nothing happens. But if there is not exactly three lines with "abc" in the first column, all lines with "abc" in column 1 must be deleted.
This
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
def,100,apples,pears
def,10,foo,bar
ghi,2,one,two
ghi,6,three,four
ghi,4,five,six
ghi,9,seven,eight
Should become:
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
Many Thanks,
Awk way
awk -F, 'FNR==NR{a[$1]++;next}a[$1]==3' test{,}
Set Field separator to ,
Whilst first file
Increment array with field 1 as key
Skip next instruction
Read file again
If the array counter is 3 print
this awk one-liner should do:
awk -F, 'NR==FNR{a[$1]++;next}a[$1]==3' file file
it doesn't require your file to be sorted.
I have an html table in which the first row is the title and the next rows represent the body of the table. I want to extract the values from the 3'rd column of each row. How can I proceed?
Try the below awk command,
awk 'NR>1{print $3}' file
This prints the value of third column except the one in the header.
Update:
awk -v RS='</tr>' -v F='<td>' '{$3=gsub(/<[^<>]*>/,"",$3);print $3}' file
UPDATE: added an example to clarify the format of the data.
Considering a CSV with each line formatted like this:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,[tbl2.col1:tbl2.col2]+
where [tbl2.col1:tbl2.col2]+ means that there could be any number of these pairs repeated
ex:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2
The tables would relate to eachother using the line number as a key which would have to be created in addition to any columns mentioned above.
Is there a way to use mysql load
data infile to load the data into
two separate tables?
If not, what Unix command line tools
would be best suited for this?
no, not directly. load data can only insert into one table or partitioned table.
what you can do is load the data into a staging table, then use insert into to select the individual columns into the 2 final tables. you may also need substring_index if you're using different delimiters for tbl2's values. the line number is handled by an auto incrementing column in the staging table (the easiest way is to make the auto column last in the staging table definition).
the format is not exactly clear, and is best done w/perl/php/python, but if you really want to use shell tools:
cut -d , -f 1-5 file | awk -F, '{print NR "," $0}' > table1
cut -d , -f 6- file | sed 's,\:,\,,g' | \
awk -F, '{i=1; while (i<=NF) {print NR "," $(i) "," $(i+1); i+=2;}}' > table2
this creates table1 and table 2 files with these contents:
1,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
2,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
3,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
and
1,tbl2.col1,tbl2.col2
1,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
As you say, the problematic part is the unknown number of [tbl2.col1:tbl2.col2] pairs declared in each line. I would tempted to solve this through sed: split the one file into two files, one for each table. Then you can use load data infile to load each file into its corresponding table.