Get lines from file that match strings in another file using AWK - csv

I have file named key and another csv file named val.csv. As you can imagine, the file named key looks something like this:
123
012
456
The file named val.csv has multiple columns and corresponding values. It looks like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
9,0,0,452,K,p
1,2,2,000,L,x
I would like get the subset of lines from val.csv whose value in the KEY column matches the values in the KEY file. Using the above example, I would like to get an output like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
Obviously these are just toy examples. The real KEY file I am using has nearly 500,000 'keys' and the val.csv file has close to 5 million lines in them. Thanks.

$ awk -F, 'FNR==NR{k[$1]=1;next;} FNR==1 || k[$4]' key val.csv
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
How it works
FNR==NR { k[$1]=1;next; }
This saves the values of all keys read from the first file, key.
The condition is FNR==NR. FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, if FNR==NR, we are still reading the first file.
When reading the first file, key, this saves the value of key in associative array k. This then skips the rest of the commands and starts over on the next line.
FNR==1 || k[$4]
If we get here, we are working on the second file.
This condition is true either for the first line of the file, FNR==1, or for lines whose fourth field is in array k. If the condition is true, awk performs the default action which is to print the line.

Related

Let Google BigQuery infer schema from csv string file

I want to upload csv data into BigQuery. When the data has different types (like string and int), it is capable of inferring the column names with the headers, because the headers are all strings, whereas the other lines contains integers.
BigQuery infers headers by comparing the first row of the file with
other rows in the data set. If the first line contains only strings,
and the other lines do not, BigQuery assumes that the first row is a
header row.
https://cloud.google.com/bigquery/docs/schema-detect
The problem is when your data is all strings ...
You can specify --skip_leading_rows, but BigQuery still does not use the first row as the name of your variables.
I know I can specify the column names manually, but I would prefer not doing that, as I have a lot of tables. Is there another solution ?
If your data is all in "string" type and if you have the first row of your CSV file containing the metadata, then I guess it is easy to do a quick script that would parse the first line of your CSV and generates a similar "create table" command:
bq mk --schema name:STRING,street:STRING,city:STRING... -t mydataset.myNewTable
Use that command to create a new (void) table, and then load your CSV file into that new table (using --skip_leading_rows as you mentioned)
14/02/2018: Update thanks to Felipe's comment:
Above comment can be simplified this way:
bq mk --schema `head -1 myData.csv` -t mydataset.myNewTable
It's not possible with current API. You can file a feature request in the public BigQuery tracker https://issuetracker.google.com/issues/new?component=187149&template=0.
As a workaround, you can add a single non-string value at the end of the second line in your file, and then set the allowJaggedRows option in the Load configuration. Downside is you'll get an extra column in your table. If having an extra column is not acceptable, you can use query instead of load, and select * EXCEPT the added extra column, but query is not free.

AWK or GREP 1 instance of repeated output

so what I have here is some output from a cisco switch and I need to capture the host name and use that to populate a csv file.
basically I run a show mac address-table and pull mac addresses and populate them into a csv file. that I got however I cant figure out how to grab the host name so that I can put that in a separate column.
I have done this:
awk '/#/{print $1}'
but that will print every line that has '#' in it. I only need 1 to populate a variable so I can re use it. the end result needs to look like this: (the CSV file has MAC address, port number , hostname. I use commas to indicate the column seperation
0011.2233.4455,Gi1/1,Switch1#
0011.2233.4488,Gi1/2,Switch1#
0011.2233.4499,Gi1/3,Switch1#
Without knowing what the input file looks like, the exact solution that is required will be uncertain. However, as an example, given an input file like the requested output (which I've called switch.txt):
0011.2233.4455,Gi1/1,Switch1#
0011.2233.4488,Gi1/2,Switch1#
0011.2233.4499,Gi1/3,Switch1#
0011.2233.4455,Gi1/1,Switch3#
0011.2233.4488,Gi1/2,Switch2#
0011.2233.4498,Gi1/3,Switch3#
... a list of the unique values of the first field (comma-separated) can be obtained from:
$ awk -F, '{print $1}' <switch.txt | sort | uniq
0011.2233.4455
0011.2233.4488
0011.2233.4498
0011.2233.4499
An approach like this might help with extracting unique values from the actual input file.

Delete lines if string does not appear exactly thrice

I have a CSV file that contains a few thousand lines. It looks something like this:
abc,123,hello,world
abc,124,goodbye,turtles
def,100,apples,pears
....
I want each unique entry in column one to be repeated exactly three times. For example: If exactly three lines have "abc" in the first column that is fine and nothing happens. But if there is not exactly three lines with "abc" in the first column, all lines with "abc" in column 1 must be deleted.
This
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
def,100,apples,pears
def,10,foo,bar
ghi,2,one,two
ghi,6,three,four
ghi,4,five,six
ghi,9,seven,eight
Should become:
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
Many Thanks,
Awk way
awk -F, 'FNR==NR{a[$1]++;next}a[$1]==3' test{,}
Set Field separator to ,
Whilst first file
Increment array with field 1 as key
Skip next instruction
Read file again
If the array counter is 3 print
this awk one-liner should do:
awk -F, 'NR==FNR{a[$1]++;next}a[$1]==3' file file
it doesn't require your file to be sorted.

Howto process multivariate time series given as multiline, multirow *.csv files with Apache Pig?

I need to process multivariate time series given as multiline, multirow *.csv files with Apache Pig. I am trying to use a custom UDF (EvalFunc) to solve my problem. However, all Loaders I tried (except org.apache.pig.impl.io.ReadToEndLoader which I do not get to work) to load data in my csv-files and pass it to the UDF return one line of the file as one record. What I need is, however one column (or the content of the complete file) to be able to process a complete time series. Processing one value is obviously useless because I need longer sequences of values...
The data in the csv-files looks like this (30 columns, 1st is a datetime, all others are double values, here 3 sample lines):
17.06.2013 00:00:00;427;-13.793273;2.885583;-0.074701;209.790688;233.118828;1.411723;329.099170;331.554919;0.077026;0.485670;0.691253;2.847106;297.912382;50.000000;0.000000;0.012599;1.161726;0.023110;0.952259;0.024673;2.304819;0.027350;0.671688;0.025068;0.091313;0.026113;0.271128;0.032320;0
17.06.2013 00:00:01;430;-13.879651;3.137179;-0.067678;209.796500;233.141233;1.411920;329.176863;330.910693;0.071084;0.365037;0.564816;2.837506;293.418550;50.000000;0.000000;0.014108;1.159334;0.020250;0.954318;0.022934;2.294808;0.028274;0.668540;0.020850;0.093157;0.027120;0.265855;0.033370;0
17.06.2013 00:00:02;451;-15.080651;3.397742;-0.078467;209.781511;233.117081;1.410744;328.868437;330.494671;0.076037;0.358719;0.544694;2.841955;288.345883;50.000000;0.000000;0.017203;1.158976;0.022345;0.959076;0.018688;2.298611;0.027253;0.665095;0.025332;0.099996;0.023892;0.271983;0.024882;0
Has anyone an idea how I could process this as 29 time series?
Thanks in advance!
What do you want to achieve?
If you want to read all rows in all files as a single record, this can work:
a = LOAD '...' USING PigStorage(';') as <schema> ;
b = GROUP a ALL;
b will contain all the rows in a bag.
If you want to read each CSV file as a single record, this can work:
a = LOAD '...' USING PigStorage(';','tagsource') as <schema> ;
b = GROUP a BY $0; --$0 is the filename
b will contain all the rows per file in a bag.

DB load CSV into multiple tables

UPDATE: added an example to clarify the format of the data.
Considering a CSV with each line formatted like this:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,[tbl2.col1:tbl2.col2]+
where [tbl2.col1:tbl2.col2]+ means that there could be any number of these pairs repeated
ex:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2
The tables would relate to eachother using the line number as a key which would have to be created in addition to any columns mentioned above.
Is there a way to use mysql load
data infile to load the data into
two separate tables?
If not, what Unix command line tools
would be best suited for this?
no, not directly. load data can only insert into one table or partitioned table.
what you can do is load the data into a staging table, then use insert into to select the individual columns into the 2 final tables. you may also need substring_index if you're using different delimiters for tbl2's values. the line number is handled by an auto incrementing column in the staging table (the easiest way is to make the auto column last in the staging table definition).
the format is not exactly clear, and is best done w/perl/php/python, but if you really want to use shell tools:
cut -d , -f 1-5 file | awk -F, '{print NR "," $0}' > table1
cut -d , -f 6- file | sed 's,\:,\,,g' | \
awk -F, '{i=1; while (i<=NF) {print NR "," $(i) "," $(i+1); i+=2;}}' > table2
this creates table1 and table 2 files with these contents:
1,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
2,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
3,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
and
1,tbl2.col1,tbl2.col2
1,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
As you say, the problematic part is the unknown number of [tbl2.col1:tbl2.col2] pairs declared in each line. I would tempted to solve this through sed: split the one file into two files, one for each table. Then you can use load data infile to load each file into its corresponding table.