Delete lines if string does not appear exactly thrice - csv

I have a CSV file that contains a few thousand lines. It looks something like this:
abc,123,hello,world
abc,124,goodbye,turtles
def,100,apples,pears
....
I want each unique entry in column one to be repeated exactly three times. For example: If exactly three lines have "abc" in the first column that is fine and nothing happens. But if there is not exactly three lines with "abc" in the first column, all lines with "abc" in column 1 must be deleted.
This
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
def,100,apples,pears
def,10,foo,bar
ghi,2,one,two
ghi,6,three,four
ghi,4,five,six
ghi,9,seven,eight
Should become:
abc,123,hello,world
abc,124,goodbye,turtles
abc,167,cat,dog
Many Thanks,

Awk way
awk -F, 'FNR==NR{a[$1]++;next}a[$1]==3' test{,}
Set Field separator to ,
Whilst first file
Increment array with field 1 as key
Skip next instruction
Read file again
If the array counter is 3 print

this awk one-liner should do:
awk -F, 'NR==FNR{a[$1]++;next}a[$1]==3' file file
it doesn't require your file to be sorted.

Related

Delete rows of a CSV file based off a column value on command line

I have a large file that I cannot open on my computer. I am trying to delete rows of information that are unneeded.
My file looks like this:
NODE,107983_gene,382,666,-,cd10161,8,49,9.0E-100,49.4,0.52,domain
NODE,107985_gene,24,659,-,PF09699.9,108,148,6.3E-500,22.5,0.8571428571428571,domain
NODE,33693_gene,213,1433,-,PF01966.21,92,230,9.0E-10,38.7,0.9344262295081968,domain
NODE,33693_gene,213,1433,-,PRK04926,39,133,1.0E-8,54.5,0.19,domain
NODE,33693_gene,213,1433,-,cd00077,88,238,4.0E-6,44.3,0.86,domain
NODE,33693_gene,213,1433,-,smart00471,88,139,9.0E-7,41.9,0.42,domain
NODE,33694_gene,1430,1912,-,cd16326,67,135,4.0E-50,39.5,0.38,domain
I am trying to remove all lines that have an evalue more than 1.0E-10. This information in located in column 9. I have tried on command line:
awk '$9 >=1E-10' file name > outputfile
This has given me a smaller file but the evalues are all over the place and are not actually removing anything above 1E-10. I want small E-values only.
Does anyone have any suggestions?
almost there, you need to specify the field delimiter
$ awk -F, '$9<1E-10' file > small.values

Get lines from file that match strings in another file using AWK

I have file named key and another csv file named val.csv. As you can imagine, the file named key looks something like this:
123
012
456
The file named val.csv has multiple columns and corresponding values. It looks like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
9,0,0,452,K,p
1,2,2,000,L,x
I would like get the subset of lines from val.csv whose value in the KEY column matches the values in the KEY file. Using the above example, I would like to get an output like this:
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
Obviously these are just toy examples. The real KEY file I am using has nearly 500,000 'keys' and the val.csv file has close to 5 million lines in them. Thanks.
$ awk -F, 'FNR==NR{k[$1]=1;next;} FNR==1 || k[$4]' key val.csv
V1,V2,V3,KEY,V5,V6
1,2,3,012,X,t
How it works
FNR==NR { k[$1]=1;next; }
This saves the values of all keys read from the first file, key.
The condition is FNR==NR. FNR is the number of lines read so far from the current file and NR is the total number of lines read. Thus, if FNR==NR, we are still reading the first file.
When reading the first file, key, this saves the value of key in associative array k. This then skips the rest of the commands and starts over on the next line.
FNR==1 || k[$4]
If we get here, we are working on the second file.
This condition is true either for the first line of the file, FNR==1, or for lines whose fourth field is in array k. If the condition is true, awk performs the default action which is to print the line.

AWK or GREP 1 instance of repeated output

so what I have here is some output from a cisco switch and I need to capture the host name and use that to populate a csv file.
basically I run a show mac address-table and pull mac addresses and populate them into a csv file. that I got however I cant figure out how to grab the host name so that I can put that in a separate column.
I have done this:
awk '/#/{print $1}'
but that will print every line that has '#' in it. I only need 1 to populate a variable so I can re use it. the end result needs to look like this: (the CSV file has MAC address, port number , hostname. I use commas to indicate the column seperation
0011.2233.4455,Gi1/1,Switch1#
0011.2233.4488,Gi1/2,Switch1#
0011.2233.4499,Gi1/3,Switch1#
Without knowing what the input file looks like, the exact solution that is required will be uncertain. However, as an example, given an input file like the requested output (which I've called switch.txt):
0011.2233.4455,Gi1/1,Switch1#
0011.2233.4488,Gi1/2,Switch1#
0011.2233.4499,Gi1/3,Switch1#
0011.2233.4455,Gi1/1,Switch3#
0011.2233.4488,Gi1/2,Switch2#
0011.2233.4498,Gi1/3,Switch3#
... a list of the unique values of the first field (comma-separated) can be obtained from:
$ awk -F, '{print $1}' <switch.txt | sort | uniq
0011.2233.4455
0011.2233.4488
0011.2233.4498
0011.2233.4499
An approach like this might help with extracting unique values from the actual input file.

Extract values from a specific column of an html table using bash

I have an html table in which the first row is the title and the next rows represent the body of the table. I want to extract the values from the 3'rd column of each row. How can I proceed?
Try the below awk command,
awk 'NR>1{print $3}' file
This prints the value of third column except the one in the header.
Update:
awk -v RS='</tr>' -v F='<td>' '{$3=gsub(/<[^<>]*>/,"",$3);print $3}' file

DB load CSV into multiple tables

UPDATE: added an example to clarify the format of the data.
Considering a CSV with each line formatted like this:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,[tbl2.col1:tbl2.col2]+
where [tbl2.col1:tbl2.col2]+ means that there could be any number of these pairs repeated
ex:
tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2,tbl2.col1:tbl2.col2
The tables would relate to eachother using the line number as a key which would have to be created in addition to any columns mentioned above.
Is there a way to use mysql load
data infile to load the data into
two separate tables?
If not, what Unix command line tools
would be best suited for this?
no, not directly. load data can only insert into one table or partitioned table.
what you can do is load the data into a staging table, then use insert into to select the individual columns into the 2 final tables. you may also need substring_index if you're using different delimiters for tbl2's values. the line number is handled by an auto incrementing column in the staging table (the easiest way is to make the auto column last in the staging table definition).
the format is not exactly clear, and is best done w/perl/php/python, but if you really want to use shell tools:
cut -d , -f 1-5 file | awk -F, '{print NR "," $0}' > table1
cut -d , -f 6- file | sed 's,\:,\,,g' | \
awk -F, '{i=1; while (i<=NF) {print NR "," $(i) "," $(i+1); i+=2;}}' > table2
this creates table1 and table 2 files with these contents:
1,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
2,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
3,tbl1.col1,tbl1.col2,tbl1.col3,tbl1.col4,tbl1.col5
and
1,tbl2.col1,tbl2.col2
1,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
2,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
3,tbl2.col1,tbl2.col2
As you say, the problematic part is the unknown number of [tbl2.col1:tbl2.col2] pairs declared in each line. I would tempted to solve this through sed: split the one file into two files, one for each table. Then you can use load data infile to load each file into its corresponding table.