Remove Rows From CSV Where A Specific Column Matches An Input File - csv

I have a CSV that contains multiple columns and rows [File1.csv].
I have another CSV file (just one column) that lists a specific words [File2.csv].
I want to able to take remove rows within File1 if any columns match any of the words listed in File2.
I originally used this:
grep -v -F -f File2.csv File1.csv > File3.csv
This worked, to a certain extent. This issue I ran into was with columns that had more than word in it (ex. word1,word2,word3). File2 contained word2 but did not delete that row.
I tired spreading the words apart to look like this: (word1 , word2 , word3), but the original command did not work.
How can I remove a row that contains a word from File2 and may have other words in it?

One way using awk.
Content of script.awk:
BEGIN {
## Split line with a doble quote surrounded with spaces.
FS = "[ ]*\"[ ]*"
}
## File with words, save them in a hash.
FNR == NR {
words[ $2 ] = 1;
next;
}
## File with multiple columns.
FNR < NR {
## Omit line if eigth field has no interesting value or is first line of
## the file (header).
if ( $8 == "N/A" || FNR == 1 ) {
print $0
next
}
## Split interested field with commas. Traverse it searching for a
## word saved from first file. Print line only if not found.
## Change due to an error pointed out in comments.
##--> split( $8, array, /[ ]*,[ ]*/ )
##--> for ( i = 1; i <= length( array ); i++ ) {
len = split( $8, array, /[ ]*,[ ]*/ )
for ( i = 1; i <= len; i++ ) {
## END change.
if ( array[ i ] in words ) {
found = 1
break
}
}
if ( ! found ) {
print $0
}
found = 0
}
Assuming File1.csv and File2.csv have content provided in comments of Thor's answer (I suggest to add that information to the question), run the script like:
awk -f script.awk File2.csv File1.csv
With following output:
"DNSName","IP","OS","CVE","Name","Risk"
"ex.example.com","1.2.3.4","Linux","N/A","HTTP 1.1 Protocol Detected","Information"
"ex.example.com","1.2.3.4","Linux","CVE-2011-3048","LibPNG Memory Corruption Vulnerability (20120329) - RHEL5","High"
"ex.example.com","1.2.3.4","Linux","CVE-2012-2141","Net-SNMP Denial of Service (Zero-Day) - RHEL5","Medium"
"ex.example.com","1.2.3.4","Linux","N/A","Web Application index.php?s=-badrow Detected","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","Apache HTTPD Server Version Out Of Date","High"
"ex.example.com","1.2.3.4","Linux","CVE-1999-0662","PHP Unsupported Version Detected","High"
"ex.example.com","1.2.3.4","Linux","N/A","HBSS Common Management Agent - UNIX/Linux","High"

You could convert split lines containing multiple patterns in File2.csv.
Below uses tr to convert lines containing word1,word2 into separate lines before using them as patterns. The <() construct temporarily acts as a file/fifo (tested in bash):
grep -v -F -f <(tr ',' '\n' < File2.csv) File1.csv > File3.csv

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

How to combine several GAWK statements?

I have the following:
cat *.csv > COMBINED.csv
sort -k1 -n -t, COMBINED.csv > A.csv
gawk -F ',' '{sub(/[[:lower:]]+/,"",$1)}1' OFS=',' A.csv # REMOVE LOWER CASE CHARACTERS FROM 1st COLUMN
gawk -F ',' 'length($1) == 14 { print }' A.csv > B.csv # REMOVE ANY LINE FROM CSV WHERE VALUE IN FIRST COLUMN IS NOT 14 CHARACTERS
gawk -F ',' '{ gsub("/", "-", $2) ; print }' OFS=',' B.csv > C.csv # REPLACE FORWARD SLASH WITH HYPHEN IN SECOND COLUMN
gawk -F ',' '{print > ("processed/"$1".csv")}' C.csv # SPLIT CSV INTO FILES GROUPED BY VALUE IN FIRST COLUMN AND SAVE THE FILE WITH THAT VALUE
However, I think 4 separate lines is a bit overkill and was wondering whether I could optimise it or at least streamline it into a one-liner?
I've tried piping the data but getting stuck in a mix of errors
Thanks
In awk you can append multiple actions as:
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
So every time a record is read, it will process it by first doing pattern-action1 followed by pattern-action2, ...
In your case, it seems like you can do:
awk 'BEGIN{FS=OFS=","}
# remove lower case characters from first column
{sub(/[[:lower:]]+/,"",$1)}
# process only lines with 14 characters in first column
(length($1) != 14) { next }
# replace forward slash with hyphen
{ gsub("/", "-", $2) }
{ print > ("processed/" $1 ".csv") }' <(sort -k1 -n -t, combined.csv)
You could essentially also put the sorting in GNU awk, but that is a but to mimic the sort exactly, we would need to know your input format.

Replacing a few sensitive characters in fields with XXX-masked fields in UNIX

I have a table which has been exported to a file in UNIX which has data in CSV format like for e.g.:
File 1:
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123456,09-09-2019,Prisi,Kumar
Now I need to mask ACCT_NUM and FIRST_NAME and replace the masked values in File 1, the output should look something like this
File 2:
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123XXX,09-09-2019,PRXXX,Kumar
I have separate masking functions for numerical and string fields, I need to know how to replace the masked columns in the original file.
I'm not sure what you want to do with FNR and what the point of assigning to array a should be. This is how I would do it:
$ cat x.awk
#!/bin/sh
awk -F, -vOFS=, ' # Set input and output field separators.
NR == 1 { # First record?
print # Just output.
next # Then continue with next line.
}
NR > 1 { # Second and subsequent record?
if (length($1) < 4) { # Short account number?
$1 = "XXX" # Replace the whole number.
} else {
sub(/...$/, "XXX", $1) # Change last three characters.
}
if (length($3) < 4) { # Short first name number?
$3 = "XXX" # Replace the whole name.
} else {
sub(/...$/, "XXX", $3) # Change last three characters.
}
print # Output the changed line.
}'
Showtime!
$ cat input
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123456,09-09-2019,Prisi,Kumar
123,29-12-2017,Jim,Kirk
$ ./x.awk < input
ACCT_NUM,EXPIRY_DT,FIRST_NAME,LAST_NAME
123XXX,09-09-2019,PrXXX,Kumar
XXX,29-12-2017,XXX,Kirk

package to query tab separated files in bash

I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001

Pasting Text Vertically Into .CSV File

I have this awk script to write a text file to a specific cell in a .cvs file, but I am trying to have the text displayed vertically, not horizontally.
`nawk -v r=2 -v c=3 '
BEGIN { FS=OFS=","
}
FNR == NR {
val = sprintf("%s%s%s", val, NR > 1 ? " " : "", $0)
next
}
FNR == r {
$c = val
}
1' file new_one.csv`
Want
the
text
like
this
Can't you do something like:
val = sprintf("%s%s%s", val, NR > 1 ? " " : "", $0) + '\n'
Documentation here
Assuming a csv input file like:
a,b,c
d,e,f
g,h,i
k,l,m
and the data you want vertically like:
This file has words horizontally
here's one way to modify your script:
NR==FNR {gsub(" ","\n"); val="\""$0"\""; next}
This is going to replace all the single spaces in $0 with newlines. Then the whole line is assigned to val, but wrapped in double quotes per wikipedia's csv page.
Running this with the data files I created from the command line (using a slightly different syntax used than you for FS/OFS):
awk -F"," -v r=2 -v c=3 'NR==FNR{gsub(" ","\n"); val="\""$0"\""; next} FNR==r {$c=val} 1' OFS="," vert data
a,b,c
d,e,"This
file
has
words
horizontally"
g,h,i
k,l,m
where vert is the name of the vertical data and data is the name of the csv data file. Notice that f at [2,3] has been replaced with the altered input from the vert file.
Be aware that the row/column indexing you've chosen only works if none of the fields in the data file have internal commas in them and that awk isn't going to be your best friend for parsing csv files in general.