Delete rows of CSV file based on the value of a column - csv

Here's an example of a few lines of my CSV file:
movieID,actorID,actorName,ranking
1,don_rickles,Don Rickles,3
1,jack_angel,Jack Angel,6
1,jim_varney,Jim Varney,4
1,tim_allen,Tim Allen,2
1,tom_hanks,Tom Hanks,1
1,wallace_shawn,Wallace Shawn,5
I would like to remove all rows that have a ranking of > 4, so far I've been trying use this awk line:
awk -F ',' 'BEGIN {OFS=","} { if (($4) < 5) print }' file.csv > file_out.csv
It should print all the rows with a ranking (4th column) of less than 5 to a new file. I can't tell exactly what this line actually does, but it's not what I want. can someone tell me where I've gone wrong with that line?

Instead of deleting the records, think of which ones you're going to print. I guess it's <=4. In idiomatic awk you can write this as
$ awk -F, '$4<=4' file
1,don_rickles,Don Rickles,3
1,jim_varney,Jim Varney,4
1,tim_allen,Tim Allen,2
1,tom_hanks,Tom Hanks,1

Related

Change column if some regex expression is true with awk or sed

I have a file (lets call it data.csv) similar to this
"123","456","ud,h-match","moredata"
with many rows in the same format and embedded commas. What I need to do is look at the third column and see if it has an expression. In this case I want to know if the third column has "match" anywhere (which it does). If there is any, then I to replace the whole column to something else like "replaced". So to relate it to the example data.csv file, I would want it to look this.
"123","456","replaced","moredata"
Ideally, I want the file data.csv itself to be changed (time is of the essence since I have a big file) but it's also fine if you write it to another file.
Edit:
I have tried using awk:
awk -F'","' -OFS="," '{if(tolower($3) ~ "stringI'mSearchingFor"){$3="replacement"; print}else print}' file
but it dosen't change anything. If I remove the OFS portion then it works but it gets separated by spaces and the columns don't get enclosed by double quotes.
Depending on the answer to my question about what you mean by column, this may be what you want (uses GNU awk for FPAT):
$ awk -v FPAT='[^,]+|"[^"]+"' -v OFS=',' '$3~/match/{$3="\"replaced\""} 1' file
"123","456","replaced","moredata"
Use awk -i inplace ... if you want to do "in place" editing.
With any awk (but slightly more fragile than the above since it leaves the leading/trailing " on the first and last fields, and has no -i inplace):
$ awk 'BEGIN{FS=OFS="\",\""} $3~/match/{$3="replaced"} 1' file
"123","456","replaced","moredata"

awk, header: igore when reading, include when writing

i know how to ignore the column header when reading a file. i can do something like this:
awk 'FNR > 1 { #process here }' table > out_table
But if i do this, everything else other than the COLUMN HEADER is written into the output file. But i want the output file to have COLUMN HEADER also.
Of course i can do something like this, after i execute the first statement:
awk 'BEGIN {print "Column Headers\t"} {print}' Out_table > out_table_with_header
But this becomes a 2 step process. So is there a way to do this in a SINGE STEP itself?
In short, is there a way for me to ignore Column Header while reading file, perform operation on the data, then include Column Header when writing it to output file, in a single step (or a block of steps that takes very less response time?)
Not sure if I got your correctly, you can simply:
awk 'NR==1{print}; NR>1 { # process }' file
which can be simplified to:
awk 'NR==1; NR>1 { # process }' file
That works for a single input file.
If you want to process more than one file, all having the same column headers at line 1 use this:
awk 'FNR==1 && !h {print; h=1}; FNR>1 { # process }' file1 file2 ...
I'm using the variable h to check whether the headers have been printed already or not.

Print first, penultimate and last fields in CSV file

I have big comma separated file with 20000 row and five column, I want to extract particular column, but there are more values so more comma, except header, so how to cut such column.
example file:
name,v1,v2,v3,v4,v5
as,"10,12,15",21,"12,11,10,12",5,7
bs,"11,15,16",24,"19,15,18,23",9,3
This is my desired output:
name,v4,v5
as,5,7
bs,9,3
I tried following cut command but doesn't work
cut -d, -f1,5,6
In general, for these scenarios is best to use a proper csv parser. You can find those in Python, for example.
However, since your data seems to have fields with commas just in the very beginning, you can decide to print the first field and then the penultimate and last one:
$ awk 'BEGIN{FS=OFS=","} {print $1, $(NF-1), $NF}' file
name,v4,v5
as,5,7
bs,9,3
In TXR Lisp:
$ txr extract.tl < data
name,v4,v5
as,5,7
bs,9,3
Code in extract.tl:
(mapdo
(lambda (line)
(let ((f (tok-str line #/"[^"]*"|[^,]+/)))
(put-line `#[f 0],#[f 4],#[f 5]`)))
(get-lines))
As a condensed one liner:
$ txr -t '(mapcar* (do let ((f (tok-str #1 #/"[^"]*"|[^,]+/)))
`#[f 0],#[f 4],#[f 5]`) (get-lines))' < data

How to create a string with x numbers of the same character without looping in AWK?

I'm trying to write a CSV join program in AWK which joins the first.csv file with the second.csv file. The program now works perfectly fine, assuming the the number of rows in both files are the same.
The problem arises when one of the files contains more rows than the other; in this case, I'd have to add multiple number of commas (which depends to the number of fields in input files), to the file with less number of rows, so that the columns are not misplaced.
The question is, How can I create and assign strings containing different number of commas? for instance,
if: NumberOfFields==5; Then, I want to create string ",,,,," and add it to an Array[i].
Here is another answer with sample code using your variable and array name.
BEGIN {
NumberOfFields=5;
i=1;
Array[i] = gensub(/0/, ",", "g", sprintf("%0*d", NumberOfFields, 0));
print Array[i];
}
Run it with awk -f x,awk where x.awk is the above code in a text file. Be aware that it always prints at least 1 comma, even if you specify zero.
$ awk -v num=3 'BEGIN {var=sprintf("%*s",num,""); gsub(/ /,",",var); print var}'
,,,
$
Use an array instead of var if/when you like. Note that unlike another solution posted the above will work with any awk, not just gawk, and it will not print any commas if the number requested is zero:
$ awk -v num=0 'BEGIN {var=sprintf("%*s",num,""); gsub(/ /,",",var); print var}'
$
The equivalent with GNU awk and gensub() would be:
$ awk -v num=3 'BEGIN {var=gensub(/ /,",","g",sprintf("%*s",num,"")); print var}'
,,,
$
$ awk -v num=0 'BEGIN {var=gensub(/ /,",","g",sprintf("%*s",num,"")); print var}'
$

How to change csv file delimiter

here's a csv file items.txt
item-number,item-description,item-category,cost,quantity-available
I tried to change the field separator from , to \n using awk and I need a easy way to do it.
1) this command does not work
awk -F, 'BEGIN{OFS="\n"} {print}' items.txt
2) this command works, but the real csv i need to process has 15+ columns, i dont want to provide all columns variables
awk -F, 'BEGIN{OFS="\n"} {print $1,$2}' items.txt
Thanks in advance.
If your fields do not contain ,'s, you may use tr:
tr ',' '\n' < infile > outfile
You were close. After setting the OFS to newline, you will need to re-construct the fields to get them separated to newlines.
$1=$1 re-evaluates the fields and separates them by OFS which by default is space. Since we set the OFS to RS which by default is new line, you get the desired output.
The 1 at the end of statement is idiomatic way of saying print the line.
$ awk 'BEGIN{FS=",";OFS=RS}{$1=$1}1' file
item-number
item-description
item-category
cost
quantity-available
You need to get awk to re-evaluate the string by changing a field.
awk -F, 'BEGIN{OFS="\n"} {$1 = $1; print}' items.txt
Or, if you're sure the first column is always non-empty and nonzero, you could use the somewhat simpler:
awk -F, 'BEGIN{OFS="\n"} $1 = $1' items.txt