Change column if some regex expression is true with awk or sed - csv

I have a file (lets call it data.csv) similar to this
"123","456","ud,h-match","moredata"
with many rows in the same format and embedded commas. What I need to do is look at the third column and see if it has an expression. In this case I want to know if the third column has "match" anywhere (which it does). If there is any, then I to replace the whole column to something else like "replaced". So to relate it to the example data.csv file, I would want it to look this.
"123","456","replaced","moredata"
Ideally, I want the file data.csv itself to be changed (time is of the essence since I have a big file) but it's also fine if you write it to another file.
Edit:
I have tried using awk:
awk -F'","' -OFS="," '{if(tolower($3) ~ "stringI'mSearchingFor"){$3="replacement"; print}else print}' file
but it dosen't change anything. If I remove the OFS portion then it works but it gets separated by spaces and the columns don't get enclosed by double quotes.

Depending on the answer to my question about what you mean by column, this may be what you want (uses GNU awk for FPAT):
$ awk -v FPAT='[^,]+|"[^"]+"' -v OFS=',' '$3~/match/{$3="\"replaced\""} 1' file
"123","456","replaced","moredata"
Use awk -i inplace ... if you want to do "in place" editing.
With any awk (but slightly more fragile than the above since it leaves the leading/trailing " on the first and last fields, and has no -i inplace):
$ awk 'BEGIN{FS=OFS="\",\""} $3~/match/{$3="replaced"} 1' file
"123","456","replaced","moredata"

Related

Calling Imagemagick from awk?

I have a CSV of image details I want to loop over in a bash script. awk seems like an obvious choice to loop over the data.
For each row, I want to take the values, and use them to do Imagemagick stuff. The following isn't working (obviously):
awk -F, '{ magick "source.png" "$1.jpg" }' images.csv
GNU AWK excels at processing structured text data, although it can be used to summon commands using system function it is less handy for that than some other language, e.g. python has module of standard library called subprocess which is more feature-rich.
If you wish to use awk for this task anyway, then I suggest preparing output to be feed into bash command, say you have file.txt with following content
file1.jpg,file1.bmp
file2.png,file2.bmp
file3.webp,file3.bmp
and you have files listed in 1st column in current working directory and wish to convert them to files shown in 2nd column and access to convert command, then you might do
awk 'BEGIN{FS=","}{print "convert \"" $1 "\" \"" $2 "\""}' file.txt | bash
which is equvialent to starting bash and doing
convert "file1.jpg" "file1.bmp"
convert "file2.png" "file2.bmp"
convert "file3.webp" "file3.bmp"
Observe that I have used literal " to enclose filenames, so it should work with names containing spaces. Disclaimer: it might fail if name containing special character, e.g. ".

prefix every header column with string using awk

I have a bunch of big csv I want to prefix every header column with fixed string. There is more than 500 columns in every file.
suppose my header is:
number;date;customer;key;amount
I tried this awk line:
awk -F';' 'NR==1{gsub(/[^a-z_]/,"input_file.")} { print }'
but I get (note fist column is missing prefix and separator is removed):
numberinput_file.dateinput_file.customerinput_file.keyinput_file.amount
expected output:
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
In any awk that'd be:
$ awk 'NR==1{gsub(/^|;/,"&input_file.")} 1' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
but sed exists to do simple substitutions like that, e.g. using a sed that has -E to enable EREs (e.g. GNU and BSD sed):
$ sed -E '1s/^|;/&input_file./g' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
If you're using GNU tools then you could use either of the above to change all of your CSV files at once with either of these:
awk -i inplace 'NR==1{gsub(/^|;/,"&input_file.")} 1' *.csv
sed -i -E '1s/^|;/&input_file./g' *.csv
Your gsub would brutally replace any nonalphabetic character anywhere in the input with the prefix - including your column separators.
The print can be abbreviated to the common idiom 1 at the very end of your script; this simply means "this condition is true; perform the default action for every line (i.e. print it all)" though this is just a stylistic change.
awk -F';' 'NR==1{
sub(/^/, "input_file."); gsub(/;/, ";input_file."); }
1' filename
If you want to perform this on multiple files, probably put a shell loop around it. If you only want to concatenate everything to standard output, you can give all the files to Awk in one go (in which case you probably don't want to print the header line for any file after the first; maybe change the 1 to NR==1 || FNR != 1).
I would use GNU AWK following way, let file.txt content be
number;date;customer;key;amount
1;2;3;4;5
6;7;8;9;10
then
awk 'BEGIN{FS=";";OFS=";input_file."}NR==1{$1="input_file." $1}{print}' file.txt
output
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
1;2;3;4;5
6;7;8;9;10
Explanation: I set OFS to ; followed by prefix. Then in first line I add prefix to first column, which trigger string rebuilding. No modification is done in any other line, thus they are printed as is.
(tested in GNU Awk 5.0.1)
Also with awk using for loop and printf:
awk 'BEGIN{FS=OFS=";"} NR==1{for (i=1; i<=NF; i++) printf "%s%s", "input_file." $i, (i<NF ? OFS : ORS)}' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount

Find a string between 2 other strings in document

I have found a ton of solutions do do what I want with only one exception.
I need to search a .html document and pull a string.
The line containing the string will look like this (1 line, no newlines)
<script type="text/javascript">g_initHeader(0);LiveSearch.attach(ge('oh2345v5ks'));var _ = g_items;_[60]={icon:'INV_Chest_Leather_09',name_enus:'Layered Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered Pants'};_[3070]={icon:'INV_Misc_Cape_01',name_enus:'Ensign Cloak'};</script>
The text I need to get is
INV_CHEST_LEATHER_09
When I use awk, grep, and sed, I extract the data between icon:' and ',name_
The problem is, all three of these scripts scan the entire line and use the last occurring ',name_ thus I end up with
INV_Chest_Leather_09',name_enus:'Layered
Tunic'};_[6076]={icon:'INV_Pants_11',name_enus:'Tapered
Pants'};_[3070]={icon:'INV_Misc_Cape_01
Here's the last one I tried
grep -Po -m 1 "(?<=]={icon:').*(?=',name_)"
I've tried awk and sed too, and I don't really have a preference of which one to use.
So basically, I need to search the entire html file, find the first occurrence of icon:', extract the text right after it until the first occurrence after icon:' of ',name_.
With GNU awk for the 3rd arg to match():
$ awk 'match($0,/icon:\047([^\047]+)/,a){print a[1]}' file
INV_Chest_Leather_09
Simple perl approach:
perl -ne 'print "$1\n" if /\bicon:\047([^\047]+)/' file
The output:
INV_Chest_Leather_09
The .* in your regular expression is a greedy matcher, so the pattern will match till the end of the string and then backtrack to match the ,name_ portion. You could try replacing the .* with something like [^,]* (i.e. match anything except comma):
grep -Po -m 1 "(?<=]={icon:')[^,]*(?=',name_)"

Print first, penultimate and last fields in CSV file

I have big comma separated file with 20000 row and five column, I want to extract particular column, but there are more values so more comma, except header, so how to cut such column.
example file:
name,v1,v2,v3,v4,v5
as,"10,12,15",21,"12,11,10,12",5,7
bs,"11,15,16",24,"19,15,18,23",9,3
This is my desired output:
name,v4,v5
as,5,7
bs,9,3
I tried following cut command but doesn't work
cut -d, -f1,5,6
In general, for these scenarios is best to use a proper csv parser. You can find those in Python, for example.
However, since your data seems to have fields with commas just in the very beginning, you can decide to print the first field and then the penultimate and last one:
$ awk 'BEGIN{FS=OFS=","} {print $1, $(NF-1), $NF}' file
name,v4,v5
as,5,7
bs,9,3
In TXR Lisp:
$ txr extract.tl < data
name,v4,v5
as,5,7
bs,9,3
Code in extract.tl:
(mapdo
(lambda (line)
(let ((f (tok-str line #/"[^"]*"|[^,]+/)))
(put-line `#[f 0],#[f 4],#[f 5]`)))
(get-lines))
As a condensed one liner:
$ txr -t '(mapcar* (do let ((f (tok-str #1 #/"[^"]*"|[^,]+/)))
`#[f 0],#[f 4],#[f 5]`) (get-lines))' < data

Add double quotes in .CSV comma delimited file using awk

Hi I need to elaborate a big csv file (20M rows) adding double quotes for every comma delimited field. The csv file got 8 fields coma delimited as below:
'2016-03-12','12393659','134',,'35533605',189348,9798,gmail.com;live_com.com
'2016-03-12','12390103','138',,'35438006',5133,1897,google.com
'2016-03-12','45616164','139',,'01318800',10945593,596633,facebook.com;tumblr.com;t.co
'2016-03-12','45673436','38',,'86441702',4350985,150327,serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net
As you see first 3 fields are between single quotes, 4th is blank, 5th between single quotes and 6th to 8th only comma delimited.
I would like to get the following result (also 4th field even if empty need to be double quoted):
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985,"150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"
I partially obtain result with mix of sed and awk:
sed -e s/\'//g inpu.csv > output.csv eliminate quotes
awk '{gsub(/[^,]+/,"\"&\"")}1' output.csv > output1.csv add double quotes
but the 4th field is not double quoted and I need to reduce elaboration time as much as possible.
Anyway help to do all in awk with better performances and also 4th field double quoted.
Many thx for the help. M.Tave
If your data is really that simple with no embedded quotes or newlines or anything then all you need is:
$ awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985","150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"
give this awk one-liner a try:
awk -F, -v OFS="," -v re="^'?|'?$" -v q='"'
'{for(i=1;i<=NF;i++)if($i)gsub(re,q,$i);else $i=q$i q}7' file
The idea is, using gsub() to add double quotes to those non-empty fields. Those empty fields, just add " to the head and tail. The replace regex was defined as awk variable outside the script, for avoiding to escape.
It works with your input data here:
kent$ awk -F, -v OFS="," -v re="^'?|'?$" -v q='"' '{for(i=1;i<=NF;i++)if($i)gsub(re,q,$i);else $i=q$i q}7' f
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985","150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"