Add double quotes in .CSV comma delimited file using awk - csv

Hi I need to elaborate a big csv file (20M rows) adding double quotes for every comma delimited field. The csv file got 8 fields coma delimited as below:
'2016-03-12','12393659','134',,'35533605',189348,9798,gmail.com;live_com.com
'2016-03-12','12390103','138',,'35438006',5133,1897,google.com
'2016-03-12','45616164','139',,'01318800',10945593,596633,facebook.com;tumblr.com;t.co
'2016-03-12','45673436','38',,'86441702',4350985,150327,serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net
As you see first 3 fields are between single quotes, 4th is blank, 5th between single quotes and 6th to 8th only comma delimited.
I would like to get the following result (also 4th field even if empty need to be double quoted):
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985,"150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"
I partially obtain result with mix of sed and awk:
sed -e s/\'//g inpu.csv > output.csv eliminate quotes
awk '{gsub(/[^,]+/,"\"&\"")}1' output.csv > output1.csv add double quotes
but the 4th field is not double quoted and I need to reduce elaboration time as much as possible.
Anyway help to do all in awk with better performances and also 4th field double quoted.
Many thx for the help. M.Tave

If your data is really that simple with no embedded quotes or newlines or anything then all you need is:
$ awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985","150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"

give this awk one-liner a try:
awk -F, -v OFS="," -v re="^'?|'?$" -v q='"'
'{for(i=1;i<=NF;i++)if($i)gsub(re,q,$i);else $i=q$i q}7' file
The idea is, using gsub() to add double quotes to those non-empty fields. Those empty fields, just add " to the head and tail. The replace regex was defined as awk variable outside the script, for avoiding to escape.
It works with your input data here:
kent$ awk -F, -v OFS="," -v re="^'?|'?$" -v q='"' '{for(i=1;i<=NF;i++)if($i)gsub(re,q,$i);else $i=q$i q}7' f
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985","150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"

Related

Update a CSV file to drop the first number and insert a decimal place in a particular column

I need help to perform the following
My CSV file looks like this
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
What I need to do is generate a new csv file to be as follows
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
if I was to do this as a test
echo '036921' | awk -v range=1 '{print substr($0,range+1)}' | sed 's/.$/.&/'
I get
3692.1
Can anyone help me so I can incorporate that, (or anything similar) to change my CSV file?
Try
sed 's/,0*([0-9]*)([0-9]),/,\1.\2,/' myfile.csv
Using awk and with the conditions specified in the comment, you can use:
$ awk -F, '{ printf "%s,%06.1f,%s\n", $1, $2 / 10, $3 }' data
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
$
With the printf format string providing the commas, there's no need to set OFS (because OFS is not used by printf).
Assuming that values with leading zeros appears solely in 2nd column I would use GNU AWK for this task following way, let file.txt content be
900001_10459.jpg,036921,Initiated
900002_10454.jpg,027964,Initiated
900003_10440.jpg,021449,Initiated
900004_10440.jpg,016650,Initiated
900005_10440.jpg,013929,Initiated
then
awk 'BEGIN{FS=",0?";OFS=","}{$2=gensub(/([0-9])$/, ".\\1", 1, $2);print}' file.txt
output
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated
Explanation: I set field separator (FS) to be , optionally followed by 0, so leading zero will be discarded as part of separator. In 2nd I replace last digit by . followed by that digit. Finally I print such changed line, using , as separators.
(tested in gawk 4.2.1)
I wish to have 4 numbers (including zeros) and the last value (5th value) separated from the 4 values by a decimal point.
If I understand, you need not all digits of that field but only the last five digits.
Using awk you can get the last five with the substr function and then print the field with the last digit separeted from de previous 4 by a decimal point, using the sub() function:
awk -F',' -v OFS=',' '{$2= substr($2, length($2) - 4, length($2) ); sub(/[[:digit:]]{1}$/, ".&",$2);print}' file
900001_10459.jpg,3692.1,Initiated
900002_10454.jpg,2796.4,Initiated
900003_10440.jpg,2144.9,Initiated
900004_10440.jpg,1665.0,Initiated
900005_10440.jpg,1392.9,Initiated

Change column if some regex expression is true with awk or sed

I have a file (lets call it data.csv) similar to this
"123","456","ud,h-match","moredata"
with many rows in the same format and embedded commas. What I need to do is look at the third column and see if it has an expression. In this case I want to know if the third column has "match" anywhere (which it does). If there is any, then I to replace the whole column to something else like "replaced". So to relate it to the example data.csv file, I would want it to look this.
"123","456","replaced","moredata"
Ideally, I want the file data.csv itself to be changed (time is of the essence since I have a big file) but it's also fine if you write it to another file.
Edit:
I have tried using awk:
awk -F'","' -OFS="," '{if(tolower($3) ~ "stringI'mSearchingFor"){$3="replacement"; print}else print}' file
but it dosen't change anything. If I remove the OFS portion then it works but it gets separated by spaces and the columns don't get enclosed by double quotes.
Depending on the answer to my question about what you mean by column, this may be what you want (uses GNU awk for FPAT):
$ awk -v FPAT='[^,]+|"[^"]+"' -v OFS=',' '$3~/match/{$3="\"replaced\""} 1' file
"123","456","replaced","moredata"
Use awk -i inplace ... if you want to do "in place" editing.
With any awk (but slightly more fragile than the above since it leaves the leading/trailing " on the first and last fields, and has no -i inplace):
$ awk 'BEGIN{FS=OFS="\",\""} $3~/match/{$3="replaced"} 1' file
"123","456","replaced","moredata"

removing commas from numbers in CSV file

I have a file that has many columns and I only need two of those columns. I am getting the columns I need using
cut -f 2-3 -d, file1.csv > file2.csv
The issue I am having is that the first column is ID and once it gets past 999 it becomes 1,000 and so it is treated as an extra column now. I cant get rid of all commas because I need them to separate the data. Is there a way to use sed to remove commas that only show up between 0-9?
I'd use a real CSV parser, and count backwards from the end of the line:
ruby -rcsv -ne '
row = $_.parse_csv
puts row[-5..-4].to_csv :force_quotes => true
' <<END
999,"someone#example.com","Doe, John","Doe","555-1212","address"
1,234,"email#email.com","name","lastname","phone","address"
END
"someone#example.com","Doe, John"
"email#email.com","name"
This works for the example in the comments:
awk -F'"?,"' '{print $2, $3}' file
The field separator is zero or one " followed by ,". This means that the comma in the first number doesn't count.
To separate the two fields with a comma instead of a space, you can change the OFS variable like this:
awk -F'"?,"' -v OFS=',' '{print $2, $3}' file
Or like this:
awk -F'"?,"' 'BEGIN{OFS=","}{print $2, $3}' file
Alternatively, if you want the quotes as well, you can use printf:
awk -F'"?,"' '{printf "\"%s\",\"%s\"\n", $2, $3}' file
From your comments, it sounds like there is a comma and a space (', ') pattern between tokens.
If this is the case, you can do this easily with sed. The strategy is to first replace all occurrences of , with some unique character sequence (like maybe ||).
's:, :||:g'
From there you can remove all commas:
's:,::g'
Finally, replace the double pipes with comma-space again.
's:||:, :g'
Putting it into one statement:
sed -i -e 's:, :||:g;s:,::g;s:||:, :g' your_odd_file.csv
And a command-line example to try before you buy:
bash$ sed -e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000, hello world, 123,456"
1200000, hello world, 123456
If you are in the unfortunate situation where there is not a space between fields in the CSV - you can attempt to 'fake it' by detecting changes in data type - like where there is a numeric field followed by a text field.
's:,\([^0-9]\):, \1:g' # numeric followed by non-numeric
's:\([^0-9]\),:\1, :g' # non-numeric field followed by something (anything)
You can put this all together into one statement, but you are venturing into dangerous waters here - this will definitely be a one-off solution and should be taken with a large grain of salt.
sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' file1.csv > file2.csv
And another example:
bash$ sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000,hello world,123,456"
1200000, hello world, 123456

One liner awk html tag replace "''>" with "''> " with gsub

I've been racking my brain with this for the past half an hour and everything I've tried so far has failed miserably!
Within an html file, there is a field within tags, but the field itself is not separated with a space from the > sign so it's hard to read with awk. I would basically like to add a single space after the opening tag, but gsub and awk are refusing to cooperate.
I've tried
awk 'gsub("class\\\'\\\'>","class\\\'\\\'>")' filename
since one backslash is needed to escape the single quote, the second to escape the backslash itself, and the third to escape the sequence \' but Terminal (I'm working on a Mac) refuses to execute, and instead goes in the next line awaiting some other input from me.
Please help :(
In Bash, single quotes accept absolutely no kind of escape. Suppose e.g. I write this command:
$ echo '\''
>
Bash will consider the string opened by ' closed at the second ', generating a string containing only \. The next ', then, is considered the opening of a new string, so bash expects for more input in the next line (signalled by the >).
If you are not aware of this fact, you may think that the string after the echo command below will be open yet, but it is closed:
$ echo 'will this string contain a single quote like \'
will this string contain a single quote like \
So, when you write
'gsub("class\\\'\\\'>","class\\\'\\\'> ")'
you are writing the string gsub("class\\\ concatenated with a backslash and a quote (\'); then a greater than signal. After this, the "," is interpreted as a string containing a comma, because the single quote of the beginning of the expression was closed before. For now, the result is:
gsub("class\\\\'>,
After the comma, you have the string class, followed by a backslash and a quote, followed by another backslash and another quote, and finally by a greater than symbol and a space. This is the current string:
gsub("class\\\\'>,class\'\'>
This is no valid awk expression! Anyway, it gets worse: the double quote " will start a string, which will contain a closing parenthesis and a single quote, but this string is never closed!
Summing up, your problem is that, if you opened a string with ' in Bash, it will be forcedly close at the next ', no matter how many backslashes you put before it.
Solution: you can make some tricks opening and closing strings with ' and " but it will become cumbersome quickly. My suggested solution is to put your awk expression in a file. Then, use the -f flag from awk - this flag will make awk to execute the following file:
$ cat filename # The file to be changed
class''>
class>
class''>
$ cat mycode.awk # The awk script
gsub("class''>", "class''>[PSEUDOSPACE]")
$ awk -f mycode.awk filename # THE RESULT!
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
If you do not want to write a file, use the so called here documents:
$ awk -f- filename <<EOF
gsub("class''>", "class''>[PSEUDOSPACE]")
EOF
class''>[PSEUDOSPACE]
class''>[PSEUDOSPACE]
The problem is that you are escaping the ', so you are not finishing the command. For example:
echo \' > foo
echoes a single quote into the file named foo, and
echo \\\' > foo
writes a single backslash followed by a single quote.
In particular, you cannot escape a single quote inside a string, so
'foo\'bar'
is the string foo\ followed by the string bar followed by an unmatched open quote. It is exactly the same as writing "foo\\"bar'

parse a csv file that contains commans in the fields with awk

i have to use awk to print out 4 different columns in a csv file. The problem is the strings are in a $x,xxx.xx format. When I run the regular awk command.
awk -F, {print $1} testfile.csv
my output `ends up looking like
307.00
$132.34
30.23
What am I doing wrong.
"$141,818.88","$52,831,578.53","$52,788,069.53"
this is roughly the input. The file I have to parse is 90,000 rows and about 40 columns
This is how the input is laid out or at least the parts of it that I have to deal with. Sorry if I made you think this wasn't what I was talking about.
If the input is "$307.00","$132.34","$30.23"
I want the output to be in a
$307.00
$132.34
$30.23
Oddly enough I had to tackle this problem some time ago and I kept the code around to do it. You almost had it, but you need to get a bit tricky with your field separator(s).
awk -F'","|^"|"$' '{print $2}' testfile.csv
Input
# cat testfile.csv
"$141,818.88","$52,831,578.53","$52,788,069.53"
"$2,558.20","$482,619.11","$9,687,142.69"
"$786.48","$8,568,159.41","$159,180,818.00"
Output
# awk -F'","|^"|"$' '{print $2}' testfile.csv
$141,818.88
$2,558.20
$786.48
You'll note that the "first" field is actually $2 because of the field separator ^". Small price to pay for a short 1-liner if you ask me.
I think what you're saying is that you want to split the input into CSV fields while not getting tripped up by the commas inside the double quotes. If so...
First, use "," as the field separator, like this:
awk -F'","' '{print $1}'
But then you'll still end up with a stray double-quote at the beginning of $1 (and at the end of the last field). Handle that by stripping quotes out with gsub, like this:
awk -F'","' '{x=$1; gsub("\"","",x); print x}'
Result:
echo '"abc,def","ghi,xyz"' | awk -F'","' '{x=$1; gsub("\"","",x); print x}'
abc,def
In order to let awk handle quoted fields that contain the field separator, you can use a small script I wrote called csvquote. It temporarily replaces the offending commas with nonprinting characters, and then you restore them at the end of your pipeline. Like this:
csvquote testfile.csv | awk -F, {print $1} | csvquote -u
This would also work with any other UNIX text processing program like cut:
csvquote testfile.csv | cut -d, -f1 | csvquote -u
You can get the csvquote code here: https://github.com/dbro/csvquote
The data file:
$ cat data.txt
"$307.00","$132.34","$30.23"
The AWK script:
$ cat csv.awk
BEGIN { RS = "," }
{ gsub("\"", "", $1);
print $1 }
The execution:
$ awk -f csv.awk data.txt
$307.00
$132.34
$30.23