I have a file that has many columns and I only need two of those columns. I am getting the columns I need using
cut -f 2-3 -d, file1.csv > file2.csv
The issue I am having is that the first column is ID and once it gets past 999 it becomes 1,000 and so it is treated as an extra column now. I cant get rid of all commas because I need them to separate the data. Is there a way to use sed to remove commas that only show up between 0-9?
I'd use a real CSV parser, and count backwards from the end of the line:
ruby -rcsv -ne '
row = $_.parse_csv
puts row[-5..-4].to_csv :force_quotes => true
' <<END
999,"someone#example.com","Doe, John","Doe","555-1212","address"
1,234,"email#email.com","name","lastname","phone","address"
END
"someone#example.com","Doe, John"
"email#email.com","name"
This works for the example in the comments:
awk -F'"?,"' '{print $2, $3}' file
The field separator is zero or one " followed by ,". This means that the comma in the first number doesn't count.
To separate the two fields with a comma instead of a space, you can change the OFS variable like this:
awk -F'"?,"' -v OFS=',' '{print $2, $3}' file
Or like this:
awk -F'"?,"' 'BEGIN{OFS=","}{print $2, $3}' file
Alternatively, if you want the quotes as well, you can use printf:
awk -F'"?,"' '{printf "\"%s\",\"%s\"\n", $2, $3}' file
From your comments, it sounds like there is a comma and a space (', ') pattern between tokens.
If this is the case, you can do this easily with sed. The strategy is to first replace all occurrences of , with some unique character sequence (like maybe ||).
's:, :||:g'
From there you can remove all commas:
's:,::g'
Finally, replace the double pipes with comma-space again.
's:||:, :g'
Putting it into one statement:
sed -i -e 's:, :||:g;s:,::g;s:||:, :g' your_odd_file.csv
And a command-line example to try before you buy:
bash$ sed -e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000, hello world, 123,456"
1200000, hello world, 123456
If you are in the unfortunate situation where there is not a space between fields in the CSV - you can attempt to 'fake it' by detecting changes in data type - like where there is a numeric field followed by a text field.
's:,\([^0-9]\):, \1:g' # numeric followed by non-numeric
's:\([^0-9]\),:\1, :g' # non-numeric field followed by something (anything)
You can put this all together into one statement, but you are venturing into dangerous waters here - this will definitely be a one-off solution and should be taken with a large grain of salt.
sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' file1.csv > file2.csv
And another example:
bash$ sed -e 's:,\([^0-9]\):, \1:g;s:\([^0-9]\),:\1, :g' \
-e 's:, :||:g;s:,::g;s:||:, :g' <<< "1,200,000,hello world,123,456"
1200000, hello world, 123456
Related
I have a bunch of big csv I want to prefix every header column with fixed string. There is more than 500 columns in every file.
suppose my header is:
number;date;customer;key;amount
I tried this awk line:
awk -F';' 'NR==1{gsub(/[^a-z_]/,"input_file.")} { print }'
but I get (note fist column is missing prefix and separator is removed):
numberinput_file.dateinput_file.customerinput_file.keyinput_file.amount
expected output:
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
In any awk that'd be:
$ awk 'NR==1{gsub(/^|;/,"&input_file.")} 1' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
but sed exists to do simple substitutions like that, e.g. using a sed that has -E to enable EREs (e.g. GNU and BSD sed):
$ sed -E '1s/^|;/&input_file./g' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
If you're using GNU tools then you could use either of the above to change all of your CSV files at once with either of these:
awk -i inplace 'NR==1{gsub(/^|;/,"&input_file.")} 1' *.csv
sed -i -E '1s/^|;/&input_file./g' *.csv
Your gsub would brutally replace any nonalphabetic character anywhere in the input with the prefix - including your column separators.
The print can be abbreviated to the common idiom 1 at the very end of your script; this simply means "this condition is true; perform the default action for every line (i.e. print it all)" though this is just a stylistic change.
awk -F';' 'NR==1{
sub(/^/, "input_file."); gsub(/;/, ";input_file."); }
1' filename
If you want to perform this on multiple files, probably put a shell loop around it. If you only want to concatenate everything to standard output, you can give all the files to Awk in one go (in which case you probably don't want to print the header line for any file after the first; maybe change the 1 to NR==1 || FNR != 1).
I would use GNU AWK following way, let file.txt content be
number;date;customer;key;amount
1;2;3;4;5
6;7;8;9;10
then
awk 'BEGIN{FS=";";OFS=";input_file."}NR==1{$1="input_file." $1}{print}' file.txt
output
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
1;2;3;4;5
6;7;8;9;10
Explanation: I set OFS to ; followed by prefix. Then in first line I add prefix to first column, which trigger string rebuilding. No modification is done in any other line, thus they are printed as is.
(tested in GNU Awk 5.0.1)
Also with awk using for loop and printf:
awk 'BEGIN{FS=OFS=";"} NR==1{for (i=1; i<=NF; i++) printf "%s%s", "input_file." $i, (i<NF ? OFS : ORS)}' file
input_file.number;input_file.date;input_file.customer;input_file.key;input_file.amount
I have a json string and should extract the values in the square brackets with bash script and validate it against the expected values. If the expected value exists, leave as it is or else add the new values into the square brackets as expected.
"hosts": [“unix://“,”tcp://0.0.0.0:2376"]
I cannot use jq.
Expected :
Verify if the values “unix://“ and ”tcp://0.0.0.0:2376" exists for the key "hosts". Add if it doesn't exist
I tried using like below,
$echo "\"hosts\":[\"unix://\",\"tcp://0.0.0.0:2376\"]" | cut -d: -f2
["unix
$echo "\"hosts\":[\"unix://\",\"tcp://0.0.0.0:2376\"]" | sed 's/:.*//'
"hosts"
I have tried multiple possibilities with sed & cut but cannot achieve what I expect. I'm a shell script beginner.
How can I achieve this with sed or cut ?
You need to detect the precense of "unix://" and "tcp://0.0.0.0:2376" in your string. You can do it like this:
#!/bin/bash
#
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
check1=$(echo "$string" | grep -c "unix://")
check2=$(echo "$string" | grep -c "tcp://0.0.0.0:2376")
(( total = check1 + check2 ))
if [[ "$total" -eq 2 ]]
then
echo "they are both in, nothing to do"
else
echo "they are NOT both there, fix variable string"
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
fi
grep -c counts how many times a specific string appears. In your case, both strings have to be found once, so adding them together will produce 0, 1 or 2. Only when it is equal to 2 is the string correct.
cut will extract some string based on a certain delimiter. But it is not typically used to verify if a string is in there, grep does that.
sed has many uses, such as replacing text (with 's///'). But again, grep is the tool that was built to detect strings in other strings (or files).
Now when it comes to adding text, you say that if one of "unix://" or "tcp://0.0.0.0:2376" is missing, add it. Well that comes back to redefining the whole string with the correct values, so just assign it.
Finaly, if you think about it, you want to ensure that string is "hosts": ["unix://","tcp://0.0.0.0:2376"]. So no need to verify anything, just force it through hardcode at the start of your script. The end result will be the same.
Part 2
If you MUST use cut, you could:
#!/bin/bash
#
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
firstelement=$(echo "$string" | cut -d',' -f1 | cut -d'"' -f4
echo $firstelement
# will display unix://
secondelement=$(echo "$string" | cut -d',' -f2 | cut -d'"' -f2
echo $secondelement
# will display tcp://0.0.0.0:2376
Then you can use if statements to compare to your desired values. But note that this approach will fail if you do not have at least 2 elements in your text between the [ ]. Ex. ["unix://"] will fail cut -d',' since there is no ',' character in the string.
Part 3
If you MUST use sed:
#!/bin/bash
#
string='"hosts": ["unix://","tcp://0.0.0.0:2376"]'
firstelement=$(echo "$string" | sed 's/.*\["\(.*\)",".*/\1/')
echo "$firstelement"
# will output unix://
secondelement=$(echo "$string" | sed 's/.*","\(.*\)"\]/\1/')
echo $secondelement
# will output tcp://0.0.0.0:2376
Again here, the main character to work with is the ,.
firstelement explanation
sed 's/.*\["\(.*\)",".*/\1/'
.* anything...
\[" followed by [ and ". Since [ means something to sed, you have to \ it
\(.*\) followed by anything at all (. matches any character, * matches any number of these characters).
"," followed by ",". This only happens for the first element.
.* followed by anything
\1 keep only the characters enclosed between \( and \)
Similarily, for the second element the s/// is modified to keep only what follows ",", up to the last "] at the end of the string.
Again like with cut above, use if statements to verify if the extracted values are what you wanted.
Again, read my last comments in the first approach, you might not need all this...
Hi I need to elaborate a big csv file (20M rows) adding double quotes for every comma delimited field. The csv file got 8 fields coma delimited as below:
'2016-03-12','12393659','134',,'35533605',189348,9798,gmail.com;live_com.com
'2016-03-12','12390103','138',,'35438006',5133,1897,google.com
'2016-03-12','45616164','139',,'01318800',10945593,596633,facebook.com;tumblr.com;t.co
'2016-03-12','45673436','38',,'86441702',4350985,150327,serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net
As you see first 3 fields are between single quotes, 4th is blank, 5th between single quotes and 6th to 8th only comma delimited.
I would like to get the following result (also 4th field even if empty need to be double quoted):
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985,"150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"
I partially obtain result with mix of sed and awk:
sed -e s/\'//g inpu.csv > output.csv eliminate quotes
awk '{gsub(/[^,]+/,"\"&\"")}1' output.csv > output1.csv add double quotes
but the 4th field is not double quoted and I need to reduce elaboration time as much as possible.
Anyway help to do all in awk with better performances and also 4th field double quoted.
Many thx for the help. M.Tave
If your data is really that simple with no embedded quotes or newlines or anything then all you need is:
$ awk -F"'?,'?" -v OFS='","' '{$1=$1; gsub(/^.|$/,"\"")} 1' file
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985","150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"
give this awk one-liner a try:
awk -F, -v OFS="," -v re="^'?|'?$" -v q='"'
'{for(i=1;i<=NF;i++)if($i)gsub(re,q,$i);else $i=q$i q}7' file
The idea is, using gsub() to add double quotes to those non-empty fields. Those empty fields, just add " to the head and tail. The replace regex was defined as awk variable outside the script, for avoiding to escape.
It works with your input data here:
kent$ awk -F, -v OFS="," -v re="^'?|'?$" -v q='"' '{for(i=1;i<=NF;i++)if($i)gsub(re,q,$i);else $i=q$i q}7' f
"2016-03-12","12393659","134","","35533605","189348","9798","gmail.com;live_com.com"
"2016-03-12","12390103","138","","35438006","5133","1897","google.com"
"2016-03-12","45616164","139","","01318800","10945593","596633","facebook.com;tumblr.com;t.co"
"2016-03-12","45673436","38","","86441702","4350985","150327","serving-sys.com;chartboost.com;admarvel.com;mydas.mobi;adap.tv;cloudfront.net"
I want to filter out the lines that has "synonymous" in the 3rd column. The command is like below
awk '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
But the fileCSV2.csv still contains the word "synonymous" in the 3rd column. I wonder what might be wrong?
Two lines from the fileCSV.csv:
"exonic","LINC00115","synonymous SNV","uc010nxx.2:c.C299T:p.P100L",,"0.99",,0.56,rs3115849,,,,,,,,,,,,,chr1,762273,762273,G,A,"chr1","762273",".","G","A","30483.62","PASS","AC=24;AF=1.00;AN=24;DP=2972;FS=0.000;MLEAC=8;MLEAF=1.00;MQ0=0;VQSLOD=19.50;culprit=FS;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,2:2:6:66,6,0","1/1:0,297:297:99:10476,951,0","1/1:0,304:304:99:10098,950,0","1/1:0,295:295:99:9869,929,0","1/1:0,292:292:99:8655,895,0","1/1:0,304:304:99:10006,965,0","1/1:0,179:179:99:5862,568,0","1/1:0,273:273:99:9328,851,0","1/1:0,279:279:99:7946,850,0","1/1:0,283:283:99:9214,866,0","1/1:0,8:8:21:229,21,0","1/1:0,456:456:99:16385,1285,0"
"exonic","SAMD11","synonymous SNV","uc001abw.1:c.T1027C:p.W343R","559;Name=lod=249",,,1.00,rs6672356,1,0.916445,N,0.0,T,0.0,B,0.998605,N,4.19E-4,N,3.17,chr1,877831,877831,T,C,"chr1","877831",".","T","C","3594.56","PASS","AC=24;AF=1.00;AN=24;DP=387;FS=0.000;MLEAC=8;MLEAF=1.00;MQ=60.00;MQ0=0;VQSLOD=15.00;culprit=DP;set=Intersection","GT:AD:DP:GQ:PL","1/1:0,3:3:9:97,9,0","1/1:0,3:3:12:113,12,0","1/1:0,64:64:99:1805,189,0","1/1:0,57:57:99:1605,168,0","1/1:0,30:30:90:768,90,0","1/1:0,69:69:99:2026,216,0","1/1:0,15:15:45:428,45,0","1/1:0,23:23:81:809,81,0","1/1:0,22:22:69:562,69,0","1/1:0,40:40:99:1142,117,0","1/1:0,3:3:9:94,9,0","1/1:0,58:58:99:14,7,0"
If your fileCSV.csv has columns separated by , than you need to
awk -F, '$3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
If -F does not work with your version of awk try
awk 'BEGIN{FS=","} $3 !~ /^synonymous/' fileCSV.csv > fileCSV2.csv
EDIT: you also need to take " into account, so use /^"synonymous/
To process csv file using awk I would prefer the following method to automatically account for quotation marks, namely preprocess with sed.
For your concrete question I would use
sed -e 's/^"//;s/"$//' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'
If you also want to correctly process files with string fields containing quotation marks (which will be represented by double quotation marks in csv files), you need to change the sed expression the following way,
sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/'
This method has the advantage that it allows you to correctly print or process some fields using awk. For example if you want to print the first and fifth field from the filtered lines, seperated by a : you can now use
sed -e 's/^"//;s/"$//;s/""/"/g' fileCSV.csv | awk -F '"?,"?' '$3 !~ /^synonymous/ { print $1,":",$5}'
(If the difference between the methods is not clear to you, you can try the last awk command without the sed preprocessing)
i have to use awk to print out 4 different columns in a csv file. The problem is the strings are in a $x,xxx.xx format. When I run the regular awk command.
awk -F, {print $1} testfile.csv
my output `ends up looking like
307.00
$132.34
30.23
What am I doing wrong.
"$141,818.88","$52,831,578.53","$52,788,069.53"
this is roughly the input. The file I have to parse is 90,000 rows and about 40 columns
This is how the input is laid out or at least the parts of it that I have to deal with. Sorry if I made you think this wasn't what I was talking about.
If the input is "$307.00","$132.34","$30.23"
I want the output to be in a
$307.00
$132.34
$30.23
Oddly enough I had to tackle this problem some time ago and I kept the code around to do it. You almost had it, but you need to get a bit tricky with your field separator(s).
awk -F'","|^"|"$' '{print $2}' testfile.csv
Input
# cat testfile.csv
"$141,818.88","$52,831,578.53","$52,788,069.53"
"$2,558.20","$482,619.11","$9,687,142.69"
"$786.48","$8,568,159.41","$159,180,818.00"
Output
# awk -F'","|^"|"$' '{print $2}' testfile.csv
$141,818.88
$2,558.20
$786.48
You'll note that the "first" field is actually $2 because of the field separator ^". Small price to pay for a short 1-liner if you ask me.
I think what you're saying is that you want to split the input into CSV fields while not getting tripped up by the commas inside the double quotes. If so...
First, use "," as the field separator, like this:
awk -F'","' '{print $1}'
But then you'll still end up with a stray double-quote at the beginning of $1 (and at the end of the last field). Handle that by stripping quotes out with gsub, like this:
awk -F'","' '{x=$1; gsub("\"","",x); print x}'
Result:
echo '"abc,def","ghi,xyz"' | awk -F'","' '{x=$1; gsub("\"","",x); print x}'
abc,def
In order to let awk handle quoted fields that contain the field separator, you can use a small script I wrote called csvquote. It temporarily replaces the offending commas with nonprinting characters, and then you restore them at the end of your pipeline. Like this:
csvquote testfile.csv | awk -F, {print $1} | csvquote -u
This would also work with any other UNIX text processing program like cut:
csvquote testfile.csv | cut -d, -f1 | csvquote -u
You can get the csvquote code here: https://github.com/dbro/csvquote
The data file:
$ cat data.txt
"$307.00","$132.34","$30.23"
The AWK script:
$ cat csv.awk
BEGIN { RS = "," }
{ gsub("\"", "", $1);
print $1 }
The execution:
$ awk -f csv.awk data.txt
$307.00
$132.34
$30.23