Remove newline in the middle of CSV file - csv

I need to clean a CSV file looking like this :
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking
price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
Yes sometimes without a double quote, but the new line occurs only with double quote fields.
The issue happens only with the 4th field.
I work on an awk command and it's now what I have :
awk '{ if (substr($4,1,1) == "\"" && substr($4,length($4)) != "\"") gsub(/\n/," ");}' FS=";" input_file
This awk looks if the first char of the field is a double quote and if the last one isn't a double quote.
Then try to remove the new line but it clearly didn't remove it.
I think I miss a little "easy" thing but can't figure out what is it.

You may use this awk:
awk -F';' -v ORS= '1; {print (NF==4 ? " " : "\n")}' file
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO
How it works:
This command sets ORS to empty character initially.
Then for each line it prints full record.
Then it prints a space when NF == 4 otherwise it prints a line break.

Using GNU sed
$ sed -Ez 's/(;"[^"]*)\n/\1/g' input_file
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO

With GNU awk for RT:
$ awk -v RS='"' '!(NR%2){gsub(/\n/,"")} {ORS=RT} 1' file
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO

One idea for tweaking OP's current awk code:
awk -F';' '
{ if (substr($4,1,1) == "\"" && substr($4,length($4)) != "\"") { # if we have an incomplete line then ...
printf $0 # printf, sans a "\n", will leave the cursor at the end of the current line
next # skip to next line of input
}
}
1 # otherwise print current line
' input_file
# or as a one-liner sans comments:
awk -F';' ' { if (substr($4,1,1) == "\"" && substr($4,length($4)) != "\"") { printf $0; next } } 1 ' input_file
This generates:
food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO

This might work for you (GNU sed):
sed -E ':a;/^[^\"]*(\\.[^\"]*)*("[^\"]*(\\.[^"\]*)*"[^\"]*)*"[^\"]*(\\.[^"\]*)*$/{N;s/\n//;ta}' file
This matches any unbalanced double quotes (with or without escaped double quotes), appends the following line, removes the newline and repeats until the double quotes are balanced.
A simpler solution in which escaped double quotes are forgone:
sed -E ':a;/^[^"]*("[^"]*"[^"]*)*"[^"]*$/{N;s/\n//;ta}' file

echo 'food;1;ZZ;"lipsum";NR
foobar;123;NA;"asking
price";NR
foobar;5;NN;Random text;NN
moongoo;13;VV;"Any label";OO' |
mawk '(ORS = (_<ORS)==(NF % 2) ? RS : _)^_' FS=';' | gcat -n
1 food;1;ZZ;"lipsum";NR
2 foobar;123;NA;"asking price";NR
3 foobar;5;NN;Random text;NN
4 moongoo;13;VV;"Any label";OO

Related

How to print only those numbers of a column which are greater than certain number in bash [duplicate]

I found some ways to pass external shell variables to an awk script, but I'm confused about ' and ".
First, I tried with a shell script:
$ v=123test
$ echo $v
123test
$ echo "$v"
123test
Then tried awk:
$ awk 'BEGIN{print "'$v'"}'
$ 123test
$ awk 'BEGIN{print '"$v"'}'
$ 123
Why is the difference?
Lastly I tried this:
$ awk 'BEGIN{print " '$v' "}'
$ 123test
$ awk 'BEGIN{print ' "$v" '}'
awk: cmd. line:1: BEGIN{print
awk: cmd. line:1: ^ unexpected newline or end of string
I'm confused about this.
#Getting shell variables into awk
may be done in several ways. Some are better than others. This should cover most of them. If you have a comment, please leave below.                                                                                    v1.5
Using -v (The best way, most portable)
Use the -v option: (P.S. use a space after -v or it will be less portable. E.g., awk -v var= not awk -vvar=)
variable="line one\nline two"
awk -v var="$variable" 'BEGIN {print var}'
line one
line two
This should be compatible with most awk, and the variable is available in the BEGIN block as well:
If you have multiple variables:
awk -v a="$var1" -v b="$var2" 'BEGIN {print a,b}'
Warning. As Ed Morton writes, escape sequences will be interpreted so \t becomes a real tab and not \t if that is what you search for. Can be solved by using ENVIRON[] or access it via ARGV[]
PS If you have vertical bar or other regexp meta characters as separator like |?( etc, they must be double escaped. Example 3 vertical bars ||| becomes -F'\\|\\|\\|'. You can also use -F"[|][|][|]".
Example on getting data from a program/function inn to awk (here date is used)
awk -v time="$(date +"%F %H:%M" -d '-1 minute')" 'BEGIN {print time}'
Example of testing the contents of a shell variable as a regexp:
awk -v var="$variable" '$0 ~ var{print "found it"}'
Variable after code block
Here we get the variable after the awk code. This will work fine as long as you do not need the variable in the BEGIN block:
variable="line one\nline two"
echo "input data" | awk '{print var}' var="${variable}"
or
awk '{print var}' var="${variable}" file
Adding multiple variables:
awk '{print a,b,$0}' a="$var1" b="$var2" file
In this way we can also set different Field Separator FS for each file.
awk 'some code' FS=',' file1.txt FS=';' file2.ext
Variable after the code block will not work for the BEGIN block:
echo "input data" | awk 'BEGIN {print var}' var="${variable}"
Here-string
Variable can also be added to awk using a here-string from shells that support them (including Bash):
awk '{print $0}' <<< "$variable"
test
This is the same as:
printf '%s' "$variable" | awk '{print $0}'
P.S. this treats the variable as a file input.
ENVIRON input
As TrueY writes, you can use the ENVIRON to print Environment Variables.
Setting a variable before running AWK, you can print it out like this:
X=MyVar
awk 'BEGIN{print ENVIRON["X"],ENVIRON["SHELL"]}'
MyVar /bin/bash
ARGV input
As Steven Penny writes, you can use ARGV to get the data into awk:
v="my data"
awk 'BEGIN {print ARGV[1]}' "$v"
my data
To get the data into the code itself, not just the BEGIN:
v="my data"
echo "test" | awk 'BEGIN{var=ARGV[1];ARGV[1]=""} {print var, $0}' "$v"
my data test
Variable within the code: USE WITH CAUTION
You can use a variable within the awk code, but it's messy and hard to read, and as Charles Duffy points out, this version may also be a victim of code injection. If someone adds bad stuff to the variable, it will be executed as part of the awk code.
This works by extracting the variable within the code, so it becomes a part of it.
If you want to make an awk that changes dynamically with use of variables, you can do it this way, but DO NOT use it for normal variables.
variable="line one\nline two"
awk 'BEGIN {print "'"$variable"'"}'
line one
line two
Here is an example of code injection:
variable='line one\nline two" ; for (i=1;i<=1000;++i) print i"'
awk 'BEGIN {print "'"$variable"'"}'
line one
line two
1
2
3
.
.
1000
You can add lots of commands to awk this way. Even make it crash with non valid commands.
One valid use of this approach, though, is when you want to pass a symbol to awk to be applied to some input, e.g. a simple calculator:
$ calc() { awk -v x="$1" -v z="$3" 'BEGIN{ print x '"$2"' z }'; }
$ calc 2.7 '+' 3.4
6.1
$ calc 2.7 '*' 3.4
9.18
There is no way to do that using an awk variable populated with the value of a shell variable, you NEED the shell variable to expand to become part of the text of the awk script before awk interprets it. (see comment below by Ed M.)
Extra info:
Use of double quote
It's always good to double quote variable "$variable"
If not, multiple lines will be added as a long single line.
Example:
var="Line one
This is line two"
echo $var
Line one This is line two
echo "$var"
Line one
This is line two
Other errors you can get without double quote:
variable="line one\nline two"
awk -v var=$variable 'BEGIN {print var}'
awk: cmd. line:1: one\nline
awk: cmd. line:1: ^ backslash not last character on line
awk: cmd. line:1: one\nline
awk: cmd. line:1: ^ syntax error
And with single quote, it does not expand the value of the variable:
awk -v var='$variable' 'BEGIN {print var}'
$variable
More info about AWK and variables
Read this faq.
It seems that the good-old ENVIRON awk built-in hash is not mentioned at all. An example of its usage:
$ X=Solaris awk 'BEGIN{print ENVIRON["X"], ENVIRON["TERM"]}'
Solaris rxvt
You could pass in the command-line option -v with a variable name (v) and a value (=) of the environment variable ("${v}"):
% awk -vv="${v}" 'BEGIN { print v }'
123test
Or to make it clearer (with far fewer vs):
% environment_variable=123test
% awk -vawk_variable="${environment_variable}" 'BEGIN { print awk_variable }'
123test
You can utilize ARGV:
v=123test
awk 'BEGIN {print ARGV[1]}' "$v"
Note that if you are going to continue into the body, you will need to adjust
ARGC:
awk 'BEGIN {ARGC--} {print ARGV[2], $0}' file "$v"
I just changed #Jotne's answer for "for loop".
for i in `seq 11 20`; do host myserver-$i | awk -v i="$i" '{print "myserver-"i" " $4}'; done
I had to insert date at the beginning of the lines of a log file and it's done like below:
DATE=$(date +"%Y-%m-%d")
awk '{ print "'"$DATE"'", $0; }' /path_to_log_file/log_file.log
It can be redirect to another file to save
Pro Tip
It could come handy to create a function that handles this so you dont have to type everything every time. Using the selected solution we get...
awk_switch_columns() {
cat < /dev/stdin | awk -v a="$1" -v b="$2" " { t = \$a; \$a = \$b; \$b = t; print; } "
}
And use it as...
echo 'a b c d' | awk_switch_columns 2 4
Output:
a d c b

Replace a string if it is not followed by another string

I would like to replace the fortawesome string (if it is not followed by the /fontawesome-common-type string) by the stephane string.
sed -e 's,"#fortawesome(/^fontawesome-common-types+),"#stephaneeybert\1,g'
sed: -e expression #1, char 65: invalid reference \1 on `s' command's RHS
An example input:
"#fortawesome/fontawesome-common-types": "^0.2.32"
"name": "#fortawesome/pro-duotone-svg-icons",
And its expected output:
"#fortawesome/fontawesome-common-types": "^0.2.32"
"name": "#stephane/pro-duotone-svg-icons",
UPDATE: I went with the simple alternative of using an intermediate variable:
EXCLUDE=fontawesome-common-types
BUFFER=EkSkLUdE
cat package/package.json \
| sed -e "s,\"#$REPO_SOURCE/$EXCLUDE,\"#$BUFFER,g" \
| sed -e "s,\"#$REPO_SOURCE,\"#$REPO_DEST,g" \
| sed -e "s,\"#$BUFFER,\"#$REPO_SOURCE/$EXCLUDE,g" \
> package/package.out.json;
sed doesn't support negative lookahead functionality. Other than the obvious perl fallback that supports lookaheads, uou may use this awk as a work-around:
awk -F 'fortawesome' -v OFS='stephane' 'NF > 1 {
s = ""
for (i=1; i<NF; ++i)
s = s $i ($(i+1) ~ /^\/fontawesome-common-type/ ? FS : OFS)
$0 = s $i
} 1' file
This awk uses fortawesome as input field separator and stephane as OFS
NF > 1 will be true when we have fortawesome in a line
we loop through fields split by fortawesome and keep track of next field
if next field starts with /fontawesome-common-type then we keep same FS otherwise use OFS
Use temporary values:
exclude='fortawesome/fontawesome-common-type';
match='fortawesome';
repl='stephane';
tmpvar='EkSkLUdE';
sed "s#$exclude#$tmpvar#g;s#$match#$repl#g;s#$tmpvar#$exclude#g" file > newfile
All cases of exclude are replaced with tmpvars, then real expected matches are replaced with repls, and then tmpvars are changed back to excludes.

How to combine several GAWK statements?

I have the following:
cat *.csv > COMBINED.csv
sort -k1 -n -t, COMBINED.csv > A.csv
gawk -F ',' '{sub(/[[:lower:]]+/,"",$1)}1' OFS=',' A.csv # REMOVE LOWER CASE CHARACTERS FROM 1st COLUMN
gawk -F ',' 'length($1) == 14 { print }' A.csv > B.csv # REMOVE ANY LINE FROM CSV WHERE VALUE IN FIRST COLUMN IS NOT 14 CHARACTERS
gawk -F ',' '{ gsub("/", "-", $2) ; print }' OFS=',' B.csv > C.csv # REPLACE FORWARD SLASH WITH HYPHEN IN SECOND COLUMN
gawk -F ',' '{print > ("processed/"$1".csv")}' C.csv # SPLIT CSV INTO FILES GROUPED BY VALUE IN FIRST COLUMN AND SAVE THE FILE WITH THAT VALUE
However, I think 4 separate lines is a bit overkill and was wondering whether I could optimise it or at least streamline it into a one-liner?
I've tried piping the data but getting stuck in a mix of errors
Thanks
In awk you can append multiple actions as:
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
So every time a record is read, it will process it by first doing pattern-action1 followed by pattern-action2, ...
In your case, it seems like you can do:
awk 'BEGIN{FS=OFS=","}
# remove lower case characters from first column
{sub(/[[:lower:]]+/,"",$1)}
# process only lines with 14 characters in first column
(length($1) != 14) { next }
# replace forward slash with hyphen
{ gsub("/", "-", $2) }
{ print > ("processed/" $1 ".csv") }' <(sort -k1 -n -t, combined.csv)
You could essentially also put the sorting in GNU awk, but that is a but to mimic the sort exactly, we would need to know your input format.

Print Rows if End of Field Matches a String in AWK

I have a csv file and I am trying to print rows using awk if a certail field ends with a specific string. So for example, I have the below CSV file:
col1,col2,col3
1,abcd,.abcd_efg
2,efgh,.abcd
3,ijkl,.abcd_mno
4,mnop,.abcd
5,qrst,.abcd_uvw
This is the result I am seeking after:
2,efgh,.abcd
4,mnop,.abcd
But I am getting a different result. This is the awk command I am using:
cat file.csv | awk -F"," '{if ($3 ~ ".abcd" ) print $0}'
and This is the result I am getting:
1,abcd,.abcd_efg
2,efgh,.abcd
3,ijkl,.abcd_mno
4,mnop,.abcd
5,qrst,.abcd_uvw
I event tried the below, but no matched is returned so it didn't work:
cat file.csv | awk -F"," '{if ($3 ~ ".abcd$" ) print $0}'
Any clue what the issue might be? Am I using the wrong expression to get this result?
EDIT: This is another command I tried where I tried Kent's solution, but it didn't work:
cat file.csv | awk -F"," '$3 ~ "[.]abcd"'
First of all the cat in cat file|awk ... is useless, just awk ... file
Your input text has no single comma, how come you set FS=","?
If you want to do exact String compare, use $3 == "whatever" instead of $3 ~ /regex/
So your codes could be changed into:
awk '$3 == ".abcd"' file
If you really love regex, and want to do it in regex match way:
awk '$3 ~ "[.]abcd$"' file
or
awk '$3 ~ /^[.]abcd$/' file
depends on what you required.
You may modify your awk command as followed,
$ cat file.csv | awk '$3 ~ /\.abcd$/ {print $0}'
2 efgh .abcd
4 mnop .abcd
Brief explanation,
$3 ~ /.abcd$/: if $3 matches the regex .abcd$, print $0
According to your modified question, you may change the awk command to:
cat file.csv | awk -F, '$3 ~ /\.abcd$/ {print $0}'

How to use multiple pattern with the condition

The input text in test.txt file:
{"col1":"250000","col2":"8089389","col4":"09876545","col3":"121","col5":"123456789"}
{"col1":"210000","col3":"112","col2":"8089389","col4":"09876545","col5":"123456789"}
{"col1":"120000","col2":"8089389","col3":"123","col4":"09876545","col5":"123456789"}
{"col1":"170000","col2":"8089389","col4":"09876545","col5":"123456789","col3":"123"}
{"col1":"190000","col2":"8089389","col4":"09876545","col5":"123456789,"col3":"124""}
{"col3":"176","col1":"220000","col2":"8089389","col4":"09876545","col5":"123456789"}
The command line and result that i tried:
$ awk -F"," '{for(i=1;i<=NF;i++){ if($i ~ /col1/){print $i} };for (x=1;x<=NF;x++){if($x ~ /col3/){print $x}}}' test.txt
{"col1":"250000"
"col3":"121"
{"col1":"210000"
"col3":"112"
{"col1":"120000"
"col3":"123"
{"col1":"170000"
"col3":"123"
{"col1":"190000"
"col3":"124"
{"col1":"220000"
"col3":"176"
The expected result that i would like to get:
col1:250000,col3:121
col1:210000,col3:112
col1:120000,col3:123
col1:170000,col3:123
col1:190000,col3:124
col1:220000,col3:176
Seems like you are parsing a json file. You can use jq,
$ jq --raw-output '"col1:" + .col1 + ",col3:" + .col3' file.json
col1:250000,col3:121
col1:210000,col3:112
col1:120000,col3:123
col1:170000,col3:123
col1:190000,col3:124
col1:220000,col3:176
For more info: jq manual
try:
awk '{gsub(/\{|\"|\}|\;/,"");match($0,/col1[^,]*/);VAL1=substr($0,RSTART,RLENGTH)",";match($0,/col3[^,]*/);VAL2=substr($0,RSTART,RLENGTH);if(VAL1 && VAL2){print VAL1 VAL2}}' Input_file
I am globally substituting the characters {}"; in the line and then looking for a match for col1 and col3 strings in each line and if both the col1 and col3 strings are present then printing them.
EDIT: Adding a non-one liner form of solution too now.
awk '{
gsub(/\{|\"|\}|\;/,"");
match($0,/col1[^,]*/);
VAL1=substr($0,RSTART,RLENGTH)",";
match($0,/col3[^,]*/);
VAL2=substr($0,RSTART,RLENGTH);
if(VAL1 && VAL2){
print VAL1 VAL2
}
}
' Input_file
if missing json tools, here is an awk hack
$ awk -F'[:,]' -v OFS=, -v cols='col1,col3' '
{n=split(cols,c);
gsub(/[{}"]/,"");
for(i=1;i<NF;i+=2) a[$i]=$(i+1);
for(i=1;i<=n;i++) printf "%s%s", (c[i]":"a[c[i]]), (i==n?ORS:OFS)}' file
col1:250000,col3:121
col1:210000,col3:112
col1:120000,col3:123
col1:170000,col3:123
col1:190000,col3:124
col1:220000,col3:176
Whenever you're manipulating data that has name->value mappings, it's best to first create an associative array to store that mapping (n2v[] below) and then you can just print whatever values you want by looking them up in the array by their name.
$ cat tst.awk
BEGIN { RS="}"; FS="\""; OFS="," }
{
for (i=2; i<=NF; i+=4) {
n2v[$i] = $(i+2)
}
print p("col1"), p("col3")
}
function p(n) { return (n ":" n2v[n]) }
$ awk -f tst.awk file
col1:250000,col3:121
col1:210000,col3:112
col1:120000,col3:123
col1:170000,col3:123
col1:190000,col3:123
col1:220000,col3:176
col1:220000,col3:176