Change delimiters in files - csv

Below I have files as they should, and further down, what I made till now. I think that in my code is the source of the problem: delimiters, but I can't get it much better.
My source file is with ; as delimiter, and the files for my database have a , as separator; also, the strings are between "":
The category file should be like this:
"1","1","testcategory","testdescription"
And the manufacturers file, like this:
"24","ASUS",NULL,NULL,NULL
"23","ASROCK",NULL,NULL,NULL
"22","ARNOVA",NULL,NULL,NULL
What I have at this moment:
- category file:
1;2;Alarmen en beveiligingen;
2;2;Apparatuur en toebehoren;
3;2;AUDIO;
- manufacturers file:
315;XTREAMER;NULL;NULL;NULL
316;XTREMEMAC;NULL;NULL;NULL
317;Y-CAM;NULL;NULL;NULL
318;ZALMAN;NULL;NULL;NULL
I tried a bit around to use sed; first, on the categories file:
cut -d ";" -f1 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/categories_description-in.csv
sed 's/^/;2;/g' /home/arno/pixtmp/categories_description-in.csv > /home/arno/pixtmp/categories_description-in.tmp
sed -e "s/$/;/" /home/arno/pixtmp/categories_description-in.tmp > /home/arno/pixtmp/categories_description-in.tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/categories_description-in.tmp2 > /home/arno/pixtmp/categories_description$
And then on the manufacturers file:
cut -d ";" -f5 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/manufacturers-in
sed 's/^/;/g' /home/arno/pixtmp/manufacturers-in > /home/arno/pixtmp/manufacturers-tmp
sed -e "s/$/;NULL;NULL;NULL/" /home/arno/pixtmp/manufacturers-tmp > /home/arno/pixtmp/manufacturers-tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/manufacturers-tmp2 > /home/arno/pixtmp/manufacturers.ok

You were trying to solve the problem by using cut, sed, and AWK. AWK by itself is powerful enough to solve your problem.
I wrote one AWK program that can handle both of your examples. If NULL is not a special case, and the manufacturers' file is a different format, you will need to make two AWK programs but I think it should be clear how to do it.
All we do here is tell AWK that the "field separator" is the semicolon. Then AWK splits the input lines into fields for us. We loop over the fields, printing as we go.
#!/usr/bin/awk -f
BEGIN {
FS = ";"
DQUOTE = "\""
}
function add_quotes(s) {
if (s == "NULL")
return s
else
return DQUOTE s DQUOTE
}
NF > 0 {
# if input ended with a semicolon, last field will be empty
if ($NF == "")
NF -= 1 # subtract one from NF to forget the last field
if (NF > 0)
{
for (i = 1; i <= NF - 1; ++i)
printf("%s,", add_quotes($i))
printf("%s\n", add_quotes($i))
}
}

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

Awk multiple transformation / separators in once

I have to transform (preprocess) a CSV file, by generating / inserting a new column, being the result of the concat of existing columns.
For example, transform:
A|B|C|D|E
into:
A|B|C|D|C > D|E
In this example, I do it with:
cat myfile.csv | awk 'BEGIN{FS=OFS="|"} {$4 = $4 OFS $3" > "$4} 1'
But now I have something more complex to do, and dont find how to do this.
I have to transform:
A|B|C|x,y,z|E
into
A|B|C|x,y,z|C > x,C > y,C > z|E
How can it be done in awk (or other command) efficiently (my csv file can contains thousands of lines)?
Thanks.
With GNU awk (for gensub which is a GNU extension):
awk -F'|' '{$6=$5; $5=gensub(/(^|,)/,"\\1" $3 " > ","g",$4); print}' OFS='|'
You can split the 4th field into an array:
awk 'BEGIN{FS=OFS="|"} {split($4,a,",");$4="";for(i=1;i in a;i++)$4=($4? $4 "," : "") $3 " > " a[i]} 1' myfile.csv
A|B|C|C > x,C > y,C > z|E
There are many ways to do this, but the simplest is the following:
$ awk 'BEGIN{FS=OFS="|"}{t=$4;gsub(/[^,]+/,$3" > &",t);$4 = $4 OFS t}1'
we make a copy of the fourth field in variable t. In there, we replace every string which does not contain the new separator (,) by the content of the third field followed by > and the original matched string (&).

How to combine several GAWK statements?

I have the following:
cat *.csv > COMBINED.csv
sort -k1 -n -t, COMBINED.csv > A.csv
gawk -F ',' '{sub(/[[:lower:]]+/,"",$1)}1' OFS=',' A.csv # REMOVE LOWER CASE CHARACTERS FROM 1st COLUMN
gawk -F ',' 'length($1) == 14 { print }' A.csv > B.csv # REMOVE ANY LINE FROM CSV WHERE VALUE IN FIRST COLUMN IS NOT 14 CHARACTERS
gawk -F ',' '{ gsub("/", "-", $2) ; print }' OFS=',' B.csv > C.csv # REPLACE FORWARD SLASH WITH HYPHEN IN SECOND COLUMN
gawk -F ',' '{print > ("processed/"$1".csv")}' C.csv # SPLIT CSV INTO FILES GROUPED BY VALUE IN FIRST COLUMN AND SAVE THE FILE WITH THAT VALUE
However, I think 4 separate lines is a bit overkill and was wondering whether I could optimise it or at least streamline it into a one-liner?
I've tried piping the data but getting stuck in a mix of errors
Thanks
In awk you can append multiple actions as:
pattern1 { action1 }
pattern2 { action2 }
pattern3 { action3 }
So every time a record is read, it will process it by first doing pattern-action1 followed by pattern-action2, ...
In your case, it seems like you can do:
awk 'BEGIN{FS=OFS=","}
# remove lower case characters from first column
{sub(/[[:lower:]]+/,"",$1)}
# process only lines with 14 characters in first column
(length($1) != 14) { next }
# replace forward slash with hyphen
{ gsub("/", "-", $2) }
{ print > ("processed/" $1 ".csv") }' <(sort -k1 -n -t, combined.csv)
You could essentially also put the sorting in GNU awk, but that is a but to mimic the sort exactly, we would need to know your input format.

How to grep specific value from JSON file?

I have a JSON file and content like below:
[
{
"id":"54545-f919-4b0f-930c-0117d6e6c987",
"name":"Inventory_Groups",
"path":"/Groups",
"subGroups":[
{
"id":"343534-394b-429a-834e-f8774240d736",
"name":"UserGroup",
"path":"/Groups/UserGroup",
"subGroups":[
]
}
]
}
]
Now I want to grep value of key id from the subGroups area. How to achive this, if id key not duplicate then it can be achieved by:
grep -o '"id": "[^"]*' Group.json | grep -o '[^"]*$'
But in my case how can I get the value of id as it appears two times?
A valid question to ask your employer is why you're in a position to use the shell but not to use appropriate linux packages. Compare:
awk -F '[":,]+' '$2=="subGroups" {f=1} f && $2=="id" {print $3; exit}' file
(Brittle solution, will fail if the structure of your JSON changes)
To:
jq '.[].subGroups[].id' file
Which can handle compact JSON in addition to numerous other realistic complications.
Using just standard UNIX tools and assuming your sed can tolerate input without a terminating newline (otherwise we can swap out the tr for an awk command that keeps the last newline):
$ tr -d '\n' < file | sed 's/.*"subGroups":[^]}]*"id":"\([^"]*\)\".*/\1\n/'
343534-394b-429a-834e-f8774240d736
Alternatively with just a call to any awk:
$ awk '
{ rec = (NR>1 ? rec ORS : "") $0 }
END {
gsub(/.*"subGroups":[^]}]*"id":"|".*/,"",rec)
print rec
}
' file
343534-394b-429a-834e-f8774240d736

package to query tab separated files in bash

I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001