Avoid to print twice the last line in awk - json

I'm trying to put a JSON format to a file with one column, to do this I thought that awk can be a great tool.My input is (for example):
a
b
c
d
e
And my output that I want is:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'}]}
I tried with two different codes. The first one is:
BEGIN{
FS = "\t"
printf "{nodes:["
}
{printf "{'id':'%s'},\n",$1}
END{printf "{'id':'%s'}]}\n",$1}
But I print twice the last line:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'},
{id='e'}]}
The other option that I tried is with getline:
BEGIN{
FS = "\t"
printf "{nodes:["
}
{printf getline==0 ? "{'id':'%s'}]}" : "{'id':'%s'},\n",$1}
But for some reason, getline is always 1 instead to be 0 in the last line, so:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'},
Any suggestion to solve my problem?

In awk. Buffer the output to variable b and process it before outputing:
$ awk 'BEGIN{b="{nodes:["}{b=b "{id=\x27" $0 "\x27},\n"}END{sub(/,\n$/,"]}",b);print b}' file
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'}]}
Explained:
BEGIN { b="{nodes:[" } # front matter
{ b=b "{id=\x27" $0 "\x27},\n" } # middle
END { sub(/,\n$/,"]}",b); print b } # end matter and replace ,\n in the end
# with something more appropriate

Solution (thanks to #Ruud and #suleiman)
BEGIN{
FS = "\t"
printf "{'nodes':["
}
NR > 1{printf "{'id':'%s'},\n",prev}
{prev = $1}
END{printf "{'id':'%s'}]}",prev}

Try this -
$awk -v count=$(wc -l < f) 'BEGIN{kk=getline;printf "{nodes:[={'id':'%s'},\n",$kk}
> {
> if(NR < count)
> {
> {printf "{'id':'%s'},\n",$1}
> }}
> END{printf "{'id':'%s'}]}\n",$1}' f
{nodes:[={id:a},
{id:b},
{id:c},
{id:d},
{id:e}]}

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

AWK : comparing 2 columns from 2 csv files, outputting to a third. How do I also get the output that doesnt match to another file?

I currently have the following script:
awk -F, 'NR==FNR { a[$1 FS $4]=$0; next } $1 FS $4 in a { printf a[$1 FS $4]; sub($1 FS $4,""); print }' file1.csv file2.csv > combined.csv
this compares two columns 1 & 4 from both csv files and outputs the result from both files to combined.csv. Is it possible to output the lines from file 1 & file 2 that dont match to other files with the same awk line? or would i need to do seperate parses?
File1
ResourceName,ResourceType,PatternType,User,Host,Operation,PermissionType
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow
File2
topic,groupName,Name,User,email,team,contact,teamemail,date,clienttype
BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer
combined
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow,BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow,BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
Wanted additional files:
non matched file1:
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
non matched file2:
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer```
again, I might be trying to do too much in one line? would it be wiser to run another parse?
Assuming the key pairs of $1 and $4 are unique within each input file then using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { next }
{ key = $1 FS $4 }
NR==FNR {
file1[key] = $0
next
}
key in file1 {
print file1[key], $0 > "out_combined"
delete file1[key]
next
}
{
print > "out_file2_only"
}
END {
for (key in file1) {
print file1[key] > "out_file1_only"
}
}
$ awk -f tst.awk file{1,2}
$ head out_*
==> out_combined <==
BIG.TestTopic,Cluster,LITERAL,Bigboy,*,Create,Allow,BIG.TestTopic,BIG.ConsumerGroup,Bobby,Bigboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
BIG.DEVtopic,Cluster,LITERAL,Oldboy,*,DescribeConfigs,Allow,BIG.DEVtopic,BIG.ConsumerGroup,Bobby,Oldboy,bobby#example.com,team 1,Bobby,boys#example.com,2021-11-26T10:10:17Z,Consumer
==> out_file1_only <==
BIG.PRETopic,Cluster,LITERAL,Smallboy,*,Create,Allow
==> out_file2_only <==
BIG.TestTopic,BIG.ConsumerGroup,Susan,Younglady,younglady#example.com,team 1,Susan,girls#example.com,2021-11-26T10:10:17Z,Producer
The order of lines in out_file1_only will be shuffled by the in operator - if that's a problem let us know as it's an easy tweak to retain the input order.

Merging rows based on cell contents in a CSV file

I am trying to merge rows with a matching first cell in a CSV file. So that the following cells are placed within their rightful columns based on matching strings.
I have a file with the following contents:
item,pieces,color,last order
"apples","4 pieces"
"apples","red color"
"apples","last ordered 2 hours ago"
"mangos","1 piece"
"mangos","last ordered 1 day ago"
"carrots","10 pieces"
"carrots","orange color"
Which then should be merged into the following:
item,pieces,color,last order
"apples","4 pieces","red color","last ordered 2 hours ago"
"mangos","1 piece","","last ordered 1 day ago"
"carrots","10 pieces","orange color",""
The code I have used for this:
awk '{ printf "%s", $0; if (NR % 3 == 0) print ""; else printf "," }' file.csv
This method of merging three rows at a time worked with a little manual editing for as long as all items had the three pieces of data "pieces", "color" & "last order".
However, this is not working as different items have different sets of data.
You may try this awk:
awk 'BEGIN {FS=OFS=","} NR == 1 {print; next} item != $1 {if (item != "") print item, pieces, color, order; item = $1; pieces = $2; color = order = "\"\""; next} {if ($2 ~ /color/) color = $2; else order = $2} END {print item, pieces, color, order}' file
item,pieces,color,last order
"apples","4 pieces","red color","last ordered 2 hours ago"
"mangos","1 piece","","last ordered 1 day ago"
"carrots","10 pieces","orange color",""
A more readable version:
awk 'BEGIN {
FS = OFS = ","
}
NR == 1 {
print
next
}
item != $1 {
if (item != "")
print item, pieces, color, order
item = $1
pieces = $2
color = order = "\"\""
next
}
{
if ($2 ~ /color/)
color = $2
else
order = $2
}
END {
print item, pieces, color, order
}' file
$ cat tst.awk
BEGIN { FS=OFS="," }
{ gsub(/"/,"") }
NR==1 {
print
sub(/s,/,",")
numTags = split($0,tags)
next
}
$1 != prev {
if ( prev != "" ) {
prt()
}
prev=$1
}
{
tag = tags[1]
tag2val[tag] = $1
for (tagNr=2; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
if ( index($2,tag) ) {
tag2val[tag] = $2
next
}
}
}
END { prt() }
function prt( tagNr,tag,val) {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = tag2val[tag]
printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
}
delete tag2val
}
$ awk -f tst.awk file
item,pieces,color,last order
"apples","4 pieces","red color","last ordered 2 hours ago"
"mangos","1 piece","","last ordered 1 day ago"
"carrots","10 pieces","orange color",""
check this out :
awk -F, '{ if (f == $1) { for (c=0; c <length($1) + length(FS); c++) printf " "; print $2 FS $3 } else { print $0 } } { f = $1 }' yourfile.csv
The following implementation works so long as all our items are grouped together in blocks of lines within the input file. The implementation caches fields for a single output record and prints that when it sees a different item (or alternatively when the script END's).
awk -F, -v OFS=, '
NR==1 { f1=$1; f2=$2; f3=$3; f4=$4 }
FNR==1 { next }
f1 != $1 {
print f1, f2, f3, f4
f1=$1
f2=f3=f4="\"\""
}
$2 ~ /piece/ { f2 = $2; next }
$2 ~ /color/ { f3 = $2; next }
$2 ~ /last order/ { f4 = $2; next }
END { print f1, f2, f3, f4 }
' file.csv
If items are not grouped, we have to cache the whole file(s)... In the following, f basically stores/represents the file as a table. The k array stores the the line number for the output record of a given item so that when we print, our output records will come out in the input file's order.
awk -F, -v OFS=, '
BEGIN { n = 0; split("", k); split("", f) }
NR==1 { f[++n,1]=$1; f[n,2]=$2; f[n,3]=$3; f4[n,4]=$4 }
FNR==1 { next }
!($1 in k) {
k[$1] = ++n
f[n,1]=$1
f[n,2]=f[n,3]=f[n,4]="\"\""
}
$2 ~ /piece/ { f[k[$1],2] = $2; next }
$2 ~ /color/ { f[k[$1],3] = $2; next }
$2 ~ /last order/ { f[k[$1],4] = $2; next }
END {
for (i=1; i<=n; ++i)
print f[i,1], f[i,2], f[i,3], f[i,4]
}
' file.csv
Notes:
The NR/FNR dance in both implementations is an attempt at handling the header line and multiple input files (and we may or may not care about handling multiple file input).
The latter implementation permits overwriting of an item's fields... If we don't want that behavior, we would have to change the logic -- the changes to the logic would be about the lines that have the regex comparisons on input field $2.

Trying to read from specific fields of a CSV file

The code provided reads a CSV file and prints the count of all strings found in descending order. However, I would like to know how to specify what fields I would like to read in count...for example
./example-awk.awk 1,2 file.csv would read strings from fields 1 and 2 and print the counts
#!/bin/awk -f
BEGIN {
FIELDS = ARGV[1];
delete ARGV[1];
FS = ", *"
}
{
for(i = 1; i <= NF; i++)
if(FNR != 1)
data[++data_index] = $i
}
END {
produce_numbers(data)
PROCINFO["sorted_in"] = "#val_num_desc"
for(i in freq)
printf "%s\t%d\n", i, freq[i]
}
function produce_numbers(sortedarray)
{
n = asort(sortedarray)
for(i = 1 ; i <= n; i++)
{
freq[sortedarray[i]]++
}
return
}
This is currently the code I am working with, ARGV[1] will of course be the specified fields. I am unsure how to go about storing this value to use it.
For example ./example-awk.awk 1,2 simple.csv with simple.csv containing
A,B,C,A
B,D,C,A
C,D,A,B
D,C,A,A
Should result in
D 3
C 2
B 2
A 1
Because it only counts strings in fields 1 and 2
EDIT(as per OP's request): As per OP he/she needs to have solution using ARGV so adding solution as per that now (NOTE: cat script.awk is only written to show content of actual awk script only).
cat script.awk
BEGIN{
FS=","
OFS="\t"
for(i=1;i<(ARGC-1);i++){
arr[ARGV[i]]
delete ARGV[i]
}
}
{
for(i in arr){ value[$i]++ }
}
END{
PROCINFO["sorted_in"] = "#ind_str_desc"
for(j in value){
print j,value[j]
}
}
Now when we run it as follows:
awk -f script.awk 1 2 Input_file
D 3
C 2
B 2
A 1
My original solution: Could you please try following, written and tested with shown samples. It is a generic solution where awk program has a variable named fields where you could mention all field numbers which you want to deal with using ,(comma) separator in it.
awk -v fields="1,2" '
BEGIN{
FS=","
OFS="\t"
num=split(fields,arr,",")
for(i=1;i<=num;i++){
key[arr[i]]
}
}
{
for(i in key){
value[$i]++
}
}
END{
for(i in value){
print i,value[i]
}
}' Input_file | sort -rk1
Output will be as follows.
D 3
C 2
B 2
A 1
Don't use a shebang to invoke awk in a shell script as that robs you of the ability to use the shell and awk separately for what they both do best. Use the shebang to invoke your shell and then call awk within the script. You also don't need to use gawk-only sorting functions for this:
$ cat tst.sh
#!/usr/bin/env bash
(( $# == 2 )) || { echo "bad args: $0 $*" >&2; exit 1; }
cols=$1
shift
awk -v cols="$cols" '
BEGIN {
FS = ","
OFS = "\t"
split(cols,tmp)
for (i in tmp) {
fldNrs[tmp[i]]
}
}
{
for (fldNr in fldNrs) {
val = $fldNr
cnt[val]++
}
}
END {
for (val in cnt) {
print val, cnt[val]
}
}
' "${#:--}" |
sort -r
$ ./tst.sh 1,2 file
D 3
C 2
B 2
A 1
I decided to give it a go in the spirit of OP's attempt as kids don't learn if kids don't play (trying ARGIND manipulation (it doesn't work) and delete ARGV[] and some others that also didn't work):
$ gawk '
BEGIN {
FS=","
OFS="\t"
split(ARGV[1],t,/,/) # field list picked from ARGV
for(i in t) # from vals to index
h[t[i]]
delete ARGV[1] # ARGIND manipulation doesnt work
}
{
for(i in h) # subset of fields processes
a[$i]++ # count hits
}
END {
PROCINFO["sorted_in"]="#val_num_desc" # ordering from OPs attempt
for(i in a)
print i,a[i]
}' 1,2 file
Output
D 3
B 2
C 2
A 1
You could as well drop the ARGV[] manipulation and replace the BEGIN block with:
$ gawk -v var=1,2 '
BEGIN {
FS=","
OFS="\t"
split(var,t,/,/) # field list picked from a var
for(i in t) # from vals to index
h[t[i]]
} ...

Comparing split strings inside fields of two CSV files

I have a CSV file (file1) that looks something like this:
123,info,ONE NAME
124,info,ONE VARIATION
125,info,NAME ANOTHER
126,info,SOME TITLE
and another CSV file (file2) that looks like this:
1,info,NAME FIRST
2,info,TWO VARIATION
3,info,NAME SECOND
4,info,ANOTHER TITLE
My desired output would be:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
Where if the first word in comma delimited field 3 (ie: NAME in line 1) of file2 is equal to any of the words in field 3 of file1, print a line with format:
field1(file2),field1(file1),field3(file2),field3(file1)
Each file has the same number of lines and matches are only made when each has the same line number.
I know I can split fields and get the first word in field3 in Awk like this:
awk -F"," '{split($3,a," "); print a[1]}' file
But since I'm only moderately competent in Awk, I'm at a loss for how to approach a job where there are two files compared using splits.
I could do it in Python like this:
with open('file1', 'r') as f1, open('file2', 'r') as f2:
l1 = f1.readlines()
l2 = f2.readlines()
for i in range(len(l1)):
line_1 = l1[i].split(',')
line_2 = l2[i].split(',')
field_3_1 = line_1[2].split()
field_3_2 = line_2[2].split()
if field_3_2[0] in field_3_1:
one = ' '.join(field_3_1)
two = ' '.join(field_3_2)
print(','.join((line_2[0], line_1[0], two, one)))
But I'd like to know how a job like this would be done in Awk as occasionally I use shells where only Awk is available.
This seems like a strange task to need to do, and my example I think can be a bit confusing, but I need to perform this to check for broken/ill-formatted data in one of the files.
awk -F, -vOFS=, '
{
num1 = $1
name1 = $3
split(name1, words1, " ")
getline <"file2"
split($3, words2, " ")
for (i in words1)
if (words2[1] == words1[i]) {
print $1, num1, $3, name1
break
}
}
' file1
Output:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
You can try something along the lines, although the following prints only one match for each line in second file:
awk -F, 'FNR==NR {
count= split($3, words, " ");
for (i=1; i <= count; i++) {
field1hash[words[i]]=$1;
field3hash[$1]=$3;
}
next;
}
{
split($3,words," ");
if (field1hash[words[1]]) {
ff1 = field1hash[words[1]];
print $1","ff1","$3","field3hash[ff1]
}
}' file1 file2
I like #ooga's answer better than this:
awk -F, -v OFS=, '
NR==FNR {
split($NF, a, " ")
data[NR,"word"] = a[1]
data[NR,"id"] = $1
data[NR,"value"] = $NF
next
}
{
n = split($NF, a, " ")
for (i=1; i<=n; i++)
if (a[i] == data[FNR,"word"])
print data[FNR,"id"], $1, data[FNR,"value"], $NF
}
' file2 file1