Trying to read from specific fields of a CSV file - csv

The code provided reads a CSV file and prints the count of all strings found in descending order. However, I would like to know how to specify what fields I would like to read in count...for example
./example-awk.awk 1,2 file.csv would read strings from fields 1 and 2 and print the counts
#!/bin/awk -f
BEGIN {
FIELDS = ARGV[1];
delete ARGV[1];
FS = ", *"
}
{
for(i = 1; i <= NF; i++)
if(FNR != 1)
data[++data_index] = $i
}
END {
produce_numbers(data)
PROCINFO["sorted_in"] = "#val_num_desc"
for(i in freq)
printf "%s\t%d\n", i, freq[i]
}
function produce_numbers(sortedarray)
{
n = asort(sortedarray)
for(i = 1 ; i <= n; i++)
{
freq[sortedarray[i]]++
}
return
}
This is currently the code I am working with, ARGV[1] will of course be the specified fields. I am unsure how to go about storing this value to use it.
For example ./example-awk.awk 1,2 simple.csv with simple.csv containing
A,B,C,A
B,D,C,A
C,D,A,B
D,C,A,A
Should result in
D 3
C 2
B 2
A 1
Because it only counts strings in fields 1 and 2

EDIT(as per OP's request): As per OP he/she needs to have solution using ARGV so adding solution as per that now (NOTE: cat script.awk is only written to show content of actual awk script only).
cat script.awk
BEGIN{
FS=","
OFS="\t"
for(i=1;i<(ARGC-1);i++){
arr[ARGV[i]]
delete ARGV[i]
}
}
{
for(i in arr){ value[$i]++ }
}
END{
PROCINFO["sorted_in"] = "#ind_str_desc"
for(j in value){
print j,value[j]
}
}
Now when we run it as follows:
awk -f script.awk 1 2 Input_file
D 3
C 2
B 2
A 1
My original solution: Could you please try following, written and tested with shown samples. It is a generic solution where awk program has a variable named fields where you could mention all field numbers which you want to deal with using ,(comma) separator in it.
awk -v fields="1,2" '
BEGIN{
FS=","
OFS="\t"
num=split(fields,arr,",")
for(i=1;i<=num;i++){
key[arr[i]]
}
}
{
for(i in key){
value[$i]++
}
}
END{
for(i in value){
print i,value[i]
}
}' Input_file | sort -rk1
Output will be as follows.
D 3
C 2
B 2
A 1

Don't use a shebang to invoke awk in a shell script as that robs you of the ability to use the shell and awk separately for what they both do best. Use the shebang to invoke your shell and then call awk within the script. You also don't need to use gawk-only sorting functions for this:
$ cat tst.sh
#!/usr/bin/env bash
(( $# == 2 )) || { echo "bad args: $0 $*" >&2; exit 1; }
cols=$1
shift
awk -v cols="$cols" '
BEGIN {
FS = ","
OFS = "\t"
split(cols,tmp)
for (i in tmp) {
fldNrs[tmp[i]]
}
}
{
for (fldNr in fldNrs) {
val = $fldNr
cnt[val]++
}
}
END {
for (val in cnt) {
print val, cnt[val]
}
}
' "${#:--}" |
sort -r
$ ./tst.sh 1,2 file
D 3
C 2
B 2
A 1

I decided to give it a go in the spirit of OP's attempt as kids don't learn if kids don't play (trying ARGIND manipulation (it doesn't work) and delete ARGV[] and some others that also didn't work):
$ gawk '
BEGIN {
FS=","
OFS="\t"
split(ARGV[1],t,/,/) # field list picked from ARGV
for(i in t) # from vals to index
h[t[i]]
delete ARGV[1] # ARGIND manipulation doesnt work
}
{
for(i in h) # subset of fields processes
a[$i]++ # count hits
}
END {
PROCINFO["sorted_in"]="#val_num_desc" # ordering from OPs attempt
for(i in a)
print i,a[i]
}' 1,2 file
Output
D 3
B 2
C 2
A 1
You could as well drop the ARGV[] manipulation and replace the BEGIN block with:
$ gawk -v var=1,2 '
BEGIN {
FS=","
OFS="\t"
split(var,t,/,/) # field list picked from a var
for(i in t) # from vals to index
h[t[i]]
} ...

Related

Merging rows based on cell contents in a CSV file

I am trying to merge rows with a matching first cell in a CSV file. So that the following cells are placed within their rightful columns based on matching strings.
I have a file with the following contents:
item,pieces,color,last order
"apples","4 pieces"
"apples","red color"
"apples","last ordered 2 hours ago"
"mangos","1 piece"
"mangos","last ordered 1 day ago"
"carrots","10 pieces"
"carrots","orange color"
Which then should be merged into the following:
item,pieces,color,last order
"apples","4 pieces","red color","last ordered 2 hours ago"
"mangos","1 piece","","last ordered 1 day ago"
"carrots","10 pieces","orange color",""
The code I have used for this:
awk '{ printf "%s", $0; if (NR % 3 == 0) print ""; else printf "," }' file.csv
This method of merging three rows at a time worked with a little manual editing for as long as all items had the three pieces of data "pieces", "color" & "last order".
However, this is not working as different items have different sets of data.
You may try this awk:
awk 'BEGIN {FS=OFS=","} NR == 1 {print; next} item != $1 {if (item != "") print item, pieces, color, order; item = $1; pieces = $2; color = order = "\"\""; next} {if ($2 ~ /color/) color = $2; else order = $2} END {print item, pieces, color, order}' file
item,pieces,color,last order
"apples","4 pieces","red color","last ordered 2 hours ago"
"mangos","1 piece","","last ordered 1 day ago"
"carrots","10 pieces","orange color",""
A more readable version:
awk 'BEGIN {
FS = OFS = ","
}
NR == 1 {
print
next
}
item != $1 {
if (item != "")
print item, pieces, color, order
item = $1
pieces = $2
color = order = "\"\""
next
}
{
if ($2 ~ /color/)
color = $2
else
order = $2
}
END {
print item, pieces, color, order
}' file
$ cat tst.awk
BEGIN { FS=OFS="," }
{ gsub(/"/,"") }
NR==1 {
print
sub(/s,/,",")
numTags = split($0,tags)
next
}
$1 != prev {
if ( prev != "" ) {
prt()
}
prev=$1
}
{
tag = tags[1]
tag2val[tag] = $1
for (tagNr=2; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
if ( index($2,tag) ) {
tag2val[tag] = $2
next
}
}
}
END { prt() }
function prt( tagNr,tag,val) {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = tag2val[tag]
printf "\"%s\"%s", val, (tagNr<numTags ? OFS : ORS)
}
delete tag2val
}
$ awk -f tst.awk file
item,pieces,color,last order
"apples","4 pieces","red color","last ordered 2 hours ago"
"mangos","1 piece","","last ordered 1 day ago"
"carrots","10 pieces","orange color",""
check this out :
awk -F, '{ if (f == $1) { for (c=0; c <length($1) + length(FS); c++) printf " "; print $2 FS $3 } else { print $0 } } { f = $1 }' yourfile.csv
The following implementation works so long as all our items are grouped together in blocks of lines within the input file. The implementation caches fields for a single output record and prints that when it sees a different item (or alternatively when the script END's).
awk -F, -v OFS=, '
NR==1 { f1=$1; f2=$2; f3=$3; f4=$4 }
FNR==1 { next }
f1 != $1 {
print f1, f2, f3, f4
f1=$1
f2=f3=f4="\"\""
}
$2 ~ /piece/ { f2 = $2; next }
$2 ~ /color/ { f3 = $2; next }
$2 ~ /last order/ { f4 = $2; next }
END { print f1, f2, f3, f4 }
' file.csv
If items are not grouped, we have to cache the whole file(s)... In the following, f basically stores/represents the file as a table. The k array stores the the line number for the output record of a given item so that when we print, our output records will come out in the input file's order.
awk -F, -v OFS=, '
BEGIN { n = 0; split("", k); split("", f) }
NR==1 { f[++n,1]=$1; f[n,2]=$2; f[n,3]=$3; f4[n,4]=$4 }
FNR==1 { next }
!($1 in k) {
k[$1] = ++n
f[n,1]=$1
f[n,2]=f[n,3]=f[n,4]="\"\""
}
$2 ~ /piece/ { f[k[$1],2] = $2; next }
$2 ~ /color/ { f[k[$1],3] = $2; next }
$2 ~ /last order/ { f[k[$1],4] = $2; next }
END {
for (i=1; i<=n; ++i)
print f[i,1], f[i,2], f[i,3], f[i,4]
}
' file.csv
Notes:
The NR/FNR dance in both implementations is an attempt at handling the header line and multiple input files (and we may or may not care about handling multiple file input).
The latter implementation permits overwriting of an item's fields... If we don't want that behavior, we would have to change the logic -- the changes to the logic would be about the lines that have the regex comparisons on input field $2.

awk combine unique values of other columns based on the unique values of a single column

My input file looks like
Item1,200,a,four,five,six,seven,eight1,nine1
Item2,500,b,four,five,six,seven,eight2,nine2
Item3,900,c,four,five,six,seven,eight3,nine3
Item2,800,d,four,five,six,seven,eight4,nine4
Item1,,e,four,five,six,seven,eight5,nine5
Based on the unique values of first column, I want to combine the unique values of all other columns.
What I tried so far is:
awk -F, '{
a[$1]=a[$1]?a[$1]"_"$2:$2;
b[$1]=b[$1]?b[$1]"_"$3:$3;
c[$1]=c[$1]?c[$1]"_"$4:$4;
d[$1]=d[$1]?d[$1]"_"$5:$5;
e[$1]=e[$1]?e[$1]"_"$6:$6;
f[$1]=f[$1]?f[$1]"_"$7:$7;
g[$1]=g[$1]?g[$1]"_"$8:$8;
h[$1]=h[$1]?h[$1]"_"$9:$9;
}END{for (i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i], g[i], h[i];}' OFS=, input.txt
the output from above is:
Item3,900,c,four,five,six,seven,eight3,nine3
Item1,200_,a_e,four_four,five_five,six_six,seven_seven,eight1_eight5,nine1_nine5
Item2,500_800,b_d,four_four,five_five,six_six,seven_seven,eight2_eight4,nine2_nine4
but what I am expecting is:
Item3,900,c,four,five,six,seven,eight3,nine3
Item1,200,a_e,four,five,six,seven,eight1_eight5,nine1_nine5
Item2,500_800,b_d,four,five,six,seven,eight2_eight4,nine2_nine4
I am looking for some help on:
How to take only the unique values while combining the values?
Whenever a blank value is present, the delimiter (underscore in my case above) shouldn't be appended at the end while combining?
How to sort the output based on column-1 values?
thanks a lot for your help.
With any awk plus sort:
$ cat tst.awk
BEGIN { FS=OFS="," }
{
key = $1
keys[key]
for (i=2; i<=NF; i++) {
if ( ($i ~ /[^[:space:]]/) && (!seen[key,i,$i]++) ) {
idx = key FS i
vals[idx] = (idx in vals ? vals[idx] "_" : "") $i
}
}
}
END {
for (key in keys) {
printf "%s%s", key, OFS
for (i=2; i<=NF; i++) {
idx = key FS i
printf "%s%s", vals[idx], (i<NF ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk file | sort -t, -k1,1
Item1,200,a_e,four,five,six,seven,eight1_eight5,nine1_nine5
Item2,500_800,b_d,four,five,six,seven,eight2_eight4,nine2_nine4
Item3,900,c,four,five,six,seven,eight3,nine3
or with GNU awk for arrays of arrays (see https://www.gnu.org/software/gawk/manual/gawk.html#Multidimensional and https://www.gnu.org/software/gawk/manual/gawk.html#Arrays-of-Arrays for the difference between the two) and sorted_in (see https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Array-Traversal and https://www.gnu.org/software/gawk/manual/gawk.html#Controlling-Scanning):
$ cat tst.awk
BEGIN { FS=OFS="," }
{
for ( i=2; i<=NF; i++ ) {
vals[$1][i][$i]
}
}
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for ( key in vals ) {
printf "%s%s", key, OFS
for ( i=2; i<=NF; i++ ) {
sep = ""
for ( val in vals[key][i] ) {
if ( val ~ /[^[:space:]]/ ) {
printf "%s%s", sep, val
sep = "_"
}
}
printf "%s", (i<NF ? OFS : ORS)
}
}
}
.
$ awk -f tst.awk file
Item1,200,a_e,four,five,six,seven,eight1_eight5,nine1_nine5
Item2,500_800,b_d,four,five,six,seven,eight2_eight4,nine2_nine4
Item3,900,c,four,five,six,seven,eight3,nine3
EDIT: Adding solution with more sensible variable names.
awk '
BEGIN{
FS=OFS=","
}
{
first_field_value[$1]
for(i=2;i<=NF;i++){
if($i!=""){
split(field_values[$1,i],temp_array,"_")
delete column_value
for(p in temp_array){
column_value[temp_array[p]]
}
if(!($i in column_value)){
(field_values[$1,i] == "" ? "" : field_values[$1,i] "_")$i
}
}
}
tot_field=tot_field>NF?tot_field:NF
}
END{
for(ind in first_field_value){
printf "%s,",ind;
for(j=2;j<=tot_field;j++){
printf("%s%s",field_values[ind,j],j==tot_field?ORS:OFS)
}
}
}
' Input_file
Output will be as follows.
Item3,900,c,four,five,six,seven,eight3,nine3
Item1,200,a_e,four,five,six,seven,eight1_eight5,nine1_nine5
Item2,500_800,b_d,four,five,six,seven,eight2_eight4,nine2_nine4
Explanation: This is my previous code's explanation; which had less sensible variable names, but still this explanation could be read for understanding purposes.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section.
FS=OFS="," ##Setting FS and OFS as comma here.
}
{
b[$1] ##Creating array b which has index $1, basically to keep track of $1 values as index here.
for(i=2;i<=NF;i++){ ##Running for loop from i=2 to till value of NF here.
if($i!=""){ ##Checking if any field is NOT NULL then do following.
num=split(c[$1,i],d," ") ##Splitting array c with index of $1,i and splitting its value to array d; it also saves number of elements in array d to variable num here.
for(p=1;p<=num;p++){ ##Running a for loop from p=1 to value of num.
e[d[p]] ##Creating array e whose index is value of array d which are actually values of fields and I am making sure duplicate values will NOT come by this array.
}
if(!($i in e)){ ##If current field is not present in array e then do following.
a[$1,i]=(a[$1,i]?a[$1,i] "_":"")$i ##Creating array a with index of $1,i and keep concatenating its value to it.
}
c[$1,i]=(c[$1,i]?c[$1,i] OFS:"")$i ##Creating array c with current field value and keep concatenating it; array c is the one which STOPS values to re-enter OR let us say it DO NOT allow duplicates values in array a.
}
}
tot_field=tot_field>NF?tot_field:NF ##Creating variable tot_field which will let us know till what value we need to run loop in END BLOCK of this code.
}
END{
for(k in b){ ##Starting a for loop which traverse through array b here.
printf "%s,",k; ##Printing its index here which is basically first field of all lines.
for(j=2;j<=tot_field;j++){ ##Running for loop till value of Maximum field value.
printf("%s%s",a[k,j],j==tot_field?ORS:OFS) ##Printing value of array a whose index is k and j where k is index of array b(1st field) and j is field number starts from 2.
}
}
}
' Input_file ##Mentioning Input_file name here.

Avoid to print twice the last line in awk

I'm trying to put a JSON format to a file with one column, to do this I thought that awk can be a great tool.My input is (for example):
a
b
c
d
e
And my output that I want is:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'}]}
I tried with two different codes. The first one is:
BEGIN{
FS = "\t"
printf "{nodes:["
}
{printf "{'id':'%s'},\n",$1}
END{printf "{'id':'%s'}]}\n",$1}
But I print twice the last line:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'},
{id='e'}]}
The other option that I tried is with getline:
BEGIN{
FS = "\t"
printf "{nodes:["
}
{printf getline==0 ? "{'id':'%s'}]}" : "{'id':'%s'},\n",$1}
But for some reason, getline is always 1 instead to be 0 in the last line, so:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'},
Any suggestion to solve my problem?
In awk. Buffer the output to variable b and process it before outputing:
$ awk 'BEGIN{b="{nodes:["}{b=b "{id=\x27" $0 "\x27},\n"}END{sub(/,\n$/,"]}",b);print b}' file
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'}]}
Explained:
BEGIN { b="{nodes:[" } # front matter
{ b=b "{id=\x27" $0 "\x27},\n" } # middle
END { sub(/,\n$/,"]}",b); print b } # end matter and replace ,\n in the end
# with something more appropriate
Solution (thanks to #Ruud and #suleiman)
BEGIN{
FS = "\t"
printf "{'nodes':["
}
NR > 1{printf "{'id':'%s'},\n",prev}
{prev = $1}
END{printf "{'id':'%s'}]}",prev}
Try this -
$awk -v count=$(wc -l < f) 'BEGIN{kk=getline;printf "{nodes:[={'id':'%s'},\n",$kk}
> {
> if(NR < count)
> {
> {printf "{'id':'%s'},\n",$1}
> }}
> END{printf "{'id':'%s'}]}\n",$1}' f
{nodes:[={id:a},
{id:b},
{id:c},
{id:d},
{id:e}]}

Comparing split strings inside fields of two CSV files

I have a CSV file (file1) that looks something like this:
123,info,ONE NAME
124,info,ONE VARIATION
125,info,NAME ANOTHER
126,info,SOME TITLE
and another CSV file (file2) that looks like this:
1,info,NAME FIRST
2,info,TWO VARIATION
3,info,NAME SECOND
4,info,ANOTHER TITLE
My desired output would be:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
Where if the first word in comma delimited field 3 (ie: NAME in line 1) of file2 is equal to any of the words in field 3 of file1, print a line with format:
field1(file2),field1(file1),field3(file2),field3(file1)
Each file has the same number of lines and matches are only made when each has the same line number.
I know I can split fields and get the first word in field3 in Awk like this:
awk -F"," '{split($3,a," "); print a[1]}' file
But since I'm only moderately competent in Awk, I'm at a loss for how to approach a job where there are two files compared using splits.
I could do it in Python like this:
with open('file1', 'r') as f1, open('file2', 'r') as f2:
l1 = f1.readlines()
l2 = f2.readlines()
for i in range(len(l1)):
line_1 = l1[i].split(',')
line_2 = l2[i].split(',')
field_3_1 = line_1[2].split()
field_3_2 = line_2[2].split()
if field_3_2[0] in field_3_1:
one = ' '.join(field_3_1)
two = ' '.join(field_3_2)
print(','.join((line_2[0], line_1[0], two, one)))
But I'd like to know how a job like this would be done in Awk as occasionally I use shells where only Awk is available.
This seems like a strange task to need to do, and my example I think can be a bit confusing, but I need to perform this to check for broken/ill-formatted data in one of the files.
awk -F, -vOFS=, '
{
num1 = $1
name1 = $3
split(name1, words1, " ")
getline <"file2"
split($3, words2, " ")
for (i in words1)
if (words2[1] == words1[i]) {
print $1, num1, $3, name1
break
}
}
' file1
Output:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
You can try something along the lines, although the following prints only one match for each line in second file:
awk -F, 'FNR==NR {
count= split($3, words, " ");
for (i=1; i <= count; i++) {
field1hash[words[i]]=$1;
field3hash[$1]=$3;
}
next;
}
{
split($3,words," ");
if (field1hash[words[1]]) {
ff1 = field1hash[words[1]];
print $1","ff1","$3","field3hash[ff1]
}
}' file1 file2
I like #ooga's answer better than this:
awk -F, -v OFS=, '
NR==FNR {
split($NF, a, " ")
data[NR,"word"] = a[1]
data[NR,"id"] = $1
data[NR,"value"] = $NF
next
}
{
n = split($NF, a, " ")
for (i=1; i<=n; i++)
if (a[i] == data[FNR,"word"])
print data[FNR,"id"], $1, data[FNR,"value"], $NF
}
' file2 file1

Separating output records in AWK without a trailing separator

I have the following records:
31 Stockholm
42 Talin
34 Helsinki
24 Moscow
15 Tokyo
And I want to convert it to JSON with AWK. Using this code:
#!/usr/bin/awk
BEGIN {
print "{";
FS=" ";
ORS=",\n";
OFS=":";
};
{
if ( !a[city]++ && NR > 1 ) {
key = $2;
value = $1;
print "\"" key "\"", value;
}
};
END {
ORS="\n";
OFS=" ";
print "\b\b}";
};
Gives me this:
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15, <--- I don't want this comma
}
The problem is that trailing comma on the last data line. It makes the JSON output not acceptable. How can I get this output:
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15
}
Mind some feedback on your posted script?
#!/usr/bin/awk # Just be aware that on Solaris this will be old, broken awk which you must never use
BEGIN {
print "{"; # On this and every other line, the trailing semi-colon is a pointless null-statement, remove all of these.
FS=" "; # This is setting FS to the value it already has so remove it.
ORS=",\n";
OFS=":";
};
{
if ( !a[city]++ && NR > 1 ) { # awk consists of <condition>{<action} segments so move this condition out to the condition part
# also, you never populate a variable named "city" so `!a[city]++` won't behave sensibly.
key = $2;
value = $1;
print "\"" key "\"", value;
}
};
END {
ORS="\n"; # no need to set ORS and OFS when the script will no longer use them.
OFS=" ";
print "\b\b}"; # why would you want to print a backspace???
};
so your original script should have been written as:
#!/usr/bin/awk
BEGIN {
print "{"
ORS=",\n"
OFS=":"
}
!a[city]++ && (NR > 1) {
key = $2
value = $1
print "\"" key "\"", value
}
END {
print "}"
}
Here's how I'd really write a script to convert your posted input to your posted output though:
$ cat file
31 Stockholm
42 Talin
34 Helsinki
24 Moscow
15 Tokyo
$
$ awk 'BEGIN{print "{"} {printf "%s\"%s\":%s",sep,$2,$1; sep=",\n"} END{print "\n}"}' file
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15
}
You have a couple of choices. An easy one would be to add the comma of the previous line as you are about to write out a new line:
Set a variable first = 1 in your BEGIN.
When about to print a line, check first. If it is 1, then just set it to 0. If it is 0 print out a comma and a newline:
if (first) { first = 0; } else { print ","; }
The point of this is to avoid putting an extra comma at the start of the list.
Use printf("%s", ...) instead of print ... so that you can avoid the newline when printing a record.
Add an extra newline before the close brace, as in: print "\n}";
Also, note that if you don't care about the aesthetics, JSON doesn't really require newlines between items, etc. You could just output one big line for the whole enchilada.
You should really use a json parser but here is how with awk:
BEGIN {
print "{"
}
NR==1{
s= "\""$2"\":"$1
next
}
{
s=s",\n\""$2"\":"$1
}
END {
printf "%s\n%s",s,"}"
}
Outputs:
{
"Stockholm":31,
"Talin":42,
"Helsinki":34,
"Moscow":24,
"Tokyo":15
}
Why not use json parser? Don't force awk to do something isn't wasn't designed to do. Here is a solution using python:
import json
d = {}
with open("file") as f:
for line in f:
(val, key) = line.split()
d[key] = int(val)
print json.dumps(d,indent=0)
This outputs:
{
"Helsinki": 34,
"Moscow": 24,
"Stockholm": 31,
"Talin": 42,
"Tokyo": 15
}