Comparing split strings inside fields of two CSV files - csv

I have a CSV file (file1) that looks something like this:
123,info,ONE NAME
124,info,ONE VARIATION
125,info,NAME ANOTHER
126,info,SOME TITLE
and another CSV file (file2) that looks like this:
1,info,NAME FIRST
2,info,TWO VARIATION
3,info,NAME SECOND
4,info,ANOTHER TITLE
My desired output would be:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
Where if the first word in comma delimited field 3 (ie: NAME in line 1) of file2 is equal to any of the words in field 3 of file1, print a line with format:
field1(file2),field1(file1),field3(file2),field3(file1)
Each file has the same number of lines and matches are only made when each has the same line number.
I know I can split fields and get the first word in field3 in Awk like this:
awk -F"," '{split($3,a," "); print a[1]}' file
But since I'm only moderately competent in Awk, I'm at a loss for how to approach a job where there are two files compared using splits.
I could do it in Python like this:
with open('file1', 'r') as f1, open('file2', 'r') as f2:
l1 = f1.readlines()
l2 = f2.readlines()
for i in range(len(l1)):
line_1 = l1[i].split(',')
line_2 = l2[i].split(',')
field_3_1 = line_1[2].split()
field_3_2 = line_2[2].split()
if field_3_2[0] in field_3_1:
one = ' '.join(field_3_1)
two = ' '.join(field_3_2)
print(','.join((line_2[0], line_1[0], two, one)))
But I'd like to know how a job like this would be done in Awk as occasionally I use shells where only Awk is available.
This seems like a strange task to need to do, and my example I think can be a bit confusing, but I need to perform this to check for broken/ill-formatted data in one of the files.

awk -F, -vOFS=, '
{
num1 = $1
name1 = $3
split(name1, words1, " ")
getline <"file2"
split($3, words2, " ")
for (i in words1)
if (words2[1] == words1[i]) {
print $1, num1, $3, name1
break
}
}
' file1
Output:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER

You can try something along the lines, although the following prints only one match for each line in second file:
awk -F, 'FNR==NR {
count= split($3, words, " ");
for (i=1; i <= count; i++) {
field1hash[words[i]]=$1;
field3hash[$1]=$3;
}
next;
}
{
split($3,words," ");
if (field1hash[words[1]]) {
ff1 = field1hash[words[1]];
print $1","ff1","$3","field3hash[ff1]
}
}' file1 file2

I like #ooga's answer better than this:
awk -F, -v OFS=, '
NR==FNR {
split($NF, a, " ")
data[NR,"word"] = a[1]
data[NR,"id"] = $1
data[NR,"value"] = $NF
next
}
{
n = split($NF, a, " ")
for (i=1; i<=n; i++)
if (a[i] == data[FNR,"word"])
print data[FNR,"id"], $1, data[FNR,"value"], $NF
}
' file2 file1

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

Avoid to print twice the last line in awk

I'm trying to put a JSON format to a file with one column, to do this I thought that awk can be a great tool.My input is (for example):
a
b
c
d
e
And my output that I want is:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'}]}
I tried with two different codes. The first one is:
BEGIN{
FS = "\t"
printf "{nodes:["
}
{printf "{'id':'%s'},\n",$1}
END{printf "{'id':'%s'}]}\n",$1}
But I print twice the last line:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'},
{id='e'}]}
The other option that I tried is with getline:
BEGIN{
FS = "\t"
printf "{nodes:["
}
{printf getline==0 ? "{'id':'%s'}]}" : "{'id':'%s'},\n",$1}
But for some reason, getline is always 1 instead to be 0 in the last line, so:
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'},
Any suggestion to solve my problem?
In awk. Buffer the output to variable b and process it before outputing:
$ awk 'BEGIN{b="{nodes:["}{b=b "{id=\x27" $0 "\x27},\n"}END{sub(/,\n$/,"]}",b);print b}' file
{nodes:[{id='a'},
{id='b'},
{id='c'},
{id='d'},
{id='e'}]}
Explained:
BEGIN { b="{nodes:[" } # front matter
{ b=b "{id=\x27" $0 "\x27},\n" } # middle
END { sub(/,\n$/,"]}",b); print b } # end matter and replace ,\n in the end
# with something more appropriate
Solution (thanks to #Ruud and #suleiman)
BEGIN{
FS = "\t"
printf "{'nodes':["
}
NR > 1{printf "{'id':'%s'},\n",prev}
{prev = $1}
END{printf "{'id':'%s'}]}",prev}
Try this -
$awk -v count=$(wc -l < f) 'BEGIN{kk=getline;printf "{nodes:[={'id':'%s'},\n",$kk}
> {
> if(NR < count)
> {
> {printf "{'id':'%s'},\n",$1}
> }}
> END{printf "{'id':'%s'}]}\n",$1}' f
{nodes:[={id:a},
{id:b},
{id:c},
{id:d},
{id:e}]}

AWK: convert messy record to Title Case

I have a pipe delimited file with the following syntax:
|ID Number|First Name|Middle Name|Last Name|#, Street, City|etc..
Some records are messy and I would like to have the strings be converted into title case. Based on other questions regarding converting the strings to title case, I've found this command:
awk 'BEGIN { FS=OFS="|" } {for (i=1; i<=NF; ++i) { $i=toupper(substr($i,1,1)) tolower(substr($i,2)); } print }'
Running that produces |Id number|First name|Middle name|Last name|#, street, city|, which capitalizes the first letter but since I have FS="|", The string in each field is being treated as one word.
I would like to have every thing inside each field to be title-cased, not just the first letter of each field.
If possible, I would like to have an awk only solution for this.
Try something like this:
awk 'BEGIN { FS=OFS="\0" }
{n = split($0, words, /[ |]/, separators);
out = separators[0];
for (i=1; i<=n; ++i) {
out = out toupper(substr(words[i],1,1)) tolower(substr(words[i],2)) separators[i];
};
print out separators[n+1]; }' file

Prompt the way to edit csv

I have an csv file like:
1;2,3,4
5;2,3
etc
I need to get file like:
1;12
1;13
1;14
5;52
5;53
Can i do that without deep programming, maybe something like awk or something. I can do this thing on perl or python, but ш think there is a simpler way.
This is a way:
$ awk 'BEGIN{FS=OFS=";"}{n=split($2, a, ","); for (i=1; i<=n; i++) print $1, $1a[i]}' file
1;12
1;13
1;14
5;52
5;53
Explanation
BEGIN{FS=OFS=";"} set input and output field separator as ;.
{n=split($2, a, ",") slice the second field based on comma. The pieces are stored in the array a[].
for (i=1; i<=n; i++) print $1, $1a[i]} loop through the fields in a[] printing them together with the first field on the format FIRST_FIELD;FIRST_FIELD + a[i]
awk -F '[;,]' '{ for (i = 2; i <= NF; ++i) print $1 ";" $1 $i }' file
Output:
1;12
1;13
1;14
5;52
5;53
how about:
awk -F";" '{sub(/;/,FS $1);gsub(/,/,ORS $1";"$1)}7' file
test with your data:
kent$ echo "1;2,3,4
5;2,3"|awk -F";" '{sub(/;/,FS $1);gsub(/,/,ORS $1";"$1)}7'
1;12
1;13
1;14
5;52
5;53
or:
awk -F";" 'sub(/;/,FS $1)+gsub(/,/,ORS $1";"$1)' file
You can use awk:
awk -F'[;,]' '{for(i=2;i<=NF;i++)printf "%s;%s%s\n",$1,$1,$i}' a.txt
Explanation
-F';|,' Split line by , or ;
{for(i=2;i<NF;i++)printf "%s;%s%s\n",$1,$1,$i} Iterate though columns and produce output as desired.

awk process own field value

I have a CSV file formatted like this:
Postcode,Count,Total
L1 3RT,20,345.65
I am summing the counts and totals by Postcode using awk, however I'd like to do this for the first portion of a postcode (ie L1, thus combining the values for L1 3RT and L2 4XW). Sample data and existing awk command shown below.
CM1 4QR,979,32950.8
CM1 4QS,2,145.14
CM13 1DL,115,3771
AWK line
awk 'BEGIN { FS = "," } ; {sums[$1] += $2; totals[$1] += $3} END { for (i in sums) printf("%s,%s,%i\n", i, sums[i],totals[i])}' coach.csv
I would like the output to be
CM1,981,33095.94
CM13,115,3771
The following works:
awk -F'[ ,]' '
{
sums[$1] += $3;
totals[$1] += $4;
}
END {
for (i in sums)
printf("%s,%i,%i\n", i, sums[i],totals[i]);
}' coach.csv
It uses two delimiters, the comma and space. It works for your sample input, but won't for more complex input that has spaces elsewhere.
You can use multiple delimiters in awk. Please try this
awk -F'[, ]' '{sums[$1] += $3; totals[$1] += $4} END {for (i in sums) printf("%s,%.2f,%.2f\n", i, sums[i], totals[i])}' coach.csv