AWK: convert messy record to Title Case - csv

I have a pipe delimited file with the following syntax:
|ID Number|First Name|Middle Name|Last Name|#, Street, City|etc..
Some records are messy and I would like to have the strings be converted into title case. Based on other questions regarding converting the strings to title case, I've found this command:
awk 'BEGIN { FS=OFS="|" } {for (i=1; i<=NF; ++i) { $i=toupper(substr($i,1,1)) tolower(substr($i,2)); } print }'
Running that produces |Id number|First name|Middle name|Last name|#, street, city|, which capitalizes the first letter but since I have FS="|", The string in each field is being treated as one word.
I would like to have every thing inside each field to be title-cased, not just the first letter of each field.
If possible, I would like to have an awk only solution for this.

Try something like this:
awk 'BEGIN { FS=OFS="\0" }
{n = split($0, words, /[ |]/, separators);
out = separators[0];
for (i=1; i<=n; ++i) {
out = out toupper(substr(words[i],1,1)) tolower(substr(words[i],2)) separators[i];
};
print out separators[n+1]; }' file

Related

Truncate CSV Header Names

I'm looking for a relatively simple method for truncating CSV header names to a given maximum length. For example a file like:
one,two,three,four,five,six,seven
data,more data,words,,,data,the end
Could limit all header names to a max of 3 characters and become:
one,two,thr,fou,fiv,six,sev
data,more data,words,,,data,the end
Requirements:
Only the first row is affected
I don't know what the headers are going to be, so it has to dynamically read and write the values and lengths
I tried a few things with awk and sed, but am not proficient at either. The closest I found was this snippet:
csvcut -c 3 file.csv |
sed -r 's/^"|"$//g' |
awk -F';' -vOFS=';' '{ for (i=1; i<=NF; ++i) $i = substr($i, 0, 2) } { printf("\"%s\"\n", $0) }' >tmp-3rd
But it was focusing on columns and also feels more complicated than necessary to use csvcut.
Any help is appreciated.
With GNU sed:
sed -E '1s/([^,]{1,3})[^,]*/\1/g' file
Output:
one,two,thr,fou,fiv,six,sev
data,more data,words,,,data,the end
See: man sed and The Stack Overflow Regular Expressions FAQ
With your shown samples, please try following awk program. Simple explanation would be, setting field separator and output field separator as , Then in first line cutting short each field of first line to 3 chars as per requirement and printing them(new line after last field of first line), printing rest of lines as it is.
awk '
BEGIN { FS=OFS="," }
FNR==1{
for(i=1; i<=NF; i++){
printf("%s%s",substr($i, 1, 3),(i==NF?ORS:OFS))
}
next
}
1
' Input_file

Prompt the way to edit csv

I have an csv file like:
1;2,3,4
5;2,3
etc
I need to get file like:
1;12
1;13
1;14
5;52
5;53
Can i do that without deep programming, maybe something like awk or something. I can do this thing on perl or python, but ш think there is a simpler way.
This is a way:
$ awk 'BEGIN{FS=OFS=";"}{n=split($2, a, ","); for (i=1; i<=n; i++) print $1, $1a[i]}' file
1;12
1;13
1;14
5;52
5;53
Explanation
BEGIN{FS=OFS=";"} set input and output field separator as ;.
{n=split($2, a, ",") slice the second field based on comma. The pieces are stored in the array a[].
for (i=1; i<=n; i++) print $1, $1a[i]} loop through the fields in a[] printing them together with the first field on the format FIRST_FIELD;FIRST_FIELD + a[i]
awk -F '[;,]' '{ for (i = 2; i <= NF; ++i) print $1 ";" $1 $i }' file
Output:
1;12
1;13
1;14
5;52
5;53
how about:
awk -F";" '{sub(/;/,FS $1);gsub(/,/,ORS $1";"$1)}7' file
test with your data:
kent$ echo "1;2,3,4
5;2,3"|awk -F";" '{sub(/;/,FS $1);gsub(/,/,ORS $1";"$1)}7'
1;12
1;13
1;14
5;52
5;53
or:
awk -F";" 'sub(/;/,FS $1)+gsub(/,/,ORS $1";"$1)' file
You can use awk:
awk -F'[;,]' '{for(i=2;i<=NF;i++)printf "%s;%s%s\n",$1,$1,$i}' a.txt
Explanation
-F';|,' Split line by , or ;
{for(i=2;i<NF;i++)printf "%s;%s%s\n",$1,$1,$i} Iterate though columns and produce output as desired.

Comparing split strings inside fields of two CSV files

I have a CSV file (file1) that looks something like this:
123,info,ONE NAME
124,info,ONE VARIATION
125,info,NAME ANOTHER
126,info,SOME TITLE
and another CSV file (file2) that looks like this:
1,info,NAME FIRST
2,info,TWO VARIATION
3,info,NAME SECOND
4,info,ANOTHER TITLE
My desired output would be:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
Where if the first word in comma delimited field 3 (ie: NAME in line 1) of file2 is equal to any of the words in field 3 of file1, print a line with format:
field1(file2),field1(file1),field3(file2),field3(file1)
Each file has the same number of lines and matches are only made when each has the same line number.
I know I can split fields and get the first word in field3 in Awk like this:
awk -F"," '{split($3,a," "); print a[1]}' file
But since I'm only moderately competent in Awk, I'm at a loss for how to approach a job where there are two files compared using splits.
I could do it in Python like this:
with open('file1', 'r') as f1, open('file2', 'r') as f2:
l1 = f1.readlines()
l2 = f2.readlines()
for i in range(len(l1)):
line_1 = l1[i].split(',')
line_2 = l2[i].split(',')
field_3_1 = line_1[2].split()
field_3_2 = line_2[2].split()
if field_3_2[0] in field_3_1:
one = ' '.join(field_3_1)
two = ' '.join(field_3_2)
print(','.join((line_2[0], line_1[0], two, one)))
But I'd like to know how a job like this would be done in Awk as occasionally I use shells where only Awk is available.
This seems like a strange task to need to do, and my example I think can be a bit confusing, but I need to perform this to check for broken/ill-formatted data in one of the files.
awk -F, -vOFS=, '
{
num1 = $1
name1 = $3
split(name1, words1, " ")
getline <"file2"
split($3, words2, " ")
for (i in words1)
if (words2[1] == words1[i]) {
print $1, num1, $3, name1
break
}
}
' file1
Output:
1,123,NAME FIRST,ONE NAME
3,125,NAME SECOND,NAME ANOTHER
You can try something along the lines, although the following prints only one match for each line in second file:
awk -F, 'FNR==NR {
count= split($3, words, " ");
for (i=1; i <= count; i++) {
field1hash[words[i]]=$1;
field3hash[$1]=$3;
}
next;
}
{
split($3,words," ");
if (field1hash[words[1]]) {
ff1 = field1hash[words[1]];
print $1","ff1","$3","field3hash[ff1]
}
}' file1 file2
I like #ooga's answer better than this:
awk -F, -v OFS=, '
NR==FNR {
split($NF, a, " ")
data[NR,"word"] = a[1]
data[NR,"id"] = $1
data[NR,"value"] = $NF
next
}
{
n = split($NF, a, " ")
for (i=1; i<=n; i++)
if (a[i] == data[FNR,"word"])
print data[FNR,"id"], $1, data[FNR,"value"], $NF
}
' file2 file1

awk process own field value

I have a CSV file formatted like this:
Postcode,Count,Total
L1 3RT,20,345.65
I am summing the counts and totals by Postcode using awk, however I'd like to do this for the first portion of a postcode (ie L1, thus combining the values for L1 3RT and L2 4XW). Sample data and existing awk command shown below.
CM1 4QR,979,32950.8
CM1 4QS,2,145.14
CM13 1DL,115,3771
AWK line
awk 'BEGIN { FS = "," } ; {sums[$1] += $2; totals[$1] += $3} END { for (i in sums) printf("%s,%s,%i\n", i, sums[i],totals[i])}' coach.csv
I would like the output to be
CM1,981,33095.94
CM13,115,3771
The following works:
awk -F'[ ,]' '
{
sums[$1] += $3;
totals[$1] += $4;
}
END {
for (i in sums)
printf("%s,%i,%i\n", i, sums[i],totals[i]);
}' coach.csv
It uses two delimiters, the comma and space. It works for your sample input, but won't for more complex input that has spaces elsewhere.
You can use multiple delimiters in awk. Please try this
awk -F'[, ]' '{sums[$1] += $3; totals[$1] += $4} END {for (i in sums) printf("%s,%.2f,%.2f\n", i, sums[i], totals[i])}' coach.csv

Change delimiters in files

Below I have files as they should, and further down, what I made till now. I think that in my code is the source of the problem: delimiters, but I can't get it much better.
My source file is with ; as delimiter, and the files for my database have a , as separator; also, the strings are between "":
The category file should be like this:
"1","1","testcategory","testdescription"
And the manufacturers file, like this:
"24","ASUS",NULL,NULL,NULL
"23","ASROCK",NULL,NULL,NULL
"22","ARNOVA",NULL,NULL,NULL
What I have at this moment:
- category file:
1;2;Alarmen en beveiligingen;
2;2;Apparatuur en toebehoren;
3;2;AUDIO;
- manufacturers file:
315;XTREAMER;NULL;NULL;NULL
316;XTREMEMAC;NULL;NULL;NULL
317;Y-CAM;NULL;NULL;NULL
318;ZALMAN;NULL;NULL;NULL
I tried a bit around to use sed; first, on the categories file:
cut -d ";" -f1 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/categories_description-in.csv
sed 's/^/;2;/g' /home/arno/pixtmp/categories_description-in.csv > /home/arno/pixtmp/categories_description-in.tmp
sed -e "s/$/;/" /home/arno/pixtmp/categories_description-in.tmp > /home/arno/pixtmp/categories_description-in.tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/categories_description-in.tmp2 > /home/arno/pixtmp/categories_description$
And then on the manufacturers file:
cut -d ";" -f5 /home/arno/pixtmp/pixtmp.csv |sort | uniq > /home/arno/pixtmp/manufacturers-in
sed 's/^/;/g' /home/arno/pixtmp/manufacturers-in > /home/arno/pixtmp/manufacturers-tmp
sed -e "s/$/;NULL;NULL;NULL/" /home/arno/pixtmp/manufacturers-tmp > /home/arno/pixtmp/manufacturers-tmp2
awk 'BEGIN{n=1}{printf("%s%s\n",n++,$0)}' /home/arno/pixtmp/manufacturers-tmp2 > /home/arno/pixtmp/manufacturers.ok
You were trying to solve the problem by using cut, sed, and AWK. AWK by itself is powerful enough to solve your problem.
I wrote one AWK program that can handle both of your examples. If NULL is not a special case, and the manufacturers' file is a different format, you will need to make two AWK programs but I think it should be clear how to do it.
All we do here is tell AWK that the "field separator" is the semicolon. Then AWK splits the input lines into fields for us. We loop over the fields, printing as we go.
#!/usr/bin/awk -f
BEGIN {
FS = ";"
DQUOTE = "\""
}
function add_quotes(s) {
if (s == "NULL")
return s
else
return DQUOTE s DQUOTE
}
NF > 0 {
# if input ended with a semicolon, last field will be empty
if ($NF == "")
NF -= 1 # subtract one from NF to forget the last field
if (NF > 0)
{
for (i = 1; i <= NF - 1; ++i)
printf("%s,", add_quotes($i))
printf("%s\n", add_quotes($i))
}
}