AWK: Converting CSV with Headers to Summary Table - csv

I have a need to document 100+ CSV files as far as the format of those files and including sample data. What I would like to do is take a CSV of the following format:
Name, Phone, State
Fred, 1234567, TX
John, 2345678, NC
and convert it to:
Field | Sample
--- | ----
Name | Fred
Phone | 1234567
State | TX
Is this possible with AWK? From my example below, you will see I am trying to format as a markdown table. I have it currently transposing the header row with
#!/usr/bin/awk -v RS='\r\n' -f
BEGIN { printf "| Field \t| Critical |\n"}
{
printf "|---\t|---\t|\n"
for (i=1; i<=NF; i++) {print "|", toupper($i), "| sample |"}
}
END {}
But I am not sure now how to use the first row of data, after the header to display the sample data?

awk is the right tool for data parsing. You can try something like:
awk '
BEGIN { FS=", "; OFS=" | " }
NR==1 {
for(tag = 1; tag <= NF; tag++) {
hdr[tag] = sprintf ("%-7s", $tag)
}
next
}
{
for(fld = 1; fld <= NF; fld++) {
data[NR,fld] = $fld
}
}
END {
print "Field | Sample\n------- | -------";
for(rec = 2; rec <= NR; rec++) {
for(line = 1; line <= NF; line++) {
print hdr[line], data[rec,line]
}
}
}' file
Output:
Field | Sample
------- | -------
Name | Fred
Phone | 1234567
State | TX
Name | John
Phone | 2345678
State | NC

Here is a more simple way to do it with awk
No need to store everything in a array then print at the end.
awk -F", " 'NR==1{split($0,a,FS);print "Field | Sample\n------- | -------";next} {for (i=1;i<=NF;i++) printf "%-8s| %s\n",a[i],$i}' file
Field | Sample
------- | -------
Name | Fred
Phone | 1234567
State | TX
Name | John
Phone | 2345678
State | NC
How it works:
awk -F", " ' # set field separator to ","
NR==1{ # if first line do:
split($0,a,FS) # split first line to an array named "a" to get the labels
print "Field | Sample" # print header
print "------- | -------" # print separator
next} # prevents nothing more run for first line
{ # for all lines except first do:
for (i=1;i<=NF;i++) # loop trough all element in line
printf "%-8s| %s\n",a[i],$i # print data for every element
}
' file

Related

How can I clean a TSV file having record or fields separators in one of its fields?

Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD | 567 |
| | EF" | |
+------+---------+------+
Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD EF" | 567 |
+------+---------+------+
I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr : mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line 2.
A good library cleanly handles things like embedded newlines and quoted separators (in fields)
In a Perl script with Text::CSV
use warnings;
use strict;
use Text::CSV;
my $file = shift // die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } );
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\s+/ /g for #$row; # collapse multiple spaces, tabs, newlines
$csv->say(*STDOUT, $row);
}
Note the many other options for the constructor that can help handle various irregularities.
This can fit in a one-liner; its functional interface (with csv) is particularly well suited for that.
if you run
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace
col1 col2 col3
1 A B 1234
2 CD EF 567
I'm using mlr 6.2.
A way to do it in miller 5 is to use simply the put verb:
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --tsv put -S 'for (k in $*) {$[k] = gsub($[k], "\n", " ")}' then clean-whitespace
perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
on_in => sub { s/\s+/ /g for #{$_[1]} },
sep_char => "\t";
'
Or s/[\t\n]/ /g if you prefer.
Can be placed all on one line.
Input is accepted from file named by argument or STDIN.
With GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
The above just uses RS to isolate each "..." string and saves it in RT, then replaces every \n or \t in that string with a blank and saves the result in ORS, then prints the record.
you absolutely don't need gawk to get this done - here's one that works for mawk, gawk, or macos nawk :
INPUT
--before--
col1 col2 col3
1 "A B" 1234
2 "CD
EF" 567
CODE
{m,n,g}awk '
BEGIN {
1 __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
(RS = "^$"),_+=_^=_<_,_)
}
END {
1 printbefore()
3 for (_^=_<_; _<=NF; _++) {
3 sub(/[\t-\r]+/, ($_~__)?" ":"&", $_)
}
1 print
}
1 function printbefore(_)
{
1 printf("\n\n--before--\n%s\n------"\
"------AFTER------\n\n", $+_)>("/dev/stderr")
}
OUTPUT
———AFTER (using mawk)------
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
strip out the part about printbefore() that's more for debugging purposes, then it's just
{m,n,g}awk '
BEGIN { __=substr((OFS=FS="\t\"") FS \
(ORS=_) (RS="^$"),_+=_^=_<_,_)
} END {
for(--_;_<=NF;_++) {
sub(/[\t-\r]+/, $_~__?" ":"&",$_) } print }'

AWK: How to merge CSV files and eliminate rows that contain certain values?

I have hundreds of CSV files. Each CSV file is similar to this:
| KEYWORD | NUMBER OF COMPS | AVGE M E (K) | GS/M | EST. A SE/M | C CORE |
|---------|-----------------|--------------|------|-------------|--------|
| Apples | 311 | 12 | N/A | <100 | 10 |
| Bananas | >1,200 | 737 | N/A | 490 | 88 |
| Oranges | 48 | 184 | N/A | N/A | 1 |
| Fruits | 161 | 94 | N/A | - | 6 |
(I have posted this in table format, to make it more readable, but the CSV data is at the bottom of this post).
All the CSV files have the same header row. Only the data is different.
I would like to do the following:
Merge all the CSV files together, but only have 1 header row.
Omit any rows where EST. A SE/M (Column 5) contains any of the following data: <100, N/A or -
Notes about the Data
Sometimes the some or even all cells in the CSV file are wrapped in quotation marks.
Other times they are not.
Sometimes the first column (keyword) may contain multiple words or accented characters.
My code so far
This code merges all the CSV files into 1 without only one heading
awk '(NR == 1) || (FNR > 1)' *.csv > ^0-output.csv
This works perfectly.
However, I am not sure how to delete the unwanted rows after the merge.
So far I have this:
awk '$5 !~ /(<100|N\/A|-)/' ^0-output.csv > ^0-output.csv
But when I use this code, it just produces a blank file.
Plus, I am not sure if there is a way to integrate it in the first line, so it does everything with a single command.
Notes
Here is how the data looks in CSV format
Sample1.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Apples,311,12,N/A,<100,10
Bananas,">1,200",737,N/A,490,88
Oranges,48,184,N/A,N/A,1
Fruits,161,94,N/A,-,63
Sample2.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Dino,588,67,N/A,888,234
Thunder,">1,200",211,N/A,<100,77
Ninja,95,37,N/A,-,878
Sample3.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Blur,84,2454,N/A,-,234
Sample4.csv
"KEYWORD","NUMBER OF COMPS","AVGE M E (K)","GS/M","EST. A SE/M","C CORE"
"hedgehog rolls ròund",32,481,N/A,"878",13
"Clever Fox jumps Hîgh",233,83,N/A,"<100",12
"Bear à lot",122,35,N/A,"-",11
"kitten hîgh life","121","673","32","N/A","15"
Please note: The actual files that the finished script will be used on will have a variety of file names. They will NOT always follow the pattern of sample 1, sample 2 etc.
Expected Output
Expected output: (CSV format)
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Bananas,">1,200",737,N/A,490,88
Dino,588,67,N/A,888,234
"hedgehog rolls ròund",32,481,N/A,"878",13
(Note: It doesn't matter if the expected output keeps the wrapping quote marks as the final CSV file is opened in Apple Numbers)
Expected output: (Readable format)
| KEYWORD | NUMBER OF COMPS | AVGE M E (K) | GS/M | EST. A SE/M | C CORE |
|---------|-----------------|--------------|------|-------------|--------|
| Bananas | >1,200 | 737 | N/A | 490 | 88 |
| Dino | 588 | 67 | N/A | 888 | 234 |
| hedgehog rolls ròund | 588 | 67 | N/A | 888 | 234 |
Environment:
I am using Mac OS X 10.14.6. I am unable to install other versions of awk.
You may just add merge 2 conditions into one using && :
awk -F, 'NR==1 || (FNR>1 && $5 !~ /^(<100|N\/A|-)$/)' *.csv > output.csv
Here $5 !~ /^(<100|N\/A|-)$/) will skip a row if $5 is <100 or - or N/A. It is important to use regex anchors ^ and $ to avoid matching unwanted string such as 1000 or AB-123.
It seems you have a comma in double quotes also in file1.csv. In that case following gnu-awk command should work from you:
awk -v FPAT='"[^"]*"|[^,]*' '
NR == 1 || (FNR > 1 && $5 !~ /^(<100|N\/A|-)*$/)' *.csv > output.csv
EDIT: As per OP's comments there could be a comma in between " too, so to handle that its better to use FPAT, written and tested with GNU awk.
awk -v FPAT='[^,]*|"[^"]+"' '
{ sub(/\r$/,"") }
FNR==1{
if(NR==1){ print }
next
}
$5=="<100"||$5=="N/A"||$5=="-"{
next
}
1
' *.csv
Could you please try following, written and tested with GNU awk on shown samples only.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
if(NR==1){ print }
next
}
$5=="<100"||$5=="N/A"||$5=="-"{ next }
1
' *.csv
OR in case your values can contain something else also and you want to use regex to match the values which you want to neglect then try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
if(NR==1){ print }
next
}
$5~/<100/ || $5~/N\/A/ || $5~/-/{ next }
1
' *.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator as comma here.
}
FNR==1{ ##Checking condition if its firt line of current Input_file then do following.
if(NR==1){ print } ##If its very first line of very first Input_file then print that line.
next ##next will skip all further statements from here.
}
$5=="<100"||$5=="N/A"||$5=="-"{ next } ##Checking condition if 5th field contains either <100 OR N/A OR - then skip all further statements.
1 ##awk'sh way to print the current line.
' *.csv ##Passing all .csv files to awk program from here.
It looks to me like you're only interested in testing the 2nd-last field and neither that nor the last field can contain commas so just count field numbers from the end instead of from the beginning of each line and then you don't care whether earlier fields contain commas or not. Given that, this will work using any awk:
$ awk -F',' '(NR==1) || (FNR>1 && $(NF-1)!~/^"?(<100|N\/A|-)"?$/)' *.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Bananas,">1,200",737,N/A,490,88
Dino,588,67,N/A,888,234
"hedgehog rolls ròund",32,481,N/A,"878",13

Output Error for Matrix in Python?

I am trying to create a function that outputs a matrix that contains each item in a list on a separate line with lines in between. The only output I'm getting is quotations (''). I do not understand why. I think I set it all up correctly to output what is needed but there has to be something missing?
I included examples below my code.
def show_table(table):
table=[]
s=[[str(e) for e in row] for row in table]
lens= [max(map(len, col)) for col in zip(*s)]
fmt= '\t'.join('{{:{}}}'.format(x) for x in lens)
table= [fmt.format(*row) for row in s]
return '\n'.join(table)
show_table([['A','BB'],['C','DD']])
output:
'| A | BB |\n| C | DD |\n'
print(show_table([['A','BB'],['C','DD']]))
output:
| A | BB |
| C | DD |
The issue is on the second line where you are initialising your list to an empty list. Instead try:
if table is None:
table = []
Perhaps a better way to accomplish this could be:
def show_table(table):
if table is None:
table = []
data = ""
for row in table:
for val in row:
data += "| " + val + " "
data += "|\n"
return data.strip("\n")
print show_table([['a','bb'],['c','dd']])
Output:
| a | bb |
| c | dd |

HTML in unix shell scripting with sequential output

I have a script called "main.ksh" which returns "output.txt" file and I am sending that file via mail.(list contains 50+ records, I just give 4 records for example)
mail output I am getting is:
DATE | FEED NAMEs | FILE NAMEs | JOB NAMEs | SCHEDULED_TIME| TIMESTAMP| SIZE(MB)| COUNT| STATUS |
Dec 17 INVEST_AI_FUNDS_FEED amlfunds_iai_20161217.txt gdcpl3392_uxmow080_ori_inv_ai TUE-SAT 02:03 0.4248 4031 On_Time
Dec 17 INVEST_AI_SECURITIES_FEED amltxn_iai_20161217.txt gdcpl3392_uxmow080_ori_inv_ai TUE-SAT 02:03 0.0015 9 On_Time
Dec 17 INVEST_AI_CONNECTED_PARTIES_FEED amlbene_iai_20161217.txt gdcpl3392_uxmow080_ori_inv_ai TUE-SAT 02:03 0.0001 1 No_Records
I am implementing coloring for Delayed,On_Time and No_Records field and I wrote below script which gives me bottom output(output is correct but there is no space separated).
awk 'BEGIN {
print "<html>" \
"<body bgcolor=\"#333\" text=\"#f3f3f3\">" \
"<pre>"
}
NR == 1 { print $0 }
NR > 1 {
if ($NF == "Delayed") color="red"
else if ($NF == "On_time") color="green"
else if ($NF == "No_records") color="yellow"
else color="#003abc"
$NF="<span style=\"color:" color "\">" $NF "</span>"
print $0
}
END {
print "</pre>" \
"</body>" \
"</html>"
}
' output.txt > output.html
output with perfect coloring:
| DATE | FEED NAMEs | FILE NAMEs | JOB NAMEs | SCHEDULED_TIME| TIMESTAMP| SIZE(MB)| COUNT| STATUS |
Dec 17 INVEST_AI_FUNDS_FEED amlfunds_iai_20161217.txt gdcpl3392_uxmow080_ori_inv_ai On_Time
Dec 17 INVEST_AI_SECURITIES_FEED amltxn_iai_20161217.txt gdcpl3392_uxmow080_ori_inv_ai On_Time
Dec 17 INVEST_AI_CONNECTED_PARTIES_FEED amlbene_iai_20161217.txt gdcpl3392_uxmow080_ori_inv_ai No_Records
There are 4 columns are skipped automatically. Could you please help me on this please ? Thanks a lot !
When your code executes this
$NF="<span style=\"color:" color "\">" $NF "</span>"
print $0
the input line is rebuilt and therefore the multiple blanks between two consecutive fields are replaced by just ONE only blank space.
My solution copies the input line in a variable, deletes the last field (changing the value of the variable, not the input line), adds the modified last field and prints:
Dummy=$0
sub("[^ ]+$","",Dummy) # removes last field
Dummy=Dummy "<span style=\"color:" color "\">" $NF "</span>"
print Dummy
Best regards
update: the last two code lines can be reduced in this way:
print Dummy "<span style=\"color:" color "\">" $NF "</span>"

How to convert the following text to comma seperated list using awk - Need to skip headers and trailers

+---------------------------------+------------+------+----------+
| Name | NumCourses | Year | Semester |
+---------------------------------+------------+------+----------+
| ABDULHADI, ASHRAF M | 2 | 1990 | 3 |
| ACHANTA, BALA | 2 | 1995 | 3 |
| ACHANTA, BALA | 2 | 1996 | 3 |
+---------------------------------+------------+------+----------+
648 rows in set (0.02 sec)
--------------------------
Skip the first 3 lines and the last two lines. I would need an output like -
ABDULHADI, ASHRAF M, 2, 1990, 3
ACHANTA, BALA, 2, 1995, 3
ACHANTA, BALA, 2, 1996, 3
You can start with this awk and build on it as you need.
awk '
BEGIN {
FS = " *[|] *" # Set the Field Separator to this pattern
OFS = "," # Set the Output Field Separator to ,
}
NF { # Skip blank lines
$1 = $1 # Reconstruct your input line
gsub(/^,|,$/,"") # Remove leading and trailing ,
lines[++i] = $0 # Add line to array
}
END {
for(x=4;x<=i-2;x++) # Skip first three and last two lines
print lines[x] # Print line
}' file
ABDULHADI, ASHRAF M,2,1990,3
ACHANTA, BALA,2,1995,3
ACHANTA, BALA,2,1996,3
If your data does not have blank lines then you can remove NF and use NR as key instead of ++i
FS pattern above is zero or more spaces followed by pipe (placed in character class to consider it literal, since it is a meta character) followed by zero or more spaces.
Here is an awk
awk -F" *[|] *" 'FNR==NR {a=FNR;next} FNR>3 && FNR<a-2 {print $2,$3,$4,$5}' OFS=", " file{,}
ABDULHADI, ASHRAF M, 2, 1990, 3
ACHANTA, BALA, 2, 1995, 3
ACHANTA, BALA, 2, 1996, 3
Read the file two times, one to count the lines, one to get the correct output.
If your awk does not work with file{,}, change to file file to read it two times