How can I clean a TSV file having record or fields separators in one of its fields? - csv

Given a TSV file with col2 that contains either a field or record separator (FS/RS) being respectively a tab or a carriage return which are escaped/surrounded by quotes.
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \cat -vet
col1^Icol2^Icol3$
1^I"A^IB"^I1234$
2^I"CD$
EF"^I567$
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD | 567 |
| | EF" | |
+------+---------+------+
Is there a way in sed/awk/perl or even (preferably) miller/mlr to transform those pesky characters into spaces in order to generate the following result:
+------+---------+------+
| col1 | col2 | col3 |
+------+---------+------+
| 1 | "A B" | 1234 |
| 2 | "CD EF" | 567 |
+------+---------+------+
I cannot get miller 6.2 to make the proper transformation (tried with DSL put/gsub) because it doesn't recognize the tab or CR/LF being part of the columns which breaks the field number:
$ printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | mlr --opprint --barred --itsv cat
mlr : mlr: CSV header/data length mismatch 3 != 4 at filename (stdin) line 2.

A good library cleanly handles things like embedded newlines and quoted separators (in fields)
In a Perl script with Text::CSV
use warnings;
use strict;
use Text::CSV;
my $file = shift // die "Usage: $0 filename\n";
my $csv = Text::CSV->new( { binary => 1, sep_char => "\t", auto_diag => 1 } );
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $row = $csv->getline($fh)) {
s/\s+/ /g for #$row; # collapse multiple spaces, tabs, newlines
$csv->say(*STDOUT, $row);
}
Note the many other options for the constructor that can help handle various irregularities.
This can fit in a one-liner; its functional interface (with csv) is particularly well suited for that.

if you run
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --c2t --fs "\t" clean-whitespace
col1 col2 col3
1 A B 1234
2 CD EF 567
I'm using mlr 6.2.
A way to do it in miller 5 is to use simply the put verb:
printf '%b\n' 'col1\tcol2\tcol3' '1\t"A\tB"\t1234' '2\t"CD\nEF"\t567' | \
mlr --tsv put -S 'for (k in $*) {$[k] = gsub($[k], "\n", " ")}' then clean-whitespace

perl -MText::CSV_XS=csv -e'
csv
in => *ARGV,
on_in => sub { s/\s+/ /g for #{$_[1]} },
sep_char => "\t";
'
Or s/[\t\n]/ /g if you prefer.
Can be placed all on one line.
Input is accepted from file named by argument or STDIN.

With GNU awk for multi-char RS, RT, and gensub():
$ awk -v RS='"([^"]|"")*"' '{ORS=gensub(/[\n\t]/," ","g",RT)} 1' file
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
The above just uses RS to isolate each "..." string and saves it in RT, then replaces every \n or \t in that string with a blank and saves the result in ORS, then prints the record.

you absolutely don't need gawk to get this done - here's one that works for mawk, gawk, or macos nawk :
INPUT
--before--
col1 col2 col3
1 "A B" 1234
2 "CD
EF" 567
CODE
{m,n,g}awk '
BEGIN {
1 __=substr((OFS=FS="\t\"")(FS)(ORS=_)\
(RS = "^$"),_+=_^=_<_,_)
}
END {
1 printbefore()
3 for (_^=_<_; _<=NF; _++) {
3 sub(/[\t-\r]+/, ($_~__)?" ":"&", $_)
}
1 print
}
1 function printbefore(_)
{
1 printf("\n\n--before--\n%s\n------"\
"------AFTER------\n\n", $+_)>("/dev/stderr")
}
OUTPUT
———AFTER (using mawk)------
col1 col2 col3
1 "A B" 1234
2 "CD EF" 567
strip out the part about printbefore() that's more for debugging purposes, then it's just
{m,n,g}awk '
BEGIN { __=substr((OFS=FS="\t\"") FS \
(ORS=_) (RS="^$"),_+=_^=_<_,_)
} END {
for(--_;_<=NF;_++) {
sub(/[\t-\r]+/, $_~__?" ":"&",$_) } print }'

Related

AWK: How to merge CSV files and eliminate rows that contain certain values?

I have hundreds of CSV files. Each CSV file is similar to this:
| KEYWORD | NUMBER OF COMPS | AVGE M E (K) | GS/M | EST. A SE/M | C CORE |
|---------|-----------------|--------------|------|-------------|--------|
| Apples | 311 | 12 | N/A | <100 | 10 |
| Bananas | >1,200 | 737 | N/A | 490 | 88 |
| Oranges | 48 | 184 | N/A | N/A | 1 |
| Fruits | 161 | 94 | N/A | - | 6 |
(I have posted this in table format, to make it more readable, but the CSV data is at the bottom of this post).
All the CSV files have the same header row. Only the data is different.
I would like to do the following:
Merge all the CSV files together, but only have 1 header row.
Omit any rows where EST. A SE/M (Column 5) contains any of the following data: <100, N/A or -
Notes about the Data
Sometimes the some or even all cells in the CSV file are wrapped in quotation marks.
Other times they are not.
Sometimes the first column (keyword) may contain multiple words or accented characters.
My code so far
This code merges all the CSV files into 1 without only one heading
awk '(NR == 1) || (FNR > 1)' *.csv > ^0-output.csv
This works perfectly.
However, I am not sure how to delete the unwanted rows after the merge.
So far I have this:
awk '$5 !~ /(<100|N\/A|-)/' ^0-output.csv > ^0-output.csv
But when I use this code, it just produces a blank file.
Plus, I am not sure if there is a way to integrate it in the first line, so it does everything with a single command.
Notes
Here is how the data looks in CSV format
Sample1.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Apples,311,12,N/A,<100,10
Bananas,">1,200",737,N/A,490,88
Oranges,48,184,N/A,N/A,1
Fruits,161,94,N/A,-,63
Sample2.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Dino,588,67,N/A,888,234
Thunder,">1,200",211,N/A,<100,77
Ninja,95,37,N/A,-,878
Sample3.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Blur,84,2454,N/A,-,234
Sample4.csv
"KEYWORD","NUMBER OF COMPS","AVGE M E (K)","GS/M","EST. A SE/M","C CORE"
"hedgehog rolls ròund",32,481,N/A,"878",13
"Clever Fox jumps Hîgh",233,83,N/A,"<100",12
"Bear à lot",122,35,N/A,"-",11
"kitten hîgh life","121","673","32","N/A","15"
Please note: The actual files that the finished script will be used on will have a variety of file names. They will NOT always follow the pattern of sample 1, sample 2 etc.
Expected Output
Expected output: (CSV format)
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Bananas,">1,200",737,N/A,490,88
Dino,588,67,N/A,888,234
"hedgehog rolls ròund",32,481,N/A,"878",13
(Note: It doesn't matter if the expected output keeps the wrapping quote marks as the final CSV file is opened in Apple Numbers)
Expected output: (Readable format)
| KEYWORD | NUMBER OF COMPS | AVGE M E (K) | GS/M | EST. A SE/M | C CORE |
|---------|-----------------|--------------|------|-------------|--------|
| Bananas | >1,200 | 737 | N/A | 490 | 88 |
| Dino | 588 | 67 | N/A | 888 | 234 |
| hedgehog rolls ròund | 588 | 67 | N/A | 888 | 234 |
Environment:
I am using Mac OS X 10.14.6. I am unable to install other versions of awk.
You may just add merge 2 conditions into one using && :
awk -F, 'NR==1 || (FNR>1 && $5 !~ /^(<100|N\/A|-)$/)' *.csv > output.csv
Here $5 !~ /^(<100|N\/A|-)$/) will skip a row if $5 is <100 or - or N/A. It is important to use regex anchors ^ and $ to avoid matching unwanted string such as 1000 or AB-123.
It seems you have a comma in double quotes also in file1.csv. In that case following gnu-awk command should work from you:
awk -v FPAT='"[^"]*"|[^,]*' '
NR == 1 || (FNR > 1 && $5 !~ /^(<100|N\/A|-)*$/)' *.csv > output.csv
EDIT: As per OP's comments there could be a comma in between " too, so to handle that its better to use FPAT, written and tested with GNU awk.
awk -v FPAT='[^,]*|"[^"]+"' '
{ sub(/\r$/,"") }
FNR==1{
if(NR==1){ print }
next
}
$5=="<100"||$5=="N/A"||$5=="-"{
next
}
1
' *.csv
Could you please try following, written and tested with GNU awk on shown samples only.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
if(NR==1){ print }
next
}
$5=="<100"||$5=="N/A"||$5=="-"{ next }
1
' *.csv
OR in case your values can contain something else also and you want to use regex to match the values which you want to neglect then try following.
awk '
BEGIN{
FS=OFS=","
}
FNR==1{
if(NR==1){ print }
next
}
$5~/<100/ || $5~/N\/A/ || $5~/-/{ next }
1
' *.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS="," ##Setting field separator as comma here.
}
FNR==1{ ##Checking condition if its firt line of current Input_file then do following.
if(NR==1){ print } ##If its very first line of very first Input_file then print that line.
next ##next will skip all further statements from here.
}
$5=="<100"||$5=="N/A"||$5=="-"{ next } ##Checking condition if 5th field contains either <100 OR N/A OR - then skip all further statements.
1 ##awk'sh way to print the current line.
' *.csv ##Passing all .csv files to awk program from here.
It looks to me like you're only interested in testing the 2nd-last field and neither that nor the last field can contain commas so just count field numbers from the end instead of from the beginning of each line and then you don't care whether earlier fields contain commas or not. Given that, this will work using any awk:
$ awk -F',' '(NR==1) || (FNR>1 && $(NF-1)!~/^"?(<100|N\/A|-)"?$/)' *.csv
KEYWORD,NUMBER OF COMPS,AVGE M E (K),GS/M,EST. A SE/M,C CORE
Bananas,">1,200",737,N/A,490,88
Dino,588,67,N/A,888,234
"hedgehog rolls ròund",32,481,N/A,"878",13

jq create output in many separate files

given the following json:
[
{"_id":{"$oid":"6d2"},"jlo":"ΕΙ AJSB","dd":"d5f"},
{"_id":{"$oid":"c6d3"},"jlo":"ΕΙ ALKSB","dd":"5d9"},
{"_id":{"$oid":"b0cc6d4"},"jlo":"ΕΙ AGHTSB","dd":"1b1"},
{"_id":{"$oid":"6d2"},"jlo":"ΕPOWΙ AJSB","dd":"d5f"},
{"_id":{"$oid":"c6d3"},"jlo":"ΕGTΙ ALKSB","dd":"5d9"},
{"_id":{"$oid":"b0cc6d4"},"jlo":"ΕLKΙ AGHTSB","dd":"1b1"}
]
what i need to do is have as output for each discrete value of the ll element, the unique values of ta, in a separate file, named after a one to one representation where each dd code is substituted with a human readable representation:
d5f:departmentone
5d9:departmentalt
1b1:departshort
Desired output, in a per row basis, each unique value of jlo with the count of times it was found in each dd element so we get in the end something like this:
first file named departmentone.txt:
ΕΙ AJSB 1
ΕPOWΙ AJSB 1
second file named departmentalt.txt
ΕΙ ALKSB 1
ΕGTΙ ALKSB 1
third file named departshort.txt
ΕΙ AGHTSB 2
i have tried with map and reduce, group_by, sort_by, with really poor results
Only one invocation of jq is necessary. To allocate the output to the separate files, you can combine this one invocation with a single invocation to awk, or you could use a shell loop as illustrated below.
First, here's an illustration of how the shell pipeline would look:
jq -r --rawfile dd2name dd2name.tsv -f group.jq input.json |
while IFS=$'\t' read -r f v ; do echo "$v" >> "$f" ; done
This assumes that the mapping to filenames is in a TSV file named dd2name.tsv, and that the following jq program is in group.jq:
def dict:
split("\n") | map(select(length>0) | split("\t"))
| INDEX(.[0]) | map_values(.[1]);
($dd2name | dict) as $dict
| ($dict | keys_unsorted[]) as $dd
| map(select(.dd == $dd))
| group_by(.jlo)
| map("\($dict[$dd])\t\(.[0].jlo) \(length)")[]
As the name suggests, the dict function creates a dictionary giving the mapping of .dd values to the filenames. It assumes the availability of INDEX. If your jq does not have INDEX, then now would be an excellent time to upgrade your jq; otherwise, its def can easily be copied from builtin.jq (google: builtin.jq "def INDEX"), or you could replace the last line by: | reduce .[] as $p ({}; .[$p[0]] = $p[1]);
awk-based solution
The following invocation of awk can be used instead of the while ... done command above:
awk -F\\t 'fn && (fn!=$1) {close(fn)}; {fn=$1; print $2 >> fn}'
Season to taste
If the dd2name.tsv mapping file does not contain the ".txt" suffix, it can easily be added in any of a variety of ways, according to taste.
Note also that the proposed solutions above make some assumptions, notably that the .jlo values do not contain tabs, newlines, or NULs. If any of those assumptions is violated, then some tweaking will be required.
I'd do it in three passes, filtering the array with the desired dd and grouping by jlo, then extracting the jlo of the first (guaranteed) item of the array and its length :
map(select(.dd == "d5f")) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]
You can try it here.
Full bash run :
jq --arg dd d5f --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > departmentone.txt
jq --arg dd 5d9 --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > departmentalt.txt
jq --arg dd 1b1 --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > departmentshort.txt
Supposing you have a file named "mapping.txt" with the following content :
d5f:departmentone
5d9:departmentalt
1b1:departshort
You could extract those codes and labels to generate the files :
while IFS=: read -r code label; do
jq --arg dd $code --raw-output 'map(select(.dd == $dd)) | group_by(.jlo) | map("\(.[0].jlo) \(length)") | .[]' yourJsonFile > "$label".txt
done < mapping.txt

Read MySQL result set with multiple columns and spaces

Pretend I have a MySQL table test that looks like:
+----+---------------------+
| id | value |
+----+---------------------+
| 1 | Hello World |
| 2 | Foo Bar |
| 3 | Goodbye Cruel World |
+----+---------------------+
And I execute the query SELECT id, value FROM test.
How would I assign each column to a variable in Bash using read?
read -a truncates everything after the first space in value:
mysql -D "jimmy" -NBe "SELECT id, value FROM test" | while read -a row;
do
id="${row[0]}"
value="${row[1]}"
echo "$id : $value"
done;
and output looks like:
1 : Hello
2 : Foo
3 : Goodbye
but I need it to look like:
1 : Hello World
2 : Foo Bar
3 : Goodbye Cruel World
I'm aware there are args I could pass to MySQL to format the results in table format, but I need to parse each value in each row. This is just a simplified example of my problem.
Use individual fields in the read loop instead of the array:
mysql -D "jimmy" -NBe "SELECT id, value FROM test" | while read -r id value;
do
echo "$id : $value"
done
This will make sure that id will be read into the id field and everything else would be read into the value field - that's how read behaves when input has more fields than the number of variables being read into. If there are more columns to be read, using a delimiter (such as #) that doesn't clash with actual data would help:
mysql -D "jimmy" -NBe "SELECT CONCAT(id, '#', value, '#', column3) FROM test" | while IFS='#' read -r id value column3;
do
echo "$id : $value : $column3"
done
You can do this, also avoid piping a command to a while read loop if possible to avoid creating a subshell.
while read -r line; do
id=$(echo $line | awk '{print $1}')
value=$(echo $line | awk '{print $1=""; print $0}'|sed ':a;N;$!ba;s/\n/ /g'| sed 's/^[ \t]*//g')
echo "ID: $id"
echo "VALUE: $value"
done< <(mysql -D "jimmy" -NBe "SELECT id, value FROM test")
If you want to store all the id's and values in an array for later use, you can modify it to look like this.
#!/bin/bash
declare -A -g arr
while read -r line; do
id=$(echo $line | awk '{print $1}')
value=$(echo $line | awk '{print $1=""; print $0}'|sed ':a;N;$!ba;s/\n/ /g'| sed 's/^[ \t]*//g')
arr[$id]=$value
done< <(mysql -D "jimmy" -NBe "SELECT id, value FROM test")
for key in "${!arr[#]}"; do
echo "$key: ${arr[$key]}"
done
Which gives you this output
dumbledore#ansible1a [OPS]:~/tmp/tmp > bash test.sh
1: Hello World
2: Foo Bar
3: Goodbye Cruel World

How to convert the following text to comma seperated list using awk - Need to skip headers and trailers

+---------------------------------+------------+------+----------+
| Name | NumCourses | Year | Semester |
+---------------------------------+------------+------+----------+
| ABDULHADI, ASHRAF M | 2 | 1990 | 3 |
| ACHANTA, BALA | 2 | 1995 | 3 |
| ACHANTA, BALA | 2 | 1996 | 3 |
+---------------------------------+------------+------+----------+
648 rows in set (0.02 sec)
--------------------------
Skip the first 3 lines and the last two lines. I would need an output like -
ABDULHADI, ASHRAF M, 2, 1990, 3
ACHANTA, BALA, 2, 1995, 3
ACHANTA, BALA, 2, 1996, 3
You can start with this awk and build on it as you need.
awk '
BEGIN {
FS = " *[|] *" # Set the Field Separator to this pattern
OFS = "," # Set the Output Field Separator to ,
}
NF { # Skip blank lines
$1 = $1 # Reconstruct your input line
gsub(/^,|,$/,"") # Remove leading and trailing ,
lines[++i] = $0 # Add line to array
}
END {
for(x=4;x<=i-2;x++) # Skip first three and last two lines
print lines[x] # Print line
}' file
ABDULHADI, ASHRAF M,2,1990,3
ACHANTA, BALA,2,1995,3
ACHANTA, BALA,2,1996,3
If your data does not have blank lines then you can remove NF and use NR as key instead of ++i
FS pattern above is zero or more spaces followed by pipe (placed in character class to consider it literal, since it is a meta character) followed by zero or more spaces.
Here is an awk
awk -F" *[|] *" 'FNR==NR {a=FNR;next} FNR>3 && FNR<a-2 {print $2,$3,$4,$5}' OFS=", " file{,}
ABDULHADI, ASHRAF M, 2, 1990, 3
ACHANTA, BALA, 2, 1995, 3
ACHANTA, BALA, 2, 1996, 3
Read the file two times, one to count the lines, one to get the correct output.
If your awk does not work with file{,}, change to file file to read it two times

AWK: Converting CSV with Headers to Summary Table

I have a need to document 100+ CSV files as far as the format of those files and including sample data. What I would like to do is take a CSV of the following format:
Name, Phone, State
Fred, 1234567, TX
John, 2345678, NC
and convert it to:
Field | Sample
--- | ----
Name | Fred
Phone | 1234567
State | TX
Is this possible with AWK? From my example below, you will see I am trying to format as a markdown table. I have it currently transposing the header row with
#!/usr/bin/awk -v RS='\r\n' -f
BEGIN { printf "| Field \t| Critical |\n"}
{
printf "|---\t|---\t|\n"
for (i=1; i<=NF; i++) {print "|", toupper($i), "| sample |"}
}
END {}
But I am not sure now how to use the first row of data, after the header to display the sample data?
awk is the right tool for data parsing. You can try something like:
awk '
BEGIN { FS=", "; OFS=" | " }
NR==1 {
for(tag = 1; tag <= NF; tag++) {
hdr[tag] = sprintf ("%-7s", $tag)
}
next
}
{
for(fld = 1; fld <= NF; fld++) {
data[NR,fld] = $fld
}
}
END {
print "Field | Sample\n------- | -------";
for(rec = 2; rec <= NR; rec++) {
for(line = 1; line <= NF; line++) {
print hdr[line], data[rec,line]
}
}
}' file
Output:
Field | Sample
------- | -------
Name | Fred
Phone | 1234567
State | TX
Name | John
Phone | 2345678
State | NC
Here is a more simple way to do it with awk
No need to store everything in a array then print at the end.
awk -F", " 'NR==1{split($0,a,FS);print "Field | Sample\n------- | -------";next} {for (i=1;i<=NF;i++) printf "%-8s| %s\n",a[i],$i}' file
Field | Sample
------- | -------
Name | Fred
Phone | 1234567
State | TX
Name | John
Phone | 2345678
State | NC
How it works:
awk -F", " ' # set field separator to ","
NR==1{ # if first line do:
split($0,a,FS) # split first line to an array named "a" to get the labels
print "Field | Sample" # print header
print "------- | -------" # print separator
next} # prevents nothing more run for first line
{ # for all lines except first do:
for (i=1;i<=NF;i++) # loop trough all element in line
printf "%-8s| %s\n",a[i],$i # print data for every element
}
' file