awk: appending columns from multiple csv files into a single csv file - csv

I have several CSV files (all have the same number of rows and columns). Each file follows this format:
1 100.23 1 102.03 1 87.65
2 300.56 2 131.43 2 291.32
. . . . . .
. . . . . .
200 213.21 200 121.81 200 500.21
I need to extract columns 2, 4 and 6, and add them to a single CSV file.
I have a loop in my shell script which goes through all the CSV files, extracts the columns, and appends these columns to a single file:
#output header column
awk -F"," 'BEGIN {OFS=","}{ print $1; }' "$input" > $output
for f in "$1"*.csv;
do
if [[ -f "$f" ]] #removes symlinks (only executes on files with .csv extension)
then
fname=$(basename $f)
arr+=("$fname") #array to store filenames
paste -d',' $output <(awk -F',' '{ print $2","$4","$6; }' "$f") > temp.csv
mv temp.csv "$output"
fi
done
Running this produces this output:
1 100.23 102.03 87.65 219.42 451.45 903.1 ... 542.12 321.56 209.2
2 300.56 131.43 291.32 89.57 897.21 234.52 125.21 902.25 254.12
. . . . . . . . . .
. . . . . . . . . .
200 213.23 121.81 500.21 231.56 5023.1 451.09 ... 121.09 234.45 709.1
My desired output is a single CSV file that looks something like this:
1.csv 1.csv 1.csv 2.csv 2.csv 2.csv ... 700.csv 700.csv 700.csv
1 100.23 102.03 87.65 219.42 451.45 903.1 542.12 321.56 209.2
2 300.56 131.43 291.32 89.57 897.21 234.52 125.21 902.25 254.12
. . . . . . . . . .
. . . . . . . . . .
200 213.23 121.81 500.21 231.56 5023.1 451.09 ... 121.09 234.45 709.1
In other words, I need a header row containing the file names in order to identify which files the columns were extracted from. I can't seem to wrap my head around how to do this.
What is the easiest way to achieve this (preferably using awk)?
I was thinking of storing the file names into an array, inserting a header row and then print the array but I can't figure out the syntax.

So, based on a few assumptions:
the inputs are called "*.csv" but they're actually whitespace-separated, as they appear.
the odd-numbered input columns just repeat the row number 3 times, and can be ignored
the column headings are just the filenames, repeated 3 times each
they are input to some other program, and the numbers are left-justified anyway, so you aren't particular about the column formatting (columns lining up, decimals aligned, ...)
Humble apologies because code PRE formatting is not working for me here
f=$(set -- *.csv; echo $*)
(echo $f; paste $f) |
awk 'NR==1 { for (i=1; i<=NF; i++) {x=x" "$i" "$i" "$i} }
NR > 1 { x=$1; for (i=2; i<= NF; i+=2) {x=x" "$i} }
{print x}'
hth

Related

Create CSV file with below output line

I have below output line , from this line I want to create CSV file. In CSV
it should print below line as first column and in second column I want to print the string before second delemeter ":".I am using below script but It is separating data wherever "," is present , and I want to print that whole line in first column and the string after second delimiter ":" in second column .Please help me to sort data in proper format
output line :/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my($lhys,$lytrn,$ccdethe
shell script : input="out.txt"
while IFS= read -r LINES
do
#echo "$LINES"
if [[ $LINES = /* ]]
then
filename=echo $LINES | cut -d ":" -f1
echo "$LINES,$filename" >> out.csv
fi
done < "$input"
I don't think I understand your question correctly.
You currently have this output
:/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my($lhys,$lytrn,$ccdethe
And you would like to have this kind of CSV output
2
3
/home/nagios/NaCl/files/chk_raid.pl:token=13704value=undef;next};my($lhys,$lytrn,$ccdethe
token=13704value=undef;next};my($lhys,$lytrn,$ccdethe
If that's what you want, you can use Miller like this
echo ":/home/nagios/NaCl/files/chk_raid.pl:token=$$value=undef;next};my(\$lhys,\$lytrn,\$ccdethe" | mlr --n2c --ifs ":" cut -x -f 1 then put '$2=$2.":".$3'
and you will have this two columns CSV
2,3
"/home/nagios/NaCl/files/chk_raid.pl:token=13704value=undef;next};my($lhys,$lytrn,$ccdethe","token=13704value=undef;next};my($lhys,$lytrn,$ccdethe"

How to use awk to sum up fields based on other field

In my assessment I'm asked to write a shell script using only bash commands and another shell script using only SQL queries. These scripts should do the following:
1. Clean data in the .csv file (not important at the moment)
2. Sum up earnings based upon gender
3. Produce a simple HTML table
I have made the SQL query produce the correct numbers and HTML file, but with som help from other bash commands.
For the file that should only contain bash commands I'm able to get the table but one of the numbers are wrong.
I'm very new to bash scripting and SQL queries so the code isn't very optimised.
The following is a shortned version of the sample input:
CSV input
title,site,country,year_release,box_office,director,number_of_subjects,subject,type_of_subject,race_known,subject_race,person_of_color,subject_sex,lead_actor_actress
10 Rillington Place,http://www.imdb.com/title/tt0066730/,UK,1971,-,Richard Fleischer,1,John Christie,Criminal,Unknown,,0,Male,Richard Attenborough
12 Years a Slave,http://www.imdb.com/title/tt2024544/,US/UK,2013,56700000,Steve McQueen,1, Solomon Northup,Other,Known,African American,1,Male,Chiwetel Ejiofor
127 Hours,http://www.imdb.com/title/tt1542344/,US/UK,2010,18300000,Danny Boyle,1,Aron Ralston,Athlete,Unknown,,0,Male,James Franco
1987,http://www.imdb.com/title/tt2833074/,Canada,2014,-,Ricardo Trogi,1,Ricardo Trogi,Other,Known,White,0,Male,Jean-Carl Boucher
20 Dates,http://www.imdb.com/title/tt0138987/,US,1998,537000,Myles Berkowitz,1,Myles Berkowitz,Other,Unknown,,0,Male,Myles Berkowitz
21,http://www.imdb.com/title/tt0478087/,US,2008,81200000,Robert Luketic,1,Jeff Ma,Other,Known,Asian American,1,Male,Jim Sturgess
24 Hour Party People,http://www.imdb.com/title/tt0274309/,UK,2002,1130000,Michael Winterbottom,1,Tony Wilson,Musician,Known,White,0,Male,Steve Coogan
42,http://www.imdb.com/title/tt0453562/,US,2013,95000000,Brian Helgeland,1,Jackie Robinson,Athlete,Known,African American,1,Male,Chadwick Boseman
8 Seconds,http://www.imdb.com/title/tt0109021/,US,1994,19600000,John G. Avildsen,1,Lane Frost,Athlete,Unknown,,0,Male,Luke Perry
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Frank Doel,Author,Unknown,,0,Male,Anthony Hopkins
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Helene Hanff,Author,Unknown,,0,Female,Anne Bancroft
A Beautiful Mind,http://www.imdb.com/title/tt0268978/,US,2001,171000000,Ron Howard,1,John Nash,Academic,Unknown,,0,Male,Russell Crowe
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Carl Gustav Jung,Academic,Known,White,0,Male,Michael Fassbender
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sigmund Freud,Academic,Known,White,0,Male,Viggo Mortensen
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sabina Spielrein,Academic,Known,White,0,Female,Keira Knightley
A Home of Our Own,http://www.imdb.com/title/tt0107130/,US,1993,1700000,Tony Bill,1,Frances Lacey,Other,Unknown,,0,Female,Kathy Bates
A Man Called Peter,http://www.imdb.com/title/tt0048337/,US,1955,-,Henry Koster,1,Peter Marshall,Other,Known,White,0,Male,Richard Todd
A Man for All Seasons,http://www.imdb.com/title/tt0060665/,UK,1966,-,Fred Zinnemann,1,Thomas More,Historical,Known,White,0,Male,Paul Scofield
A Matador's Mistress,http://www.imdb.com/title/tt0491046/,US/UK,2008,-,Menno Meyjes,2,Lupe Sino,Actress ,Known,Hispanic (White),0,Female,PenÌÎå©lope Cruz
For the SQL queries only file this is my code so far (produces right numbers and correct table):
python3 csv2sqlite.py --table-name test_table --input table.csv --output table.sqlite
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
sqlite3 biopics.sqlite 'SELECT subject_sex,SUM(earnings) FROM table \
GROUP BY subject_sex;' -html > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo '</TABLE>' >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
For the bash only file this is my code so far:
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
awk -F ',' '{for (i=1;i<=NF;i++)
if ($1)
a[$13] += $5} END{for (i in a) printf("<TR><TD> %s </TD><TD> %i </TD></TR>\n", i, a[i])}' table.csv | sort | head -2 > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo -e "</TABLE>" >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
The expected output should look like this:
<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>
<TR><TD>Female</TD>
<TD>8480000.0</TD>
</TR>
<TR><TD>Male</TD>
<TD>455947000.0</TD>
</TR>
</TABLE>
Thank you in advance!
#! /bin/bash
awk -F, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
print "<TABLE BORDER = \"1\">"
print "<TR><TH>Gender</TH><TH>Total Amount [$]</TH></TR>"
for ( gender in sum )
{
print "<TR><TD>"gender"</TD>", "<TD>"sum[gender]"</TD></TR>"
}
print "</TABLE>"
}' table.csv
Here try this if it works for you.
UPDATE:
What I understand from your comment is that you want to sort data as per the sum.
#! /bin/bash
awk -F, -v OFS=, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
for ( gender in sum )
{
print gender, sum[gender]
}
}' table.csv | sort -nk 2,2 |
awk -v firstline="$(sed -n '1p' table.csv)" '{
printrow($0)
}
BEGIN {
split(firstline, headers, ",")
print "<html>"
print "<TABLE BORDER = "1">"
printrow(headers[5]","headers[13], 1)
}
END {
print "</table>"
print "</html>"
}
function printrow(row, flag)
{
# if flag == 0 or null "<TD>" else "<TH>"
len = split(row, cells, ",")
print "<TR>"
for (i = 1 ; i <= len ; ++i)
{
if (!flag)
print "<TD>"cells[i]"</TD>"
else
print "<TH>"cells[i]"</TH>"
}
print "</TR>"
}'
Above, I have basically divided what you need into 2 modules,
Manipulating data in table:
1) Just organises the table
2) Sorts data as per the 2nd column. This one I should have had done in the first awk script itself but it was a little shorter this way.
Converting it into an html table:
The second awk script receives output from the first one.
It sets the headings and tags.
I feel its more modular this way. This just makes it easier to make modifications. First script for data manipulation and second for placing headers or tags.
What I would personally like is giving the second awk script its own executable file. Now simply using first script for data manipulation and then passing it to another script for setting html tags and headers.
There might be better alternatives, I suggested the best I knew.

Export a text file to MySQL in Talend

So I have many files like this :
The first file :
File1;
Code;1971;1981;1991;2001;2011
A;10;20;30;40;50
B;12;22;32;89;95
...
...
The second file :
File2;
Code;1971;1981;1991;2001;2011
A;1500;1600;460;6000;8000
B;6000;7000;8007;8009;9005
...
...
All Files have the exact same format.
I like to have a table in my database like this :
File Code Year Value
File1 A 1971 10
File1 A 1981 20
File1 A 1991 30
. . .
. . .
File2 A 1971 1500
File2 A 1981 1600
File1 A 1991 460
. . . .
. . . .
File2 B 1971 .
File2 B 1981 .
. . . .
My idea is to creat for each file a t_map in wich we have the file as follow :
My solution with t_map
The problem is that I have so many file like this, and my solution will take me a long time to finish it. Is there any better solution ?
I prefer using PHP to dump all data into MySQL...
Code;1971;1981;1991;2001;2011
A;10;20;30;40;50
B;12;22;32;89;95
after Open & Read file ($file), try this...
<?php
function _T($a){return rtrim(trim($a));}
function _E($a,$b){return explode($a,$b);}
function _S($a,$b,$c){return str_replace($a,$b,$c);}
function _G($a,$b,$c){$d=strpos($a,$b);$e=substr($a,$d);$f=strpos($a,$c);return substr($e,0,($f-$d));}
$file='Code;1971;1981;1991;2001;2011
A;10;20;30;40;50
B;12;22;32;89;95';
$fC=_E('|',_S('Code;','|Code;',$file));
/*Useful with Multiple "Code;" in same file too*/
$fC2=($file).'|';
/* Add "|" to the last */
$fY=array();/*Year*/
$fA=array();/*A*/
$fB=array();/*B*/
$dB=array();/* For Database */
foreach($fC as $fR){
/* Get Year; */
$gY=_E(';',_S('Code;','',_G(_T($fR),'Code;','A;')));
$c=0;foreach($gY as $yR){if(!empty($yR)){/*echo _T($yR).'<br>';*/$fY[$c++]=_T($yR);}}
/* Get A */
$gA=_E(';',_S('A;','',_G(_T($fR),'A;','B;')));
$c=0;foreach($gA as $aR){if(!empty($aR)){/*echo _T($aR).'<br>';*/$fA[$c++]=_T($aR);}}
/* Get B */
$gB=_E(';',_S('B;','',_G(_T($fR),'B;','|')));
$c=0;foreach($gB as $bR){if(!empty($bR)){/*echo _T($bR).'<br>';*/$fB[$c++]=_T($bR);}}
if(empty($fB)){$gB=_E(';',_S('B;','',_G(_T($fC2),'B;','|')));
$c=0;foreach($gB as $bR){if(!empty($bR)){/*echo _T($bR).'<br>';*/$fB[$c++]=_T($bR);}}}
}
/* Show Result */
foreach($fY as $fK=>$fV){
echo $fK.':'.$fV.'='.$fA[$fK].'='.$fB[$fK].'<br>';
$dB[]=array('Year'=>$fV,'A'=>$fA[$fK],'B'=>$fB[$fK]);
}print_r($dB);
?>
Test yourself... :)

RegEx match for multiline in Perl

I have following data line i need to parse in Perl:
my $string='Upper Left ( 440720.000, 3751320.000) (117d38\'28.21"W, 33d54\'8.47"N)';
Here is my perl script:
if ($string=~ m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,6}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})/ig) {
$upperLeft="lat=". $1. 'd'. $2. "'". $3. ".". $4. '"W long='. $5. 'd'. $6. "'". $7. ".". $8. '"W';
print $upperLeft. "\n";
}
However this expression fails to 117d38'28.21" as lat and 33d54'8.47 as long. Note the space and '(' in the input $string which i use to create this regular expression.
What I am I doing wrong in extracting (117d38'28.21"W, 33d54'8.47"N) into 8 fields? Any help is appreciated.
You had several issues. The main being your regex just parsing up to lat, not lon.
What changed:
m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,6}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})/ig
m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,7}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([WE])[\,]\s(\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([NS])/ig
^-- Your test number was 7-digit big ^-- (1) ^-- (2) ^-- (3)
At the ending: (1) added group to deal with W/E (([WE])). (2) Added groups to extract lon number. (3) Added group to deal with N/S (([NS])).
Your code, corrected:
if ($string=~ m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,7}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([WE])[\,]\s(\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([NS])/ig) {
$upperLeft = "lat=" . $1 . 'd' . $2 . "'" . $3 . "." . $4 . '"' . $5 . " long=" . $6 . 'd' . $7 . "'" . $8 . "." . $9 . '"' . $10;
print $upperLeft. "\n";
}
Output:
lat=117d38'28.21"W long=33d54'8.47"N

shell script to read CSV missing first line

I'm using the following :
#!/bin/sh
export IFS=","
cat myfile.csv | while read a b c d e f g h i j
do
#example do
echo "a=$a"
echo "b=$b:"
echo "c=$c:"
echo "d=$d:"
echo "e=$e:"
echo "f=$f:"
echo "g=$g:"
echo "h=$h:"
echo "i=$i:"
echo "j=$j:"
done
to read in variables from a CSV file. It works quite well, only that it seems to miss off the first line of the file, why is this? can I change it to capture line one also?