In my assessment I'm asked to write a shell script using only bash commands and another shell script using only SQL queries. These scripts should do the following:
1. Clean data in the .csv file (not important at the moment)
2. Sum up earnings based upon gender
3. Produce a simple HTML table
I have made the SQL query produce the correct numbers and HTML file, but with som help from other bash commands.
For the file that should only contain bash commands I'm able to get the table but one of the numbers are wrong.
I'm very new to bash scripting and SQL queries so the code isn't very optimised.
The following is a shortned version of the sample input:
CSV input
title,site,country,year_release,box_office,director,number_of_subjects,subject,type_of_subject,race_known,subject_race,person_of_color,subject_sex,lead_actor_actress
10 Rillington Place,http://www.imdb.com/title/tt0066730/,UK,1971,-,Richard Fleischer,1,John Christie,Criminal,Unknown,,0,Male,Richard Attenborough
12 Years a Slave,http://www.imdb.com/title/tt2024544/,US/UK,2013,56700000,Steve McQueen,1, Solomon Northup,Other,Known,African American,1,Male,Chiwetel Ejiofor
127 Hours,http://www.imdb.com/title/tt1542344/,US/UK,2010,18300000,Danny Boyle,1,Aron Ralston,Athlete,Unknown,,0,Male,James Franco
1987,http://www.imdb.com/title/tt2833074/,Canada,2014,-,Ricardo Trogi,1,Ricardo Trogi,Other,Known,White,0,Male,Jean-Carl Boucher
20 Dates,http://www.imdb.com/title/tt0138987/,US,1998,537000,Myles Berkowitz,1,Myles Berkowitz,Other,Unknown,,0,Male,Myles Berkowitz
21,http://www.imdb.com/title/tt0478087/,US,2008,81200000,Robert Luketic,1,Jeff Ma,Other,Known,Asian American,1,Male,Jim Sturgess
24 Hour Party People,http://www.imdb.com/title/tt0274309/,UK,2002,1130000,Michael Winterbottom,1,Tony Wilson,Musician,Known,White,0,Male,Steve Coogan
42,http://www.imdb.com/title/tt0453562/,US,2013,95000000,Brian Helgeland,1,Jackie Robinson,Athlete,Known,African American,1,Male,Chadwick Boseman
8 Seconds,http://www.imdb.com/title/tt0109021/,US,1994,19600000,John G. Avildsen,1,Lane Frost,Athlete,Unknown,,0,Male,Luke Perry
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Frank Doel,Author,Unknown,,0,Male,Anthony Hopkins
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Helene Hanff,Author,Unknown,,0,Female,Anne Bancroft
A Beautiful Mind,http://www.imdb.com/title/tt0268978/,US,2001,171000000,Ron Howard,1,John Nash,Academic,Unknown,,0,Male,Russell Crowe
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Carl Gustav Jung,Academic,Known,White,0,Male,Michael Fassbender
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sigmund Freud,Academic,Known,White,0,Male,Viggo Mortensen
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sabina Spielrein,Academic,Known,White,0,Female,Keira Knightley
A Home of Our Own,http://www.imdb.com/title/tt0107130/,US,1993,1700000,Tony Bill,1,Frances Lacey,Other,Unknown,,0,Female,Kathy Bates
A Man Called Peter,http://www.imdb.com/title/tt0048337/,US,1955,-,Henry Koster,1,Peter Marshall,Other,Known,White,0,Male,Richard Todd
A Man for All Seasons,http://www.imdb.com/title/tt0060665/,UK,1966,-,Fred Zinnemann,1,Thomas More,Historical,Known,White,0,Male,Paul Scofield
A Matador's Mistress,http://www.imdb.com/title/tt0491046/,US/UK,2008,-,Menno Meyjes,2,Lupe Sino,Actress ,Known,Hispanic (White),0,Female,PenÌÎå©lope Cruz
For the SQL queries only file this is my code so far (produces right numbers and correct table):
python3 csv2sqlite.py --table-name test_table --input table.csv --output table.sqlite
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
sqlite3 biopics.sqlite 'SELECT subject_sex,SUM(earnings) FROM table \
GROUP BY subject_sex;' -html > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo '</TABLE>' >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
For the bash only file this is my code so far:
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
awk -F ',' '{for (i=1;i<=NF;i++)
if ($1)
a[$13] += $5} END{for (i in a) printf("<TR><TD> %s </TD><TD> %i </TD></TR>\n", i, a[i])}' table.csv | sort | head -2 > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo -e "</TABLE>" >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
The expected output should look like this:
<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>
<TR><TD>Female</TD>
<TD>8480000.0</TD>
</TR>
<TR><TD>Male</TD>
<TD>455947000.0</TD>
</TR>
</TABLE>
Thank you in advance!
#! /bin/bash
awk -F, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
print "<TABLE BORDER = \"1\">"
print "<TR><TH>Gender</TH><TH>Total Amount [$]</TH></TR>"
for ( gender in sum )
{
print "<TR><TD>"gender"</TD>", "<TD>"sum[gender]"</TD></TR>"
}
print "</TABLE>"
}' table.csv
Here try this if it works for you.
UPDATE:
What I understand from your comment is that you want to sort data as per the sum.
#! /bin/bash
awk -F, -v OFS=, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
for ( gender in sum )
{
print gender, sum[gender]
}
}' table.csv | sort -nk 2,2 |
awk -v firstline="$(sed -n '1p' table.csv)" '{
printrow($0)
}
BEGIN {
split(firstline, headers, ",")
print "<html>"
print "<TABLE BORDER = "1">"
printrow(headers[5]","headers[13], 1)
}
END {
print "</table>"
print "</html>"
}
function printrow(row, flag)
{
# if flag == 0 or null "<TD>" else "<TH>"
len = split(row, cells, ",")
print "<TR>"
for (i = 1 ; i <= len ; ++i)
{
if (!flag)
print "<TD>"cells[i]"</TD>"
else
print "<TH>"cells[i]"</TH>"
}
print "</TR>"
}'
Above, I have basically divided what you need into 2 modules,
Manipulating data in table:
1) Just organises the table
2) Sorts data as per the 2nd column. This one I should have had done in the first awk script itself but it was a little shorter this way.
Converting it into an html table:
The second awk script receives output from the first one.
It sets the headings and tags.
I feel its more modular this way. This just makes it easier to make modifications. First script for data manipulation and second for placing headers or tags.
What I would personally like is giving the second awk script its own executable file. Now simply using first script for data manipulation and then passing it to another script for setting html tags and headers.
There might be better alternatives, I suggested the best I knew.
So I have many files like this :
The first file :
File1;
Code;1971;1981;1991;2001;2011
A;10;20;30;40;50
B;12;22;32;89;95
...
...
The second file :
File2;
Code;1971;1981;1991;2001;2011
A;1500;1600;460;6000;8000
B;6000;7000;8007;8009;9005
...
...
All Files have the exact same format.
I like to have a table in my database like this :
File Code Year Value
File1 A 1971 10
File1 A 1981 20
File1 A 1991 30
. . .
. . .
File2 A 1971 1500
File2 A 1981 1600
File1 A 1991 460
. . . .
. . . .
File2 B 1971 .
File2 B 1981 .
. . . .
My idea is to creat for each file a t_map in wich we have the file as follow :
My solution with t_map
The problem is that I have so many file like this, and my solution will take me a long time to finish it. Is there any better solution ?
I prefer using PHP to dump all data into MySQL...
Code;1971;1981;1991;2001;2011
A;10;20;30;40;50
B;12;22;32;89;95
after Open & Read file ($file), try this...
<?php
function _T($a){return rtrim(trim($a));}
function _E($a,$b){return explode($a,$b);}
function _S($a,$b,$c){return str_replace($a,$b,$c);}
function _G($a,$b,$c){$d=strpos($a,$b);$e=substr($a,$d);$f=strpos($a,$c);return substr($e,0,($f-$d));}
$file='Code;1971;1981;1991;2001;2011
A;10;20;30;40;50
B;12;22;32;89;95';
$fC=_E('|',_S('Code;','|Code;',$file));
/*Useful with Multiple "Code;" in same file too*/
$fC2=($file).'|';
/* Add "|" to the last */
$fY=array();/*Year*/
$fA=array();/*A*/
$fB=array();/*B*/
$dB=array();/* For Database */
foreach($fC as $fR){
/* Get Year; */
$gY=_E(';',_S('Code;','',_G(_T($fR),'Code;','A;')));
$c=0;foreach($gY as $yR){if(!empty($yR)){/*echo _T($yR).'<br>';*/$fY[$c++]=_T($yR);}}
/* Get A */
$gA=_E(';',_S('A;','',_G(_T($fR),'A;','B;')));
$c=0;foreach($gA as $aR){if(!empty($aR)){/*echo _T($aR).'<br>';*/$fA[$c++]=_T($aR);}}
/* Get B */
$gB=_E(';',_S('B;','',_G(_T($fR),'B;','|')));
$c=0;foreach($gB as $bR){if(!empty($bR)){/*echo _T($bR).'<br>';*/$fB[$c++]=_T($bR);}}
if(empty($fB)){$gB=_E(';',_S('B;','',_G(_T($fC2),'B;','|')));
$c=0;foreach($gB as $bR){if(!empty($bR)){/*echo _T($bR).'<br>';*/$fB[$c++]=_T($bR);}}}
}
/* Show Result */
foreach($fY as $fK=>$fV){
echo $fK.':'.$fV.'='.$fA[$fK].'='.$fB[$fK].'<br>';
$dB[]=array('Year'=>$fV,'A'=>$fA[$fK],'B'=>$fB[$fK]);
}print_r($dB);
?>
Test yourself... :)
I have following data line i need to parse in Perl:
my $string='Upper Left ( 440720.000, 3751320.000) (117d38\'28.21"W, 33d54\'8.47"N)';
Here is my perl script:
if ($string=~ m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,6}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})/ig) {
$upperLeft="lat=". $1. 'd'. $2. "'". $3. ".". $4. '"W long='. $5. 'd'. $6. "'". $7. ".". $8. '"W';
print $upperLeft. "\n";
}
However this expression fails to 117d38'28.21" as lat and 33d54'8.47 as long. Note the space and '(' in the input $string which i use to create this regular expression.
What I am I doing wrong in extracting (117d38'28.21"W, 33d54'8.47"N) into 8 fields? Any help is appreciated.
You had several issues. The main being your regex just parsing up to lat, not lon.
What changed:
m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,6}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})/ig
m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,7}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([WE])[\,]\s(\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([NS])/ig
^-- Your test number was 7-digit big ^-- (1) ^-- (2) ^-- (3)
At the ending: (1) added group to deal with W/E (([WE])). (2) Added groups to extract lon number. (3) Added group to deal with N/S (([NS])).
Your code, corrected:
if ($string=~ m/Upper Left\s+[(]\s+\d{1,6}[.]\d{1,3}[\,]\s+\d{1,7}[.]\d{1,3}[)]\s+[(](\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([WE])[\,]\s(\d{1,3})d(\d{1,2})['](\d{1,2})[.](\d{1,2})"([NS])/ig) {
$upperLeft = "lat=" . $1 . 'd' . $2 . "'" . $3 . "." . $4 . '"' . $5 . " long=" . $6 . 'd' . $7 . "'" . $8 . "." . $9 . '"' . $10;
print $upperLeft. "\n";
}
Output:
lat=117d38'28.21"W long=33d54'8.47"N