How to use awk to sum up fields based on other field - html

In my assessment I'm asked to write a shell script using only bash commands and another shell script using only SQL queries. These scripts should do the following:
1. Clean data in the .csv file (not important at the moment)
2. Sum up earnings based upon gender
3. Produce a simple HTML table
I have made the SQL query produce the correct numbers and HTML file, but with som help from other bash commands.
For the file that should only contain bash commands I'm able to get the table but one of the numbers are wrong.
I'm very new to bash scripting and SQL queries so the code isn't very optimised.
The following is a shortned version of the sample input:
CSV input
title,site,country,year_release,box_office,director,number_of_subjects,subject,type_of_subject,race_known,subject_race,person_of_color,subject_sex,lead_actor_actress
10 Rillington Place,http://www.imdb.com/title/tt0066730/,UK,1971,-,Richard Fleischer,1,John Christie,Criminal,Unknown,,0,Male,Richard Attenborough
12 Years a Slave,http://www.imdb.com/title/tt2024544/,US/UK,2013,56700000,Steve McQueen,1, Solomon Northup,Other,Known,African American,1,Male,Chiwetel Ejiofor
127 Hours,http://www.imdb.com/title/tt1542344/,US/UK,2010,18300000,Danny Boyle,1,Aron Ralston,Athlete,Unknown,,0,Male,James Franco
1987,http://www.imdb.com/title/tt2833074/,Canada,2014,-,Ricardo Trogi,1,Ricardo Trogi,Other,Known,White,0,Male,Jean-Carl Boucher
20 Dates,http://www.imdb.com/title/tt0138987/,US,1998,537000,Myles Berkowitz,1,Myles Berkowitz,Other,Unknown,,0,Male,Myles Berkowitz
21,http://www.imdb.com/title/tt0478087/,US,2008,81200000,Robert Luketic,1,Jeff Ma,Other,Known,Asian American,1,Male,Jim Sturgess
24 Hour Party People,http://www.imdb.com/title/tt0274309/,UK,2002,1130000,Michael Winterbottom,1,Tony Wilson,Musician,Known,White,0,Male,Steve Coogan
42,http://www.imdb.com/title/tt0453562/,US,2013,95000000,Brian Helgeland,1,Jackie Robinson,Athlete,Known,African American,1,Male,Chadwick Boseman
8 Seconds,http://www.imdb.com/title/tt0109021/,US,1994,19600000,John G. Avildsen,1,Lane Frost,Athlete,Unknown,,0,Male,Luke Perry
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Frank Doel,Author,Unknown,,0,Male,Anthony Hopkins
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Helene Hanff,Author,Unknown,,0,Female,Anne Bancroft
A Beautiful Mind,http://www.imdb.com/title/tt0268978/,US,2001,171000000,Ron Howard,1,John Nash,Academic,Unknown,,0,Male,Russell Crowe
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Carl Gustav Jung,Academic,Known,White,0,Male,Michael Fassbender
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sigmund Freud,Academic,Known,White,0,Male,Viggo Mortensen
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sabina Spielrein,Academic,Known,White,0,Female,Keira Knightley
A Home of Our Own,http://www.imdb.com/title/tt0107130/,US,1993,1700000,Tony Bill,1,Frances Lacey,Other,Unknown,,0,Female,Kathy Bates
A Man Called Peter,http://www.imdb.com/title/tt0048337/,US,1955,-,Henry Koster,1,Peter Marshall,Other,Known,White,0,Male,Richard Todd
A Man for All Seasons,http://www.imdb.com/title/tt0060665/,UK,1966,-,Fred Zinnemann,1,Thomas More,Historical,Known,White,0,Male,Paul Scofield
A Matador's Mistress,http://www.imdb.com/title/tt0491046/,US/UK,2008,-,Menno Meyjes,2,Lupe Sino,Actress ,Known,Hispanic (White),0,Female,PenÌÎå©lope Cruz
For the SQL queries only file this is my code so far (produces right numbers and correct table):
python3 csv2sqlite.py --table-name test_table --input table.csv --output table.sqlite
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
sqlite3 biopics.sqlite 'SELECT subject_sex,SUM(earnings) FROM table \
GROUP BY subject_sex;' -html > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo '</TABLE>' >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
For the bash only file this is my code so far:
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
awk -F ',' '{for (i=1;i<=NF;i++)
if ($1)
a[$13] += $5} END{for (i in a) printf("<TR><TD> %s </TD><TD> %i </TD></TR>\n", i, a[i])}' table.csv | sort | head -2 > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo -e "</TABLE>" >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
The expected output should look like this:
<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>
<TR><TD>Female</TD>
<TD>8480000.0</TD>
</TR>
<TR><TD>Male</TD>
<TD>455947000.0</TD>
</TR>
</TABLE>
Thank you in advance!

#! /bin/bash
awk -F, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
print "<TABLE BORDER = \"1\">"
print "<TR><TH>Gender</TH><TH>Total Amount [$]</TH></TR>"
for ( gender in sum )
{
print "<TR><TD>"gender"</TD>", "<TD>"sum[gender]"</TD></TR>"
}
print "</TABLE>"
}' table.csv
Here try this if it works for you.
UPDATE:
What I understand from your comment is that you want to sort data as per the sum.
#! /bin/bash
awk -F, -v OFS=, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
for ( gender in sum )
{
print gender, sum[gender]
}
}' table.csv | sort -nk 2,2 |
awk -v firstline="$(sed -n '1p' table.csv)" '{
printrow($0)
}
BEGIN {
split(firstline, headers, ",")
print "<html>"
print "<TABLE BORDER = "1">"
printrow(headers[5]","headers[13], 1)
}
END {
print "</table>"
print "</html>"
}
function printrow(row, flag)
{
# if flag == 0 or null "<TD>" else "<TH>"
len = split(row, cells, ",")
print "<TR>"
for (i = 1 ; i <= len ; ++i)
{
if (!flag)
print "<TD>"cells[i]"</TD>"
else
print "<TH>"cells[i]"</TH>"
}
print "</TR>"
}'
Above, I have basically divided what you need into 2 modules,
Manipulating data in table:
1) Just organises the table
2) Sorts data as per the 2nd column. This one I should have had done in the first awk script itself but it was a little shorter this way.
Converting it into an html table:
The second awk script receives output from the first one.
It sets the headings and tags.
I feel its more modular this way. This just makes it easier to make modifications. First script for data manipulation and second for placing headers or tags.
What I would personally like is giving the second awk script its own executable file. Now simply using first script for data manipulation and then passing it to another script for setting html tags and headers.
There might be better alternatives, I suggested the best I knew.

Related

Convert single column to multiple, ensuring column count on last line

I would like to use AWK (Windows) to convert a text file with a single column to multiple columns - the count specified in the script or on the command line.
This question has been asked before but my final data file needs to have the same column count all the way.
Example of input:
L1
L2
L3
L4
L5
L6
L7
split into 3 columns and ";" as a separator
L1;L2;L3
L4;L5;L6
L7;; <<< here two empty fields are created after end of file, since I used just one on this line.
I tried to modify variants of the typical solution given: NR%4 {printf $0",";next} 1; and a counter, but could not quite get it right.
I would prefer not to count lines before, thereby running over the file multiple times.
You may use this awk solution:
awk -v n=3 '{
sub(/\r$/, "") # removes DOS line break, if present
printf "%s", $0(NR%n ? ";" : ORS)
}
END {
# now we need to add empty columns in last record
if (NR % n) {
for (i=1; i < (n - (NR % n)); ++i)
printf ";"
print ""
}
}' file
L1;L2;L3
L4;L5;L6
L7;;
With your shown samples please try following awk code. Using xargs + awk combination to achieve the outcome needed by OP.
xargs -n3 < Input_file |
awk -v OFS=";" '{if(NF==1){$0=$0";;"};if(NF==2){$0=$0";"};$1=$1} 1'
For an awk I would do:
awk -v n=3 '
{printf("%s%s", $0, (NR%n>0) ? ";" : ORS)}
END{
for(i=NR%n; i<n-1; i++) printf(";")
printf ORS
}' file
Or, an alternative awk:
awk -v n=3 -v OFS=";" '
{ row=row ? row FS $0 : $0 } # build row of n fields
!(NR%n) {$0=row; NF=n; print; row="" } # split the fields sep by OFS
END { if (NR%n) { $0=row; NF=n; print } } # same
' file
Or you can use ruby if you want more options:
ruby -le '
n=3
puts $<.read.
split($/).
each_slice(n).
map{|sl| sl.fill(sl.size...n) { "" }; sl.join(";") }.
join($\) # By using $\ and $/ with the -l the RS and ORS is set correctly for the platform
' file
Or, realize that paste is designed to do this:
paste -d';' - - - <file
(Use a - for each column desired)
Any of those prints (with n=3):
L1;L2;L3
L4;L5;L6
L7;;
(And work correctly for other values of n...)

package to query tab separated files in bash

I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001

inserting data into MYSQL tables by using shell script

I am a beginner in shell script. I am trying to store the output of a LINUX command in MySQL tables. I need to select the partition details in a column and used % in another column. I nearly made it but i get the output in a single column.In table test, disk is a column and used is another column. My desired output is
**DISK** **USED**
filesystem 45%
but my actual output is like
**DISK USED**
filesystem
45%
My code:
df -h | tee /home/abcd/test/monitor.text;
details=$(awk '{ print $1 } ' monitor.text);
echo $details;
used=$(awk '{ print $5}' monitor.text);
echo $used;
mysql test<<EOF;
INSERT INTO test_1 (details,used) VALUES ('$details','$used');
EOF
Please give me the correct code for the desired output. Thank you in advance.
Here is a script which captures the value of those 2 columns fine You can do the inserts?
#!/bin/sh
df -h | awk '{ print$1 " " $5} ' > monitor.txt
exec<monitor.txt
value=0
while read line
do
col1=`echo $line | cut -f1 -d " " `
col2=`echo $line | cut -f2 -d " " `
echo $col1
echo $col2 ;
done
Plug your insert statements inside the do-done loop.

Output JSON from Bash script

So I have a bash script which outputs details on servers. The problem is that I need the output to be JSON. What is the best way to go about this? Here is the bash script:
# Get hostname
hostname=`hostname -A` 2> /dev/null
# Get distro
distro=`python -c 'import platform ; print platform.linux_distribution()[0] + " " + platform.linux_distribution()[1]'` 2> /dev/null
# Get uptime
if [ -f "/proc/uptime" ]; then
uptime=`cat /proc/uptime`
uptime=${uptime%%.*}
seconds=$(( uptime%60 ))
minutes=$(( uptime/60%60 ))
hours=$(( uptime/60/60%24 ))
days=$(( uptime/60/60/24 ))
uptime="$days days, $hours hours, $minutes minutes, $seconds seconds"
else
uptime=""
fi
echo $hostname
echo $distro
echo $uptime
So the output I want is something like:
{"hostname":"server.domain.com", "distro":"CentOS 6.3", "uptime":"5 days, 22 hours, 1 minutes, 41 seconds"}
Thanks.
If you only need to output a small JSON, use printf:
printf '{"hostname":"%s","distro":"%s","uptime":"%s"}\n' "$hostname" "$distro" "$uptime"
Or if you need to produce a larger JSON, use a heredoc as explained by leandro-mora. If you use the here-doc solution, please be sure to upvote his answer:
cat <<EOF > /your/path/myjson.json
{"id" : "$my_id"}
EOF
Some of the more recent distros, have a file called: /etc/lsb-release or similar name (cat /etc/*release). Therefore, you could possibly do away with dependency your on Python:
distro=$(awk -F= 'END { print $2 }' /etc/lsb-release)
An aside, you should probably do away with using backticks. They're a bit old fashioned.
I find it much more easy to create the json using cat:
cat <<EOF > /your/path/myjson.json
{"id" : "$my_id"}
EOF
I'm not a bash-ninja at all, but I wrote a solution, that works perfectly for me. So, I decided to share it with community.
First of all, I created a bash script called json.sh
arr=();
while read x y;
do
arr=("${arr[#]}" $x $y)
done
vars=(${arr[#]})
len=${#arr[#]}
printf "{"
for (( i=0; i<len; i+=2 ))
do
printf "\"${vars[i]}\": ${vars[i+1]}"
if [ $i -lt $((len-2)) ] ; then
printf ", "
fi
done
printf "}"
echo
And now I can easily execute it:
$ echo key1 1 key2 2 key3 3 | ./json.sh
{"key1":1, "key2":2, "key3":3}
#Jimilian script was very helpful for me. I changed it a bit to send data to zabbix auto discovery
arr=()
while read x y;
do
arr=("${arr[#]}" $x $y)
done
vars=(${arr[#]})
len=${#arr[#]}
printf "{\n"
printf "\t"data":[\n"
for (( i=0; i<len; i+=2 ))
do
printf "\t{ "{#VAL1}":\"${vars[i]}\",\t"{#VAL2}":\"${vars[i+1]}\" }"
if [ $i -lt $((len-2)) ] ; then
printf ",\n"
fi
done
printf "\n"
printf "\t]\n"
printf "}\n"
echo
Output:
$ echo "A 1 B 2 C 3 D 4 E 5" | ./testjson.sh
{
data:[
{ {#VAL1}:"A", {#VAL2}:"1" },
{ {#VAL1}:"B", {#VAL2}:"2" },
{ {#VAL1}:"C", {#VAL2}:"3" },
{ {#VAL1}:"D", {#VAL2}:"4" },
{ {#VAL1}:"E", {#VAL2}:"5" }
]
}
I wrote a tiny program in Go, json_encode. It works pretty good for such cases:
$ ./getDistro.sh | json_encode
["my.dev","Ubuntu 17.10","4 days, 2 hours, 21 minutes, 17 seconds"]
data=$(echo " BUILD_NUMBER : ${BUILD_NUMBER} , BUILD_ID : ${BUILD_ID} , JOB_NAME : ${JOB_NAME} " | sed 's/ /"/g')
output => data="BUILD_NUMBER":"29","BUILD_ID":"29","JOB_NAME":"OSM_LOG_ANA"
To answer the subject line, if you were to needing to to get a tab separated output from any command line and needed it to be formatted as a JSON list of lists, you can do the following:
echo <tsv_string> | python3 -c "import sys,json; print(json.dumps([row.split('\t') for row in sys.stdin.read().splitlines() if True]))
For example, to get a zfs list output as json:
zfs list -Hpr -o name,creation,mountpoint,mounted | python3 -c "import sys,json; print(json.dumps([row.split('\t') for row in sys.stdin.read().splitlines() if True]))

Creating an HTML table with BASH & AWK

I am having issues creating a html table to display stats from a text file. I am sure there are 100 ways to do this better but here it is:
(The comments in the following script show the outputs)
#!/bin/bash
function getapistats () {
curl -s http://api.example.com/stats > api-stats.txt
awk {'print $1'} api-stats.txt > api-stats-int.txt
awk {'print $2'} api-stats.txt > api-stats-fqdm.txt
}
# api-stats.txt example
# 992 cdn.example.com
# 227 static.foo.com
# 225 imgcdn.bar.com
# end api-stats.txt example
function get_int () {
for i in `cat api-stats-int.txt`;
do echo -e "<tr><td>${i}</td>";
done
}
function get_fqdn () {
for f in `cat api-stats-fqdn.txt`;
do echo -e "<td>${f}</td></tr>";
done
}
function build_table () {
echo "<table>";
echo -e "`get_int`" "`get_fqdn`";
#echo -e "`get_fqdn`";
echo "</table>";
}
getapistats;
build_table > api-stats.html;
# Output fail :|
# <table>
# <tr><td>992</td>
# <tr><td>227</td>
# <tr><td>225</td><td>cdn.example.com</td></tr>
# <td>static.foo.com</td></tr>
# <td>imgcdn.bar.com</td></tr>
# Desired output:
# <tr><td>992</td><td>cdn.example.com</td></tr>
# ...
This is reasonably simple to do in pure awk:
curl -s http://api.example.com/stats > api-stats.txt
awk 'BEGIN { print "<table>" }
{ print "<tr><td>" $1 "</td><td>" $2 "</td></tr>" }
END { print "</table>" }' api-stats.txt > api-stats.html
Awk is really made for this type of use.
You can do it with one awk at least.
curl -s http://api.example.com/stats | awk '
BEGIN{print "<table>"}
{printf("<tr><td>%d</td><td>%s</td></tr>\n",$1,$2)}
END{print "</table>"}
'
this can be done w/ bash ;)
while read -u 3 a && read -u 4 b;do
echo $a$b;
done 3&lt/etc/passwd 4&lt/etc/services
but my experience is that usually it's a bad thing to do things like this in bash/awk/etc
the feature i used in the code is deeply burried in the bash manual page...
i would recommend to use some real language for this kind of data processing for example: (ruby or python) because they are more flexible/readable/maintainable