How make a TSV processing golang program that is as efficient as `cut`? - csv

I have the following Go program to process a TSV input. But it is slower than awk and cut. I know cut uses string manipulate tricks to achieve a fast performance.
https://github.com/coreutils/coreutils/blob/master/src/cut.c
Is it possible to achieve the same performance as cut with Go (or at least better than awk)? What should things be used in Go to achieve a better performance?
$ ./main_.sh | indent.sh
time ./main.go 10 < "$infile" > /dev/null
real 0m1.431s
user 0m0.978s
sys 0m0.436s
time cut -f 10 < "$infile" > /dev/null
real 0m0.252s
user 0m0.225s
sys 0m0.025s
time awk -v FS='\t' -v OFS='\t' -e '{ print $10 }' < "$infile" > /dev/null
real 0m1.134s
user 0m1.108s
sys 0m0.024s
$ cat.sh main_.sh
#!/usr/bin/env bash
# vim: set noexpandtab tabstop=2:
infile=$(mktemp)
seq 10000000 | paste -s -d $'\t\t\t\t\t\t\t\t\t\n' > "$infile"
set -v
time ./main.go 10 < "$infile" > /dev/null
time cut -f 10 < "$infile" > /dev/null
time awk -v FS='\t' -v OFS='\t' -e '{ print $10 }' < "$infile" > /dev/null
$ cat main.go
#!/usr/bin/env gorun
// vim: set noexpandtab tabstop=2:
package main
import (
"bufio"
"fmt"
"os"
"strings"
"strconv"
)
func main() {
idx, _ := strconv.Atoi(os.Args[1])
col := idx - 1
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
line := strings.TrimRight(scanner.Text(), "\n")
fields := strings.Split(line, "\t")
fmt.Printf("%s\n", fields[col])
}
}

If you profile the application, it will show most of the time is spent in
fmt.Printf("%s\n", fields[col])
The main issue there is really the 10000000 syscalls you're making to write to stdout, so making stdout buffered will significantly reduce the execution time. Removing the overhead of the fmt calls will help even further.
The next step would be to reduce allocations, which you can do by using byte slices rather than strings. Combining these would lead to something like
stdout := bufio.NewWriter(os.Stdout)
defer stdout.Flush()
scanner := bufio.NewScanner(os.Stdin)
for scanner.Scan() {
line := scanner.Bytes()
fields := bytes.Split(line, []byte{'\t'})
stdout.Write(fields[col])
stdout.Write([]byte{'\n'})
}

Related

How to use awk to sum up fields based on other field

In my assessment I'm asked to write a shell script using only bash commands and another shell script using only SQL queries. These scripts should do the following:
1. Clean data in the .csv file (not important at the moment)
2. Sum up earnings based upon gender
3. Produce a simple HTML table
I have made the SQL query produce the correct numbers and HTML file, but with som help from other bash commands.
For the file that should only contain bash commands I'm able to get the table but one of the numbers are wrong.
I'm very new to bash scripting and SQL queries so the code isn't very optimised.
The following is a shortned version of the sample input:
CSV input
title,site,country,year_release,box_office,director,number_of_subjects,subject,type_of_subject,race_known,subject_race,person_of_color,subject_sex,lead_actor_actress
10 Rillington Place,http://www.imdb.com/title/tt0066730/,UK,1971,-,Richard Fleischer,1,John Christie,Criminal,Unknown,,0,Male,Richard Attenborough
12 Years a Slave,http://www.imdb.com/title/tt2024544/,US/UK,2013,56700000,Steve McQueen,1, Solomon Northup,Other,Known,African American,1,Male,Chiwetel Ejiofor
127 Hours,http://www.imdb.com/title/tt1542344/,US/UK,2010,18300000,Danny Boyle,1,Aron Ralston,Athlete,Unknown,,0,Male,James Franco
1987,http://www.imdb.com/title/tt2833074/,Canada,2014,-,Ricardo Trogi,1,Ricardo Trogi,Other,Known,White,0,Male,Jean-Carl Boucher
20 Dates,http://www.imdb.com/title/tt0138987/,US,1998,537000,Myles Berkowitz,1,Myles Berkowitz,Other,Unknown,,0,Male,Myles Berkowitz
21,http://www.imdb.com/title/tt0478087/,US,2008,81200000,Robert Luketic,1,Jeff Ma,Other,Known,Asian American,1,Male,Jim Sturgess
24 Hour Party People,http://www.imdb.com/title/tt0274309/,UK,2002,1130000,Michael Winterbottom,1,Tony Wilson,Musician,Known,White,0,Male,Steve Coogan
42,http://www.imdb.com/title/tt0453562/,US,2013,95000000,Brian Helgeland,1,Jackie Robinson,Athlete,Known,African American,1,Male,Chadwick Boseman
8 Seconds,http://www.imdb.com/title/tt0109021/,US,1994,19600000,John G. Avildsen,1,Lane Frost,Athlete,Unknown,,0,Male,Luke Perry
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Frank Doel,Author,Unknown,,0,Male,Anthony Hopkins
84 Charing Cross Road,http://www.imdb.com/title/tt0090570/,US/UK,1987,1080000,David Hugh Jones,2,Helene Hanff,Author,Unknown,,0,Female,Anne Bancroft
A Beautiful Mind,http://www.imdb.com/title/tt0268978/,US,2001,171000000,Ron Howard,1,John Nash,Academic,Unknown,,0,Male,Russell Crowe
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Carl Gustav Jung,Academic,Known,White,0,Male,Michael Fassbender
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sigmund Freud,Academic,Known,White,0,Male,Viggo Mortensen
A Dangerous Method,http://www.imdb.com/title/tt1571222/,Canada/UK,2011,5700000,David Cronenberg,3,Sabina Spielrein,Academic,Known,White,0,Female,Keira Knightley
A Home of Our Own,http://www.imdb.com/title/tt0107130/,US,1993,1700000,Tony Bill,1,Frances Lacey,Other,Unknown,,0,Female,Kathy Bates
A Man Called Peter,http://www.imdb.com/title/tt0048337/,US,1955,-,Henry Koster,1,Peter Marshall,Other,Known,White,0,Male,Richard Todd
A Man for All Seasons,http://www.imdb.com/title/tt0060665/,UK,1966,-,Fred Zinnemann,1,Thomas More,Historical,Known,White,0,Male,Paul Scofield
A Matador's Mistress,http://www.imdb.com/title/tt0491046/,US/UK,2008,-,Menno Meyjes,2,Lupe Sino,Actress ,Known,Hispanic (White),0,Female,PenÌÎå©lope Cruz
For the SQL queries only file this is my code so far (produces right numbers and correct table):
python3 csv2sqlite.py --table-name test_table --input table.csv --output table.sqlite
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
sqlite3 biopics.sqlite 'SELECT subject_sex,SUM(earnings) FROM table \
GROUP BY subject_sex;' -html > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo '</TABLE>' >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
For the bash only file this is my code so far:
echo -e '<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>' >> tmp1.txt
awk -F ',' '{for (i=1;i<=NF;i++)
if ($1)
a[$13] += $5} END{for (i in a) printf("<TR><TD> %s </TD><TD> %i </TD></TR>\n", i, a[i])}' table.csv | sort | head -2 > tmp2.txt
cat tmp2.txt >> tmp1.txt
echo -e "</TABLE>" >> tmp1.txt
cp tmp1.txt $1
cat $1
rm tmp1.txt tmp2.txt
The expected output should look like this:
<TABLE BORDER = "1">
<TR><TH>Gender</TH>
<TH>Total Amount [$]</TH>
</TR>
<TR><TD>Female</TD>
<TD>8480000.0</TD>
</TR>
<TR><TD>Male</TD>
<TD>455947000.0</TD>
</TR>
</TABLE>
Thank you in advance!
#! /bin/bash
awk -F, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
print "<TABLE BORDER = \"1\">"
print "<TR><TH>Gender</TH><TH>Total Amount [$]</TH></TR>"
for ( gender in sum )
{
print "<TR><TD>"gender"</TD>", "<TD>"sum[gender]"</TD></TR>"
}
print "</TABLE>"
}' table.csv
Here try this if it works for you.
UPDATE:
What I understand from your comment is that you want to sort data as per the sum.
#! /bin/bash
awk -F, -v OFS=, '{
if (NR != 1)
{
if (sum[$13] == "")
{
sum[$13]=0
}
sum[$13]+=$5
}
}
END {
for ( gender in sum )
{
print gender, sum[gender]
}
}' table.csv | sort -nk 2,2 |
awk -v firstline="$(sed -n '1p' table.csv)" '{
printrow($0)
}
BEGIN {
split(firstline, headers, ",")
print "<html>"
print "<TABLE BORDER = "1">"
printrow(headers[5]","headers[13], 1)
}
END {
print "</table>"
print "</html>"
}
function printrow(row, flag)
{
# if flag == 0 or null "<TD>" else "<TH>"
len = split(row, cells, ",")
print "<TR>"
for (i = 1 ; i <= len ; ++i)
{
if (!flag)
print "<TD>"cells[i]"</TD>"
else
print "<TH>"cells[i]"</TH>"
}
print "</TR>"
}'
Above, I have basically divided what you need into 2 modules,
Manipulating data in table:
1) Just organises the table
2) Sorts data as per the 2nd column. This one I should have had done in the first awk script itself but it was a little shorter this way.
Converting it into an html table:
The second awk script receives output from the first one.
It sets the headings and tags.
I feel its more modular this way. This just makes it easier to make modifications. First script for data manipulation and second for placing headers or tags.
What I would personally like is giving the second awk script its own executable file. Now simply using first script for data manipulation and then passing it to another script for setting html tags and headers.
There might be better alternatives, I suggested the best I knew.

Passing variable into a function that uses awk

I have a script that attempts to stop a certain process by name (but I need to specify a certain string that can't be killed, namely "notThisProcess"), then kills it after 20 seconds if it hasn't come down gracefully, ie:
#!/bin/ksh
bname=BTEST
bserver=BSERVER
PROCESS_ID=`ps auxww | awk '/PROCESS_NAME_ALPHA/ && !/awk/ && !/notThisProcess/ {print $2}'`
/apps/customapp/stopcommand -a $bname -processName PROCESS_NAME_ALPHA -serverName $bserver
sleep 20
kill -4 $PROCESS_ID
PROCESS_ID2=`ps auxww | awk '/PROCESS_NAME_BETA/ && !/awk/ && !/notThisProcess/ {print $2}'`
/apps/customapp/stopcommand -a $bname -processName PROCESS_NAME_BETA -serverName $bserver
sleep 20
kill -4 $PROCESS_ID2
#etc..
As my list of processes just increased I'm trying to put those steps into a function but I can't figure out how to pass the process name to awk. ie, this doesn't work:
#!/bin/ksh
bname=BTEST
bserver=BSERVER
cycleProcess()
{
PROCESS_ID=`ps auxww | awk '/$1/ && !/awk/ && !/notThisProcess/ {print $2}'`
/apps/customapp/stopcommand -a $bname -processName PROCESS_NAME_ALPHA -serverName $bserver
sleep 20
kill -4 $PROCESS_ID
}
cycleProcess PROCESS_NAME_ALPHA
cycleProcess PROCESS_NAME_BETA
exit
I've seen several references to assignment via -v but despite several attempts I haven't been successful. Any suggestions?
I'd write it like this:
#!/bin/ksh
bname=BTEST
bserver=BSERVER
cycleProcess() {
typeset procname="$1"
typeset pid=$(ps auxww | awk -v name="$procname" '$0 ~ name && !/awk/ && !/notThisProcess/ {print $2}')
if [[ -z "$pid" ]]; then
echo "$procname is not running"
return
fi
/apps/customapp/stopcommand -a "$bname" -processName "$procname" -serverName "$bserver"
sleep 20
kill -4 "$pid"
}
processes=(
PROCESS_NAME_ALPHA
PROCESS_NAME_BETA
)
for proc in "${processes[#]}"; do
cycleProcess "$proc"
done
typeset in a function is a way to declare a variable as local to that function.
I don't have access to an AIX box. ps auxww output on my Linux box shows the command name in field 11, so instead of /name/ && !/awk/ && !/thisScript/ you might be able to use $11 == name {print $2},
or $11 ~ name if the match is not exact.
you can pass them in a pipe char delimited list and compare with the last field.
ps ... | awk -v keep="process1|process2|process3" '$NF!~keep{print $2}'
also note that in your script awk '/$1/ && ... the variable is not the bash variable but the first field passed to awk script.
As others already noted, shell variables may be passed to awk scripts using the option -v. This must be used if the awk script resides in a seperate file (by using the option -f).
When specifying the awk script directly within the shell script between single quotes ('...'), You may also use the construct ' " $shell_variable " '. Note that when doing so, there must be no spaces between the single and double quotes!
Example:
process_string="plugin-container"
pids=$( ps -fu $LOGNAME | awk '/'"$process_string"'/ { print $2 }' )

Bash loop to merge files in batches for mongoimport

I have a directory with 2.5 million small JSON files in it. It's 104gb on disk. They're multi-line files.
I would like to create a set of JSON arrays from the files so that I can import them using mongoimport in a reasonable amount of time. The files can be no bigger than 16mb, but I'd be happy even if I managed to get them in sets of ten.
So far, I can use this to do them one at a time at about 1000/minute:
for i in *.json; do mongoimport --writeConcern 0 --db mydb --collection all --quiet --file $i; done
I think I can use "jq" to do this, but I have no idea how to make the bash loop pass 10 files at a time to jq.
Note that using bash find results in an error as there are too many files.
With jq you can use --slurp to create arrays, and -c to make multiline json single line. However, I can't see how to combine the two into a single command.
Please help with both parts of the problem if possible.
Here's one approach. To illustrate, I've used awk as it can read the list of files in small batches and because it has the ability to execute jq and mongoimport. You will probably need to make some adjustments to make the whole thing more robust, to test for errors, and so on.
The idea is either to generate a script that can be reviewed and then executed, or to use awk's system() command to execute the commands directly. First, let's generate the script:
ls *.json | awk -v group=10 -v tmpfile=json.tmp '
function out() {
print "jq -s . " files " > " tmpfile;
print "mongoimport --writeConcern 0 --db mydb --collection all --quiet --file " tmpfile;
print "rm " tmpfile;
files="";
}
BEGIN {n=1; files="";
print "test -r " tmpfile " && rm " tmpfile;
}
n % group == 0 {
out();
}
{ files = files " \""$0 "\"";
n++;
}
END { if (files) {out();}}
'
Once you've verified this works, you can either execute the generated script, or change the "print ..." lines to use "system(....)"
Using jq to generate the script
Here's a jq-only approach for generating the script.
Since the number of files is very large, the following uses features that were only introduced in jq 1.5, so its memory usage is similar to the awk script above:
def read(n):
# state: [answer, hold]
foreach (inputs, null) as $i
([null, null];
if $i == null then .[0] = .[1]
elif .[1]|length == n then [.[1],[$i]]
else [null, .[1] + [$i]]
end;
.[0] | select(.) );
"test -r json.tmp && rm json.tmp",
(read($group|tonumber)
| map("\"\(.)\"")
| join(" ")
| ("jq -s . \(.) > json.tmp", mongo("json.tmp"), "rm json.tmp") )
Invocation:
ls *.json | jq -nRr --arg group 10 -f generate.jq
Here is what I came up with. It seems to work and is importing at roughly 80 a second into an external hard drive.
#!/bin/bash
files=(*.json)
for((I=0;I<${#files[*]};I+=500)); do jq -c '.' ${files[#]:I:500} | mongoimport --writeConcern 0 --numInsertionWorkers 16 --db mydb --collection all --quiet;echo $I; done
However, some are failing. I've imported 105k files but only 98547 appeared in the mongo collection. I think it's because some documents are > 16mb.

Awk json log file which are greater that a time

I have a log file that each line is a long json dictionary. None of logs have the same length, but all of them have a '_time_' key which is an epoch time in milliseconds. I want to search inside this log file to extract logs which are greater than a time like 1450616426 (second). Some log examples are:
{'id':Bob, 'last-login':'...', '_time_':1444211444123456, ...}
{'name':'ehsan', 'family':'toghian', 'last-login':'2015-4-12', '_time_': 1444215425123465, .....}
How can I write an awk command? Thanks in advanced.
$ cat tst.awk
{
milli = $0
sub(/.*_time_[^[:digit:]]+/,"",milli)
sub(/[^[:digit:]].*/,"",milli)
secs = milli / 1000
}
secs > tgt
$ awk -v tgt=1450616426 -f tst.awk file
{'id':Bob, 'last-login':'...', '_time_':1444211444123456, ...}
{'name':'ehsan', 'family':'toghian', 'last-login':'2015-4-12', '_time_': 1444215425123465, .....}
or with GNU awk for gensub():
$ awk -v tgt=1450616426 '(gensub(/.*_time_[^[:digit:]]+([[:digit:]]+).*/,"\\1",1) / 1000) > tgt' file
{'id':Bob, 'last-login':'...', '_time_':1444211444123456, ...}
{'name':'ehsan', 'family':'toghian', 'last-login':'2015-4-12', '_time_': 1444215425123465, .....}
gawk
awk -vl=1450616426 '{match($0,"_time_.: *([0-9]{10})[0-9]+",a);if(a[1]>l)print}' file

package to query tab separated files in bash

I often have to conduct very simple queries on tab separated files in bash. For example summing/counting/max/min all the values in the n-th column. I usually do this in awk via command-line, but I've grown tired of re-writing the same one line scripts over and over and I'm wondering if there is a known package or solution for this.
For example, consider the text file (test.txt):
apples joe 4
oranges bill 3
apples sally 2
I can query this as:
awk '{ val += $3 } END { print "sum: "val }' test.txt
Also, I may want a where clause:
awk '{ if ($1 == "apples") { val += $3 } END { print "sum: "val }' test.txt
Or a group by:
awk '{ val[$1] += $3 } END { for(k in val) { print k": "val[k] } }' test.txt
What I would rather do is:
query 'sum($3)' test.txt
query 'sum($3) where $1 = "apples"' test.txt
query 'sum($3) group by $1' test.txt
#Wintermute posted a link to a great tool for this in the comments below. Unfortunately it does have one drawback:
$ time gawk '{ a += $6 } END { print a }' my1GBfile.tsv
28371787287
real 0m2.276s
user 0m1.909s
sys 0m0.313s
$ time q -t 'select sum(c6) from my1GBfile.tsv'
28371787287
real 3m32.361s
user 3m27.078s
sys 0m1.983s
it also loads the entire file into memory, obviously this will be necessary in some cases, but doesn't work for me as I often work with large files.
Wintermute's answer: Tools like q that can run SQL queries directly on CSVs.
Ed Morton's answer: Refer https://stackoverflow.com/a/15765479/1745001