Automatically sum numeric columns and print total - language-agnostic

Given the output of git ... --stat:
3 files changed, 72 insertions(+), 21 deletions(-)
3 files changed, 27 insertions(+), 4 deletions(-)
4 files changed, 164 insertions(+), 0 deletions(-)
9 files changed, 395 insertions(+), 0 deletions(-)
1 files changed, 3 insertions(+), 2 deletions(-)
1 files changed, 1 insertions(+), 1 deletions(-)
2 files changed, 57 insertions(+), 0 deletions(-)
10 files changed, 189 insertions(+), 230 deletions(-)
3 files changed, 111 insertions(+), 0 deletions(-)
8 files changed, 61 insertions(+), 80 deletions(-)
I wanted to produce the sum of the numeric columns but preserve the formatting of the line. In the interest of generality, I produced this awk script that automatically sums any numeric columns and produces a summary line:
{
for (i = 1; i <= NF; ++i) {
if ($i + 0 != 0) {
numeric[i] = 1;
total[i] += $i;
}
}
}
END {
# re-use non-numeric columns of last line
for (i = 1; i <= NF; ++i) {
if (numeric[i])
$i = total[i]
}
print
}
Yielding:
44 files changed, 1080 insertions(+), 338 deletions(-)
Awk has several features that simplify the problem, like automatic string->number conversion, all arrays as associative arrays, and the ability to overwrite auto-split positional parameters and then print the equivalent lines.
Is there a better language for this hack?

Perl - 47 char
Inspired by ChristopheD's awk solution. Used with the -an command-line switch. 43 chars + 4 chars for the command-line switch:
$i-=#a=map{($b[$i++]+=$_)||$_}#F}{print"#a"
I can get it to 45 (41 + -ap switch) with a little bit of cheating:
$i=0;$_="Ctrl-M#{[map{($b[$i++]+=$_)||$_}#F]}"
Older, hash-based 66 char solution:
#a=(),s#(\d+)(\D+)#$b{$a[#a]=$2}+=$1#gefor<>;print map$b{$_}.$_,#a

Ruby — 87
puts ' '+[*$<].map(&:split).inject{|i,j|[0,3,5].map{|k|i[k]=i[k].to_i+j[k].to_i};i}*' '

Python - 101 chars
import sys
print" ".join(`sum(map(int,x))`if"A">x[0]else x[0]for x in zip(*map(str.split,sys.stdin)))'
Using reduce is longer at 126 chars
import sys
print" ".join(reduce(lambda X,Y:[str(int(x)+int(y))if"A">x[0]else x for x,y in zip(X,Y)],map(str.split,sys.stdin)))

AWK - 63 characters
(in a bash script, $1 is the filename provided as command line argument):
awk -F' ' '{x+=$1;y+=$4;z+=$6}END{print x,$2,$3,y,$5,z,$7}' $1
One could of course also pipe the input in (would save another 3 characters when allowed).

This problem is not challenging or difficult... it is "cute" though.
Here is solution in Python:
import sys
r = []
for s in sys.stdin:
r = map(lambda x,y:(x or 0)+int(y) if y.isdigit() else y, r, s.split())
print ' '.join(map(str, r))
What does it do... it keeps tally in r while proceeding line by line. Splits the line, then for each element of the list, if it is a number, adds it to the tally or keeps it as string. At the end they all get re-mapped to string and merged with spaces in between to be printed.
Alternative, more "algebraic" implementation, if we did not care about reading all input at once:
import sys
def totalize(l):
try: r = str(sum(map(int,l)))
except: r = l[-1]
return r
print ' '.join(map(totalize, zip(*map(str.split, sys.stdin))))
What does this one do? totalize() takes a list of strings and tries to calculate sum of the numbers; if that fails, it simply returns the last one. zip() is fed with a matrix that is list of rows, each of them being list of column items in the row - zip transposes the matrix so it turns into list of column items and then totalize is invoked on each column and the results are joined as before.

At the expense of making your code slightly longer, I moved the main parsing into the BEGIN clause so the main clause is only processing numeric fields. For a slightly larger input file, I was able to measure a significant improvement in speed.
BEGIN {
getline
for (i = 1; i <= NF; ++i) {
# need to test for 0, too, in this version
if ($i == 0 || $i + 0 != 0) {
numeric[i] = 1;
total[i] = $i;
}
}
}
{
for (i in numeric) total[i] += $i
}
END {
# re-use non-numeric columns of last line
for (i = 1; i <= NF; ++i) {
if (numeric[i])
$i = total[i]
}
print
}
I made a test file using your data and doing paste file file file ... and cat file file file ... so that the result had 147 fields and 1960 records. My version took about 1/4 as long as yours. On the original data, the difference was not measurable.

JavaScript (Rhino) - 183 154 139 bytes
Golfed:
x=[n=0,0,0];s=[];readFile('/dev/stdin').replace(/(\d+)(\D+)/g,function(a,b,c){x[n]+=+b;s[n++]=c;n%=3});print(x[0]+s[0]+x[1]+s[1]+x[2]+s[2])
Readable-ish:
x=[n=0,0,0];
s=[];
readFile('/dev/stdin').replace(/(\d+)(\D+)/g,function(a,b,c){
x[n]+=+b;
s[n++]=c;
n%=3
});
print(x[0]+s[0]+x[1]+s[1]+x[2]+s[2]);

PHP 152 130 Chars
Input:
$i = "
3 files changed, 72 insertions(+), 21 deletions(-)
3 files changed, 27 insertions(+), 4 deletions(-)
4 files changed, 164 insertions(+), 0 deletions(-)
9 files changed, 395 insertions(+), 0 deletions(-)
1 files changed, 3 insertions(+), 2 deletions(-)
1 files changed, 1 insertions(+), 1 deletions(-)
2 files changed, 57 insertions(+), 0 deletions(-)
10 files changed, 189 insertions(+), 230 deletions(-)
3 files changed, 111 insertions(+), 0 deletions(-)
8 files changed, 61 insertions(+), 80 deletions(-)";
Code:
$a = explode(" ", $i);
foreach($a as $k => $v){
if($k % 7 == 0)
$x += $v;
if(3-$k % 7 == 0)
$y += $v;
if(5-$k % 7 == 0)
$z += $v;
}
echo "$x $a[1] $a[2] $y $a[4] $z $a[6]";
Output:
44 files changed, 1080 insertions(+), 338 deletions(-)
Note: explode() will require that there is a space char before the new line.

Haskell - 151 135 bytes
import Char
c a b|all isDigit(a++b)=show$read a+read b|True=a
main=interact$unwords.foldl1(zipWith c).map words.filter(not.null).lines
... but I'm sure it can be done better/smaller.

Lua, 140 bytes
I know Lua isn't the best golfing language, but compared by the size of the runtimes, it does pretty well I think.
f,i,d,s=0,0,0,io.read"*a"for g,a,j,b,e,c in s:gmatch("(%d+)(.-)(%d+)(.-)(%d+)(.-)")do f,i,d=f+g,i+j,d+e end print(table.concat{f,a,i,b,d,c})

PHP, 176 166 164 159 158 153
for($a=-1;$a<count($l=explode("
",$i));$r=explode(" ",$l[++$a]))for($b=-1;$b<count($r);$c[++$b]=is_numeric($r[$b])?$c[$b]+$r[$b]:$r[$b]);echo join(" ",$c);
This would, however, require the whole input in $i... A variant with $i replaced with $_POST["i"] so it would be sent in a textarea... Has 162 chars:
for($a=-1;$a<count($l=explode("
",$_POST["i"]));$r=explode(" ",$l[$a++]))for($b=0;$b<count($r);$c[$b]=is_numeric($r[$b])?$c[$b]+$r[$b]:$r[$b])$b++;echo join(" ",$c);
This is a version with
NO HARDCODED COLUMNS

Related

rjson::fromJSON returns only the first item

I have a sqlite database file with several columns. One of the columns has a JSON dictionary (with two keys) embedded in it. I want to extract the JSON column to a data frame in R that shows each key in a separate column.
I tried rjson::fromJSON, but it reads only the first item. Is there a trick that I'm missing?
Here's an example that mimics my problem:
> eg <- as.vector(c("{\"3x\": 20, \"6y\": 23}", "{\"3x\": 60, \"6y\": 50}"))
> fromJSON(eg)
$3x
[1] 20
$6y
[1] 23
The desired output is something like:
# a data frame for both variables
3x 6y
1 20 23
2 60 50
or,
# a data frame for each variable
3x
1 20
2 60
6y
1 23
2 50
What you are looking for is actually a combination of lapply and some application of rbind or related.
I'll extend your data a little, just to have more than 2 elements.
eg <- c("{\"3x\": 20, \"6y\": 23}",
"{\"3x\": 60, \"6y\": 50}",
"{\"3x\": 99, \"6y\": 72}")
library(jsonlite)
Using base R, we can do
do.call(rbind.data.frame, lapply(eg, fromJSON))
# X3x X6y
# 1 20 23
# 2 60 50
# 3 99 72
You might be tempted to do something like Reduce(rbind, lapply(eg, fromJSON)), but the notable difference is that in the Reduce model, rbind is called "N-1" times, where "N" is the number of elements in eg; this results in a LOT of copying of data, and though it might work alright with small "N", it scales horribly. With the do.call option, rbind is called exactly once.
Notice that the column labels have been R-ized, since data.frame column names should not start with numbers. (It is possible, but generally discouraged.)
If you're confident that all substrings will have exactly the same elements, then you may be good here. If there's a chance that there will be a difference at some point, perhaps
eg <- c(eg, "{\"3x\": 99}")
then you'll notice that the base R solution no longer works by default.
do.call(rbind.data.frame, lapply(eg, fromJSON))
# Error in (function (..., deparse.level = 1, make.row.names = TRUE, stringsAsFactors = default.stringsAsFactors()) :
# numbers of columns of arguments do not match
There may be techniques to try to normalize the elements such that you can be assured of matches. However, if you're not averse to a tidyverse package:
library(dplyr)
eg2 <- bind_rows(lapply(eg, fromJSON))
eg2
# # A tibble: 4 × 2
# `3x` `6y`
# <int> <int>
# 1 20 23
# 2 60 50
# 3 99 72
# 4 99 NA
though you cannot call it as directly with the dollar-method, you can still use [[ or backticks.
eg2$3x
# Error: unexpected numeric constant in "eg2$3"
eg2[["3x"]]
# [1] 20 60 99 99
eg2$`3x`
# [1] 20 60 99 99

Highlight a matched pattern in a DNA sequence with HTML markup using Perl

I am working on generating an HTML page using a CGI script in Perl.
I need filter some sequences in order to check whether they contain a specific pattern; if they contain it I need to print those sequences on my page with 50 bases per line, and highlight the pattern in the sequences. My sequences are in an hash called %hash; the keys are the names, the values are the actual sequences.
my %hash2;
foreach my $key (keys %hash) {
if ($hash{$key} =~ s!(aaagg)!<b>$1</b>!) {
$hash2{$key} = $hash{$key}
}
}
foreach my $key (keys %hash2) {
print "<p> <b> $key </b> </p>";
print "<p>$_</p>\n" for unpack '(A50)*', $hash2{$key};
}
This method "does" the job however if I highlight the pattern "aaagg" using this method I am messing up the unpacking of the line (for unpack '(A50)*'); because now the sequences contains the extra characters of the bold tags which are included in the unpacking count. This beside making the lines of different length it is also a big problem if the tag falls between 2 lines due to unpacking 50 characters, it basically remains open and everything after that is bold.
The script below uses a single randomly generated DNA sequence of length 243 (generated using http://www.bioinformatics.org/sms2/random_dna.html) and a variable length pattern.
It works by first recording the positions which need to be highlighted instead of changing the sequence string. The highlighting is inserted after the sequence is split into chunks of 50 bases.
The highlighting is done in reverse order to minimize bookkeeping busy work.
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use YAML::XS;
my $PRETTY_WIDTH = 50;
# I am using bold-italic so the highlighting
# is visible on Stackoverflow, but in real
# life, this would be something like:
# my #PRETTY_MARKUP = ('<span class="highlighted-match">', '</span>');
my #PRETTY_MARKUP = ('<b><i>', '</i></b>');
use constant { BAŞ => 0, SON => 1, ROW => 0, COL => 1 };
my $sequence = q{ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgcttttccccgtaatgagggccccatattcaggccgtcgtccggaattgtcttggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatatctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagaccggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct};
my $wanted = 'c..?gg';
my #pos;
while ($sequence =~ /($wanted)/g) {
push #pos, [ pos($sequence) - length($1), pos($sequence) ];
}
print Dump \#pos;
my #output = unpack "(A$PRETTY_WIDTH)*", $sequence;
print Dump \#output;
while (my $pos = pop #pos) {
my #rc = map pos_to_rc($_, $PRETTY_WIDTH), #$pos;
substr($output[ $rc[$_][ROW] ], $rc[$_][COL], 0, $PRETTY_MARKUP[$_]) for SON, BAŞ;
}
print Dump \#output;
sub pos_to_rc {
my $r = int( $_[0] / $_[1] );
my $c = $_[0] - $r * $_[1];
[ $r, $c ];
}
Output:
C:\...\Temp> perl s.pl
---
- - 0
- 4
- - 76
- 80
- - 87
- 91
- - 97
- 102
- - 104
- 108
- - 165
- 170
- - 184
- 188
- - 198
- 202
- - 226
- 231
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
---
- ccggtgagacatccagttagttcactgagccgacttgcatcagtcatgct
- tttccccgtaatgagggccccatattcaggccgtcgtccggaattgtctt
- ggatccggaatgcagcttttctcaccgcttgatgaacattcactgaatat
- ctgacgccgcgaaaacagggtcactagcctgtttccggtcgcccgagacc
- ggcgagtttgtggtatcgcgagcgcccccgggcggtagggtct
Especially since this turns out to have been a homework assignment, it is now up to you to take this and apply it to all sequences in your hash table.

Fixing broken csv files using awk

I have some csv files which are broken since there are junk such as control characters, enters and delimiters in some of the fields. An example mockup data without control characters:
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut
in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut
and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut
in
three;data 38
9;data 19;data 29;data 39
I am processing above crap with awk:
BEGIN { FS=OFS=";" } # delimiters
NR==1 { nf=NF; } # header record is fine, use the NF
NR>1 {
if(NF<nf) { # if NF less that header's NF
prev=$0 # store $0
if(getline==1) { # read the "next" line
succ=$0 # set the "next" line to succ
$0=prev succ # rebuild a current record
}
}
if(NF!=nf) # if NF is still not adequate
$0=succ # expect original line to be malformed
if(NF!=nf) # if the "next" line was malformed as well
next # well skip "next" line and move to next
} 1
Naturally above program will fail records 4 and 6 (as the actual data has several fields where the extra delimiter may lurk) and 8 (since I only read the next line if NF is too short. I can live with loosing 4 and 6 but 8 might be doable?
Also, three successive ifs scream for a for loop but it's Friday afternoon here and my day is nearing $ and I just can't spin my head around it anymore. Do you guys have any brain reserve left I could borrow? Any best practices I didn't think of?
The key her is to keep a buffer containing the lines that are still not "complete"; once they are, print them and clear the buffer:
awk -F';' 'NF>=4 && !nf {print; next} # normal lines are printed
{ # otherwise,
if (nf>0) { # continue with a "broken" line by...
buff=buff OFS $0 # appending to the buffer
nf+=NF-1 # and adding NF
} else { # new "broken" line, so...
buff=$0 # start buffer
nf=NF # set number of fields already seen
}
}
nf>=4{ # once line is complete
print buff # print it
buff=""; nf=0 # and remove variables
}' file
Here, buff is such buffer and nf an internal counter to keep track of how many fields have been seen so far for the current record (like you did in your attempt).
We are adding NF-1 when appending to the buffer (that is, from the 2nd line of a broken stream) because a line with NF==1 does not add any record but just concatenates with the last field of the previous line:
8;data 18;cut # NF==3 |
in # NF==1 but it just continues $3 | all together, NF==4
three;data 38 # NF==2 but $1 continues $3 |
With your sample input:
$ awk -F';' 'NF>=4 && !nf {print; next} {buff=(nf>0 ? buff OFS : "") $0; nf+=(nf>0 ? NF-1 : NF)} nf>=4{print buff; buff=""; nf=0}' a
id;col 1;col 2;col 3
1;data 11;good 21;data 31
2;data 12;cut in two;data 32
3;data 13;good 23;data 33
4;data 14;has;extra delimiter;data 34
5;data 15;good 25;data 35
6;data 16;cut and;extra delimiter;data 36
7;data 17;data 27;data 37
8;data 18;cut in three;data 38
9;data 19;data 29;data 39

Calling .csv file into Octave

I have a code written in C++ that outputs a .csv file with data in three columns (Time, Force, Height). I want to plot the data using Octave, or else use the octave function plot in the C++ file (I'm aware this is possible but I don't necessarily need to do it this way).
Right now I have the simple .m file:
filename = linear_wave_loading.csv;
M = csvread(filename);
Just to practice bringing this file into octave (will try and plot after)
I am getting this error.
error: dlmread: error parsing range
What is the correct method to load .csv files into octave?
Edit: Here is the first few lines of my .csv file
Wavelength= 88.7927 m
Time Height Force(KN/m)
0 -20 70668.2
0 -19 65875
0 -18 61411.9
0 -17 57256.4
Thanks in advance for your help.
Using octave 3.8.2
>> format long g
>> dlmread ('test.csv',' ',2,0)
ans =
0 0 0 -20 70668.2
0 0 0 -19 65875
0 0 0 -18 61411.9
0 0 0 -17 57256.4
General, use dlmread if your value separator is not a comma. Furthermore, you have to skip the two headlines.
Theoretical dlmread works with tab separated values too '\t', but this failes with you given example, because of the discontinuous tab size (maybe it's just a copy paste problem), so taking one space ' ' as separator is a workaround.
you should better save your .csv file comma separated
Wavelength= 88.7927 m
Time Height Force(KN/m)
0, -20, 70668.2
0, -19, 65875
0, -18, 61411.9
0, -17, 57256.4
Then you can easily do dlmread('file.csv',',',2,0).
You can try my csv2cell(not to be confused with csv2cell from io package!) function (never tried it < 3.8.0).
>> str2double(reshape(csv2cell('test.csv', ' +',2),3,4))'
ans =
0 -20 70668.2
0 -19 65875
0 -18 61411.9
0 -17 57256.4
Usually it reshaped successful automatically, but in case of space seperators, it often failed, so you have to reshape it by your self (and convert to double in any case).
And when you need your headline
>> reshape(csv2cell('test.csv', ' +',1),3,5)'
ans =
{
[1,1] = Time
[2,1] = +0
[3,1] = +0
[4,1] = +0
[5,1] = +0
[1,2] = Height
[2,2] = -20
[3,2] = -19
[4,2] = -18
[5,2] = -17
[1,3] = Force(KN/m)
[2,3] = 70668.2
[3,3] = 65875
[4,3] = 61411.9
[5,3] = 57256.4
}
But take care, then everything is a string in your cell.
Your not storing you .csv filename as a string.
Try:
filename = 'linear_wave_loading.csv';

Best way to Grep for html

I'm having trouble using grep to through some html code.
I'm trying to find similar strings to this
<td>product description here</td><td> $<font color='red'>0.25</font>
i'm trying to generalize formula to count each line that is under $0.25 the parts that will vary are the:
href='/go/12229' the number after /go/ will change but always be a number 5 digits long
the product description can be alphanumeric with spaces and special characters
and the price can be anything from 0.01 to 0.25
I've tried making formulas like the one below but it either does not work or returns nothing.
grep -c "href='/go/'[*] target="_blank" rel="nofollow">*</a></td><td> $<font color='red'>[0].[0-2][0-9]</font>"
I think it has to do with me not escaping special characters correctly, but i'm not sure.
Any help is appreciated.
Okay - this requires that each line be formated as in your example, but this should give you the link, description and prices where each line is between 0.01 and 0.25. The the contents of this code an put them in a file like "priceawk" and make it executable:
grep 'go\/[0-9]\{5\}' | awk -F"<" '
{
split( $7, price_arr, ">" )
if( price_arr[ 2 ] > 0.00 && price_arr[ 2 ] < 0.26 )
{
split( $3, link_arr, "'\''" )
split( link_arr[ 3 ], desc_arr, ">" )
printf( "%s %s %s\n", link_arr[ 2 ], desc_arr[ 2 ], price_arr[ 2 ] )
}
} '
Then use it like:
cat input | priceawk
With a test input file I made from your line, I get the following kinds of output:
/go/12229 product description here 0.25
/go/13455 find this line2 0.01
/go/12334 find this line3 0.23
/go/34455 find this line4 0.16
The printf() can be improved to give your output in a different form, with a more useful delimiter than the current space.