What is the most efficient way to pass some constant arguments to a function in Julia? - function

Suppose I have the following function
function foo(x::Float64, a::Float64)
if do_some_intense_stuff(a)
return bar(x)
else
return baz(x)
end
end
Let's assume that at runtime a will be a constant. But x will not. I have to run foo() many times, so I would like it to run as fast as possible, which means running do_some_intense_stuff as rarely as possible. Because a is a constant, at runtime we know which branch the if statement should take.
So ideally, I'd do the following:
foowrapper(x) = foo(x,a)
Y = [foowrapper(x) for x in lots_of_x]
and it would be a lot faster than
Y = [foo(x,a) for x in lots_of_x]
But that's not what happens. I don't blame the compiler for not optimizing my code since I didn't explicitly tell it that foo() will only ever be called with the constant value of a. But is there a good way for me to do that?
Of course, I can always get rid of foo and just write that if statement in the global scope, but that seems inelegant because the rest of the program does not care about the output of do_some_intense_stuff()
Update:
To benchmark the solution suggested below, I implemented the functions as follows. I also modified the declaration of foo() to make a an integer, for obvious reasons:
function bar(x::Float64)
return 2 * x
#println("Ran bar for value ",x)
end
function baz(x::Float64)
return -2 * x
#println("Ran baz for value ",x)
end
#memoize function do_some_intense_stuff(a::Int64)
return isprime(a + 32614262352646106013967035018546810367130464316134634614)
end
And defined lots_of_x = 1.0:1.0:1000.0.
Here is the output of #benchmark Y = [foo(x,a) for x in lots_of_x ] with and without memoize:
Without:
BenchmarkTools.Trial:
memory estimate: 109.50 KiB
allocs estimate: 5001
--------------
minimum time: 6.858 ms (0.00% GC)
median time: 6.924 ms (0.00% GC)
mean time: 7.067 ms (0.77% GC)
maximum time: 78.747 ms (49.00% GC)
--------------
samples: 707
evals/sample: 1
With:
BenchmarkTools.Trial:
memory estimate: 39.19 KiB
allocs estimate: 2001
--------------
minimum time: 97.500 μs (0.00% GC)
median time: 98.801 μs (0.00% GC)
mean time: 108.897 μs (1.37% GC)
maximum time: 2.099 ms (93.76% GC)
--------------
samples: 10000

Perhaps caching the result of your call to do_some_intense_stuff(a) will help, e.g. using Memoize.jl.

Related

Julia waits for function to finish before printing message in a loop

I have a function in Julia that requires to do things in a loop. A parameter is passed to the loop, and the bigger this parameter, the slower the function gets. I would like to have a message to know in which iteration it is, but it seems that Julia waits for the whole function to be finished before printing anything. This is Julia 1.4. That behaviour was not on Julia 1.3.
A example would be like this
function f(x)
rr=0.000:0.0001:x
aux=0
for r in rr
print(r, " ")
aux+=veryslowfunction(r)
end
return aux
end
As it is, f, when called, does not print anything until it has finished.
You need to add after the print command:
flush(stdout)
Explanation
The standard output of a process is usually buffered. The particular buffer size and behavior will depend on your system setting and perhaps the terminal type.
By flushing the buffer you make sure that the contents is actually sent to the terminal.
Alternatively, you can also use a library like ProgressLogging.jl (needs TerminalLoggers.jl to see actual output), or ProgressMeter.jl, which will automatically update a nicely formatted status bar during each step of the loop.
For example, with ProgressMeter, a call to
function f(x)
rr=0.000:0.0001:x
aux=0
#showprogress for r in rr
aux += veryslowfunction(r)
end
return aux
end
will show something like (in the end):
Progress: 100%|██████████████████████████████████████████████████████████████| Time: 0:00:10
Again I can't reproduce the behaviour in my terminal (it always prints), but I wanted to add that for these types of situations the #show macro is quite neat:
julia> function f(x)
rr=0.000:0.0001:x
aux=0
for r in rr
#show r
aux+=veryslowfunction(r)
end
return aux
end
f (generic function with 1 method)
julia> f(1)
r = 0.0
r = 0.0001
r = 0.0002
...
It uses println under the hood:
julia> using MacroTools
julia> a = 5
5
julia> prettify(#expand(#show a))
quote
Base.println("a = ", Base.repr($(Expr(:(=), :ibex, :a))))
ibex
end

Trying to improve execution time of Julia code

I need to write a code to generate a string using a grammar with just one rule. For example, if the rule is "G -> [G+G]", and we apply the rule to "G", the result is the string "[G+G]"; if we apply it to the previous result, we obtain "[[G+G]+[G+G]]" and so on. In other words, it's about rewritting the axiom (left side of the rule) a given number of times,
following the rule.
I've been given a piece of code written in Octave that implements this operation (I won't include the code because it's a bit long, but I will if it's necessary for understanding or answering the question). What I need to do is to write an equivalent function in Julia; so I wrote this
function generate_developedstring(axiom::ASCIIString, genome::ASCIIString, iterations::Int8)
tic()
developedstring = axiom
for i=1:iterations
developedstring = replace(developedstring, axiom, genome)
end
toc()
return developedstring
end
In the example I wrote earlier, axiom would be "G" and genome "[G+G]".
According to the benchmark times published is julialang.org, Julia should be way faster than Octave, but in this case, Octave is twice as fast as Julia
(I used the same axiom, genome and iterations for both codes and I measured times with tic toc functions).
Is there any way to make the Julia code faster?
Edit: First of all, thank you all so much for your comments. I will show you the Octave code I've been given (I didn't write it):
function axiom = ls(genome)
tic
ProductionSystem = ['[=>[ ]=>] +=>+ -=>- G=>',genome];
rule = extract(ProductionSystem);
n_Rules = length(rule);
% starting string
axiom = 'G';
% iterations (choose only from 1 to 7, >= 8 critical,
% depends on the string and on the computer !!
n_Repeats = 3;
%CALCULATE THE STRING
%=================================
for i = 1:n_Repeats
% a single letter (axiom)
axiomINcells = cellstr(axiom);
for j = 1:n_Rules
% find all occurrences of that axiom
hit = strfind(axiom, rule(j).pre);
if (length(hit) >= 1)
for k = hit
% perform the rule
% (replace 'pre' by 'post')
axiomINcells{k} = rule(j).pos;
end
end
end
axiom = [];
for j = 1:length(axiomINcells)
% put all strings together
axiom = [axiom, axiomINcells{j}];
end
end
toc
function rule = extract(ProductionSystem)
% rules are separated by space character, and pre and post sides are
% separtated by '->'
% e.g. F->FF G->F[+G][-G]F[+G][-G]FG
i=0;
while (~isempty(ProductionSystem))
i=i+1;
[rule1,ProductionSystem] = strtok(ProductionSystem,' ');
[rule(i).pre,post] = strtok(rule1,'=>');
rule(i).pos = post(3:end);
if (~isempty(ProductionSystem)) ProductionSystem=ProductionSystem(2:end); % delete separator
end
end
About the Julia version I'm using, it's 0.4.7. You also asked me how fast I need it to run; I just need to write a code as fast as posible, and the fact that Octave was faster made me think that I was doing something wrong.
Thank you again.
Can you specify the genome as a rule instead of a pattern? I mean, for example genome = ax -> "[$ax,$ax]".
Compare these two implementations, the first of which is the same as yours:
function genstring(axiom::String, genome::String, iter::Int)
str = axiom
for i in 1:iter
str = replace(str, axiom, genome)
end
return str
end
And then with an anonymous function:
genome = ax -> "[$ax,$ax]"
function genstring_(axiom::String, genome, n::Int)
if n < 1
return axiom
end
return genstring_(genome(axiom), genome, n-1)
end
On version 0.5, with BenchmarkTools:
julia> #benchmark genstring("G", "[G,G]", 2)
BenchmarkTools.Trial:
memory estimate: 752.00 bytes
allocs estimate: 15
--------------
minimum time: 745.950 ns (0.00% GC)
median time: 801.067 ns (0.00% GC)
mean time: 1.006 μs (14.30% GC)
maximum time: 50.271 μs (96.63% GC)
--------------
samples: 10000
evals/sample: 119
time tolerance: 5.00%
memory tolerance: 1.00%
julia> #benchmark genstring_("G", genome, 2)
BenchmarkTools.Trial:
memory estimate: 352.00 bytes
allocs estimate: 9
--------------
minimum time: 397.562 ns (0.00% GC)
median time: 414.149 ns (0.00% GC)
mean time: 496.511 ns (13.06% GC)
maximum time: 24.410 μs (97.18% GC)
--------------
samples: 10000
evals/sample: 201
time tolerance: 5.00%
memory tolerance: 1.00%
It scales better:
julia> #benchmark genstring("G", "[G,G]", 10)
BenchmarkTools.Trial:
memory estimate: 18.00 kb
allocs estimate: 71
--------------
minimum time: 93.569 μs (0.00% GC)
median time: 95.959 μs (0.00% GC)
mean time: 103.429 μs (3.05% GC)
maximum time: 4.216 ms (97.14% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> #benchmark genstring_("G", genome, 10)
BenchmarkTools.Trial:
memory estimate: 14.13 kb
allocs estimate: 49
--------------
minimum time: 3.072 μs (0.00% GC)
median time: 3.597 μs (0.00% GC)
mean time: 5.703 μs (29.78% GC)
maximum time: 441.515 μs (98.24% GC)
--------------
samples: 10000
evals/sample: 8
time tolerance: 5.00%
memory tolerance: 1.00%
As far as I know, string interpolation isn't superfast, so there could further optimizations.

Memory allocation in Julia function

Here is a simple function in Julia 0.5.
function foo{T<:AbstractFloat}(x::T)
a = zero(T)
b = zero(T)
return x
end
I started with julia --track-allocation=user. then include("test.jl"). test.jl only has this function. Run foo(5.). Then Profile.clear_malloc_data(). foo(5.) again in the REPL. Quit julia. Look at the file test.jl.mem.
- function foo{T<:AbstractFloat}(x::T)
- a = zero(T)
194973 b = zero(T)
0 return x
- end
-
Why is there 194973 bytes of memory allocated here? This is also not the first line of the function. Although after Profile.clear_malloc_data(), this shouldn't matter.
Let's clarify some parts of the relevant documentation, which can be a little misleading:
In interpreting the results, there are a few important details. Under the user setting, the first line of any function directly called from the REPL will exhibit allocation due to events that happen in the REPL code itself.
Indeed, the line with allocation is not the first line. However, it is still the first tracked line, since Julia 0.5 has some issues with tracking allocation on the actual first statement (this has been fixed on v0.6). Note that it may also (contrary to what the documentation says) propagate into functions, even if they are annotated with #noinline. The only real solution is to ensure the first statement of what's being called is something you don't want to measure.
More significantly, JIT-compilation also adds to allocation counts, because much of Julia’s compiler is written in Julia (and compilation usually requires memory allocation). The recommended procedure is to force compilation by executing all the commands you want to analyze, then call Profile.clear_malloc_data() to reset all allocation counters. Finally, execute the desired commands and quit Julia to trigger the generation of the .mem files.
You're right that Profile.clear_malloc_data() prevents the allocation for JIT compilation being counted. However, this paragraph is separate from the first paragraph; clear_malloc_data does not do anything about allocation due to "events that happen in the REPL code itself".
Indeed, as I'm sure you suspected, there is no allocation in this function:
julia> function foo{T<:AbstractFloat}(x::T)
a = zero(T)
b = zero(T)
return x
end
foo (generic function with 1 method)
julia> #allocated foo(5.)
0
The numbers you see are due to events in the REPL itself. To avoid this issue, wrap the code to measure in a function. That is to say, we can use this as our test harness, perhaps after disabling inlining on foo with #noinline. For instance, here's a revised test.jl:
#noinline function foo{T<:AbstractFloat}(x::T)
a = zero(T)
b = zero(T)
return x
end
function do_measurements()
x = 0. # dummy statement
x += foo(5.)
x # prevent foo call being optimized out
# (it won't, but this is good practice)
end
Then a REPL session:
julia> include("test.jl")
do_measurements (generic function with 1 method)
julia> do_measurements()
5.0
julia> Profile.clear_malloc_data()
julia> do_measurements()
5.0
Which produces the expected result:
- #noinline function foo{T<:AbstractFloat}(x::T)
0 a = zero(T)
0 b = zero(T)
0 return x
- end
-
- function do_measurements()
155351 x = 0. # dummy statement
0 x += foo(5.)
0 x # prevent foo call being optimized out
- # (it won't, but this is good practice)
- end
-

"dimension too large" error when broadcasting to sparse matrix in octave

32-bit Octave has a limit on the maximum number of elements in an array. I have recompiled from source (following the script at https://github.com/calaba/octave-3.8.2-enable-64-ubuntu-14.04 ), and now have 64-bit indexing.
Nevertheless, when I attempt to perform elementwise multiplication using a broadcast function, I get error: out of memory or dimension too large for Octave's index type
Is this a bug, or an undocumented feature? If it's a bug, does anyone have a reasonably efficient workaround?
Minimal code to reproduce the problem:
function indexerror();
% both of these are formed without error
% a = zeros (2^32, 1, 'int8');
% b = zeros (1024*1024*1024*3, 1, 'int8');
% sizemax % returns 9223372036854775806
nnz = 1000 % number of non-zero elements
rowmax = 250000
colmax = 100000
irow = zeros(1,nnz);
icol = zeros(1,nnz);
for ind =1:nnz
irow(ind) = round(rowmax/nnz*ind);
icol(ind) = round(colmax/nnz*ind);
end
sparseMat = sparse(irow,icol,1,rowmax,colmax);
% column vector to be broadcast
broad = 1:rowmax;
broad = broad(:);
% this gives "dimension too large" error
toobig = bsxfun(#times,sparseMat,broad);
% so does this
toobig2 = sparse(repmat(broad,1,size(sparseMat,2)));
mult = sparse( sparseMat .* toobig2 ); % never made it this far
end
EDIT:
Well, I have an inefficient workaround. It's slower than using bsxfun by a factor of 3 or so (depending on the details), but it's better than having to sort through the error in the libraries. Hope someone finds this useful some day.
% loop over rows, instead of using bsxfun
mult_loop = sparse([],[],[],rowmax,colmax);
for ind =1:length(broad);
mult_loop(ind,:) = broad(ind) * sparseMat(ind,:);
end
The unfortunate answer is that yes, this is a bug. Apparently #bsxfun and repmat are returning full matrices rather than sparse. Bug has been filed here:
http://savannah.gnu.org/bugs/index.php?47175

How to optimize this short factorial function in scala? (Creating 50000 BigInts)

I've compaired the scala version
(BigInt(1) to BigInt(50000)).reduce(_ * _)
to the python version
reduce(lambda x,y: x*y, range(1,50000))
and it turns out, that the scala version took about 10 times longer than the python version.
I'm guessing, a big difference is that python can use its native long type instead of creating new BigInt-objects for each number. But is there a workaround in scala?
The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.
The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:
import org.jscience.mathematics.number.LargeInteger
(1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)
This is faster than the Python solution on my machine.
Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:
benchmark ms linear runtime
BigIntFoldLeft 4774 ==============================
BigIntFold 4739 =============================
BigIntReduce 4769 =============================
BigIntFoldLeftPar 4642 =============================
BigIntFoldPar 500 ===
BigIntReducePar 499 ===
LargeIntegerFoldLeft 3042 ===================
LargeIntegerFold 3003 ==================
LargeIntegerReduce 3018 ==================
LargeIntegerFoldLeftPar 3038 ===================
LargeIntegerFoldPar 246 =
LargeIntegerReducePar 260 =
I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.
Python on my machine:
def func():
start= time.clock()
reduce(lambda x,y: x*y, range(1,50000))
end= time.clock()
t = (end-start) * 1000
print t
gives 1219 ms
Scala:
def timed[T](f: => T) = {
val t0 = System.currentTimeMillis
val r = f
val t1 = System.currentTimeMillis
println("Took: "+(t1 - t0)+" ms")
r
}
timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms
timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms
timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms
timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms
// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms
timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
LargeInteger.ONE)(_ times _) }
361 ms
This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.
The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.
So, experimentally, the way to optimize this function is
a) Use fold rather than reduce
b) Use parallel collections
update:
Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).
import org.jscience.mathematics.number.LargeInteger
def fact(n: Int) = {
def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
case 0 => throw new IllegalArgumentException
case 1 => seq.head
case _ => loop {
val (a, b) = seq.splitAt(seq.length / 2)
a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
}
}
loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
}
Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:
scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms
scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms
Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)
Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:
def fact(n: Int): BigInt = rangeProduct(1, n)
private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
case 0 => BigInt(n1)
case 1 => BigInt(n1 * n2)
case 2 => BigInt(n1 * (n1 + 1)) * n2
case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
case _ =>
val nm = (n1 + n2) >> 1
rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}
Also to get more speed use latest version of JDK and following JVM options:
-server -XX:+TieredCompilation
Bellow are results for Intel(R) Core(TM) i7-2640M CPU # 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:
(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage
BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it