I need to write a code to generate a string using a grammar with just one rule. For example, if the rule is "G -> [G+G]", and we apply the rule to "G", the result is the string "[G+G]"; if we apply it to the previous result, we obtain "[[G+G]+[G+G]]" and so on. In other words, it's about rewritting the axiom (left side of the rule) a given number of times,
following the rule.
I've been given a piece of code written in Octave that implements this operation (I won't include the code because it's a bit long, but I will if it's necessary for understanding or answering the question). What I need to do is to write an equivalent function in Julia; so I wrote this
function generate_developedstring(axiom::ASCIIString, genome::ASCIIString, iterations::Int8)
tic()
developedstring = axiom
for i=1:iterations
developedstring = replace(developedstring, axiom, genome)
end
toc()
return developedstring
end
In the example I wrote earlier, axiom would be "G" and genome "[G+G]".
According to the benchmark times published is julialang.org, Julia should be way faster than Octave, but in this case, Octave is twice as fast as Julia
(I used the same axiom, genome and iterations for both codes and I measured times with tic toc functions).
Is there any way to make the Julia code faster?
Edit: First of all, thank you all so much for your comments. I will show you the Octave code I've been given (I didn't write it):
function axiom = ls(genome)
tic
ProductionSystem = ['[=>[ ]=>] +=>+ -=>- G=>',genome];
rule = extract(ProductionSystem);
n_Rules = length(rule);
% starting string
axiom = 'G';
% iterations (choose only from 1 to 7, >= 8 critical,
% depends on the string and on the computer !!
n_Repeats = 3;
%CALCULATE THE STRING
%=================================
for i = 1:n_Repeats
% a single letter (axiom)
axiomINcells = cellstr(axiom);
for j = 1:n_Rules
% find all occurrences of that axiom
hit = strfind(axiom, rule(j).pre);
if (length(hit) >= 1)
for k = hit
% perform the rule
% (replace 'pre' by 'post')
axiomINcells{k} = rule(j).pos;
end
end
end
axiom = [];
for j = 1:length(axiomINcells)
% put all strings together
axiom = [axiom, axiomINcells{j}];
end
end
toc
function rule = extract(ProductionSystem)
% rules are separated by space character, and pre and post sides are
% separtated by '->'
% e.g. F->FF G->F[+G][-G]F[+G][-G]FG
i=0;
while (~isempty(ProductionSystem))
i=i+1;
[rule1,ProductionSystem] = strtok(ProductionSystem,' ');
[rule(i).pre,post] = strtok(rule1,'=>');
rule(i).pos = post(3:end);
if (~isempty(ProductionSystem)) ProductionSystem=ProductionSystem(2:end); % delete separator
end
end
About the Julia version I'm using, it's 0.4.7. You also asked me how fast I need it to run; I just need to write a code as fast as posible, and the fact that Octave was faster made me think that I was doing something wrong.
Thank you again.
Can you specify the genome as a rule instead of a pattern? I mean, for example genome = ax -> "[$ax,$ax]".
Compare these two implementations, the first of which is the same as yours:
function genstring(axiom::String, genome::String, iter::Int)
str = axiom
for i in 1:iter
str = replace(str, axiom, genome)
end
return str
end
And then with an anonymous function:
genome = ax -> "[$ax,$ax]"
function genstring_(axiom::String, genome, n::Int)
if n < 1
return axiom
end
return genstring_(genome(axiom), genome, n-1)
end
On version 0.5, with BenchmarkTools:
julia> #benchmark genstring("G", "[G,G]", 2)
BenchmarkTools.Trial:
memory estimate: 752.00 bytes
allocs estimate: 15
--------------
minimum time: 745.950 ns (0.00% GC)
median time: 801.067 ns (0.00% GC)
mean time: 1.006 μs (14.30% GC)
maximum time: 50.271 μs (96.63% GC)
--------------
samples: 10000
evals/sample: 119
time tolerance: 5.00%
memory tolerance: 1.00%
julia> #benchmark genstring_("G", genome, 2)
BenchmarkTools.Trial:
memory estimate: 352.00 bytes
allocs estimate: 9
--------------
minimum time: 397.562 ns (0.00% GC)
median time: 414.149 ns (0.00% GC)
mean time: 496.511 ns (13.06% GC)
maximum time: 24.410 μs (97.18% GC)
--------------
samples: 10000
evals/sample: 201
time tolerance: 5.00%
memory tolerance: 1.00%
It scales better:
julia> #benchmark genstring("G", "[G,G]", 10)
BenchmarkTools.Trial:
memory estimate: 18.00 kb
allocs estimate: 71
--------------
minimum time: 93.569 μs (0.00% GC)
median time: 95.959 μs (0.00% GC)
mean time: 103.429 μs (3.05% GC)
maximum time: 4.216 ms (97.14% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> #benchmark genstring_("G", genome, 10)
BenchmarkTools.Trial:
memory estimate: 14.13 kb
allocs estimate: 49
--------------
minimum time: 3.072 μs (0.00% GC)
median time: 3.597 μs (0.00% GC)
mean time: 5.703 μs (29.78% GC)
maximum time: 441.515 μs (98.24% GC)
--------------
samples: 10000
evals/sample: 8
time tolerance: 5.00%
memory tolerance: 1.00%
As far as I know, string interpolation isn't superfast, so there could further optimizations.
Related
I am trying to write a Cuda kernel to generate row-wise histogram based on the input feature set (2 x 6) where each feature row (each having 6 features) is to generate a histogram having nbins=10.
I have implemented the below code but it doesn’t seem to generate the correct row-wise histogram.
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=6, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
#cuda.jit
def row_wise_histogram(feature, output, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, feature_size)
print(d_output.copy_to_host())
And the out results in
[[ 0. 0. 0. 0. 0. 0. 81111. 81111. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 162222. 0. 81111. 0.]]
which is wrong, Will appreciate it if I can get help with the issue inside the row_wise_historgram function!
I think the main issue you have in your code is that your kernel has a thread strategy to have each thread process a row, and you have 2 rows in your feature dataset, but you are launching 12 threads total:
### Kernal Configuration
threads_per_block = 6
blocks = 2
10 of those threads will be indexing out-of-bounds. For 2 rows you only need 2 threads. We can fix this multiple ways, but I will add a "thread-check" to your kernel code, to prevent out-of-bounds threads from doing anything.
You are also histogramming values that don't fit in your output array. Let's suppose your feature has a input value of 4 at some location. Let's put that value through your arithmetic:
bin_number = np.int32(nbins * (np.float32(4) - np.float32(-4)) / (np.float32(4) - np.float32(-4)))
That is 10 * (4-(-4))/(4-(-4))
So that is a bin index of 10. But you only have 10 bins, so valid bin index can only go up to 9. Which means some of your input values (e.g. 4, 5) will not be recorded in your output.
The following code is your code to add the threadcheck, plus the range of input adjusted. And I am printing out the input, the bin each input value was assigned, and the output bins. It seems to be working correctly.
$ cat t65.py
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=4, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
mybin = np.empty_like(feature)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
d_mybin = cuda.to_device(mybin)
#cuda.jit
def row_wise_histogram(feature, output, mybin, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
if idx < output.shape[0]:
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
mybin[idx][i] = bin_number
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, d_mybin, feature_size)
print(feature)
print(d_mybin.copy_to_host())
print(d_output.copy_to_host())
$ python t65.py
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
$ cuda-memcheck python t65.py
========= CUDA-MEMCHECK
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
========= ERROR SUMMARY: 0 errors
$
Note that when I restrict the input values to 1..3, then the maximum bin index is 8 (do the math). If I increase the input range to include 4, the maximum bin index goes to 10, which "won't fit". You're correctly handling this case, but it may confuse you, as these values of 4 or 5 won't be recorded in the output. Histogram bin arithmetic is fun. You will need to work out exactly what you want.
Also note that if you run this code, you should see output almost exactly the same as above. If you don't, there is a good chance your numba or cuda install is broken somehow, and the additional run I show with cuda-memcheck will help to discover what may be the issue.
Note that since you are using atomics anyway, there isn't any particular need to assign one thread to each row, you could instead assign one thread to each input point. But that isn't your question; it's a story for another day. Conversely, if you do proceed with one thread per row, each thread doing effectively a private histogram, there is no particular need to use atomics.
Consider the following CUDA kernel, which computes the mean of each row of a 2-D matrix.
using CUDA
function mean!(x, n, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:n
#inbounds out[row_idx] += x[row_idx, i]
end
out[row_idx] /= n
return
end
using Test
nrow, ncol = 1024, 10
x = CuArray{Float64, 2}(rand(nrow, ncol))
y = CuArray{Float64, 1}(zeros(nrow))
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], y)
#test isapprox(y, sum(x, dims=2)) # test passed
Also consider the following CUDA kernel
function add!(a, b, c)
""" c = a .+ b """
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return
end
a = CuArray{Float64, 1}(zeros(nrow))
b = CuArray{Float64, 1}(ones(nrow))
c = CuArray{Float64, 1}(zeros(nrow))
#cuda threads=256 blocks=4 add!(a, b, c)
#test all(c .== a .+ b) # test passed
Now, suppose I wanted to write another kernel that uses the intermediate results of mean!(). For example,
function g(x, y)
""" mean(x, dims=2) + mean(y, dims=2) """
xrow, xcol = size(x)
yrow, ycol = size(y)
mean1 = CuArray{Float64, 1}(undef, xrow)
#cuda threads=256 blocks=4 mean!(x, xcol, mean1)
mean2 = CuArray{Float64, 1}(zeros(yrow))
#cuda threads=256 blocks=4 mean!(y, ycol, mean2)
out = CuArray{Float64, 1}(zeros(yrow))
#cuda threads=256 blocks=4 add!(mean1, mean2, out)
return out
end
(Of course, g() isn't technically a kernel since it returns something.)
My question is whether g() is "correct". In particular, is g() wasting time by transferring data between the GPU/CPU?
For example, if my understanding is correct, one way g() could be optimized is by initializing mean2 the same way we initialize mean1. This is because when constructing mean2, we're actually first creating zeros(yrow) on the CPU, then passing this to the CuArray constructor to be copied to the GPU. In contrast, mean1 is constructed but uninitialized (due to the undef) and therefore avoids this extra transfer.
To summarize, how do I save/use intermediate kernel results while avoiding data transfers between the CPU/GPU as much as possible?
You can generate arrays or vectors of zeros directly on GPU!
Try:
CUDA.zeros(Float64, nrow)
Some benchmarks:
julia> #btime CUDA.zeros(Float64, 1000,1000)
12.600 μs (26 allocations: 1.22 KiB)
1000×1000 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
...
julia> #btime CuArray(zeros(1000,1000))
3.551 ms (8 allocations: 7.63 MiB)
1000×1000 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
...
Suppose I have the following function
function foo(x::Float64, a::Float64)
if do_some_intense_stuff(a)
return bar(x)
else
return baz(x)
end
end
Let's assume that at runtime a will be a constant. But x will not. I have to run foo() many times, so I would like it to run as fast as possible, which means running do_some_intense_stuff as rarely as possible. Because a is a constant, at runtime we know which branch the if statement should take.
So ideally, I'd do the following:
foowrapper(x) = foo(x,a)
Y = [foowrapper(x) for x in lots_of_x]
and it would be a lot faster than
Y = [foo(x,a) for x in lots_of_x]
But that's not what happens. I don't blame the compiler for not optimizing my code since I didn't explicitly tell it that foo() will only ever be called with the constant value of a. But is there a good way for me to do that?
Of course, I can always get rid of foo and just write that if statement in the global scope, but that seems inelegant because the rest of the program does not care about the output of do_some_intense_stuff()
Update:
To benchmark the solution suggested below, I implemented the functions as follows. I also modified the declaration of foo() to make a an integer, for obvious reasons:
function bar(x::Float64)
return 2 * x
#println("Ran bar for value ",x)
end
function baz(x::Float64)
return -2 * x
#println("Ran baz for value ",x)
end
#memoize function do_some_intense_stuff(a::Int64)
return isprime(a + 32614262352646106013967035018546810367130464316134634614)
end
And defined lots_of_x = 1.0:1.0:1000.0.
Here is the output of #benchmark Y = [foo(x,a) for x in lots_of_x ] with and without memoize:
Without:
BenchmarkTools.Trial:
memory estimate: 109.50 KiB
allocs estimate: 5001
--------------
minimum time: 6.858 ms (0.00% GC)
median time: 6.924 ms (0.00% GC)
mean time: 7.067 ms (0.77% GC)
maximum time: 78.747 ms (49.00% GC)
--------------
samples: 707
evals/sample: 1
With:
BenchmarkTools.Trial:
memory estimate: 39.19 KiB
allocs estimate: 2001
--------------
minimum time: 97.500 μs (0.00% GC)
median time: 98.801 μs (0.00% GC)
mean time: 108.897 μs (1.37% GC)
maximum time: 2.099 ms (93.76% GC)
--------------
samples: 10000
Perhaps caching the result of your call to do_some_intense_stuff(a) will help, e.g. using Memoize.jl.
Let i = [1 2] and j = [3 5]. Now in octave:
arrayfun(#(x,y) x+y,i,j)
we get [4 7]. But I want to apply the function on the combinations of i vs. j to get [i(1)+j(1) i(1)+j(2) i(2)+j(1) i(2)+j(2)]=[4 6 5 7].
How do I accomplish this? I know I can go with for-loopsl but I want vectorized-code because it's faster.
In Octave, for finding summations between two vectors, you can use a truly vectorized approach with broadcasting like so -
out = reshape(ii(:).' + jj(:),[],1)
Here's a runtime test on ideone for the input vectors of size 1 x 100 each -
-------------------- With FOR-LOOP
Elapsed time is 0.148444 seconds.
-------------------- With BROADCASTING
Elapsed time is 0.00038299 seconds.
If you want to keep it generic to accommodate operations other than just summations, you can use anonymous functions like so -
func1 = #(I,J) I+J;
out = reshape(func1(ii,jj.'),1,[])
In MATLAB, you could accomplish the same with two bsxfun alternatives as listed next.
I. bsxfun with Anonymous Function -
func1 = #(I,J) I+J;
out = reshape(bsxfun(func1,ii(:).',jj(:)),1,[]);
II. bsxfun with Built-in #plus -
out = reshape(bsxfun(#plus,ii(:).',jj(:)),1,[]);
With the input vectors of size 1 x 10000 each, the runtimes at my end were -
-------------------- With FOR-LOOP
Elapsed time is 1.193941 seconds.
-------------------- With BSXFUN ANONYMOUS
Elapsed time is 0.252825 seconds.
-------------------- With BSXFUN BUILTIN
Elapsed time is 0.215066 seconds.
First, your first example is not the best because the most efficient way to accomplish what you're doing with arrayfun would be to vectorize:
a = [1 2];
b = [3 5];
out = a+b
Second, in Matlab at least, arrayfun is not necessarily faster than a simple for loop. arrayfun is mainly a convenience (especially for it's more advanced options). Try this simple timing example yourself:
a = 1:1e5;
b = a+1;
y = arrayfun(#(x,y)x+y,a,b); % Warm up
tic
y = arrayfun(#(x,y)x+y,a,b);
toc
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k); % Warm up
end
tic
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k);
end
toc
In Matlab R2015a, the for loop method is over 70 times faster run from the Command window and over 260 times faster when run from an M-file function. Octave may be different, but you should experiment.
Finally, you can accomplish what you want using meshgrid:
a = [1 2];
b = [3 5];
[x,y] = meshgrid(a,b);
out = x(:).'+y(:).'
which returns [4 6 5 7] as in your question. You can also use ndgrid to get output in a different order.
I've compaired the scala version
(BigInt(1) to BigInt(50000)).reduce(_ * _)
to the python version
reduce(lambda x,y: x*y, range(1,50000))
and it turns out, that the scala version took about 10 times longer than the python version.
I'm guessing, a big difference is that python can use its native long type instead of creating new BigInt-objects for each number. But is there a workaround in scala?
The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.
The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:
import org.jscience.mathematics.number.LargeInteger
(1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)
This is faster than the Python solution on my machine.
Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:
benchmark ms linear runtime
BigIntFoldLeft 4774 ==============================
BigIntFold 4739 =============================
BigIntReduce 4769 =============================
BigIntFoldLeftPar 4642 =============================
BigIntFoldPar 500 ===
BigIntReducePar 499 ===
LargeIntegerFoldLeft 3042 ===================
LargeIntegerFold 3003 ==================
LargeIntegerReduce 3018 ==================
LargeIntegerFoldLeftPar 3038 ===================
LargeIntegerFoldPar 246 =
LargeIntegerReducePar 260 =
I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.
Python on my machine:
def func():
start= time.clock()
reduce(lambda x,y: x*y, range(1,50000))
end= time.clock()
t = (end-start) * 1000
print t
gives 1219 ms
Scala:
def timed[T](f: => T) = {
val t0 = System.currentTimeMillis
val r = f
val t1 = System.currentTimeMillis
println("Took: "+(t1 - t0)+" ms")
r
}
timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms
timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms
timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms
timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms
// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms
timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
LargeInteger.ONE)(_ times _) }
361 ms
This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.
The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.
So, experimentally, the way to optimize this function is
a) Use fold rather than reduce
b) Use parallel collections
update:
Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).
import org.jscience.mathematics.number.LargeInteger
def fact(n: Int) = {
def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
case 0 => throw new IllegalArgumentException
case 1 => seq.head
case _ => loop {
val (a, b) = seq.splitAt(seq.length / 2)
a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
}
}
loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
}
Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:
scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms
scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms
Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)
Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:
def fact(n: Int): BigInt = rangeProduct(1, n)
private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
case 0 => BigInt(n1)
case 1 => BigInt(n1 * n2)
case 2 => BigInt(n1 * (n1 + 1)) * n2
case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
case _ =>
val nm = (n1 + n2) >> 1
rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}
Also to get more speed use latest version of JDK and following JVM options:
-server -XX:+TieredCompilation
Bellow are results for Intel(R) Core(TM) i7-2640M CPU # 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:
(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage
BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it