I want to execute a CUDA kernel in python using Numbapro API. I have this code:
import math
import numpy
from numbapro import jit, cuda, int32, float32
from matplotlib import pyplot
#cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, float32, float32, int32)')
def calculate_velocity_field(X, Y, u_source, v_source, x_source, y_source, strength_source, N):
start = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
end = N
stride = cuda.gridDim.x * cuda.blockDim.x
for i in range(start, end, stride):
u_source[i] = strength_source/(2*math.pi) * (X[i]-x_source)/((X[i]-x_source)**2 + (Y[i]-y_source)**2)
v_source[i] = strength_source/(2*math.pi) * (Y[i]-x_source)/((X[i]-x_source)**2 + (Y[i]-y_source)**2)
def main():
N = 200 # number of points in each direction
x_start, x_end = -4.0, 4.0 # boundaries in the x-direction
y_start, y_end = -2.0, 2.0 # boundaries in the y-direction
x = numpy.linspace(x_start, x_end, N) # creates a 1D-array with the x-coordinates
y = numpy.linspace(y_start, y_end, N) # creates a 1D-array with the y-coordinates
X, Y = numpy.meshgrid(x, y) # generates a mesh grid
strength_source = 5.0 # source strength
x_source, y_source = -1.0, 0.0 # location of the source
start = timer()
#calculate grid dimensions
blockSize = 1024
gridSize = int(math.ceil(float(N)/blockSize))
#transfer memory to device
X_d = cuda.to_device(X)
Y_d = cuda.to_device(Y)
u_source_d = cuda.device_array_like(X)
v_source_d = cuda.device_array_like(Y)
#launch kernel
calculate_velocity_field[gridSize,blockSize](X_d,Y_d,u_source_d,v_source_d,x_source,y_source,strength_source,N)
#transfer memory to host
u_source = numpy.empty_like(X)
v_source = numpy.empty_like(Y)
u_source_d.to_host(u_source)
v_source_d.to_host(v_source)
elapsed_time = timer() - start
print("Exec time with GPU %f s" % elapsed_time)
if __name__ == "__main__":
main()
Is giving me this error:
NvvmError Traceback (most recent call last)
<ipython-input-17-85e4a6e56a14> in <module>()
----> 1 #cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, float32, float32, int32)')
2 def calculate_velocity_field(X, Y, u_source, v_source, x_source, y_source, strength_source, N):
3 start = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
4 end = N
5 stride = cuda.gridDim.x * cuda.blockDim.x
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/decorators.py in kernel_jit(func)
89 # Force compilation for the current context
90 if bind:
---> 91 kernel.bind()
92
93 return kernel
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in bind(self)
319 Force binding to current CUDA context
320 """
--> 321 self._func.get()
322
323 #property
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in get(self)
254 cufunc = self.cache.get(device.id)
255 if cufunc is None:
--> 256 ptx = self.ptx.get()
257
258 # Link
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in get(self)
226 arch = nvvm.get_arch_option(*cc)
227 ptx = nvvm.llvm_to_ptx(self.llvmir, opt=3, arch=arch,
--> 228 **self._extra_options)
229 self.cache[cc] = ptx
230 if config.DUMP_ASSEMBLY:
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in llvm_to_ptx(llvmir, **opts)
420 cu.add_module(llvmir.encode('utf8'))
421 cu.add_module(libdevice.get())
--> 422 ptx = cu.compile(**opts)
423 return ptx
424
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in compile(self, **options)
211 for x in opts])
212 err = self.driver.nvvmCompileProgram(self._handle, len(opts), c_opts)
--> 213 self._try_error(err, 'Failed to compile\n')
214
215 # get result
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in _try_error(self, err, msg)
229
230 def _try_error(self, err, msg):
--> 231 self.driver.check_error(err, "%s\n%s" % (msg, self.get_log()))
232
233 def get_log(self):
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in check_error(self, error, msg, exit)
118 sys.exit(1)
119 else:
--> 120 raise exc
121
122
NvvmError: Failed to compile
libnvvm : error: -arch=compute_52 is an unsupported option
NVVM_ERROR_INVALID_OPTION
I tried another numbapro examples and the same error ocurrs.
I don't know if it's a bug of Numbapro that doesn't support 5.2 compute capability or it's a problem of Nvidia NVVM... suggestions?
In theory it should be supported, but I don't know what is happening.
I'm using Linux with CUDA 7.0 and driver version 346.29
Finally I found a solution here
Solution 1:
conda update cudatoolkit
Fetching package metadata: ....
# All requested packages already installed.
# packages in environment at ~/.anaconda3:
#
cudatoolkit 6.0 p0
It looks like me updating the CUDA toolkit doesn't update to CUDA 7.0. A second solution can be done:
Solution 2
conda install -c numba cudatoolkit
Fetching package metadata: ......
Solving package specifications: .
Package plan for installation in environment ~/.anaconda3:
The following packages will be downloaded:
package | build
---------------------------|-----------------
cudatoolkit-7.0 | 1 190.8 MB
The following packages will be UPDATED:
cudatoolkit: 6.0-p0 --> 7.0-1
Proceed ([y]/n)? y
Before:
In [4]: check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
located at ~/.anaconda3/lib/libcublas.so.6.0.37
trying to open library... ok
Finding cusparse
located at ~/.anaconda3/lib/libcusparse.so.6.0.37
trying to open library... ok
Finding cufft
located at ~/.anaconda3/lib/libcufft.so.6.0.37
trying to open library... ok
Finding curand
located at ~/.anaconda3/lib/libcurand.so.6.0.37
trying to open library... ok
Finding nvvm
located at ~/.anaconda3/lib/libnvvm.so.2.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0 b'GeForce GTX 970' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 7
Summary:
1/1 devices are supported
PASSED
Out[4]: True
After:
In [6]: check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
located at ~/.anaconda3/lib/libcublas.so.7.0.28
trying to open library... ok
Finding cusparse
located at ~/.anaconda3/lib/libcusparse.so.7.0.28
trying to open library... ok
Finding cufft
located at ~/.anaconda3/lib/libcufft.so.7.0.35
trying to open library... ok
Finding curand
located at ~/.anaconda3/lib/libcurand.so.7.0.28
trying to open library... ok
Finding nvvm
located at ~/.anaconda3/lib/libnvvm.so.3.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0 b'GeForce GTX 970' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 7
Summary:
1/1 devices are supported
PASSED
Out[6]: True
Related
Hello I'm really beginner to Cython or C-based language.
I have a problem to get a square of a vector.
I have a vector(each value is double type):
x = [1, 4, 9]
and I want to get:
y = [1, 2, 3]
How can I get this vector?
A solution I thought is:
cdef floating[::1] y = x
for i in range(length):
y[i] = x[i] ** 0.5
But in this way it's too slow. I want to acclerate this.
Can I use sqrt or square function from libc.math in this case?
Edit:
If I want to get a vector like 1/3 root (like [1, 8, 27] -> [1,2,3]) what function should I use instead of sqrt?
Quick win
First you should check if your function is already implemented in Numpy. If so, it will probably be a very fast (C/C++) implementation.
This is the case for your function:
import numpy as np
x = np.array([1, 4, 9])
y = np.sqrt(x)
#> array([ 1, 2, 3])
With Numpy arrays
Alternatively (following #joni's comment), you can use np arrays just for the input/output and compute the function element-wise using C/C++:
cimport numpy as cnp
import numpy as np
from libc.math cimport sqrt
cpdef cnp.ndarray[double, ndim=1] cy_sqrt_np(cnp.ndarray[double, ndim=1] x):
cdef Py_ssize_t i, l=x.shape[0]
cdef np.ndarray[double, 1] y = np.empty(l)
for i in range(l):
y[i] = sqrt(x[i])
return y
With C++ vectors
Lastly, here is a possible implementation with C++ vectors and automatic conversion from/to python lists:
from libc.math cimport sqrt
from libcpp.vector cimport vector
cpdef vector[double] cy_sqrt_vec(vector[double] x):
cdef Py_ssize_t i, l = x.size()
cdef vector[double] y
y.reserve(l)
for i in range(l):
y.push_back(sqrt(x[i]))
return y
Some things to keep in mind in this case and the previous:
We initialize the y vector to be empty, and then allocate space for it with reserve(). According to SO this seems to be a good option.
We use a typed i in the for loop, and use push_back to assign new values.
We use sqrt from libc.math to avoid using Python code inside the loop.
We type the input of the function to be vector[double]. This automatically adds convenient type conversions from other python types (e.g., list of ints).
Time comparison
We define a random input x to avoid cached results polluting our measures:
%%timeit -n 10000 -r 7 x = gen_x()
y = np.sqrt(x)
#> executed in 177ms, finished 16:16:57 2022-04-19
#> 2.3 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = gen_x()
y = x**.5
#> executed in 194ms, finished 16:16:51 2022-04-19
#> 2.46 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = gen_x()
y = cy_sqrt(x)
executed in 359ms, finished 16:17:02 2022-04-19
4.9 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = list(gen_x())
y = cy_sqrt_vec(x)
executed in 2.85s, finished 16:17:11 2022-04-19
40.4 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As expected, the np.sqrt version wins. Besides, the vector allocation looks comparatively slower.
I want to calculate the relative error for two array. The pure numpy code is:
# a1, a2 are the two array
np.abs( 1-a2/a1 ).max()
How can I use numba.cuda to accelarate the above code?
In my thought:
#cuda.jit
def calculate(a1, a2):
start = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
grid = cuda.gridDim.x*cuda.blockDim.x
for id in range(start, a1.size, grid):
r = abs(1-a2[id]/a1[id])
ca1 = cuda.to_device(a1)
ca2 = cuda.to_device(a2)
But, how can I compare the r between different thread?
One possible method to do this is to write your own shared memory parallel reduction.
As indicated in the comments, another possible method is to use numba's built-in reduce decorator.
Here is an example demonstrating both:
$ cat t79.py
from numba import cuda, float32, vectorize
import numpy as np
from numpy import random
#values of 0..10 are legal here
TPBP2 = 9
TPB = 2**TPBP2
TPBH = TPB//2
ds = 4096
#method 1: standard cuda parallel max-finding reduction
#cuda.jit
def max_error(a1, a2, err):
s = cuda.shared.array(shape=(TPB), dtype=float32)
x = cuda.grid(1)
st = cuda.gridsize(1)
tx = cuda.threadIdx.x
s[tx] = 0
cuda.syncthreads()
for i in range(x, a1.size, st):
s[tx] = max(s[tx], abs(1-a2[i]/a1[i]))
mid = TPBH
for i in range(TPBP2):
cuda.syncthreads()
if tx < mid:
s[tx] = max(s[tx], s[tx+mid])
mid >>= 1
if tx == 0:
err[cuda.blockIdx.x] = s[0]
# data
# for best performance we should choose blocks based on GPU occupancy
# but for demonstration since we don't know the GPU:
blocks = (ds+TPB-1)//TPB
a1= np.random.rand(ds).astype(np.float32)
a1 += 1
a2= np.random.rand(ds).astype(np.float32)
err = np.zeros(blocks).astype(np.float32)
# Start the kernel
max_error[blocks, TPB](a1,a2, err)
# we could perform another stage of GPU reduction here, but for simplicity:
my_err = np.max(err)
print(my_err)
#method 2: using numba features
#vectorize(['float32(float32,float32)'], target = 'cuda')
def my_error(a1,a2):
return abs(1-a2/a1)
#cuda.reduce
def max_reduce(a,b):
return max(a,b)
r = my_error(a1,a2)
my_err = max_reduce(r)
print(my_err)
$ python t79.py
0.9999707
0.9999707
$
I am trying to test out the effectiveness of using the Python Numba module's #vectorize decorator for speeding up a code snippet relevant to my actual code. I'm utilizing a code snippet provided in CUDAcast #10 available here and shown below:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize
#vectorize(["float32(float32, float32)"], target='cpu')
def VectorAdd(a,b):
return a + b
def main():
N = 32000000
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
print("VectorAdd took %f seconds" % vectoradd_time)
if __name__ == '__main__':
main()
In the demo in the CUDAcast, the demonstrator gets a 100x speedup by sending the large array equation to the gpu via the #vectorize decorator. However, when I set the #vectorize target to the gpu:
#vectorize(["float32(float32, float32)"], target='cuda')
... the result is 3-4 times slower. With target='cpu' my runtime is 0.048 seconds; with target='cuda' my runtime is 0.22 seconds. I'm using a DELL Precision laptop with Intel Core i7-4710MQ processor and NVIDIA Quadro K2100M GPU. The output of running nvprof (NVIDIA profiler tool) indicate that the majority of the time is spent in memory handling (expected), but even the function evaluation takes longer on the GPU than the whole process did on the CPU. Obviously this isn't the result I was hoping for, but is it due to some error on my part or is this reasonable based on my hardware and code?
This question is also interesting for me.
I've tried your code and got similar results.
To somehow investigate this issue I've wrote the CUDA kernel using cuda.jit and add it in your code:
import numpy as np
from timeit import default_timer as timer
from numba import vectorize, cuda
N = 16*50000 #32000000
blockdim = 16, 1
griddim = int(N/blockdim[0]), 1
#cuda.jit("void(float32[:], float32[:])")
def VectorAdd_GPU(a, b):
i = cuda.grid(1)
if i < N:
a[i] += b[i]
#vectorize("float32(float32, float32)", target='cpu')
def VectorAdd(a,b):
return a + b
A = np.ones(N, dtype=np.float32)
B = np.ones(N, dtype=np.float32)
C = np.zeros(N, dtype=np.float32)
start = timer()
C = VectorAdd(A, B)
vectoradd_time = timer() - start
print("VectorAdd took %f seconds" % vectoradd_time)
start = timer()
d_A = cuda.to_device(A)
d_B = cuda.to_device(B)
VectorAdd_GPU[griddim,blockdim](d_A, d_B)
C = d_A.copy_to_host()
vectoradd_time = timer() - start
print("VectorAdd_GPU took %f seconds" % vectoradd_time)
print("C[:5] = " + str(C[:5]))
print("C[-5:] = " + str(C[-5:]))
In this 'benchmark' I also take into account the time for copying of arrays from host to device and from device to host. In this case the GPU function is slowly than CPU one.
For the case above:
CPU - 0.0033;
GPU - 0.0096;
Vectorize (target='cuda') - 0.15 (for my PC).
If the copying time is not accounted:
GPU - 0.000245
So, what I have learned, (1) The copying from host to device and from device to host is time-consuming. It is obvious and well-known. (2) I do not know the reason but #vectorize can significantly slowing down the calculations on GPU. (3) It is better to use self-written kernels (and of course minimize the memory copying).
By the way I have also tested the #cuda.jit by solving heat-conduction equation by explicit finite-difference scheme and found that for this case python program execution time is comparable with C program and provide about 100 times speedup. It is because, fortunately in this case you can do many iterations without data exchange between host and device.
UPD. Used Software & Hardware: Win7 64bit, CPU: Intel Core2 Quad 3GHz, GPU: NVIDIA GeForce GTX 580.
Is it possible to increase the buffer used in popen2 between Octave and the subprocess? It looks like the buffer is limited to approximately 66560 Bytes. This snippet shows the problem:
## This works with s = 65
## but not with s = 66
s = 66;
[in, out, pid] = popen2 ("dd", {"if=/dev/urandom",
"bs=1K",
sprintf("count=%i", s)});
pause (1);
[vt, cnt] = fread(out);
assert (cnt, s * 1024);
waitpid (pid);
fclose (in);
fclose (out);
returns:
66+0 Datensätze ein
66+0 Datensätze aus
67584 Bytes (68 kB) kopiert, 0.999523 s, 67.6 kB/s
error: ASSERT errors for: assert (cnt,s * 1024)
Location | Observed | Expected | Reason
() 66560 67584 Abs err 1024 exceeds tol 0
I've compaired the scala version
(BigInt(1) to BigInt(50000)).reduce(_ * _)
to the python version
reduce(lambda x,y: x*y, range(1,50000))
and it turns out, that the scala version took about 10 times longer than the python version.
I'm guessing, a big difference is that python can use its native long type instead of creating new BigInt-objects for each number. But is there a workaround in scala?
The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.
The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:
import org.jscience.mathematics.number.LargeInteger
(1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)
This is faster than the Python solution on my machine.
Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:
benchmark ms linear runtime
BigIntFoldLeft 4774 ==============================
BigIntFold 4739 =============================
BigIntReduce 4769 =============================
BigIntFoldLeftPar 4642 =============================
BigIntFoldPar 500 ===
BigIntReducePar 499 ===
LargeIntegerFoldLeft 3042 ===================
LargeIntegerFold 3003 ==================
LargeIntegerReduce 3018 ==================
LargeIntegerFoldLeftPar 3038 ===================
LargeIntegerFoldPar 246 =
LargeIntegerReducePar 260 =
I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.
Python on my machine:
def func():
start= time.clock()
reduce(lambda x,y: x*y, range(1,50000))
end= time.clock()
t = (end-start) * 1000
print t
gives 1219 ms
Scala:
def timed[T](f: => T) = {
val t0 = System.currentTimeMillis
val r = f
val t1 = System.currentTimeMillis
println("Took: "+(t1 - t0)+" ms")
r
}
timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms
timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms
timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms
timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms
// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms
timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
LargeInteger.ONE)(_ times _) }
361 ms
This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.
The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.
So, experimentally, the way to optimize this function is
a) Use fold rather than reduce
b) Use parallel collections
update:
Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).
import org.jscience.mathematics.number.LargeInteger
def fact(n: Int) = {
def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
case 0 => throw new IllegalArgumentException
case 1 => seq.head
case _ => loop {
val (a, b) = seq.splitAt(seq.length / 2)
a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
}
}
loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
}
Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:
scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms
scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms
Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)
Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:
def fact(n: Int): BigInt = rangeProduct(1, n)
private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
case 0 => BigInt(n1)
case 1 => BigInt(n1 * n2)
case 2 => BigInt(n1 * (n1 + 1)) * n2
case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
case _ =>
val nm = (n1 + n2) >> 1
rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}
Also to get more speed use latest version of JDK and following JVM options:
-server -XX:+TieredCompilation
Bellow are results for Intel(R) Core(TM) i7-2640M CPU # 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:
(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage
BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it