increase popen2 buffer size in GNU Octave - octave

Is it possible to increase the buffer used in popen2 between Octave and the subprocess? It looks like the buffer is limited to approximately 66560 Bytes. This snippet shows the problem:
## This works with s = 65
## but not with s = 66
s = 66;
[in, out, pid] = popen2 ("dd", {"if=/dev/urandom",
"bs=1K",
sprintf("count=%i", s)});
pause (1);
[vt, cnt] = fread(out);
assert (cnt, s * 1024);
waitpid (pid);
fclose (in);
fclose (out);
returns:
66+0 Datensätze ein
66+0 Datensätze aus
67584 Bytes (68 kB) kopiert, 0.999523 s, 67.6 kB/s
error: ASSERT errors for: assert (cnt,s * 1024)
Location | Observed | Expected | Reason
() 66560 67584 Abs err 1024 exceeds tol 0

Related

Calculate output size in CNN

What will be the output size of each layer in the following model?
'''
model = Sequential()
model.add(Conv2D(32, (8, 8), padding='same', strides=(4, 4), input_shape=(80,80,4)))
model.add(Activation('relu'))
model.add(Conv2D(64, (4, 4), padding='same', strides=(2, 2)))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3), padding='same', strides=(1, 1)))
model.add(Activation('relu'))
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dense(2))
'''
Full Traceback (most recent call)
TL;DR If "output size" means number of parameters, here's that plus memory size of model and memory required to train:
-------------------------------------
Layer # of Params
=====================================
Conv2D 544
-------------------------------------
Conv2D 8256
-------------------------------------
Conv2D 36928
-------------------------------------
Dense 13107712
-------------------------------------
Dense 1026
=====================================
Total # of Params: 13154466
Model Size: 59.47 MiB
Memory Required to Train: 1.95 GiB
-------------------------------------
and here are the relevant equations to compute each:
Model:
2D Conv Layer:
nw = kw * kh * d * d_prev
nb = d
n = nw + nb
Dense Layer:
nw = no * ni
nb = no
n = nw + nb
Training:
2D Conv Layer:
n = wi * hi * dl
Dense Layer:
n = no
Legend:
nw = Number of Weight Parameters
nb = Number of Bias Parameters
kw = Kernel Width
kh = Kernel Height
d = Depth of Current Layer
d_prev = Depth of Previous Layer
no = Number of outputs
ni = Number of inputs
nt = Number of training parameters
n = Total number of parameters
For a batch size of 1, 59.47 MiB to train. For a batch size of 32, 1.95 GiB to train.
A great resource is Memory Usage Computational Considerations, by Kevin McGuinness. You can watch his presentation on YouTube here. He gives a link to slides in the info portion of the YouTube post, but they are here for your reference (look for D2L1 (Day 2, Lecture 1).
It depends on four things:
The size of your data types,
The number of parameters in your model,
The number of parameters required to store training outputs, and
Your batch size
By default, tensorflow uses 32-bit floating point data types (these are 4 bytes in size since there are 8 bits to a byte).
Here's the code I wrote to calculate it. It's pretty much the same as what keras will output, but also includes memory requirements
Source Code:
def model_params_Conv2D(d_n, k_w, k_h, d_n_prev, s_x=1, s_y=1):
"""Calculate the number of model parameters in a 2D Convolution Layer
Args:
d_n ([int]): Depth of current layer
k_w ([int]]): Kernel width
k_h ([int]): Kernel height
d_n_prev ([int]): Depth of previous layer
s_x ([int]): Strides in x-direction
s_y ([int]): Strides in y-direction
Returns:
[int]: Number of layer parameters
"""
n_w = d_n * k_w * k_h * d_n_prev // (s_x * s_y) # Number of weight paramters
n_b = d_n # Number of bias parameters
return n_w + n_b
def model_params_Dense(n_o, n_i):
"""Calculate the number of model parameters in a dense layer
Args:
n_o ([int]): Number of output
n_i ([int]): Number of inputs
Returns:
[int]: Number of layer parameters
"""
n_w = n_o * n_i # Number of weight parameters
n_b = n_o # Number of bias parameters
return n_w + n_b
def training_params_Conv2D(w_i, h_i, d_l):
"""Calclate number of training parameters in a 2D Convolution layer
Args:
w_i (int): Input width
h_i (int): Input height
d_l (int): Layer depth
"""
return w_i * h_i * d_l
def training_params_Dense(n_o):
"""Calculate the number of training parameters in a Dense layer
Args:
n_o (int): Number of outputs
"""
return n_o
def memory_requirement(n_p, m_dt=4):
"""Size of neural network model in bytes
Args:
n_p ([int]): Number of parameters
m_dt ([int]): Memory size of data type in bytes
Returns:
[int]: Memory consumption in bytes
"""
return n_p * m_dt
def SI2ibi(mem):
"""Convert from SI bytes to ibibytes
Computers use powers of 2, so memory is represented in ibibytes, but
SI prefixes are powers of 1000)
kibi (KiB) = (2^10)^1, kilo (KB) = (10^3)^1 (1.024 KiB = 1 KB)
mebi (MiB) = (2^10)^2, mega (MB) = (10^3)^2 (1.048576 MiB = 1 MB)
gibi (GiB) = (2^10)^3, giga (GB) = (10^3)^3 (1.073741824 GiB = 1 GB)
Args:
mem ([int]): Memory size in bytes
"""
KB = 10 ** 3
MB = KB ** 2
GB = KB ** 3
KB2KiB = 1 / 1.024
MB2MiB = 1 / 1.048576
GB2GiB = 1 / 1.073741824
if mem >= GB:
mem /= GB * GB2GiB
units = "GiB"
elif mem >= MB:
mem /= MB * MB2MiB
units = "MiB"
else: # mem >= KB
mem /= KB * KB2KiB
units = "KiB"
return mem, units
if __name__ == "__main__":
# NOTE: Activation layers don't require any parameters. Use depth of
# input as d_n_prev of first layer.
input_shape = (80, 80, 4)
w_i = input_shape[0] # Input width
h_i = input_shape[1] # Input height
d_i = input_shape[2] # Input depth
conv01_params = model_params_Conv2D(
d_n=32, k_w=8, k_h=8, d_n_prev=d_i, s_x=4, s_y=4
)
conv02_params = model_params_Conv2D(d_n=64, k_w=4, k_h=4, d_n_prev=32, s_x=2, s_y=2)
conv03_params = model_params_Conv2D(d_n=64, k_w=3, k_h=3, d_n_prev=64)
dense01_params = model_params_Dense(n_i=w_i * h_i * d_i, n_o=512)
dense02_params = model_params_Dense(n_i=512, n_o=2)
num_model_params = (
conv01_params + conv02_params + conv03_params + dense01_params + dense02_params
)
header_ = "Layer\t\t\t# of Params"
len_header_ = len(repr(header_.expandtabs()))
bar_eq_ = "=" * len_header_
bar_dash_ = "-" * len_header_
num_training_params = training_params_Conv2D(w_i, h_i, 32)
num_training_params += training_params_Conv2D(w_i, h_i, 64)
num_training_params += training_params_Conv2D(w_i, h_i, 64)
num_training_params += training_params_Dense(512)
num_training_params += training_params_Dense(2)
model_memory = memory_requirement(num_model_params)
training_memory = memory_requirement(num_training_params)
total_memory = model_memory + training_memory
batch_size = 32
mem, units = SI2ibi(total_memory)
mem32, units32 = SI2ibi(total_memory * batch_size)
print(f"{bar_dash_}")
print(f"{header_}")
print(f"{bar_eq_}")
print(f"Conv2D\t\t\t{conv01_params}")
print(f"{bar_dash_}")
print(f"Conv2D\t\t\t{conv02_params}")
print(f"{bar_dash_}")
print(f"Conv2D\t\t\t{conv03_params}")
print(f"{bar_dash_}")
print(f"Dense\t\t\t{dense01_params}")
print(f"{bar_dash_}")
print(f"Dense\t\t\t{dense02_params}")
print(f"{bar_eq_}")
print(f"Total # of Params: {num_model_params}")
print(f"Model Size: {mem:.2f} {units}")
print(f"Memory Required to Train: {mem32:.2f} {units32}")
print(f"{bar_dash_}")

Decompiling 8051 binary, read from EEPROM

I'm trying to decompile the firmware of a Logitech Freedom 2.4 Cordless Joystick. I've managed to get something of the EEPROM. (here)
The EEPROM that is used is the Microchip 25AA320, which is a 32Kbit SPI-EEPROM. The MCU is a nRF24E1G , that contains a 8051 MCU.
The ROM should be 4096 bytes, so I think that my reading program looped over it self 4 times.
I managed to extract a 4kB ROM (here), but the start of the file doesn't look clean.
I loaded both files into IDA Pro and Ghidra and selected the 8051 processor. They don't generate anything useful.
Could anyone help me decompiling this ROM?
I used this Arduino Sketch to dump the rom.
Together with this python script
## Author: Arpan Das
## Date: Fri Jan 11 12:16:59 2019 +0530
## URL: https://github.com/Cyberster/SPI-Based-EEPROM-Reader-Writer
## It listens to serial port and writes contents into a file
## requires pySerial to be installed
import sys
import serial
import time
start = time.time()
MEMORY_SIZE = 4096 # In bytes
serial_port = 'COM5'
baud_rate = 115200 # In arduino, Serial.begin(baud_rate)
write_to_file_path = "dump.rom"
output_file = open(write_to_file_path, "wb")
ser = serial.Serial(serial_port, baud_rate)
print("Press d for dump ROM else CTRL+C to exit.")
ch = sys.stdin.read(1)
if ch == 'd':
ser.write('d')
for i in range(MEMORY_SIZE/32): # i.e. MEMORY_SIZE / 32
# wait until arduino response with 'W' i.e. 1 byte of data write request
while (ser.read() != 'W'): continue
ser.write('G') # sends back write request granted signal
for j in range(32):
byte = ser.read(1);
output_file.write(byte);
print(str(MEMORY_SIZE - (i * 32)) + " bytes remaining.")
print '\nIt took', time.time()-start, ' seconds.'
This is what I did, the next part left is for you. My machine is a Win10 notebook, however I used unix tools because they are so capable.
First of all, I divided the 16KB dump into four 4KB parts. The first one was different from the other three. And the provided 4KB dump is different to all of these parts. I did not investigate this further, and simply took one of the other three parts that are all equal.
$ split -b 4K LogitechFreedom2.4CordlessJoystick.rom part
$ cmp partaa partab
partaa partab differ: byte 1, line 1
$ cmp partab partac
$ cmp partac partad
$ cmp dump.rom partaa
dump.rom partaa differ: byte 9, line 1
$ cmp dump.rom partab
dump.rom partab differ: byte 1, line 1
From the microcontroller's data sheet I learned that the EEPROM contents has a header of at least 3 bytes (chapter 10.2 at page 61).
These bytes are:
0b Version = 00, Reserved = 00, SPEED = 0.5MHz, XO_FREQ = 16MHz
03 Offset to start of user program = 3
0f Number of 256 bytes block = 15
The last entry seems to be off by one, because there seems to be code in the 16th block, too.
Anyway, these bytes look decent, so I cut the first 3 bytes.
$ dd if=partad of=rom.bin bs=1 skip=3
4093+0 records in
4093+0 records out
4093 bytes (4,1 kB, 4,0 KiB) copied, 0,0270132 s, 152 kB/s
$ dd if=partad of=head.bin bs=1 count=3
3+0 records in
3+0 records out
3 bytes copied, 0,0043809 s, 0,7 kB/s
$ od -Ax -t x1 rom.bin > rom.hex
$ od -Ax -t x1 head.bin > head.hex
The hex files are nice for loading them into an editor and look around.
I loaded the remaining 4093 bytes into a disassembler I once wrote and peeked around a bit. It looks promising, so I think you can go on without me now:
C0000: ljmp C0F54
C0003: setb 021H.2
reti
C000B: xch a,r5
inc r6
xrl a,r6
mov a,#0B2H
movc a,#a+pc
movx #r1,a
mov r7,a
setb 021H.2
reti
C0F54: mov psw,#000H
mov sp,#07BH
mov r0,#0FFH
mov #r0,#000H
djnz r0,C0F5C
ljmp C0C09

Trying to improve execution time of Julia code

I need to write a code to generate a string using a grammar with just one rule. For example, if the rule is "G -> [G+G]", and we apply the rule to "G", the result is the string "[G+G]"; if we apply it to the previous result, we obtain "[[G+G]+[G+G]]" and so on. In other words, it's about rewritting the axiom (left side of the rule) a given number of times,
following the rule.
I've been given a piece of code written in Octave that implements this operation (I won't include the code because it's a bit long, but I will if it's necessary for understanding or answering the question). What I need to do is to write an equivalent function in Julia; so I wrote this
function generate_developedstring(axiom::ASCIIString, genome::ASCIIString, iterations::Int8)
tic()
developedstring = axiom
for i=1:iterations
developedstring = replace(developedstring, axiom, genome)
end
toc()
return developedstring
end
In the example I wrote earlier, axiom would be "G" and genome "[G+G]".
According to the benchmark times published is julialang.org, Julia should be way faster than Octave, but in this case, Octave is twice as fast as Julia
(I used the same axiom, genome and iterations for both codes and I measured times with tic toc functions).
Is there any way to make the Julia code faster?
Edit: First of all, thank you all so much for your comments. I will show you the Octave code I've been given (I didn't write it):
function axiom = ls(genome)
tic
ProductionSystem = ['[=>[ ]=>] +=>+ -=>- G=>',genome];
rule = extract(ProductionSystem);
n_Rules = length(rule);
% starting string
axiom = 'G';
% iterations (choose only from 1 to 7, >= 8 critical,
% depends on the string and on the computer !!
n_Repeats = 3;
%CALCULATE THE STRING
%=================================
for i = 1:n_Repeats
% a single letter (axiom)
axiomINcells = cellstr(axiom);
for j = 1:n_Rules
% find all occurrences of that axiom
hit = strfind(axiom, rule(j).pre);
if (length(hit) >= 1)
for k = hit
% perform the rule
% (replace 'pre' by 'post')
axiomINcells{k} = rule(j).pos;
end
end
end
axiom = [];
for j = 1:length(axiomINcells)
% put all strings together
axiom = [axiom, axiomINcells{j}];
end
end
toc
function rule = extract(ProductionSystem)
% rules are separated by space character, and pre and post sides are
% separtated by '->'
% e.g. F->FF G->F[+G][-G]F[+G][-G]FG
i=0;
while (~isempty(ProductionSystem))
i=i+1;
[rule1,ProductionSystem] = strtok(ProductionSystem,' ');
[rule(i).pre,post] = strtok(rule1,'=>');
rule(i).pos = post(3:end);
if (~isempty(ProductionSystem)) ProductionSystem=ProductionSystem(2:end); % delete separator
end
end
About the Julia version I'm using, it's 0.4.7. You also asked me how fast I need it to run; I just need to write a code as fast as posible, and the fact that Octave was faster made me think that I was doing something wrong.
Thank you again.
Can you specify the genome as a rule instead of a pattern? I mean, for example genome = ax -> "[$ax,$ax]".
Compare these two implementations, the first of which is the same as yours:
function genstring(axiom::String, genome::String, iter::Int)
str = axiom
for i in 1:iter
str = replace(str, axiom, genome)
end
return str
end
And then with an anonymous function:
genome = ax -> "[$ax,$ax]"
function genstring_(axiom::String, genome, n::Int)
if n < 1
return axiom
end
return genstring_(genome(axiom), genome, n-1)
end
On version 0.5, with BenchmarkTools:
julia> #benchmark genstring("G", "[G,G]", 2)
BenchmarkTools.Trial:
memory estimate: 752.00 bytes
allocs estimate: 15
--------------
minimum time: 745.950 ns (0.00% GC)
median time: 801.067 ns (0.00% GC)
mean time: 1.006 μs (14.30% GC)
maximum time: 50.271 μs (96.63% GC)
--------------
samples: 10000
evals/sample: 119
time tolerance: 5.00%
memory tolerance: 1.00%
julia> #benchmark genstring_("G", genome, 2)
BenchmarkTools.Trial:
memory estimate: 352.00 bytes
allocs estimate: 9
--------------
minimum time: 397.562 ns (0.00% GC)
median time: 414.149 ns (0.00% GC)
mean time: 496.511 ns (13.06% GC)
maximum time: 24.410 μs (97.18% GC)
--------------
samples: 10000
evals/sample: 201
time tolerance: 5.00%
memory tolerance: 1.00%
It scales better:
julia> #benchmark genstring("G", "[G,G]", 10)
BenchmarkTools.Trial:
memory estimate: 18.00 kb
allocs estimate: 71
--------------
minimum time: 93.569 μs (0.00% GC)
median time: 95.959 μs (0.00% GC)
mean time: 103.429 μs (3.05% GC)
maximum time: 4.216 ms (97.14% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> #benchmark genstring_("G", genome, 10)
BenchmarkTools.Trial:
memory estimate: 14.13 kb
allocs estimate: 49
--------------
minimum time: 3.072 μs (0.00% GC)
median time: 3.597 μs (0.00% GC)
mean time: 5.703 μs (29.78% GC)
maximum time: 441.515 μs (98.24% GC)
--------------
samples: 10000
evals/sample: 8
time tolerance: 5.00%
memory tolerance: 1.00%
As far as I know, string interpolation isn't superfast, so there could further optimizations.

NVVM_ERROR_INVALID_OPTION when using the CUDA kernel with Numbapro api

I want to execute a CUDA kernel in python using Numbapro API. I have this code:
import math
import numpy
from numbapro import jit, cuda, int32, float32
from matplotlib import pyplot
#cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, float32, float32, int32)')
def calculate_velocity_field(X, Y, u_source, v_source, x_source, y_source, strength_source, N):
start = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
end = N
stride = cuda.gridDim.x * cuda.blockDim.x
for i in range(start, end, stride):
u_source[i] = strength_source/(2*math.pi) * (X[i]-x_source)/((X[i]-x_source)**2 + (Y[i]-y_source)**2)
v_source[i] = strength_source/(2*math.pi) * (Y[i]-x_source)/((X[i]-x_source)**2 + (Y[i]-y_source)**2)
def main():
N = 200 # number of points in each direction
x_start, x_end = -4.0, 4.0 # boundaries in the x-direction
y_start, y_end = -2.0, 2.0 # boundaries in the y-direction
x = numpy.linspace(x_start, x_end, N) # creates a 1D-array with the x-coordinates
y = numpy.linspace(y_start, y_end, N) # creates a 1D-array with the y-coordinates
X, Y = numpy.meshgrid(x, y) # generates a mesh grid
strength_source = 5.0 # source strength
x_source, y_source = -1.0, 0.0 # location of the source
start = timer()
#calculate grid dimensions
blockSize = 1024
gridSize = int(math.ceil(float(N)/blockSize))
#transfer memory to device
X_d = cuda.to_device(X)
Y_d = cuda.to_device(Y)
u_source_d = cuda.device_array_like(X)
v_source_d = cuda.device_array_like(Y)
#launch kernel
calculate_velocity_field[gridSize,blockSize](X_d,Y_d,u_source_d,v_source_d,x_source,y_source,strength_source,N)
#transfer memory to host
u_source = numpy.empty_like(X)
v_source = numpy.empty_like(Y)
u_source_d.to_host(u_source)
v_source_d.to_host(v_source)
elapsed_time = timer() - start
print("Exec time with GPU %f s" % elapsed_time)
if __name__ == "__main__":
main()
Is giving me this error:
NvvmError Traceback (most recent call last)
<ipython-input-17-85e4a6e56a14> in <module>()
----> 1 #cuda.jit('void(float32[:], float32[:], float32[:], float32[:], float32, float32, float32, int32)')
2 def calculate_velocity_field(X, Y, u_source, v_source, x_source, y_source, strength_source, N):
3 start = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
4 end = N
5 stride = cuda.gridDim.x * cuda.blockDim.x
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/decorators.py in kernel_jit(func)
89 # Force compilation for the current context
90 if bind:
---> 91 kernel.bind()
92
93 return kernel
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in bind(self)
319 Force binding to current CUDA context
320 """
--> 321 self._func.get()
322
323 #property
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in get(self)
254 cufunc = self.cache.get(device.id)
255 if cufunc is None:
--> 256 ptx = self.ptx.get()
257
258 # Link
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/compiler.py in get(self)
226 arch = nvvm.get_arch_option(*cc)
227 ptx = nvvm.llvm_to_ptx(self.llvmir, opt=3, arch=arch,
--> 228 **self._extra_options)
229 self.cache[cc] = ptx
230 if config.DUMP_ASSEMBLY:
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in llvm_to_ptx(llvmir, **opts)
420 cu.add_module(llvmir.encode('utf8'))
421 cu.add_module(libdevice.get())
--> 422 ptx = cu.compile(**opts)
423 return ptx
424
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in compile(self, **options)
211 for x in opts])
212 err = self.driver.nvvmCompileProgram(self._handle, len(opts), c_opts)
--> 213 self._try_error(err, 'Failed to compile\n')
214
215 # get result
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in _try_error(self, err, msg)
229
230 def _try_error(self, err, msg):
--> 231 self.driver.check_error(err, "%s\n%s" % (msg, self.get_log()))
232
233 def get_log(self):
~/.anaconda3/lib/python3.4/site-packages/numba/cuda/cudadrv/nvvm.py in check_error(self, error, msg, exit)
118 sys.exit(1)
119 else:
--> 120 raise exc
121
122
NvvmError: Failed to compile
libnvvm : error: -arch=compute_52 is an unsupported option
NVVM_ERROR_INVALID_OPTION
I tried another numbapro examples and the same error ocurrs.
I don't know if it's a bug of Numbapro that doesn't support 5.2 compute capability or it's a problem of Nvidia NVVM... suggestions?
In theory it should be supported, but I don't know what is happening.
I'm using Linux with CUDA 7.0 and driver version 346.29
Finally I found a solution here
Solution 1:
conda update cudatoolkit
Fetching package metadata: ....
# All requested packages already installed.
# packages in environment at ~/.anaconda3:
#
cudatoolkit 6.0 p0
It looks like me updating the CUDA toolkit doesn't update to CUDA 7.0. A second solution can be done:
Solution 2
conda install -c numba cudatoolkit
Fetching package metadata: ......
Solving package specifications: .
Package plan for installation in environment ~/.anaconda3:
The following packages will be downloaded:
package | build
---------------------------|-----------------
cudatoolkit-7.0 | 1 190.8 MB
The following packages will be UPDATED:
cudatoolkit: 6.0-p0 --> 7.0-1
Proceed ([y]/n)? y
Before:
In [4]: check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
located at ~/.anaconda3/lib/libcublas.so.6.0.37
trying to open library... ok
Finding cusparse
located at ~/.anaconda3/lib/libcusparse.so.6.0.37
trying to open library... ok
Finding cufft
located at ~/.anaconda3/lib/libcufft.so.6.0.37
trying to open library... ok
Finding curand
located at ~/.anaconda3/lib/libcurand.so.6.0.37
trying to open library... ok
Finding nvvm
located at ~/.anaconda3/lib/libnvvm.so.2.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0 b'GeForce GTX 970' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 7
Summary:
1/1 devices are supported
PASSED
Out[4]: True
After:
In [6]: check_cuda()
------------------------------libraries detection-------------------------------
Finding cublas
located at ~/.anaconda3/lib/libcublas.so.7.0.28
trying to open library... ok
Finding cusparse
located at ~/.anaconda3/lib/libcusparse.so.7.0.28
trying to open library... ok
Finding cufft
located at ~/.anaconda3/lib/libcufft.so.7.0.35
trying to open library... ok
Finding curand
located at ~/.anaconda3/lib/libcurand.so.7.0.28
trying to open library... ok
Finding nvvm
located at ~/.anaconda3/lib/libnvvm.so.3.0.0
trying to open library... ok
finding libdevice for compute_20... ok
finding libdevice for compute_30... ok
finding libdevice for compute_35... ok
-------------------------------hardware detection-------------------------------
Found 1 CUDA devices
id 0 b'GeForce GTX 970' [SUPPORTED]
compute capability: 5.2
pci device id: 0
pci bus id: 7
Summary:
1/1 devices are supported
PASSED
Out[6]: True

How to optimize this short factorial function in scala? (Creating 50000 BigInts)

I've compaired the scala version
(BigInt(1) to BigInt(50000)).reduce(_ * _)
to the python version
reduce(lambda x,y: x*y, range(1,50000))
and it turns out, that the scala version took about 10 times longer than the python version.
I'm guessing, a big difference is that python can use its native long type instead of creating new BigInt-objects for each number. But is there a workaround in scala?
The fact that your Scala code creates 50,000 BigInt objects is unlikely to be making much of a difference here. A bigger issue is the multiplication algorithm—Python's long uses Karatsuba multiplication and Java's BigInteger (which BigInt just wraps) doesn't.
The easiest workaround is probably to switch to a better arbitrary precision math library, like JScience's:
import org.jscience.mathematics.number.LargeInteger
(1 to 50000).foldLeft(LargeInteger.ONE)(_ times _)
This is faster than the Python solution on my machine.
Update: I've written some quick benchmarking code using Caliper in response to Luigi Plingi's answer, which gives the following results on my (quad core) machine:
benchmark ms linear runtime
BigIntFoldLeft 4774 ==============================
BigIntFold 4739 =============================
BigIntReduce 4769 =============================
BigIntFoldLeftPar 4642 =============================
BigIntFoldPar 500 ===
BigIntReducePar 499 ===
LargeIntegerFoldLeft 3042 ===================
LargeIntegerFold 3003 ==================
LargeIntegerReduce 3018 ==================
LargeIntegerFoldLeftPar 3038 ===================
LargeIntegerFoldPar 246 =
LargeIntegerReducePar 260 =
I don't see the difference between reduce and fold that he does, but the moral is clear: if you can use Scala 2.9's parallel collections, they'll give you a huge improvement, but switching to LargeInteger helps as well.
Python on my machine:
def func():
start= time.clock()
reduce(lambda x,y: x*y, range(1,50000))
end= time.clock()
t = (end-start) * 1000
print t
gives 1219 ms
Scala:
def timed[T](f: => T) = {
val t0 = System.currentTimeMillis
val r = f
val t1 = System.currentTimeMillis
println("Took: "+(t1 - t0)+" ms")
r
}
timed { (BigInt(1) to BigInt(50000)).reduce(_ * _) }
4251 ms
timed { (BigInt(1) to BigInt(50000)).fold(BigInt(1))(_ * _) }
4224 ms
timed { (BigInt(1) to BigInt(50000)).par.reduce(_ * _) }
2083 ms
timed { (BigInt(1) to BigInt(50000)).par.fold(BigInt(1))(_ * _) }
689 ms
// using org.jscience.mathematics.number.LargeInteger from Travis's answer
timed { val a = (1 to 50000).foldLeft(LargeInteger.ONE)(_ times _) }
3327 ms
timed { val a = (1 to 50000).map(LargeInteger.valueOf(_)).par.fold(
LargeInteger.ONE)(_ times _) }
361 ms
This 689 ms and 361 ms were after a few warmup runs. They both started at about 1000 ms, but seem to warm up by different amounts. The parallel collections seem to warm up significantly more than the non-parallel: the non-parallel operations did not reduce significantly from their first runs.
The .par (meaning, use parallel collections) seemed to speed up fold more than reduce. I only have 2 cores, but a greater number of cores should see a bigger performance gain.
So, experimentally, the way to optimize this function is
a) Use fold rather than reduce
b) Use parallel collections
update:
Inspired by the observation that breaking the calculation down into smaller chunks speeds things up, I managed to get he following to run in 215 ms on my machine, which is a 40% improvement on the standard parallelized algorithm. (Using BigInt, it takes 615 ms.) Also, it doesn't use parallel collections, but somehow uses 90% CPU (unlike for BigInt).
import org.jscience.mathematics.number.LargeInteger
def fact(n: Int) = {
def loop(seq: Seq[LargeInteger]): LargeInteger = seq.length match {
case 0 => throw new IllegalArgumentException
case 1 => seq.head
case _ => loop {
val (a, b) = seq.splitAt(seq.length / 2)
a.zipAll(b, LargeInteger.ONE, LargeInteger.ONE).map(i => i._1 times i._2)
}
}
loop((1 to n).map(LargeInteger.valueOf(_)).toIndexedSeq)
}
Another trick here could be to try both reduceLeft and reduceRight to see what is fastest. On your example I get a much faster execution of reduceRight:
scala> timed { (BigInt(1) to BigInt(50000)).reduceLeft(_ * _) }
Took: 4605 ms
scala> timed { (BigInt(1) to BigInt(50000)).reduceRight(_ * _) }
Took: 2004 ms
Same difference between foldLeft and foldRight. Guess it matters what side of the tree you start reducing from :)
Most efficient way to calculate factorial in Scala is using of divide and conquer strategy:
def fact(n: Int): BigInt = rangeProduct(1, n)
private def rangeProduct(n1: Long, n2: Long): BigInt = n2 - n1 match {
case 0 => BigInt(n1)
case 1 => BigInt(n1 * n2)
case 2 => BigInt(n1 * (n1 + 1)) * n2
case 3 => BigInt(n1 * (n1 + 1)) * ((n2 - 1) * n2)
case _ =>
val nm = (n1 + n2) >> 1
rangeProduct(n1, nm) * rangeProduct(nm + 1, n2)
}
Also to get more speed use latest version of JDK and following JVM options:
-server -XX:+TieredCompilation
Bellow are results for Intel(R) Core(TM) i7-2640M CPU # 2.80GHz (max 3.50GHz), RAM 12Gb DDR3-1333, Windows 7 sp1, Oracle JDK 1.8.0_25-b18 64-bit:
(BigInt(1) to BigInt(100000)).product took: 3,806 ms with 26.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduce(_ * _) took: 3,728 ms with 25.4 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceLeft(_ * _) took: 3,510 ms with 25.1 % of CPU usage
(BigInt(1) to BigInt(100000)).reduceRight(_ * _) took: 4,056 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).fold(BigInt(1))(_ * _) took: 3,697 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.product took: 406 ms with 66.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduce(_ * _) took: 296 ms with 71.1 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceLeft(_ * _) took: 3,495 ms with 25.3 % of CPU usage
(BigInt(1) to BigInt(100000)).par.reduceRight(_ * _) took: 3,900 ms with 25.5 % of CPU usage
(BigInt(1) to BigInt(100000)).par.fold(BigInt(1))(_ * _) took: 327 ms with 56.1 % of CPU usage
fact(100000) took: 203 ms with 28.3 % of CPU usage
BTW to improve efficience of factorial calculation for numbers that greater than 20000 use following implementation of Schönhage-Strassen algorithm or wait until it will be merged to JDK 9 and Scala will be able to use it