How to apply sqrt to vector in Cython? - cython

Hello I'm really beginner to Cython or C-based language.
I have a problem to get a square of a vector.
I have a vector(each value is double type):
x = [1, 4, 9]
and I want to get:
y = [1, 2, 3]
How can I get this vector?
A solution I thought is:
cdef floating[::1] y = x
for i in range(length):
y[i] = x[i] ** 0.5
But in this way it's too slow. I want to acclerate this.
Can I use sqrt or square function from libc.math in this case?
Edit:
If I want to get a vector like 1/3 root (like [1, 8, 27] -> [1,2,3]) what function should I use instead of sqrt?

Quick win
First you should check if your function is already implemented in Numpy. If so, it will probably be a very fast (C/C++) implementation.
This is the case for your function:
import numpy as np
x = np.array([1, 4, 9])
y = np.sqrt(x)
#> array([ 1, 2, 3])
With Numpy arrays
Alternatively (following #joni's comment), you can use np arrays just for the input/output and compute the function element-wise using C/C++:
cimport numpy as cnp
import numpy as np
from libc.math cimport sqrt
cpdef cnp.ndarray[double, ndim=1] cy_sqrt_np(cnp.ndarray[double, ndim=1] x):
cdef Py_ssize_t i, l=x.shape[0]
cdef np.ndarray[double, 1] y = np.empty(l)
for i in range(l):
y[i] = sqrt(x[i])
return y
With C++ vectors
Lastly, here is a possible implementation with C++ vectors and automatic conversion from/to python lists:
from libc.math cimport sqrt
from libcpp.vector cimport vector
cpdef vector[double] cy_sqrt_vec(vector[double] x):
cdef Py_ssize_t i, l = x.size()
cdef vector[double] y
y.reserve(l)
for i in range(l):
y.push_back(sqrt(x[i]))
return y
Some things to keep in mind in this case and the previous:
We initialize the y vector to be empty, and then allocate space for it with reserve(). According to SO this seems to be a good option.
We use a typed i in the for loop, and use push_back to assign new values.
We use sqrt from libc.math to avoid using Python code inside the loop.
We type the input of the function to be vector[double]. This automatically adds convenient type conversions from other python types (e.g., list of ints).
Time comparison
We define a random input x to avoid cached results polluting our measures:
%%timeit -n 10000 -r 7 x = gen_x()
y = np.sqrt(x)
#> executed in 177ms, finished 16:16:57 2022-04-19
#> 2.3 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = gen_x()
y = x**.5
#> executed in 194ms, finished 16:16:51 2022-04-19
#> 2.46 µs ± 256 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = gen_x()
y = cy_sqrt(x)
executed in 359ms, finished 16:17:02 2022-04-19
4.9 µs ± 274 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit -n 10000 -r 7 x = list(gen_x())
y = cy_sqrt_vec(x)
executed in 2.85s, finished 16:17:11 2022-04-19
40.4 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As expected, the np.sqrt version wins. Besides, the vector allocation looks comparatively slower.

Related

Non-linear fit Gnu Octave

I have a problem in performing a non linear fit with Gnu Octave. Basically I need to perform a global fit with some shared parameters, while keeping others fixed.
The following code works perfectly in Matlab, but Octave returns an error
error: operator *: nonconformant arguments (op1 is 34x1, op2 is 4x1)
Attached my code and the data to play with:
clear
close all
clc
pkg load optim
D = dlmread('hd', ';'); % raw data
bkg = D(1,2:end); % 4 sensors bkg
x = D(2:end,1); % input signal
Y = D(2:end,2:end); % 4 sensors reposnse
W = 1./Y; % weights
b0 = [7 .04 .01 .1 .5 2 1]; % educated guess for start the fit
%% model function
F = #(b) ((bkg + (b(1) - bkg).*(1-exp(-(b(2:5).*x).^b(6))).^b(7)) - Y) .* W;
opts = optimset("Display", "iter");
lb = [5 .001 .001 .001 .001 .01 1];
ub = [];
[b, resnorm, residual, exitflag, output, lambda, Jacob\] = ...
lsqnonlin(F,b0,lb,ub,opts)
To give more info, giving array b0, b0(1), b0(6) and b0(7) are shared among the 4 dataset, while b0(2:5) are peculiar of each dataset.
Thank you for your help and suggestions! ;)
Raw data:
0,0.3105,0.31342,0.31183,0.31117
0.013229,0.329,0.3295,0.332,0.372
0.013229,0.328,0.33,0.33,0.373
0.021324,0.33,0.3305,0.33633,0.399
0.021324,0.325,0.3265,0.333,0.397
0.037763,0.33,0.3255,0.34467,0.461
0.037763,0.327,0.3285,0.347,0.456
0.069405,0.338,0.3265,0.36533,0.587
0.069405,0.3395,0.329,0.36667,0.589
0.12991,0.357,0.3385,0.41333,0.831
0.12991,0.358,0.3385,0.41433,0.837
0.25368,0.393,0.347,0.501,1.302
0.25368,0.3915,0.3515,0.498,1.278
0.51227,0.458,0.3735,0.668,2.098
0.51227,0.47,0.3815,0.68467,2.124
1.0137,0.61,0.4175,1.008,3.357
1.0137,0.599,0.422,1,3.318
2.0162,0.89,0.5335,1.645,5.006
2.0162,0.872,0.5325,1.619,4.938
4.0192,1.411,0.716,2.674,6.595
4.0192,1.418,0.7205,2.691,6.766
8.0315,2.34,1.118,4.195,7.176
8.0315,2.33,1.126,4.161,6.74
16.04,3.759,1.751,5.9,7.174
16.04,3.762,1.748,5.911,7.151
32.102,5.418,2.942,7.164,7.149
32.102,5.406,2.941,7.164,7.175
64.142,7.016,4.478,7.174,7.176
64.142,7.018,4.402,7.175,7.175
128.32,7.176,6.078,7.175,7.176
128.32,7.175,6.107,7.175,7.173
255.72,7.165,7.162,7.165,7.165
255.72,7.165,7.164,7.166,7.166
511.71,7.165,7.165,7.165,7.165
511.71,7.165,7.165,7.166,7.164
Giving the function definition above, if you call it by F(b0) in the command windows, you will get a 34x4 matrix which is correct, since variable Y has the same size.
In that way I can (in theory) compute the standard formula for lsqnonlin (fit - measured)^2

Implementing CUDA kernel to use row wise features for histogram

I am trying to write a Cuda kernel to generate row-wise histogram based on the input feature set (2 x 6) where each feature row (each having 6 features) is to generate a histogram having nbins=10.
I have implemented the below code but it doesn’t seem to generate the correct row-wise histogram.
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=6, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
#cuda.jit
def row_wise_histogram(feature, output, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, feature_size)
print(d_output.copy_to_host())
And the out results in
[[ 0. 0. 0. 0. 0. 0. 81111. 81111. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 162222. 0. 81111. 0.]]
which is wrong, Will appreciate it if I can get help with the issue inside the row_wise_historgram function!
I think the main issue you have in your code is that your kernel has a thread strategy to have each thread process a row, and you have 2 rows in your feature dataset, but you are launching 12 threads total:
### Kernal Configuration
threads_per_block = 6
blocks = 2
10 of those threads will be indexing out-of-bounds. For 2 rows you only need 2 threads. We can fix this multiple ways, but I will add a "thread-check" to your kernel code, to prevent out-of-bounds threads from doing anything.
You are also histogramming values that don't fit in your output array. Let's suppose your feature has a input value of 4 at some location. Let's put that value through your arithmetic:
bin_number = np.int32(nbins * (np.float32(4) - np.float32(-4)) / (np.float32(4) - np.float32(-4)))
That is 10 * (4-(-4))/(4-(-4))
So that is a bin index of 10. But you only have 10 bins, so valid bin index can only go up to 9. Which means some of your input values (e.g. 4, 5) will not be recorded in your output.
The following code is your code to add the threadcheck, plus the range of input adjusted. And I am printing out the input, the bin each input value was assigned, and the output bins. It seems to be working correctly.
$ cat t65.py
import numba
import numpy as np
from numba import cuda
np.random.seed(0)
feature = np.random.randint(1, high=4, size=(2,6), dtype=int)
output = np.zeros(20).astype(np.float32).reshape(2,10)
mybin = np.empty_like(feature)
### Kernal Configuration
threads_per_block = 6
blocks = 2
# moving data to device
d_feature = cuda.to_device(feature)
d_output = cuda.to_device(output)
feature_size = d_feature.shape[1]
d_mybin = cuda.to_device(mybin)
#cuda.jit
def row_wise_histogram(feature, output, mybin, n):
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
idx = cuda.grid(1)
nbins = 10
bin_width = (xmax - xmin) / nbins
if idx < output.shape[0]:
for i in range(n):
# Each thread will take all the row features to generate historgram
input = feature[idx][i]
bin_number = np.int32(nbins * (np.float32(input) - np.float32(xmin)) / (np.float32(xmax) - np.float32(xmin)))
mybin[idx][i] = bin_number
if bin_number >= 0 and bin_number < output.shape[1]:
cuda.atomic.add(output[idx], bin_number, 1)
row_wise_histogram[blocks, threads_per_block](d_feature, d_output, d_mybin, feature_size)
print(feature)
print(d_mybin.copy_to_host())
print(d_output.copy_to_host())
$ python t65.py
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
$ cuda-memcheck python t65.py
========= CUDA-MEMCHECK
[[1 2 1 2 2 3]
[1 3 1 1 1 3]]
[[6 7 6 7 7 8]
[6 8 6 6 6 8]]
[[ 0. 0. 0. 0. 0. 0. 2. 3. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 4. 0. 2. 0.]]
========= ERROR SUMMARY: 0 errors
$
Note that when I restrict the input values to 1..3, then the maximum bin index is 8 (do the math). If I increase the input range to include 4, the maximum bin index goes to 10, which "won't fit". You're correctly handling this case, but it may confuse you, as these values of 4 or 5 won't be recorded in the output. Histogram bin arithmetic is fun. You will need to work out exactly what you want.
Also note that if you run this code, you should see output almost exactly the same as above. If you don't, there is a good chance your numba or cuda install is broken somehow, and the additional run I show with cuda-memcheck will help to discover what may be the issue.
Note that since you are using atomics anyway, there isn't any particular need to assign one thread to each row, you could instead assign one thread to each input point. But that isn't your question; it's a story for another day. Conversely, if you do proceed with one thread per row, each thread doing effectively a private histogram, there is no particular need to use atomics.

Julia CUDA - Saving intermediate kernel results without CPU

Consider the following CUDA kernel, which computes the mean of each row of a 2-D matrix.
using CUDA
function mean!(x, n, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:n
#inbounds out[row_idx] += x[row_idx, i]
end
out[row_idx] /= n
return
end
using Test
nrow, ncol = 1024, 10
x = CuArray{Float64, 2}(rand(nrow, ncol))
y = CuArray{Float64, 1}(zeros(nrow))
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], y)
#test isapprox(y, sum(x, dims=2)) # test passed
Also consider the following CUDA kernel
function add!(a, b, c)
""" c = a .+ b """
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
c[i] = a[i] + b[i]
return
end
a = CuArray{Float64, 1}(zeros(nrow))
b = CuArray{Float64, 1}(ones(nrow))
c = CuArray{Float64, 1}(zeros(nrow))
#cuda threads=256 blocks=4 add!(a, b, c)
#test all(c .== a .+ b) # test passed
Now, suppose I wanted to write another kernel that uses the intermediate results of mean!(). For example,
function g(x, y)
""" mean(x, dims=2) + mean(y, dims=2) """
xrow, xcol = size(x)
yrow, ycol = size(y)
mean1 = CuArray{Float64, 1}(undef, xrow)
#cuda threads=256 blocks=4 mean!(x, xcol, mean1)
mean2 = CuArray{Float64, 1}(zeros(yrow))
#cuda threads=256 blocks=4 mean!(y, ycol, mean2)
out = CuArray{Float64, 1}(zeros(yrow))
#cuda threads=256 blocks=4 add!(mean1, mean2, out)
return out
end
(Of course, g() isn't technically a kernel since it returns something.)
My question is whether g() is "correct". In particular, is g() wasting time by transferring data between the GPU/CPU?
For example, if my understanding is correct, one way g() could be optimized is by initializing mean2 the same way we initialize mean1. This is because when constructing mean2, we're actually first creating zeros(yrow) on the CPU, then passing this to the CuArray constructor to be copied to the GPU. In contrast, mean1 is constructed but uninitialized (due to the undef) and therefore avoids this extra transfer.
To summarize, how do I save/use intermediate kernel results while avoiding data transfers between the CPU/GPU as much as possible?
You can generate arrays or vectors of zeros directly on GPU!
Try:
CUDA.zeros(Float64, nrow)
Some benchmarks:
julia> #btime CUDA.zeros(Float64, 1000,1000)
12.600 μs (26 allocations: 1.22 KiB)
1000×1000 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
...
julia> #btime CuArray(zeros(1000,1000))
3.551 ms (8 allocations: 7.63 MiB)
1000×1000 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
...

Apply function repeatedly a specific number of times

If you have a function, is there an easy or built-in way to apply it n times, or until the result is something specific. So, for example, if you want to apply the sqrt function 4 times, with the effect of:
julia> sqrt(sqrt(sqrt(sqrt(11231))))
1.791229164345863
you could type something like:
repeatf(sqrt, 11231, 4)
I dont know of such a function but you could use this
julia> repeatf(f, x, n) = n > 1 ? f(repeatf(f, x, n-1)) : f(x)
julia> repeatf(sqrt, 11321, 4)
106.40018796975878
also, even comfier
repeatf(n, f, x...) = n > 1 ? f(repeatf(n-1, f, x...)...) : f(x...)
for functions with more than one arguement
I'm a fan of defining that ^ operator to work on Functions and Ints
julia> (^)(f::Function, i::Int) = i==1 ? f : x->(f^(i-1))(f(x))
^ (generic function with 1 method)
julia> (sqrt^1)(2)
1.4142135623730951
julia> (sqrt^2)(2)
1.189207115002721
julia> (sqrt^3)(2)
1.0905077326652577
As #DNF points out, because julia has no Tail Call Optimization,
it is better to do this iteratively;
julia> function (∧)(f::Function, i::Int)
function inner(x)
for ii in i:-1:1
x=f(x)
end
x
end
end
After warmup:
julia> #time((sqrt ∧ 1_000)(20e300)) #Iterative
0.000018 seconds (6 allocations: 192 bytes)
1.0
julia> #time((sqrt ^ 1_000)(20e300)) #Recursive
0.000522 seconds (2.00 k allocations: 31.391 KB)
1.0
#########
julia> #time((sqrt ∧ 10_000)(20e300)) #Iterative
0.000091 seconds (6 allocations: 192 bytes)
1.0
julia> #time((sqrt ^ 10_000)(20e300)) #Recursive
0.003784 seconds (20.00 k allocations: 312.641 KB)
1.0
#########
julia> #time((sqrt ∧ 30_000)(20e300)) # Iterative
0.000224 seconds (6 allocations: 192 bytes)
1.0
julia> #time((sqrt ^ 30_000)(20e300)) #Recursive
0.008128 seconds (60.00 k allocations: 937.641 KB)
1.0
#############
julia> #time((sqrt ∧ 100_000)(20e300)) #Iterative
0.000393 seconds (6 allocations: 192 bytes)
1.0
julia> #time((sqrt ^ 100_000)(20e300)) #Recursive
ERROR: StackOverflowError:
in (::##5#6{Base.#sqrt,Int64})(::Float64) at ./REPL[1]:1 (repeats 26667 times)
The overhead isn't too bad in this case, but that `StackOverflowError` at the end is a kicker.
[Edit: See bottom for simple solution without Iterators, though I suggest using it and all the useful functions inside the package]
With Iterators package, the following could be a solution:
julia> using Iterators # install with Pkg.add("Iterators")
julia> reduce((x,y)->y,take(iterate(sqrt,11231.0),5))
1.791229164345863
iterate does the composition logic (Do ?iterate on the REPL for description). The newer version of Iterators (still untagged) has a function called nth, which would make this even simpler:
nth(iterate(sqrt,11231.0),5)
As a side note, the (x,y)->y anonymous function could nicely be defined with a name since it could potentially be used often with reduce as in:
first(x,y) = x
second(x,y) = y
Now,
julia> reduce(second,take(iterate(sqrt,11231.0),5))
1.791229164345863
works. Also, without recursion (which entails stack allocation and waste), and allocation proportional to the depth of iteration, this could be more efficient, especially for higher iteration values than 5.
Without the Iterators package, a simple solution using foldl is
julia> foldl((x,y)->sqrt(x),1:4, init=11231.0)
1.791229164345863
As before, the reduction operation is key, this time it applies sqrt but ignores the iterator values which are only used to set the number of times the function is applied (perhaps a different iterator or vector than 1:4 could be used in the application for better readability of code)
function apply(f, x, n=1)
for _ in 1:n
x = f(x)
end
return x
end
I find neither of the above answers satisfying. They all work, but none of them are truly elegant. My personal favorite is this:
Base.repeat(f::Function, n::Integer) = reduce(∘, fill(f, n))
Of course, you don't even need to define repeat, you can just use the reduce(...) construct directly.
And this is how it would be used in the case of the original example:
julia> repeat(sqrt, 4)(11231)
1.791229164345863
or
julia> reduce(∘, fill(sqrt, 4))(11231)
1.791229164345863

Using arrayfun to apply two arguments of a function on every combination

Let i = [1 2] and j = [3 5]. Now in octave:
arrayfun(#(x,y) x+y,i,j)
we get [4 7]. But I want to apply the function on the combinations of i vs. j to get [i(1)+j(1) i(1)+j(2) i(2)+j(1) i(2)+j(2)]=[4 6 5 7].
How do I accomplish this? I know I can go with for-loopsl but I want vectorized-code because it's faster.
In Octave, for finding summations between two vectors, you can use a truly vectorized approach with broadcasting like so -
out = reshape(ii(:).' + jj(:),[],1)
Here's a runtime test on ideone for the input vectors of size 1 x 100 each -
-------------------- With FOR-LOOP
Elapsed time is 0.148444 seconds.
-------------------- With BROADCASTING
Elapsed time is 0.00038299 seconds.
If you want to keep it generic to accommodate operations other than just summations, you can use anonymous functions like so -
func1 = #(I,J) I+J;
out = reshape(func1(ii,jj.'),1,[])
In MATLAB, you could accomplish the same with two bsxfun alternatives as listed next.
I. bsxfun with Anonymous Function -
func1 = #(I,J) I+J;
out = reshape(bsxfun(func1,ii(:).',jj(:)),1,[]);
II. bsxfun with Built-in #plus -
out = reshape(bsxfun(#plus,ii(:).',jj(:)),1,[]);
With the input vectors of size 1 x 10000 each, the runtimes at my end were -
-------------------- With FOR-LOOP
Elapsed time is 1.193941 seconds.
-------------------- With BSXFUN ANONYMOUS
Elapsed time is 0.252825 seconds.
-------------------- With BSXFUN BUILTIN
Elapsed time is 0.215066 seconds.
First, your first example is not the best because the most efficient way to accomplish what you're doing with arrayfun would be to vectorize:
a = [1 2];
b = [3 5];
out = a+b
Second, in Matlab at least, arrayfun is not necessarily faster than a simple for loop. arrayfun is mainly a convenience (especially for it's more advanced options). Try this simple timing example yourself:
a = 1:1e5;
b = a+1;
y = arrayfun(#(x,y)x+y,a,b); % Warm up
tic
y = arrayfun(#(x,y)x+y,a,b);
toc
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k); % Warm up
end
tic
y = zeros(1,numel(a));
for k = 1:numel(a)
y(k) = a(k)+b(k);
end
toc
In Matlab R2015a, the for loop method is over 70 times faster run from the Command window and over 260 times faster when run from an M-file function. Octave may be different, but you should experiment.
Finally, you can accomplish what you want using meshgrid:
a = [1 2];
b = [3 5];
[x,y] = meshgrid(a,b);
out = x(:).'+y(:).'
which returns [4 6 5 7] as in your question. You can also use ndgrid to get output in a different order.