Poor performance in writing a cython library - cython

test.pyx is:
import gzip, re
import numpy as np
cimport numpy as np
cpdef np.ndarray[np.uint32_t, ndim=2] collect_qualities(str file_in, int length):
'''
What it Does:
------------
bla bla bla
Input:
-----
file_in: path-filename
length: length of each sequence
Output:
------
numpy array with shape (n,m) where n=length of reads and m=solexa scores
'''
cdef str solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?#ABCDEFGHI"
cdef np.ndarray[np.uint32_t, ndim=1] N = np.zeros(shape=length, dtype=np.uint32) # This is the divisor of the mean
cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32) # This is the dividend of the mean
cdef counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
tile = line.decode('utf-8').split(':')[4]
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
for n, score in enumerate(line.decode('utf-8')):
sums[n, ord(score)] +=1
counter+=1
return sums
setup.py is:
from distutils.core import setup
from Cython.Build import cythonize
import numpy
setup(
ext_modules = cythonize("test.pyx"),
include_dirs=[numpy.get_include()]
)
It compiles with a warning about a deprecated numpy API.
python.py is:
import gzip, re
import numpy as np
def collect_qualities(file_in, length):
'''
What it Does:
------------
bla bla bla
Input:
-----
file_in: path-filename
length: length of each sequence
Output:
------
numpy array with shape (n,m) where n=length of reads and m=solexa scores
'''
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?#ABCDEFGHI"
N = np.zeros(shape=length, dtype=np.uint32) # This is the divisor of the mean
sums = np.zeros(shape=(length, len(solexa_scores)+33)) # This is the dividend of the mean
counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
tile = line.decode('utf-8').split(':')[4]
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
for n, score in enumerate(line.decode('utf-8')):
sums[n, ord(score)] +=1
counter+=1
return sums
Then I import the functions in ipython and compare their running times.
For a fairly small input files, python takes ~140 seconds while the cython-compiled ~950 seconds.
What am I doing wrong in cython?
Thank you!

As I said in the comments: I don't really believe your timings and think there's probably something else going on here (for example that you're inadvertently using an old version you have installed somewhere that does something different). I'd expect Cython to be marginally faster here, but not significantly. It's also impossible for me to actually test any of my suggestions with the information provided.
However, some suggestions:
First, a number of things you done are pointless, or tiny pessimizations.
There's no point in using a cpdef function since the internals of def and cpdef/cdef functions are compiled identically. Using c[p]def gives you slightly quicker function calls from C/Cython which is valuable for small functions that are called a lot. I doubt this applies here.
Specifying the return type is probably pointless.
Typing file_in is marginally worse than leaving it untyped - Cython can't useful optimize it since it only passes it to the open function so probably just loses time doing a type-check.
There's then a few missed opportunities for optimization:
counter should be typed as an int (cdef counter just makes it a generic Python object).
It might be worth typing line as a str (you'll need to do this at the top of the function rather than when line is used in the for-loop).
It's probably worth having an intermediate decoded_line typed as bytes (e.g. decoded_line = line.decode(...)) since this is what is being iterated over. It's possible that Cython can deduce this on it's own from str.decode but it's better to be sure.
It's usually better in Cython to do direct iteration over a range rather than use things like enumerate. (This is different to Python). Do for n in range(len(decoded_line)): score = decoded_line[n]. It's possible that Cython can make this optimization itself, but do it yourself to be sure.
It might be worth turning of boundchecking and wraparound using compiler directives. My advice is to do this as locally as possible (i.e. don't just wrap every function with them in a cargo-cult manner, but think about where it helps and whether it's safe).
sums has a different dtype in your Python and Cython version (uint32 in Cython, double in Python). Think about which is right.
Use cython -a to get an annotated html version of your function that highlights unoptimized bits. Worry about important loops that are highlighted - don't get too hung up of stuff that's only called once.

Related

is nogil safe when accessing cython extension type members

From a similar question's answer:
You should be able to access [an extension type's] cdef members [inside a nogil block]... and call their cdef functions that are marked as nogil.
However, the cython documentation disagrees:
After the GIL is released, any operation that involves python objects must first reacquire the GIL.
I assume "python object" includes cython extension types. This leads me to think that the following pseudo code is not safe, since it includes a race condition caused by modifying a python object without the GIL:
def function(ExtensionType arg):
with nogil:
# long running task
# modify arg's member
arg = ExtensionType()
function(arg)
# access arg's member
# (alternatively, accessing and modifying the member could be swapped, with the same issue)
I've expanded this into actual code to illustrate my confusion:
cimport cytime
cdef class Tester:
cdef int val
def get_val(self):
return self.val
def set_val(self):
print("1: GIL acquired by set_val")
with nogil:
cytime.sleep(0.1) # give me the GIL later, I'm not ready for it yet
self.val = 1
print("3: GIL reacquired by set_val")
t = Tester()
t.set_val()
print("2: val should be 0, but actually is: " + str(t.get_val()))
I expected the program's execution to follow: 1, 2, 3. However, here is the output:
1: GIL acquired by set_val
3: GIL reacquired by set_val
2: val should be 0, but actually is: 1
Can anyone explain this? Thanks.
Accessing/writing to self.val without the GIL is fine. You don't need to do any reference-counting for self (because you already have a reference to it, and you don't need another one) and you don't need to do any reference counting for val because it's a C int. You can actually do a reasonable amount on cdef class instances without the GIL (e.g. access nogil cdef methods).
Cython does generally block you from doing things that require the GIL in a nogil block (it isn't perfect, but generally it's reasonably thorough).
Note that if you chose to access .val from multiple threads then you may well have a race condition, and this is entirely your own fault. All I mean by "safe" is that the Python reference counting state doesn't get corrupted.
You seem to have a huge misunderstand of what a nogil block does though, and are viewing it as something similar to a coroutine!?
All that happens when you release the GIL is it allows lets other Python thread you have run. The thread that is processing set_val continues in logical order: it stops and waits for a bit, it sets the value, it waits to re-acquire the GIL, it prints statement-3, it returns to the global scope, it prints statement-2.
As a guess, a cdef class instance is refcounted, hence needs to be GIL-protected.

Replacing the njit decorator with the cuda.jit decorator

I have an Nvidia GPU, downloaded CUDA, and am trying to make use of it.
Say I have this code:
##cuda.jit (Attempted fix #1)
##cuda.jit(device = True) (Attempted fix #2)
##cuda.jit(int32(int32,int32)) (Attempted fix #3)
#njit
def product(rho, theta):
x = rho * (theta)
return(x)
a = product(1,2)
print(a)
How do I make it work with the cuda.jit decorator instead of njit?
Things I've tried:
When I switch the decorator from #njit to #cuda.jit, I get: TypingError: No conversion from int64 to none for '$0.5', defined at None.
When I switch the decorator #cuda.jit(device = True), I get: TypeError: 'DeviceFunctionTemplate' object is not callable.
And when I specify the types for my inputs and outputs, and use the decorator #cuda.jit(int32(int32,int32)), I get: TypeError: CUDA kernel must have void return type.
numba cuda kernels don't return anything. You must return results via parameters/arguments to the function. The starting point for this is usually some kind of numpy array. Here is an example:
$ cat t44.py
from numba import cuda
import numpy as np
#cuda.jit
def product(rho, theta, x):
x[0] = rho * (theta)
x = np.ones(1,dtype=np.float32)
product(1,2,x)
print(x)
$ python t44.py
[ 2.]
$
There's potentially many other things that could be said; you may wish to avail yourself of the documentation linked above, or e.g. this tutorial. Usually you will want to handle problems much bigger than multiplying two scalars, before GPU computation will be interesting.
Also, numba provides other methods to access GPU computation, that don't depend on use of the #cuda.jit decorator. These methods, such as #vectorize are documented.
I'm also omitting any kernel launch configuration syntax on the call to product. This is legal in numba cuda, but it results in launching a kernel of 1 block that contains 1 thread. This works for this particular example, but this is basically a nonsensical way to use CUDA GPUs.

More concise creations of memory view matrices in cython

What I am doing
I have a lot of tiny matrices (3x3,5x5,3x4 and so on), with sizes that are known at compile time.
Until now I used numpy to create these
A = np.zeros((3,5))
And use the numpy array as if it was a memory view.
Now, I would like to get rid of these numpy calls and instead use C arrays (or something similarily fast that is not dynamically allocated). I did the following:
cdef double[3][5] A_c
cdef double[:,:] A = A_c
A[:] = 0.0
Of course, the last line depends on the importance of settings the elements to zero.
For dynamically sized arrays I am doing this:
double[:,:] A = view.array(shape=(4, N), itemsize=sizeof(double), format="d")
and I am quite happy with that.
What I would like to do
I would like to use a more concise way to do this. I know I could implement a class like described here for conciseness:
Cython: Create memoryview without NumPy array?
But this is not a c array with a size known at compile time.
Maybe there is a way to use the DEF macros with arguments?
Like so:
** NOT WORKING, DO NOT COPY AND PASTE **
DEF small_matrix(name,size):
cdef double[size[0],size[1]] name_c
cdef double[:,:] name = name_c
...
small_matrix(A,(3,5))
small_matrix(B,(5,1))
...
small_matrix(C,(3,1))
for i in range(3):
C[i,0] = 0.0
for j in range(5):
C[i,0] += A[i,j]*B[j,0]
Maybe I am also missing a simple way to create a cdef class of vectors/matrices with not-dynamically allocated data.

Using function pointers to methods of classes without the gil

Part of my works requires a lot of calculations, but they are often fairly straight-forward and can in principle quite easily be done in parallel with Cython's prange, requiring nogil. However, given that I tried to write my Cython code with a focus on having cdef classes as building blocks I encountered a problem.
Let's say I got an numerical integration routine or similar which takes a function pointer as an input
ctypedef double (* someFunctionPointer) (double tt) nogil
cdef double integrationRoutine(someFunctionfointer ff) nogil:
# Doing something
# ...
return result
Now each of my integration points is actually a "larger" simulation (lots of parameters and so on I load/set/input or whatever), which is implemented in a class. So my initial approach was doing something like
cdef class SimulationClass:
cdef double simulationFunctionPointer(SimulationClass self, double tt) nogil:
# ...
where I though I could just hand "simulationFunctionPointer" to "integrationRoutine" and would be fine. This does of course not work because of the self argument.
All my work-arounds either require to
Not use cdef classes, rather something like a C struct (tricky if SimulationClass references a lot of other classes, parameters and so on)
Execute something with gil (because I want to work the SimulationClass; I wrote some stand alone function which took SimulationClass as a void*, but then I need to cast it to SimulationClass again, which requires the gil)
Any advice or ideas how to approach this problem? Is my first approach possible in other languages like C++?
Cheers
You can use with gil: around the blocks that need the GIL, and then with nogil: around the important inner blocks that will take most of your run time. To give a trivial example
from cython.parallel import prange
cdef class Simulation:
cdef double some_detail
def __cinit__(self,double some_detail):
self.some_detail = some_detail
def do_long_calculation(self, double v):
with nogil:
pass # replace pass with something long and time-consuming
return v*self.some_detail
def run_simulations(int number_of_simulations):
cdef int n
for n in prange(number_of_simulations,nogil=True):
with gil: # immediately get the gil back to do the "pythony bits"
sim = Simulation(5.3*n)
sim.do_long_calculation(1.2) # and release again in here"
Provided the nogil section in do_long_calculation runs from longer than the section where you set up and pass the simulations (which can run in parallel with do_long_calculation, but not with itself) this is reasonably efficient.
An additional small comment about turning a bound method into a function pointer: you really struggle to do this in Cython. The best workround I have is to use ctypes (or possibly also cffi) which can turn any Python callable into a function pointer. The way they do this appears to involve some runtime code generation which you probably don't want to replicate. You can combine this method with Cython, but it probably adds a bit of overhead to the function call (so make sure do_long_calculation is actually long!)
The following works (credit to http://osdir.com/ml/python-cython-devel/2009-10/msg00202.html)
import ctypes
# define the function type for ctypes
ftype = ctypes.CFUNCTYPE(ctypes.c_double,ctypes.c_double)
S = Simulation(3.0)
f = ftype(S.do_long_calculation) # create the ctypes function pointer
cdef someFunctionPointer cy_f_ptr = (<someFunctionPointer*><size_t>ctypes.addressof(f))[0] # this is pretty awful!

Use Typed memoryview in cython if dimensions are unknown

I want to use a typed memoryview for optimizing a function, but I don't what would be the argument type. It could be an numpy array or even a scalar. How should I use typed memoryview then?
The issue with this sort of problem is that Python on dynamically typed, so you're always going to lose speed when picking which code-path to take. However, in principle you can make the individual code-paths pretty quick. An approach which might get you good results would be:
define an "implementation" function, that operates on a 1D memoryview.
Define a wrapper function that operates on any python object.
If it's passed a 1D memoryview, call the implementation function;
if it's passed a scalar, make a 1x1 array and call the implementation function;
If it's passed a multi-D array then either flatten it for the implementation function, or iterate over the rows, calling the implementation function for each row.
A quick implementation follows. This assumes that you want a function applied to every element of the input array (and want an output array the same size). The illustrative function I've chosen just adds 1 to each value. It also uses numpy in places (rather than just typed memoryviews) which I think is reasonable:
cimport cython
import numpy as np
import numbers
#cython.boundscheck(False)
cdef double[:] _plus_one_impl(double[:] x):
cdef int n
cdef double[:] output
output = x.copy()
for n in range(x.shape[0]):
output[n] = x[n]+1
return output
def plus_one(x):
if isinstance(x,numbers.Real): # check if it's a number
return _plus_one_impl(np.array([x]))[0]
else:
try:
return _plus_one_impl(x)
except ValueError: # this gets thrown if conversion fails
if len(x.shape)<2:
raise ValueError('x could not be converted to double [:]')
output = np.empty_like(x) # output is all numpy, whatever the input is
for n in range(x.shape[0]): # this loop isn't typed, so is likely to be pretty slow
output[n,...] = plus_one(x[n,...])
return output
This code is likely to end up somewhat slow in some cases (i.e. a 2D array with a short second dimension).
However, my real recommendation is to look into numpy ufuncs, which provide an interface for achieving this kind of thing efficiently. (See http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html). Unfortunately they are a more complicated than Cython.