is nogil safe when accessing cython extension type members - cython

From a similar question's answer:
You should be able to access [an extension type's] cdef members [inside a nogil block]... and call their cdef functions that are marked as nogil.
However, the cython documentation disagrees:
After the GIL is released, any operation that involves python objects must first reacquire the GIL.
I assume "python object" includes cython extension types. This leads me to think that the following pseudo code is not safe, since it includes a race condition caused by modifying a python object without the GIL:
def function(ExtensionType arg):
with nogil:
# long running task
# modify arg's member
arg = ExtensionType()
function(arg)
# access arg's member
# (alternatively, accessing and modifying the member could be swapped, with the same issue)
I've expanded this into actual code to illustrate my confusion:
cimport cytime
cdef class Tester:
cdef int val
def get_val(self):
return self.val
def set_val(self):
print("1: GIL acquired by set_val")
with nogil:
cytime.sleep(0.1) # give me the GIL later, I'm not ready for it yet
self.val = 1
print("3: GIL reacquired by set_val")
t = Tester()
t.set_val()
print("2: val should be 0, but actually is: " + str(t.get_val()))
I expected the program's execution to follow: 1, 2, 3. However, here is the output:
1: GIL acquired by set_val
3: GIL reacquired by set_val
2: val should be 0, but actually is: 1
Can anyone explain this? Thanks.

Accessing/writing to self.val without the GIL is fine. You don't need to do any reference-counting for self (because you already have a reference to it, and you don't need another one) and you don't need to do any reference counting for val because it's a C int. You can actually do a reasonable amount on cdef class instances without the GIL (e.g. access nogil cdef methods).
Cython does generally block you from doing things that require the GIL in a nogil block (it isn't perfect, but generally it's reasonably thorough).
Note that if you chose to access .val from multiple threads then you may well have a race condition, and this is entirely your own fault. All I mean by "safe" is that the Python reference counting state doesn't get corrupted.
You seem to have a huge misunderstand of what a nogil block does though, and are viewing it as something similar to a coroutine!?
All that happens when you release the GIL is it allows lets other Python thread you have run. The thread that is processing set_val continues in logical order: it stops and waits for a bit, it sets the value, it waits to re-acquire the GIL, it prints statement-3, it returns to the global scope, it prints statement-2.

As a guess, a cdef class instance is refcounted, hence needs to be GIL-protected.

Related

Poor performance in writing a cython library

test.pyx is:
import gzip, re
import numpy as np
cimport numpy as np
cpdef np.ndarray[np.uint32_t, ndim=2] collect_qualities(str file_in, int length):
'''
What it Does:
------------
bla bla bla
Input:
-----
file_in: path-filename
length: length of each sequence
Output:
------
numpy array with shape (n,m) where n=length of reads and m=solexa scores
'''
cdef str solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?#ABCDEFGHI"
cdef np.ndarray[np.uint32_t, ndim=1] N = np.zeros(shape=length, dtype=np.uint32) # This is the divisor of the mean
cdef np.ndarray[np.uint32_t, ndim=2] sums = np.zeros(shape=(length, len(solexa_scores)+33), dtype=np.uint32) # This is the dividend of the mean
cdef counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
tile = line.decode('utf-8').split(':')[4]
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
for n, score in enumerate(line.decode('utf-8')):
sums[n, ord(score)] +=1
counter+=1
return sums
setup.py is:
from distutils.core import setup
from Cython.Build import cythonize
import numpy
setup(
ext_modules = cythonize("test.pyx"),
include_dirs=[numpy.get_include()]
)
It compiles with a warning about a deprecated numpy API.
python.py is:
import gzip, re
import numpy as np
def collect_qualities(file_in, length):
'''
What it Does:
------------
bla bla bla
Input:
-----
file_in: path-filename
length: length of each sequence
Output:
------
numpy array with shape (n,m) where n=length of reads and m=solexa scores
'''
solexa_scores = '!"#$%&' + "'()*+,-./0123456789:;<=>?#ABCDEFGHI"
N = np.zeros(shape=length, dtype=np.uint32) # This is the divisor of the mean
sums = np.zeros(shape=(length, len(solexa_scores)+33)) # This is the dividend of the mean
counter=0 # Useful to know if it's the 3rd or 4th line of the current sequence in fastq.
with gzip.open(file_in, "rb") as f:
for line in f:
if counter%4==0: # first line of the sequence (obtain tail info)
tile = line.decode('utf-8').split(':')[4]
counter=0
elif counter%3==0: # 3rd line of the sequence (obtain the qualities)
for n, score in enumerate(line.decode('utf-8')):
sums[n, ord(score)] +=1
counter+=1
return sums
Then I import the functions in ipython and compare their running times.
For a fairly small input files, python takes ~140 seconds while the cython-compiled ~950 seconds.
What am I doing wrong in cython?
Thank you!
As I said in the comments: I don't really believe your timings and think there's probably something else going on here (for example that you're inadvertently using an old version you have installed somewhere that does something different). I'd expect Cython to be marginally faster here, but not significantly. It's also impossible for me to actually test any of my suggestions with the information provided.
However, some suggestions:
First, a number of things you done are pointless, or tiny pessimizations.
There's no point in using a cpdef function since the internals of def and cpdef/cdef functions are compiled identically. Using c[p]def gives you slightly quicker function calls from C/Cython which is valuable for small functions that are called a lot. I doubt this applies here.
Specifying the return type is probably pointless.
Typing file_in is marginally worse than leaving it untyped - Cython can't useful optimize it since it only passes it to the open function so probably just loses time doing a type-check.
There's then a few missed opportunities for optimization:
counter should be typed as an int (cdef counter just makes it a generic Python object).
It might be worth typing line as a str (you'll need to do this at the top of the function rather than when line is used in the for-loop).
It's probably worth having an intermediate decoded_line typed as bytes (e.g. decoded_line = line.decode(...)) since this is what is being iterated over. It's possible that Cython can deduce this on it's own from str.decode but it's better to be sure.
It's usually better in Cython to do direct iteration over a range rather than use things like enumerate. (This is different to Python). Do for n in range(len(decoded_line)): score = decoded_line[n]. It's possible that Cython can make this optimization itself, but do it yourself to be sure.
It might be worth turning of boundchecking and wraparound using compiler directives. My advice is to do this as locally as possible (i.e. don't just wrap every function with them in a cargo-cult manner, but think about where it helps and whether it's safe).
sums has a different dtype in your Python and Cython version (uint32 in Cython, double in Python). Think about which is right.
Use cython -a to get an annotated html version of your function that highlights unoptimized bits. Worry about important loops that are highlighted - don't get too hung up of stuff that's only called once.

Replacing the njit decorator with the cuda.jit decorator

I have an Nvidia GPU, downloaded CUDA, and am trying to make use of it.
Say I have this code:
##cuda.jit (Attempted fix #1)
##cuda.jit(device = True) (Attempted fix #2)
##cuda.jit(int32(int32,int32)) (Attempted fix #3)
#njit
def product(rho, theta):
x = rho * (theta)
return(x)
a = product(1,2)
print(a)
How do I make it work with the cuda.jit decorator instead of njit?
Things I've tried:
When I switch the decorator from #njit to #cuda.jit, I get: TypingError: No conversion from int64 to none for '$0.5', defined at None.
When I switch the decorator #cuda.jit(device = True), I get: TypeError: 'DeviceFunctionTemplate' object is not callable.
And when I specify the types for my inputs and outputs, and use the decorator #cuda.jit(int32(int32,int32)), I get: TypeError: CUDA kernel must have void return type.
numba cuda kernels don't return anything. You must return results via parameters/arguments to the function. The starting point for this is usually some kind of numpy array. Here is an example:
$ cat t44.py
from numba import cuda
import numpy as np
#cuda.jit
def product(rho, theta, x):
x[0] = rho * (theta)
x = np.ones(1,dtype=np.float32)
product(1,2,x)
print(x)
$ python t44.py
[ 2.]
$
There's potentially many other things that could be said; you may wish to avail yourself of the documentation linked above, or e.g. this tutorial. Usually you will want to handle problems much bigger than multiplying two scalars, before GPU computation will be interesting.
Also, numba provides other methods to access GPU computation, that don't depend on use of the #cuda.jit decorator. These methods, such as #vectorize are documented.
I'm also omitting any kernel launch configuration syntax on the call to product. This is legal in numba cuda, but it results in launching a kernel of 1 block that contains 1 thread. This works for this particular example, but this is basically a nonsensical way to use CUDA GPUs.

Using function pointers to methods of classes without the gil

Part of my works requires a lot of calculations, but they are often fairly straight-forward and can in principle quite easily be done in parallel with Cython's prange, requiring nogil. However, given that I tried to write my Cython code with a focus on having cdef classes as building blocks I encountered a problem.
Let's say I got an numerical integration routine or similar which takes a function pointer as an input
ctypedef double (* someFunctionPointer) (double tt) nogil
cdef double integrationRoutine(someFunctionfointer ff) nogil:
# Doing something
# ...
return result
Now each of my integration points is actually a "larger" simulation (lots of parameters and so on I load/set/input or whatever), which is implemented in a class. So my initial approach was doing something like
cdef class SimulationClass:
cdef double simulationFunctionPointer(SimulationClass self, double tt) nogil:
# ...
where I though I could just hand "simulationFunctionPointer" to "integrationRoutine" and would be fine. This does of course not work because of the self argument.
All my work-arounds either require to
Not use cdef classes, rather something like a C struct (tricky if SimulationClass references a lot of other classes, parameters and so on)
Execute something with gil (because I want to work the SimulationClass; I wrote some stand alone function which took SimulationClass as a void*, but then I need to cast it to SimulationClass again, which requires the gil)
Any advice or ideas how to approach this problem? Is my first approach possible in other languages like C++?
Cheers
You can use with gil: around the blocks that need the GIL, and then with nogil: around the important inner blocks that will take most of your run time. To give a trivial example
from cython.parallel import prange
cdef class Simulation:
cdef double some_detail
def __cinit__(self,double some_detail):
self.some_detail = some_detail
def do_long_calculation(self, double v):
with nogil:
pass # replace pass with something long and time-consuming
return v*self.some_detail
def run_simulations(int number_of_simulations):
cdef int n
for n in prange(number_of_simulations,nogil=True):
with gil: # immediately get the gil back to do the "pythony bits"
sim = Simulation(5.3*n)
sim.do_long_calculation(1.2) # and release again in here"
Provided the nogil section in do_long_calculation runs from longer than the section where you set up and pass the simulations (which can run in parallel with do_long_calculation, but not with itself) this is reasonably efficient.
An additional small comment about turning a bound method into a function pointer: you really struggle to do this in Cython. The best workround I have is to use ctypes (or possibly also cffi) which can turn any Python callable into a function pointer. The way they do this appears to involve some runtime code generation which you probably don't want to replicate. You can combine this method with Cython, but it probably adds a bit of overhead to the function call (so make sure do_long_calculation is actually long!)
The following works (credit to http://osdir.com/ml/python-cython-devel/2009-10/msg00202.html)
import ctypes
# define the function type for ctypes
ftype = ctypes.CFUNCTYPE(ctypes.c_double,ctypes.c_double)
S = Simulation(3.0)
f = ftype(S.do_long_calculation) # create the ctypes function pointer
cdef someFunctionPointer cy_f_ptr = (<someFunctionPointer*><size_t>ctypes.addressof(f))[0] # this is pretty awful!

why is cpdef not used with __init__ in cython

Had a basic question after reading the Cython documentation on classes, but I thought to get it clear. Here is a sample code from the Cython documentation:
cdef class Rectangle:
cdef int x0, y0
cdef int x1, y1
def __init__(self, int x0, int y0, int x1, int y1):
self.x0 = x0; self.y0 = y0; self.x1 = x1; self.y1 = y1
cpdef int area(self):
cdef int area
area = (self.x1 - self.x0) * (self.y1 - self.y0)
if area < 0:
area = -area
return area
Why is the __init__ preceded by def and not cdef or cpdef?
I realize that there is a __cinit__ function, but shouldn't making cpdef __init__ make the __init__ code faster?
Or, are we supposed to put the code, that we need to make very fast, in the __cinit__ section and the code, which we can afford to run slower, in the __init__ section?
cpdef does not affect the speed of the code inside the function. It only creates a version of the function that can be called directly from C/Cython (without having to go through the Python call mechanism). The "inside" of the function is compiled in both cases and runs at exactly the same speed. See Definition of def, cdef and cpdef in cython for a full discussion.
Most Cython special functions (those with names of the form __xxx__) are restricted to being defined as def functions. Essentially because they serve a special purpose to the Python interpreter and Cython won't be able to use the C shortcut version. __init__ is no exception - it's part of the normal Python construction mechanism and so it only make senses to call it as a Python method.
__cinit__ is a slightly odd case - it can only be invoked by the Cython (implicitly) and cannot be called by the user. It therefore isn't really either a normal def, cdef or cpdef method. It's therefore just a matter of language design what keyword is used to define it (whatever you chose, it'd always be called in the same way anyway). In terms of what arguments it accepts it behaves like a def function though since it can handle *args and **kwds but only types that can be directly converted from Python objects (i.e. not C pointers).
If you can to make an alternate cdef (or cpdef) staticmethod constructor to be called directly from Cython you can. This is discussed in the documentation.

Use Typed memoryview in cython if dimensions are unknown

I want to use a typed memoryview for optimizing a function, but I don't what would be the argument type. It could be an numpy array or even a scalar. How should I use typed memoryview then?
The issue with this sort of problem is that Python on dynamically typed, so you're always going to lose speed when picking which code-path to take. However, in principle you can make the individual code-paths pretty quick. An approach which might get you good results would be:
define an "implementation" function, that operates on a 1D memoryview.
Define a wrapper function that operates on any python object.
If it's passed a 1D memoryview, call the implementation function;
if it's passed a scalar, make a 1x1 array and call the implementation function;
If it's passed a multi-D array then either flatten it for the implementation function, or iterate over the rows, calling the implementation function for each row.
A quick implementation follows. This assumes that you want a function applied to every element of the input array (and want an output array the same size). The illustrative function I've chosen just adds 1 to each value. It also uses numpy in places (rather than just typed memoryviews) which I think is reasonable:
cimport cython
import numpy as np
import numbers
#cython.boundscheck(False)
cdef double[:] _plus_one_impl(double[:] x):
cdef int n
cdef double[:] output
output = x.copy()
for n in range(x.shape[0]):
output[n] = x[n]+1
return output
def plus_one(x):
if isinstance(x,numbers.Real): # check if it's a number
return _plus_one_impl(np.array([x]))[0]
else:
try:
return _plus_one_impl(x)
except ValueError: # this gets thrown if conversion fails
if len(x.shape)<2:
raise ValueError('x could not be converted to double [:]')
output = np.empty_like(x) # output is all numpy, whatever the input is
for n in range(x.shape[0]): # this loop isn't typed, so is likely to be pretty slow
output[n,...] = plus_one(x[n,...])
return output
This code is likely to end up somewhat slow in some cases (i.e. a 2D array with a short second dimension).
However, my real recommendation is to look into numpy ufuncs, which provide an interface for achieving this kind of thing efficiently. (See http://docs.scipy.org/doc/numpy-dev/user/c-info.ufunc-tutorial.html). Unfortunately they are a more complicated than Cython.