Cython: Convert Python string list to 2D character array

Cython: Convert Python string list to 2D character array - cython

I am trying to convert a list of python strings to a 2D character array, and then pass it into a C function.
Python version: 3.6.4, Cython version: 0.28.3, OS Ubuntu 16.04
My first try looks like this:
def my_function(name_list):
cdef char name_array[50][30]
for i in range(len(name_list)):
name_array[i] = name_list[i]
The code builds, but during runtime I receive the following response:
Traceback (most recent call last):
File "test.py", line 532, in test_my_function
my_function(name_list)
File "my_module.pyx", line 817, in my_module.my_function
File "stringsource", line 93, in
carray.from_py.__Pyx_carray_from_py_char
IndexError: not enough values found during array assignment, expected 25, got 2
I then tried to make sure that the string on the right-hand side of the assignment is exactly 30 characters by doing the following:
def my_function(name_list):
cdef char name_array[50][30]
for i in range(len(name_list)):
name_array[i] = (name_list[i] + ' '*30)[:30]
This caused another error, as follows:
Traceback (most recent call last):
File "test.py", line 532, in test_my_function
my_function(name_list)
File "my_module.pyx", line 818, in my_module.my_function
File "stringsource", line 87, in carray.from_py.__Pyx_carray_from_py_char
TypeError: an integer is required
I will appreciate any help. Thanks.

I don't like this functionality of Cython and seems to be at least not very well thought trough:
It is convenient to use char-array and thus to avoid the hustle with allocating/freeing of dynamically allocated memory. However, it is only natural that the allocated buffer is larger than the strings for which it is used. Enforcing equal lengths doesn't make sense.
C-strings are null-terminated. Not always is \0 at the end needed, but often it is necessary, so some additional steps are needed to ensure this.
Thus, I would roll out my own solution:
%%cython
from libc.string cimport memcpy
cdef int from_str_to_chararray(source, char *dest, size_t N, bint ensure_nullterm) except -1:
cdef size_t source_len = len(source)
cdef bytes as_bytes = source.encode('ascii') #hold reference to the underlying byte-object
cdef const char *as_ptr = <const char *>(as_bytes)
if ensure_nullterm:
source_len+=1
if source_len > N:
raise IndexError("destination array too small")
memcpy(dest, as_ptr, source_len)
return 0
and then use it as following:
%%cython
def test(name):
cdef char name_array[30]
from_str_to_chararray(name, name_array, 30, 1)
print("In array: ", name_array)
A quick test yields:
>>> tests("A")
In array: A
>>> test("A"*29)
In array: AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>>> test("A"*30)
IndexError: destination array too small
Some additional remarks to the implementation:
it is necessary to hold the reference of the underlying bytes object, to keep it alive, otherwise as_ptr will become dangling as soon as it is created.
internal representation of bytes-objects has a trailing \0, so memcpy(dest, as_ptr, source_len) is safe even if source_len=len(source)+1.
except -1 in the signature is needed, so the exception is really passed to/checked in Python code.
Obviously, not everything is perfect: one has to pass the size of the array manually and this will leads to errors in the long run - something Cython's version does automatically right. But given the lacking functionality in Cython's version right now, the roll-out version is the better option in my opinion.

Thanks to #ead for responding. It got me to something that works. I am not convinced that it is the best way, but for now it is OK.
I addressed null termination, as #ead suggested, by appending null characters.
I received a TypeError: string argument without an encoding error, and had to encode the string before converting it to a bytearray. That is what the added .encode('ASCII') bit is for.
Here is the working code:
def my_function(name_list):
cdef char name_array[50][30]
for i in range(len(name_list)):
name_array[i] = bytearray((name_list[i] + '\0'*30)[:30].encode('ASCII'))

Related

Cythonize: check if word in list of strings is a substring of another string

I want to iterate over a list of input words list_words and check if any belongs to an input string.
I tried to cythonize the code but when I annotate it I see almost all of it in yellow, suggesting python interactions.
Not sure how I could speedup this:
cpdef cy_check_any_word_is_substring(list_words, string):
cdef unicode w
cdef unicode s_lowered = string.lower()
for w in list_words:
if w in s_lowered:
return True
return False
Example
# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
input_string = 'The animal saw the Dog and started to make noises'
# should return true
cy_check_any_word_is_substring(list_words, input_string)
Note I want to make the code work independently if characters are capitalized or not (that is why I do string.lower()), I assume the input list of words is already lowered.
Update
I wonder if a solution that uses C++ could be faster.
I don't know C++ though, I tried
from libcpp.vector cimport vector
from libcpp.string cimport string
cpdef cy_check_any_word_is_substring(vector[string] list_words,string string):
s_lowered = string.lower()
for w in list_words:
if w in s_lowered:
return True
return False
But it produces the error
Invalid types for 'in' (string, Python object)
Update 2
I tried the following to avoid the error presented in the previous section update.
from libcpp.vector cimport vector
from libcpp.string cimport string,npos
cdef bint cy_check_w_substring(string s_lowered, vector[string] list_words):
cdef string w
for w in list_words:
if s_lowered.find(w) !=npos:
return True
return False
cpdef cy3_check_any_word_is_substring(words_bytes, input_string):
cdef bint result = False
s_lowered = input_string.lower()
result = cy_check_w_substring(bytes(s_lowered, 'utf8'), words_bytes)
return result
This can be used changing the original list of words as a list of bytes.
# all words in list_words are lower cased
list_words = ['cat', 'dog', 'eat', 'seat']
list_words_bytes = [bytes(w,'utf8') for w in list_words]
input_string = 'The animal saw the Dog and started to make noises'
# should return true
cy3_check_any_word_is_substring(list_words_bytes, input_string)
Nevertheless this is much slower
%%timeit
cy3_check_any_word_is_substring(list_words_bytes, input_string)
#1.01 µs ± 3.16 ns per loop
%%timeit
cy_check_any_word_is_substring(list_words, input_string)
#190 ns ± 0.773 ns per loop
Note cy3_check_any_word_is_substring internally casts s_lowered as bytes but this already takes 145 ns which is almost the cost of cy_check_any_word_is_substring which makes this approach clearly not viable.
%%timeit
bytes(input_string, 'utf8')
#145 ns ± 0.55 ns per loop

The basic problem with the C++ solution is that if you pass it a Python iterable there's a hidden type conversion. So it has to iterate through the entire list and then convert every string to a C++ string. For this reason I doubt it'll give you much benefit.
If you can generate the data as a C++ vector without the type conversion then it may work better. For this you should use a cdef function instead of a cpdef function (I rarely like cpdef functions because they're usually the worst of both worlds).
The specific problems you have:
The C++ string class does not have a .lower() function, so the line s_lowered = string.lower() is implicitly converting it back to a Python bytes then calling .lower() on that. You'll have to implement .lower yourself (or convert to the C++ string after calling .lower on the Python object).
w in s_lowered isn't implemented for C++ strings. You want s_lowered.find(w) != npos (where npos is cimported from libcpp.string).

Memoryviews slices in Cython ask for a scalar

I'm trying to create a memoryview to store several vectors as rows, but when I try to change the value of any I got an error, like it is expecting a scalar.
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.float
ctypedef np.float_t DTYPE_t
cdef DTYPE_t[:, ::1] results = np.zeros(shape=(10, 10), dtype=DTYPE)
results[:, 0] = np.random.rand(10)
This trows me the following error:
TypeError: only size-1 arrays can be converted to Python scalars
Which I don't understand given that I want to overwrite the first row with that vector... Any idea about what I am doing wrong?

The operation you would like to use is possible between numpy arrays (Python functionality) or Cython's memory views (C functionality, i.e. Cython generates right for-loops in the C-code), but not when you mix a memory view (on the left-hand side) and a numpy array (on the right-hand side).
So you have either to use Cython's memory-views:
...
cdef DTYPE_t[::1] r = np.random.rand(10)
results[:, 0] = r
#check it worked:
print(results.base)
...
or numpy-arrays (we know .base is a numpy-array):
results.base[:, 0] = np.random.rand(10)
#check it worked:
print(results.base)
Cython's version has less overhead, but for large matrices there won't be much difference.

Cython set variable to named constant

I'm chasing my tail with what I suspect is a simple problem, but I can't seem to find any explanation for the observed behavior. Assume I have a constant in a C header file defined by:
#define FOOBAR 128
typedef uint32_t mytype_t;
I convert this in Cython by putting the following in the .pxd file:
cdef int _FOOBAR "FOOBAR"
ctypedef uint32_t mytype_t
In my .pyx file, I have a declaration:
FOOBAR = _FOOBAR
followed later in a class definition:
cdef class MyClass:
cdef mytype_t myvar
def __init__(self):
try:
self.myvar = FOOBAR
print("GOOD")
except:
print("BAD")
I then try to execute this with a simple program:
try:
foo = MyClass()
except:
print("FAILED TO CREATE CLASS")
Sadly, this errors out, but I don't get an error message - I just get the exception print output:
BAD
Any suggestions on root cause would be greatly appreciated.

I believe I have finally tracked it down. The root cause issue is that FOOBAR in my code was actually set to UINT32MAX. Apparently, Cython/Python interprets that as a -1 and Python then rejects setting a uint32_t variable equal to it. The solution is to define FOOBAR to be 0xffffffff - apparently Python thinks that is a non-negative value and accepts it.

Referencing Cython constants in Python

I have a C-header file (let's call it myheader.h) that contains some character string definitions such as:
#define MYSTRING "mystring-constant"
In Cython, I create a cmy.pxd file that contains:
cdef extern from "myheader.h":
cdef const char* MYSTRING "MYSTRING"
and a corresponding my.pyx file that contains some class definitions, all headed by:
from cmy cimport *
I then try to reference that string in a Python script:
from my import *
def main():
print("CONSTANT ", MYSTRING)
if __name__ == '__main__':
main()
Problem is that I keep getting an error:
NameError: name 'MYSTRING' is not defined
I've searched the documentation and can't identify the problem. Any suggestions would be welcomed - I confess it is likely something truly silly.

You cannot access cdef-variables from Python. So you have to create a Python object which would correspond to your define, something like this (it uses Cython>=0.28-feature verbatim-C-code, so you need a recent Cython version to run the snippet):
%%cython
cdef extern from *:
"""
#define MYSTRING "mystring-constant"
"""
# avoid name clash with Python-variable
# in cdef-code the value can be accessed as MYSTRING_DEFINE
cdef const char* MYSTRING_DEFINE "MYSTRING"
#python variable, can be accessed from Python
#the data is copied from MYSTRING_DEFINE
MYSTRING = MYSTRING_DEFINE
and now MYSTRING is a bytes-object:
>>> print(MYSTRING)
b'mystring-constant'

This function should work to calculate factorials, but it doesn't. [python-3.x]

(Full disclosure, I am going through the Python tutorial at CodeAcademy, and am using their web-based IDE.)
def factorial(x):
bang = 1
for num in x:
bang = bang * num
return bang
In java, this works to generate a factorial from a number smaller than 2,147,483,647. I think it should work in python, but it doesn't. Instead I get the error:
"Traceback (most recent call last):
File "python", line 3, in factorial
TypeError: 'int' object is not iterable"
Perhaps there's something I'm not understanding here, or perhaps my syntax is wrong. I tested further and created a separate function called factorial that iterates:
def factorial(x):
if x > 2:
return x
else:
return x(factorial(x-1))
This also doesn't work, giving me the error:
"Traceback (most recent call last):
File "python", line 11, in factorial
TypeError: 'int' object is not callable"
I am a python noob, but it seems that both of these should work. Please advise on the best way to learn Python syntax...

You can't do for num in x if x is an integer. An integer isn't "iterable" as the error says. You want something like this:
def factorial(x):
bang = 1
for num in xrange(1, x+1):
bang = bang * num
return bang
The xrange (or range) will generate the necessary range of numbers for the in to operate upon.

def f(x):
if x < 2:
return 1
else:
return x * f(x - 1)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Cython: Convert Python string list to 2D character array - cython

Related

Cythonize: check if word in list of strings is a substring of another string

Memoryviews slices in Cython ask for a scalar

Cython set variable to named constant

Referencing Cython constants in Python

This function should work to calculate factorials, but it doesn't. [python-3.x]

Categories

Resources