How to move list outside of cython cdef? [duplicate] - cython

This question already has answers here:
How to wrap a C pointer and length in a new-style buffer object in Cython?
(3 answers)
Scope vs life of variable in C
(4 answers)
Closed 2 years ago.
I've searched the web for a while but can't seem to find the answer: Have a look at this code:
cdef float [::1] Example_v1 (float *A, float *B, float *C) :
cdef:
float [8] out
float [::1] out_mem
int i
## < do some stuff >
## body
## < finish doing stuff >
out_mem = out
print( " what is out & out_mem type here ", type(out) , type(out_mem) )
return out_mem
def run1(float [::1] A, float [::1] B, float [::1] C):
return Example_v1( &A[0] , &B[0] , &C[0] )
I can compile this in cython without any error. Then when it comes to using it in a python code, the information inside out_mem is junk while I have verify the out is correct if I used a print statement to check the result. I know other alternatives are defined out as a pointer or directly initialized out_mem with np.zeroes(8, dtype=np.float). I'm just curious why doesn't it work as it is.
Also as a related question. If I do switch to using pointer:
cdef:
float *out = <float*> malloc( 8 * sizeof(float))
float [::1] out_mem = <float [:8]> out
Will I need to do free(out) otherwise memory leak ?
Please help.

Related

Why can I not use int parameter in Cython in Jupyterlab?

I am trying to use Cython in Jupyterlab:
%load_ext Cython
%%cython -a
def add (int x, int y):
cdef int result
result = x + y
return result
add(3,67)
error:
File "<ipython-input-1-e496e2665826>", line 9
def add (int x, int y):
^
SyntaxError: invalid syntax
What am I missing?
Update:
I just measured cpdef vs def and the difference between score was quite a low one (45(cpdef) vs 52(def), smaller = better/faster), so for your function it might not matter if called just a few times, but having that chew through a large amount of data might do some real difference.
If that's not applicable for you, just call that %load_ext in a separate cell, keep def and that should be enough.
(Cython 0.29.24, GCC 9.3.0, x86_64)
Use cpdef to make it C-like function, but also to expose it to Python, so you can call it from Jupyter (because Jupyter is using Python, unless specified by the %%cython magic func). Also, check the Early Binding for Speed section.
cpdef add (int x, int y):
cdef int result
result = x + y
return result
Also make sure to check Using the Jupyter notebook which explains that the % needs to be in a separate cell as ead mentioned in the comments.
add has to be defined with cdef, not def.
cdef add (int x, int y):
cdef int result
result = x + y
return result
add(3,67)

ctypes How to get address of NULL c_void_p field?

I need to get the address of a NULL void pointer. If I make a NULL c_void_p in Python I have no problem getting its address:
ptr = c_void_p(None)
print(ptr)
print(ptr.value)
print(addressof(ptr))
gives
c_void_p(None)
None
4676189120
But I have a
class Effect(structure):
_fields_ = [("ptr", c_void_p)]
where ptr gets initialized to NULL in C. When I access it in python
myclib.get_effect.restype = POINTER(Effect)
effect = myclib.get_effect().contents
print(effect.ptr)
gives None, so I can't take addressof(effect.ptr).
If I change my field type to a pointer to any ctype type
class Effect(structure):
_fields_ = [("ptr", POINTER(c_double)]
# get effect instance from C shared library
print(addressof(effect.ptr))
I have checked that I get the right address on the heap on the C side
140530973811664
Unfortunately, changing the field type from c_void_p is not an option. How can I do this?
Clarification
Here's C code following #CristiFati for my specific situation. struct is allocated in C, I get a ptr back to it in Python, and now I need to pass a reference to a ptr in the struct. First if I make the ptr a double, there's no problem!
#include <stdio.h>
#include <stdlib.h>
#define PRINT_MSG_2SX(ARG0, ARG1) printf("From C - [%s] (%d) - [%s]: ARG0: [%s], ARG1: 0x%016llX\n", __FILE__, __LINE__, __FUNCTION__, ARG0, (unsigned long long)ARG1)
typedef struct Effect {
double* ptr;
} Effect;
void print_ptraddress(double** ptraddress){
PRINT_MSG_2SX("Address of Pointer:", ptraddress);
}
Effect* get_effect(){
Effect* pEffect = malloc(sizeof(*pEffect));
pEffect->ptr = NULL;
print_ptraddress(&pEffect->ptr);
return pEffect;
}
And in Python
from ctypes import cdll, Structure, c_int, c_void_p, addressof, pointer, POINTER, c_double, byref
clibptr = cdll.LoadLibrary("libpointers.so")
class Effect(Structure):
_fields_ = [("ptr", POINTER(c_double))]
clibptr.get_effect.restype = POINTER(Effect)
pEffect = clibptr.get_effect()
effect = pEffect.contents
clibptr.print_ptraddress(byref(effect.ptr))
gives matching addresses:
From C - [pointers.c] (11) - [print_ptraddress]: ARG0: [Address of Pointer:], ARG1: 0x00007FC2E1AD3770
From C - [pointers.c] (11) - [print_ptraddress]: ARG0: [Address of Pointer:], ARG1: 0x00007FC2E1AD3770
But if I change the double* to void* and c_void_p, I get an error, because the c_void_p in python is set to None
ctypes ([Python 3]: ctypes - A foreign function library for Python) is meant to be able to "talk to" C from Python, which makes it Python friendly, and that means no pointers, memory addresses, ... whatsoever (well at least as possible, to be more precise).
So, under the hood, it does some "magic", which in this case stands between you and your goal.
#EDIT0: Updated the answer to better fit the (clarified) question.
Example:
>>> import ctypes
>>> s0 = ctypes.c_char_p(b"Some dummy text")
>>> s0, type(s0)
(c_char_p(2180506798080), <class 'ctypes.c_char_p'>)
>>> s0.value, "0x{:016X}".format(ctypes.addressof(s0))
(b'Some dummy text', '0x000001FBB021CF90')
>>>
>>> class Stru0(ctypes.Structure):
... _fields_ = [("s", ctypes.c_char_p)]
...
>>> stru0 = Stru0(s0)
>>> type(stru0)
<class '__main__.Stru0'>
>>> "0x{:016X}".format(ctypes.addressof(stru0))
'0x000001FBB050E310'
>>> stru0.s, type(stru0.s)
(b'Dummy text', <class 'bytes'>)
>>>
>>>
>>> b = b"Other dummy text"
>>> char_p = ctypes.POINTER(ctypes.c_char)
>>> s1 = ctypes.cast((ctypes.c_char * len(b))(*b), char_p)
>>> s1, type(s1)
(<ctypes.LP_c_char object at 0x000001FBB050E348>, <class 'ctypes.LP_c_char'>)
>>> s1.contents, "0x{:016X}".format(ctypes.addressof(s1))
(c_char(b'O'), '0x000001FBB050E390')
>>>
>>> class Stru1(ctypes.Structure):
... _fields_ = [("s", ctypes.POINTER(ctypes.c_char))]
...
>>> stru1 = Stru1(s1)
>>> type(stru1)
<class '__main__.Stru1'>
>>> "0x{:016X}".format(ctypes.addressof(stru1))
'0x000001FBB050E810'
>>> stru1.s, type(stru1.s)
(<ctypes.LP_c_char object at 0x000001FBB050E6C8>, <class 'ctypes.LP_c_char'>)
>>> "0x{:016X}".format(ctypes.addressof(stru1.s))
'0x000001FBB050E810'
This is a parallel between 2 types which in theory are the same thing:
ctypes.c_char_p: as you can see, s0 was automatically converted to bytes. This makes sense, since it's Python, and there's no need to work with pointers here; also it would be very annoying to have to convert each member from ctypes to plain Python (and viceversa), every time when working with it.
Current scenario is not part of the "happy flow", it's rather a corner case and there's no functionality for it (or at least I'm not aware of any)
ctypes.POINTER(ctypes.c_char) (named it char_p): This is closer to C, and offers the functionality you needed, but as seen it's also much harder (from Python perspective) to work with it
The problem is that ctypes.c_void_p is similar to #1., so there's no OOTB functionality for what you want, and also there's no ctypes.c_void to go with #2.. However, it is possible to do it, but additional work is required.
The well known (C) rule is: AddressOf(Structure.Member) = AddressOf(Structure) + OffsetOf(Structure, Member) (beware of memory alignment who can "play dirty tricks on your mind").
For this particular case, things couldn't be simpler. Here's an example:
dll.c:
#include <stdio.h>
#include <stdlib.h>
#if defined(_WIN32)
# define DLL_EXPORT __declspec(dllexport)
#else
# define DLL_EXPORT
#endif
#define PRINT_MSG_2SX(ARG0, ARG1) printf("From C - [%s] (%d) - [%s]: ARG0: [%s], ARG1: 0x%016llX\n", __FILE__, __LINE__, __FUNCTION__, ARG0, (unsigned long long)ARG1)
static float f = 1.618033;
typedef struct Effect {
void *ptr;
} Effect;
DLL_EXPORT void test(Effect *pEffect, int null) {
PRINT_MSG_2SX("pEffect", pEffect);
PRINT_MSG_2SX("pEffect->ptr", pEffect->ptr);
PRINT_MSG_2SX("&pEffect->ptr", &pEffect->ptr);
pEffect->ptr = !null ? NULL : &f;
PRINT_MSG_2SX("new pEffect->ptr", pEffect->ptr);
}
code.py:
#!/usr/bin/env python3
import sys
from ctypes import CDLL, POINTER, \
Structure, \
c_int, c_void_p, \
addressof, pointer
DLL = "./dll.dll"
class Effect(Structure):
_fields_ = [("ptr", c_void_p)]
def hex64_str(item):
return "0x{:016X}".format(item)
def print_addr(ctypes_inst, inst_name, heading=""):
print("{:s}{:s} addr: {:s} (type: {:})".format(heading, "{:s}".format(inst_name) if inst_name else "", hex64_str(addressof(ctypes_inst)), type(ctypes_inst)))
def main():
dll_dll = CDLL(DLL)
test_func = dll_dll.test
test_func.argtypes = [POINTER(Effect), c_int]
effect = Effect()
print_addr(effect, "effect")
test_func(pointer(effect), 1)
print(effect.ptr, type(effect.ptr)) # Not helping, it's Python int for c_void_p
try:
print_addr(effect.ptr, "effect.ptr")
except:
print("effect.ptr: - wrong type")
print_addr(effect, "effect", "\nSecond time...\n ")
print("Python addrs (irrelevant): effect: {:s}, effect.ptr: {:s}".format(hex64_str(id(effect)), hex64_str(id(effect.ptr))))
if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()
Output:
(py35x64_test) e:\Work\Dev\StackOverflow\q053531795>call "c:\Install\x86\Microsoft\Visual Studio Community\2015\vc\vcvarsall.bat" x64
(py35x64_test) e:\Work\Dev\StackOverflow\q053531795>dir /b
code.py
dll.c
(py35x64_test) e:\Work\Dev\StackOverflow\q053531795>cl /nologo /DDLL /MD dll.c /link /NOLOGO /DLL /OUT:dll.dll
dll.c
Creating library dll.lib and object dll.exp
(py35x64_test) e:\Work\Dev\StackOverflow\q053531795>dir /b
code.py
dll.c
dll.dll
dll.exp
dll.lib
dll.obj
(py35x64_test) e:\Work\Dev\StackOverflow\q053531795>"e:\Work\Dev\VEnvs\py35x64_test\Scripts\python.exe" code.py
Python 3.5.4 (v3.5.4:3f56838, Aug 8 2017, 02:17:05) [MSC v.1900 64 bit (AMD64)] on win32
effect addr: 0x000001FB25B8CB10 (type: <class '__main__.Effect'>)
From C - [dll.c] (21) - [test]: ARG0: [pEffect], ARG1: 0x000001FB25B8CB10
From C - [dll.c] (22) - [test]: ARG0: [pEffect->ptr], ARG1: 0x0000000000000000
From C - [dll.c] (23) - [test]: ARG0: [&pEffect->ptr], ARG1: 0x000001FB25B8CB10
From C - [dll.c] (25) - [test]: ARG0: [new pEffect->ptr], ARG1: 0x00007FFFAFB13000
140736141012992 <class 'int'>
effect.ptr: - wrong type
Second time...
effect addr: 0x000001FB25B8CB10 (type: <class '__main__.Effect'>)
Python addrs (irrelevant): effect: 0x000001FB25B8CAC8, effect.ptr: 0x000001FB25BCC9F0
As seen, the address of effect is the same as the address of effect's ptr. But again, this is the simplest possible scenario. But, as explained a general solution, is preferred. However that's not possible, but it can be worked around:
Use the above formula and get the field offset using [SO]: Getting elements from ctype structure with introspection? (it's long, I had a hard time coming to the current solution - especially because of the 2 container types (Structure and Array) nesting possibilities; hopefully, it's bug free (or as close as possible) :) )
Modify the C interface to something like: Effect *get_effect(void **ptr), and store the address in the parameter
Modify the (Python) Effect structure, and instead of ctypes.c_void_p field have something that involves POINTER (e.g.: ("ptr", POINTER(c_ubyte))). The definition will differ from C, and semantically things are not OK, but at the end they're both pointers
Note: don't forget to have a function that destroys a pointer returned by get_effect (to avoid memory leaks)
So after raising this in the python bug tracker, Martin Panter and Eryk Sun provided a better solution.
There is indeed an undocumented offset attribute, which allows us to access the right location in memory without having to do any introspection. We can get back our pointer using
offset = type(Effect).ptr.offset
ptr = (c_void_p).from_buffer(effect, offset)
We can more elegantly wrap this into our class by using a private field and adding a property:
class Effect(Structure):
_fields_ = [("j", c_int),
("_ptr", c_void_p)]
#property
def ptr(self):
offset = type(self)._ptr.offset
return (c_void_p).from_buffer(self, offset)
I have added an integer field before our pointer so the offset isn't just zero. For completeness, here is the code above adapted with this solution showing that it works. In C:
#include <stdio.h>
#include <stdlib.h>
#define PRINT_MSG_2SX(ARG0, ARG1) printf("%s : 0x%016llX\n", ARG0, (unsigned long long)ARG1)
typedef struct Effect {
int j;
void* ptr;
} Effect;
void print_ptraddress(double** ptraddress){
PRINT_MSG_2SX("Address of Pointer:", ptraddress);
}
Effect* get_effect(){
Effect* pEffect = malloc(sizeof(*pEffect));
pEffect->ptr = NULL;
print_ptraddress(&pEffect->ptr);
return pEffect;
}
In Python (omitting the above Effect definition):
from ctypes import cdll, Structure, c_int, c_void_p, POINTER, byref
clibptr = cdll.LoadLibrary("libpointers.so")
clibptr.get_effect.restype = POINTER(Effect)
effect = clibptr.get_effect().contents
clibptr.print_ptraddress(byref(effect.ptr))
yields
Address of Pointer: : 0x00007F9EB248FB28
Address of Pointer: : 0x00007F9EB248FB28
Thanks again to everyone for quick suggestions. For more, see here:

is it data race in nested thrust functor

I have tested this snippet and try to explain its cause as well as a way to resolve it, but have failed to do so
#include <thrust/inner_product.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
#include <thrust/random.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <cmath>
#include <boost/concept_check.hpp>
struct alter_tuple {
alter_tuple(const int& a_, const int& b_) : a(a_), b(b_){};
__host__ __device__
thrust::tuple<int,int> operator()(thrust::tuple<int,int> X)
{
int Xx = thrust::get<0>(X);
int Xy = thrust::get<1>(X);
int Xpx = a*Xx-b*Xy;
int Xpy = -b*Xx+a*Xy;
printf("in (%d,%d) -> (%d,%d)\n",Xx,Xy,Xpx,Xpy);
return thrust::make_tuple(Xpx,Xpy);
}
int a; // these variables a,b are shared between different threads used by this functor kernel
int b; // which easily creates racing problem
};
struct alter_tuple_arr {
alter_tuple_arr(int* a_, int* b_, int* c_, int* d_) : a(a_), b(b_), c(c_), d(d_) {};
__host__ __device__
thrust::tuple<int,int> operator()(const int& idx)
{
int Xx = a[idx];
int Xy = b[idx];
int Xpx = a[idx]*Xx-b[idx]*Xy;
int Xpy = -b[idx]*Xx+a[idx]*Xy;
printf("in (%d,%d) -> (%d,%d)\n",Xx,Xy,Xpx,Xpy);
return thrust::make_tuple(Xpx,Xpy);
}
int* a;
int* b;
int* c;
int* d;
};
struct bFuntor
{
bFuntor(int* av__, int* bv__, int* cv__, int* dv__, const int& N__) : av_(av__), bv_(bv__), cv_(cv__), dv_(dv__), N_(N__) {};
__host__ __device__
int operator()(const int& idx)
{
thrust::device_ptr<int> av_dpt = thrust::device_pointer_cast(av_);
thrust::device_ptr<int> av_dpt1 = thrust::device_pointer_cast(av_+N_);
thrust::device_ptr<int> bv_dpt = thrust::device_pointer_cast(bv_);
thrust::device_ptr<int> bv_dpt1 = thrust::device_pointer_cast(bv_+N_);
thrust::device_ptr<int> cv_dpt = thrust::device_pointer_cast(cv_);
thrust::device_ptr<int> cv_dpt1 = thrust::device_pointer_cast(cv_+N_);
thrust::device_ptr<int> dv_dpt = thrust::device_pointer_cast(dv_);
thrust::device_ptr<int> dv_dpt1 = thrust::device_pointer_cast(dv_+N_);
thrust::detail::normal_iterator<thrust::device_ptr<int>> a0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(av_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> a1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(av_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> b0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(bv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> b1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(bv_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> c0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(cv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> c1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(cv_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> d0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(dv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> d1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(dv_dpt1);
// ** alter_tuple is WRONG
#define WRONG
#ifdef WRONG
thrust::transform(thrust::device,
thrust::make_zip_iterator(thrust::make_tuple(a0,b0)),
thrust::make_zip_iterator(thrust::make_tuple(a1,b1)),
// thrust::make_zip_iterator(thrust::make_tuple(cv_dpt,dv_dpt)), // cv_dpt
thrust::make_zip_iterator(thrust::make_tuple(c0,d0)), // cv_dpt
alter_tuple(cv_[idx],dv_[idx]));
#endif
#ifdef RIGHT
// ** alter_tuple_arr is CORRECT way to do it
thrust::transform(thrust::device,
thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(N_),
// thrust::make_zip_iterator(thrust::make_tuple(cv_dpt,dv_dpt)), // cv_dpt
thrust::make_zip_iterator(thrust::make_tuple(c0,d0)), // cv_dpt
alter_tuple_arr(av_,bv_,cv_,dv_));
#endif
for (int i=0; i<N_; i++)
printf("out: (%d,%d) -> (%d,%d)\n",av_[i],bv_[i],cv_[i],dv_[i]);
return cv_dpt[idx];
}
int* av_;
int* bv_;
int* cv_;
int* dv_;
int N_;
float af; // are these variables host side or device side??
};
__host__ __device__
unsigned int hash(unsigned int a)
{
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);
return a;
}
int main(void)
{
int N = 10;
std::vector<int> av,bv,cv,dv;
unsigned int seed = hash(10);
thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution<float> u01(0,10);
for (int i=0;i<N;i++) {
av.push_back((int)u01(rng));
bv.push_back((int)u01(rng));
cv.push_back((int)u01(rng));
dv.push_back((int)u01(rng));
// printf("%d %d %d %d \n",av[i],bv[i],cv[i],dv[i]);
}
thrust::device_vector<int> av_d(N);
thrust::device_vector<int> bv_d(N);
thrust::device_vector<int> cv_d(N);
thrust::device_vector<int> dv_d(N);
av_d = av; bv_d = bv; cv_d = cv; dv_d = dv;
thrust::transform(thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(N),
cv_d.begin(),
bFuntor(thrust::raw_pointer_cast(av_d.data()),
thrust::raw_pointer_cast(bv_d.data()),
thrust::raw_pointer_cast(cv_d.data()),
thrust::raw_pointer_cast(dv_d.data()),
N));
thrust::host_vector<int> bv_h(N);
thrust::copy(bv_d.begin(), bv_d.end(), bv_h.begin()); // probably I forgot this! to copy back the result from device to host!
return 0;
}
In this nested thrust calls, two nested functors were tested, one of them worked (one with "#define RIGHT"). In the case of WRONG functor i.e. alter_tuple:
where do two variables int a, int b reside? host or device? or local kernel registers? or they are shared between threads of this functor's operator?
Inside, the alter_tuple functor, I tried to print out the result (int printf("in...")) and this is correct calculation. However, when this result is returned to caller functor and is printed out (in printf("out....")), they are incorrect and are different with its previous calculation
how come can these results are different? I can't seem to explain it and there is no documents or example to refer to
this difference is shown in output here
Edit 1:
minimum size test code shows functors (literally, a*x = y) in both cases receive/initialize values correctly SO_example_no_tuple_arr_wo_c.cu
print out is:
out: 9*8 -> 72
out: 9*8 -> 72
out: 9*8 -> 72
out: 6*4 -> 24
out: 6*4 -> 24
out: 6*4 -> 24
out: 1*8 -> 8
out: 1*8 -> 8
out: 1*6 -> 6
out: 9*1 -> 9
out: 9*1 -> 9
which shows the correct received values
minimum test code without using pointer/array to pass input values shows that regardless of input values are correctly initialized, the return results are wrong SO_example_no_tuple.cu
its output in case N=2:
in 9*8 -> 72
in 6*4 -> 24
in 9*8 -> 72
in 6*4 -> 24
out: 9*8 -> 24
out: 9*8 -> 24
out: 6*4 -> 24
out: 6*4 -> 24
The difference in values is not strictly due to a data race problem.
Your two approaches do not do the same thing, and it has to do with the values of a and b that will be selected for each invocation of the nested thrust::transform call. This is evident if you set N = 1, which should remove any concerns about data racing. The results are still different.
In the "failing" case, you are invoking the alter_tuple() operator like so:
thrust::transform(thrust::device,
...
alter_tuple(cv_[idx],dv_[idx]));
These values (cv_[idx], dv_[idx]) then become your initializing parameters ending up in a and b variables inside the functor. But your "passing" case is effectively initializing these variables differently, using a[idx] and b[idx], which correspond to av_[idx] and bv_[idx]. If we change the alter_tuple invocation to use a and b:
alter_tuple(av_[idx],bv_[idx]));
then the N = 1 case results now match. This was easier to understand because we had in fact only one entry in the a, b, c, d vectors.
When we expand to the N = 10 case, however, we no longer get matching results. To explain why, we need to understand the use of a and b inside the functor in this case. In the "failing" case, we are passing a single initializing value for each of a and b as used in the functor:
alter_tuple(av_[idx],bv_[idx]));
so, for a given thread, which means for a given invocation of the nested thrust::transform call, a single value will be used for a and b:
alter_tuple(const int& a_, const int& b_) : a(a_), b(b_){};
...
int a; // these values are constant across variation of "idx"
int b; // passed to the functor
on the other hand, in the "passing" case, the a and b values will vary for each element passed to the functor, within the nested transform call:
thrust::tuple<int,int> operator()(const int& idx)
{
int Xx = a[idx]; // these values of a and b *vary* for each idx
int Xy = b[idx]; // passed to the functor
Once that is understood, if the "passing" case is the desired case, then I have no idea how to transform the first case to produce passing results, as there is no way you can cause a single initializing value to take on the behavior of the varying values for a and b in the "passing" case.
None of the above involves data racing, but since your operations (i.e. each thread) is writing to every value of c and d, I don't think this overall approach makes any sense, and I'm not sure what you are trying to accomplish. I think if you expanded this to more elements/threads, then you could certainly experience unpredictable/variable results.
To answer some of your other questions, the variables a and b end up as thread-local variables, on the device. So each data member in either functor is a thread-local variable on the device.
Inside, the alter_tuple functor, I tried to print out the result (int printf("in...")) and this is correct calculation. However, when this result is returned to caller functor and is printed out (in printf("out....")), they are incorrect and are different with its previous calculation
Each thread is writing to the same locations in the c and d vector. Therefore, since each thread writes to the entire vector, but (in the failing case) each thread uses a different initializing value for a and b inside the functor, it stands to reason that each thread will compute a different result for the values of c and d, and the results you get after completion of the thrust call will depend on which thread "wins" the output write operation. This is unpredictable, and certainly not all threads printout will match the final result, because each thread will compute different values for c and d.

How to write LOP3 based instructions for Maxwell and up NVIDIA Architecture?

Maxwell Architecture has introduced a new instruction in PTX assembly called LOP3 which according to the NVIDIA blog:
"Can save instructions when performing complex logic operations
on multiple inputs."
At GTC 2016, some CUDA developers managed to accelerated the atan2f function for Tegra X1 processor (Maxwell) with such instructions.
However, the below function defined within a .cu file leads to undefined definitions for __SET_LT and __LOP3_0xe2.
Do I have to define them in .ptx file instead ? if so, how ?
float atan2f(const float dy, const float dx)
{
float flag, z = 0.0f;
__SET_LT(flag, fabsf(dy), fabsf(dx));
uint32_t m, t1 = 0x80000000;
float t2 = float(M_PI) / 2.0f;
__LOP3_0x2e(m, __float_as_int(dx), t1, __float_as_int(t2));
float w = flag * __int_as_float(m) + float(M_PI)/2.0f;
float Offset = copysignf(w, dy);
float t = fminf(fabsf(dx), fabsf(dy)) / fmaxf(fabsf(dx), fabsf(dy));
uint32_t r, b = __float_as_int(flag) << 2;
uint32_t mask = __float_as_int(dx) ^ __float_as_int(dy) ^ (~b);
__LOP3_0xe2(r, mask, t1, __floast_as_int(t));
const float p = fabsf(__int_as_float(r)) - 1.0f;
return ((-0.0663f*(-p) + 0.311f) * (-p) + float(float(M_PI)/4.0)) * (*(float *)&r) + Offset;
}
Edit:
The macro defines are finally:
#define __SET_LT(D, A, B) asm("set.lt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __SET_GT(D, A, B) asm("set.gt.f32.f32 %0, %1, %2;" : "=f"(D) : "f"(A), "f"(B))
#define __LOP3_0x2e(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0x2e;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
#define __LOP3_0xe2(D, A, B, C) asm("lop3.b32 %0, %1, %2, %3, 0xe2;" : "=r"(D) : "r"(A), "r"(B), "r"(C))
The lop3.b32 PTX instruction can perform a more-or-less arbitrary boolean (logical) operation on 3 variables A,B, and C.
In order to set the actual operation to be performed, we must provide a "lookup-table" immediate argument (immLut -- an 8-bit quantity). As indicated in the documentation, a method to compute the necessary immLut argument for a given operation F(A,B,C) is to substitute the values of 0xF0 for A, 0xCC for B, and 0xAA for C in the actual desired equation. For example suppose we want to compute:
F = (A || B) && (!C) ((A or B) and (not-C))
Then we would compute immLut argument by:
immLut = (0xF0 | 0xCC) & (~0xAA)
Note that the specified equation for F is a boolean equation, treating the arguments A,B, and C as boolean values, and producing a true/false result (F). However, the equation to compute immLut is a bitwise logical operation.
For the above example, immLut would have a computed value of 0x54
If it's desired to use a PTX instruction in ordinary CUDA C/C++ code, probably the most common (and arguably easiest) method would be to use inline PTX. Inline PTX is documented, and there are other questions discussing how to use it (such as this one), so I'll not repeat that here.
Here is a worked example of the above example case. Note that this particular PTX instruction is only available on cc5.0 and higher architectures, so be sure to compile for at least that level of target.
$ cat t1149.cu
#include <stdio.h>
const unsigned char A_or_B_and_notC=((0xF0|0xCC)&(~0xAA));
__device__ int my_LOP_0x54(int A, int B, int C){
int temp;
asm("lop3.b32 %0, %1, %2, %3, 0x54;" : "=r"(temp) : "r"(A), "r"(B), "r"(C));
return temp;
}
__global__ void testkernel(){
printf("A=true, B=false, C=true, F=%d\n", my_LOP_0x54(true, false, true));
printf("A=true, B=false, C=false, F=%d\n", my_LOP_0x54(true, false, false));
printf("A=false, B=false, C=false, F=%d\n", my_LOP_0x54(false, false, false));
}
int main(){
printf("0x%x\n", A_or_B_and_notC);
testkernel<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_50 -o t1149 t1149.cu
$ ./t1149
0x54
A=true, B=false, C=true, F=0
A=true, B=false, C=false, F=1
A=false, B=false, C=false, F=0
$
Since immLut is an immediate constant in PTX code, I know of no way using inline PTX to pass this as a function parameter - even if templating is used. Based on your provided link, it seems that the authors of that presentation also used a separately defined function for the specific desired immediate value -- presumably 0xE2 and 0x2E in their case. Also, note that I have chosen to write my function so that it returns the result of the operation as the function return value. The authors of the presentation you linked appear to be passing the return value back via a function parameter. Either method should be workable. (In fact, it appears they have written their __LOP3... codes as functional macros rather than ordinary functions.)
Also see here for a method of understanding how the 8 bit truthtable (immLut) works for LOP3 at the source code level.

Cuda Fortran: Data copy from cpu to gpu [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a problem about the data copy form host to device. Here is my problem. I have an array define as
real, allocatable :: cpuArray(:,:,:)
real, device, allocatable :: gpuArrray(:,:,:)
allocate(cpuArray(0:imax-1,0:jmax-1,0:kmax-1))
allocate(gpuArrray(-1:imax,-1:jmax,-1:kmax))
!array initialiazation
cpuArrray = randomValue !non 0 value
gpuArray = 0.0 !first 0 all gpu array elements
gpuArrray(0:imax-1,0:jmax-1,0:kmax-1)= cpuArray
My expectation is that only the designated index in the gpuArray will receive data from the host however it does not work.
Could you help me find what is wrong with this?
PS: I based my my approach from this tutorial of PGI home page
--
When I set both the cpuArray and the gpuArray the same dimension,
I get exactly the correct result.
But the current situation produces 0 for all element in the gpuArray. I modified the default value to a non zero (ie. gpuArray = 10.0 !first 10 all gpu array elements ) but the result still 0.
Best regards,
Adjeiinfo
All my apologies to the whole community. I could solve my problem. It was a silly bug I introduced in the test program. Instead of cpuArrray= cpuArray(0:imax-1,0:jmax-1,0:kmax-1) in the check program, I did cpuArrray= cpuArray.So the program was working well but the result check program was buggy.
Thank you for your follow up.
For your reference this is a part of the program (can be built and run)
module mytest
use cudafor
implicit none
integer :: imax , jmax, kmax
integer :: i,j,k
!host arrays
real,allocatable:: h_a(:,:,:)
real,allocatable:: h_b(:,:,:)
real,allocatable:: h_c(:,:,:)
!device array
real,device,allocatable:: d_b(:,:,:)
real,device,allocatable:: d_c(:,:,:)
real,device,allocatable:: d_b_copy(:,:,:)
real,device,allocatable:: d_c_copy(:,:,:)
contains
attributes(global) subroutine testdata()
integer :: d_i, d_j,d_k
d_i = (blockIdx%x-1) * blockDim%x + threadIdx%x-1
d_j = (blockIdx%y-1) * blockDim%y + threadIdx%y-1
do d_k = 0, 1
d_b_copy(d_i, d_j, d_k) = d_b(d_i, d_j, d_k)
d_c_copy(d_i, d_j, d_k) = d_c(d_i, d_j, d_k)
end do
end subroutine testdata
end module mytest
program Test
use mytest
type(dim3) :: dimGrid, dimBlock,dimGrid1, dimBlock1
imax = 32
jmax = 32
kmax = 2
dimGrid = dim3(2,2, 1)
dimBlock = dim3(imax,jmax,1)
allocate(h_a(0:imax-1,0:jmax-1,0:1))
allocate(h_b(0:imax-1,0:jmax-1,0:1))
allocate(h_c(0:imax-1,0:jmax-1,0:1))
!real,device,allocatable::d_c(:,:,:)
allocate(d_b(0:imax-1,0:jmax-1,0:1))
allocate(d_c(-1:imax,-1:jmax,-1:16))
allocate(d_b_copy(0:imax-1,0:jmax-1,0:1))
allocate(d_c_copy(-1:imax,-1:jmax,-1:1))
!array initialization
do k = 0,kmax-1
do j=0, jmax-1
do i = 0, imax-1
h_a(i,j,k) = i*0.1
end do
end do
end do
!data transfer (cpu to gpu)
d_b = h_a
d_c(0:imax-1,0:jmax-1,0:kmax-1)= h_a
call testdata<<<dimGrid,dimBlock>>>()
!copy back to cpu
h_b = d_b_copy(0:imax-1,0:jmax-1,0:kmax-1)
h_c = d_c_copy(0:imax-1,0:jmax-1,0:kmax-1)
!just for visual test
write(*,*), h_b
open(24,file='h_a.dat')
write(24,*) h_a
close(24)
open(24,file='d_b_copy.dat')
write(24,*) h_b
close(24)
open(24,file='d_c_copy.dat')
write(24,*) h_c
close(24)
end program Test