Memory allocation in cuda c [duplicate] - cuda

For example, cudaMalloc((void**)&device_array, num_bytes);
This question has been asked before, and the reply was "because cudaMalloc returns an error code", but I don't get it - what has a double pointer got to do with returning an error code? Why can't a simple pointer do the job?
If I write
cudaError_t catch_status;
catch_status = cudaMalloc((void**)&device_array, num_bytes);
the error code will be put in catch_status, and returning a simple pointer to the allocated GPU memory should suffice, shouldn't it?

In C, data can be passed to functions by value or via simulated pass-by-reference (i.e. by a pointer to the data). By value is a one-way methodology, by pointer allows for two-way data flow between the function and its calling environment.
When a data item is passed to a function via the function parameter list, and the function is expected to modify the original data item so that the modified value shows up in the calling environment, the correct C method for this is to pass the data item by pointer. In C, when we pass by pointer, we take the address of the item to be modified, creating a pointer (perhaps a pointer to a pointer in this case) and hand the address to the function. This allows the function to modify the original item (via the pointer) in the calling environment.
Normally malloc returns a pointer, and we can use assignment in the calling environment to assign this returned value to the desired pointer. In the case of cudaMalloc, the CUDA designers chose to use the returned value to carry an error status rather than a pointer. Therefore the setting of the pointer in the calling environment must occur via one of the parameters passed to the function, by reference (i.e. by pointer). Since it is a pointer value that we want to set, we must take the address of the pointer (creating a pointer to a pointer) and pass that address to the cudaMalloc function.

Adding to Robert's answer, but to first reiterate, it is a C API, which means it does not support references, which would allow you to modify the value of a pointer (not just what is pointed to) inside the function. The answer by Robert Crovella explained this. Also note that it needs to be void because C also does not support function overloading.
Further, when using a C API within a C++ program (but you have not stated this), it is common to wrap such a function in a template. For example,
template<typename T>
cudaError_t cudaAlloc(T*& d_p, size_t elements)
{
return cudaMalloc((void**)&d_p, elements * sizeof(T));
}
There are two differences with how you would call the above cudaAlloc function:
Pass the device pointer directly, without using the address-of operator (&) when calling it, and without casting to a void type.
The second argument elements is now the number of elements rather than the number of bytes. The sizeof operator facilitates this. This is arguably more intuitive to specify elements and not worry about bytes.
For example:
float *d = nullptr; // floats, 4 bytes per elements
size_t N = 100; // 100 elements
cudaError_t err = cudaAlloc(d,N); // modifies d, input is not bytes
if (err != cudaSuccess)
std::cerr << "Unable to allocate device memory" << std::endl;

I guess the signature of cudaMalloc function could be better explained by an example. It is basically assigning a buffer through a pointer to that buffer (a pointer to pointer), like the following method:
int cudaMalloc(void **memory, size_t size)
{
int errorCode = 0;
*memory = new char[size];
return errorCode;
}
As you can see, the method takes a memory pointer to pointer, on which it saves the new allocated memory. It then returns the error code (in this case as an integer, but it is actually an enum).
The cudaMalloc function could be designed as it follows also:
void * cudaMalloc(size_t size, int * errorCode = nullptr)
{
if(errorCode)
errorCode = 0;
char *memory = new char[size];
return memory;
}
In this second case, the error code is set through a pointer implicit set to null (for the case people do not bother with the error code at all). Then the allocated memory is returned.
The first method can be used as is the actual cudaMalloc right now:
float *p;
int errorCode;
errorCode = cudaMalloc((void**)&p, sizeof(float));
While the second one can be used as follows:
float *p;
int errorCode;
p = (float *) cudaMalloc(sizeof(float), &errorCode);
These two methods are functionally equivalent, while they have different signatures, and the people from cuda decided to go for the first method, returning the error code and assigning the memory through a pointer, while most people say that the second method would have been a better choice.

Related

thrust transform defining custom binary function

I am trying to write a custom function to carry out sum. I followed this question Cuda Thrust Custom function to take reference.Here is how I have defined my functor
struct hashElem
{
int freq;
int error;
};
//basically this function adds some value to to the error field of each element
struct hashErrorAdd{
const int error;
hashErrorAdd(int _error): error(_error){}
__host__ __device__
struct hashElem operator()(const hashElem& o1,const int& o2)
{
struct hashElem o3;
o3.freq = o1.freq;
o3.error = o1.error + (NUM_OF_HASH_TABLE-o2)*error; //NUM_OF_HASH_TABLE is a constant
return o3;
}
};
struct hashElem freqError[SIZE_OF_HASH_TABLE*NUM_OF_HASH_TABLE];
int count[SIZE_OF_HASH_TABLE*NUM_OF_HASH_TABLE];
thrust::device_ptr<struct hashElem> d_freqError(freqError);
thrust::device_ptr<int> d_count(count);
thrust::transform(thrust::device,d_freqError,d_freqError+new_length,d_count,hashErrorAdd(perThreadLoad)); //new_length is a constant
This code on compilation gives the following error:
error: function "hashErrorAdd::operator()" cannot be called with the given argument list
argument types are: (hashElem)
object type is: hashErrorAdd
Please can anybody explain to me why I am getting this error? and how I can resolve it. Please comment in case I am not able to explain the problem clearly. Thankyou.
It appears that you want to pass two input vectors to thrust::transform and then do an in-place transform (i.e. no output vector is specified).
There is no such incarnation of thrust::transform
Since you have passed:
thrust::transform(vector_first, vector_last, vector_first, operator);
The closest matching prototype is a version of transform that takes one input vector and creates one output vector. In that case, you would need to pass a unary op that takes the input vector type (hashElem) only as an argument, and returns a type appropriate for the output vector, which is int in this case, i.e. as you have written it (not as your intent). Your operator() does not do that, and it cannot be called with the arguments that thrust is expecting to pass to it.
As I see it, you have a couple options:
You could switch to the version of transform that takes two input vectors and produces one output vector, and create a binary op as functor.
You could zip together your two input vectors, and do an in-place transform if that is what you want. Your functor would then be a unary op, but it would take as argument whatever tuple was created from dereferencing the input vector, and it would have to return or modify the same kind of tuple.
As an aside, your method of creating device pointers directly from host arrays looks broken to me. You may wish to review the thrust quick start guide.

How can I use SWIG to handle a JAVA to C++ call with a pointer-to-pointer argout argument?

The problem involved a JAVA call to a C-function (API) which returned a pointer-to-pointer as an argout argument. I was trying to call the C API from JAVA and I had no way to modify the API.
Using SWIG typemap to pass pointer-to-pointer:
Here is another approach using typemaps. It is targetting Perl, not Java, but the concepts are the same. And I finally managed to get it working using typemaps and no helper functions:
For this function:
typedef void * MyType;
int getblock( int a, int b, MyType *block );
I have 2 typemaps:
%typemap(perl5, in, numinputs=0) void ** data( void * scrap )
{
$1 = &scrap;
}
%typemap(perl5, argout) void ** data
{
SV* tempsv = sv_newmortal();
if ( argvi >= items ) EXTEND(sp,1);
SWIG_MakePtr( tempsv, (void *)*$1, $descriptor(void *), 0);
$result = tempsv;
argvi++;
}
And the function is defined as:
int getblock( int a, int b, void ** data );
In my swig .i file. Now, this passes back an opaque pointer in the argout typemap, becaust that's what useful for this particular situation, however, you could replace the SWIG_MakePtr line with stuff to actually do stuff with the data in the pointer if you wanted to. Also, when I want to pass the pointer into a function, I have a typemap that looks like this:
%typemap(perl5, in) void * data
{
if ( !(SvROK($input)) croak( "Not a reference...\n" );
if ( SWIG_ConvertPtr($input, (void **) &$1, $1_descriptor, 0 ) == -1 )
croak( "Couldn't convert $1 to $1_descriptor\n");
}
And the function is defined as:
int useblock( void * data );
In my swig .i file.
Obviously, this is all perl, but should map pretty directly to Java as far as the typemap architecture goes. Hope it helps...
[Swig] Java: Using C helper function to pass pointer-to-pointer
The problem involved a JAVA call to a C-function (API) which returned a pointer-to-pointer as an argout argument. I was trying to call the C API from JAVA and I had no way to modify the API.
The API.h header file contained:
extern int ReadMessage(HEADER **hdr);
The original C-call looked like:
HEADER *hdr;
int status;
status = ReadMessage(&hdr);
The function of the API was to store data at the memory location specified by the pointer-to-pointer.
I tried to use SWIG to create the appropriate interface file. SWIG.i created the file SWIGTYPE_p_p_header.java from API.h. The problem is the SWIGTYPE_p_p_header constructor initialized swigCPtr to 0.
The JAVA call looked like:
SWIGTYPE_p_p_header hdr = new SWIGTYPE_p_p_header();
status = SWIG.ReadMessage(hdr);
But when I called the API from JAVA the ptr was always 0.
I finally gave up passing the pointer-to-pointer as an input argument. Instead I defined another C-function in SWIG.i to return the pointer-to-pointer in a return value. I thought it was a Kludge ... but it worked!
You may want to try this:
SWIG.i looks like:
// return pointer-to-pointer
%inline %{
HEADER *ReadMessageHelper() {
HEADER *hdr;
int returnValue;
returnValue = ReadMessage(&hdr);
if (returnValue!= 1) hdr = NULL;
return hdr;
}%}
The inline function above could leak memory as Java won't take ownership of the memory created by ReadMessageHelper, since the HEADER instance iscreated on the heap.
The fix for the memory leak is to define ReadMessageHelper as a newobject in order for Java to take control of the memory.
%newobject ReadMessageHelper();
JAVA call now would look like:
HEADER hdr;
hdr = SWIG.ReadMessageHelper();
If you are lucky, as I was, you may have another API available to release the message buffer. In which case, you wouldn’t have to do the previous step.
William Fulton, the SWIG guru, had this to say about the approach above:
“I wouldn't see the helper function as a kludge, more the simplest solution to a tricky problem. Consider what the equivalent pure 100% Java code would be for ReadMessage(). I don't think there is an equivalent as Java classes are passed by reference and there is no such thing as a reference to a reference, or pointer to a pointer in Java. In the C function you have, a HEADER instances is created by ReadMessage and passed back to the caller. I don't see how one can do the equivalent in Java without providing some wrapper class around HEADER and passing the wrapper to the ReadMessage function. At the end of the day, ReadMessage returns a newly created HEADER and the Java way of returning newly created objects is to return it in the return value, not via a parameter.”

how to cast thrust::device_vector<int> to raw pointer

I have a thrust device_vector. I want to cast it to a raw pointer so that I can pass it to a kernel. How can I do so?
thrust::device_vector<int> dv(10);
//CAST TO RAW
kernel<<<bl,tpb>>>(pass raw)
You can do this using thrust::raw_pointer_cast. The device vector class has a member function data which will return a thrust::device_ptr to the memory held by the vector, which can be cast, something like this:
thrust::device_vector<int> dv(10);
int * dv_ptr = thrust::raw_pointer_cast(dv.data());
kernel<<<bl,tpb>>>(dv_ptr)
(disclaimer: written in browser, never compiled, never tested). There is a full working example of this included with thrust: unwrap_pointer.cu

atomic operation implementation

i am using atomic operation provided by SunOs in <sys/atomic.h> , which is
void *atomic_cas_ptr(volatile void *target, void *cmp, void *newval);
now to make is usable, i have to check whether old value returned by this function and passed by callee function cmp are same, if they are then operation is successful.
but i have certain doubt: as this function returns a void pointer to the old value let's call it void *old and i'm passing void *cmp, then i need to compare these two old and cmp, so how i am going to compare these two ? and if while comparing *old got changed then what i am going to do ? in essence what i want to do is to warp this function, inside another function which takes these three arguments and return either true or false, which indicate success or failure.
about the CAS, i read that it's misnomer to call it lockfree operation, since it eventually takes lock at hardware ( lock at bus ), It's right correct ? that's why the CAS is costly operation .
Possibly the function declaration confused you. This function does not return a pointer to the old value (of what?), but the old value from the memory pointed by target (which should really be a pointer to void*, i.e. void* volatile * target).
Usually if a CAS primitive returns an old value rather than a bool, you check CAS success with something like this:
void* atomic_ptr; // global atomically modified pointer
void* oldval, newval, comparand; // local variables
/* ... */
oldval = atomic_cas_ptr( (void*)&atomic_ptr, /* note that address is taken */
comparand, newval );
if( oldval == comparand ) {
// success
} else {
// failure
}
So when you compare old_val and comparand, you work with local variables that do not change concurrently (while global atomic_ptr might be changed again), and you compare pointer values without dereferencing.
The function you want should be like this:
bool my_atomic_cas_ptr(volatile void* target, void* comparand, void* newval)
{
return (comparand == atomic_cas_ptr(target, comparand, newval));
}
Note that since in some algorithms the old value (the one before CAS) should be known, it's better to have a CAS primitive returning the old value rather than a bool, as you can easily build the latter from the former while the opposite is more complex and inefficient (see the following code that tries to obtain the correct old value out of a MacOS CAS primitive that returns a bool).
void* CAS(void* volatile* target, void* comparand, void* newval)
{
while( !OSAtomicCompareAndSwapPtr(comparand, newval, target) ) {
void* snapshot = *target;
if( snapshot!=comparand ) return snapshot;
}
return comparand;
}
As for CAS locking memory bus, it depends on hardware. It was true for old x86 processors, but in modern x86 systems it's different. First, there is no central bus; it was replaced by AMD's HyperTransport and Intel's QuickPath Interconnect. Second, in recent CPU generations locked instructions are not all serialized (see some data showing that locked instructions on different memory addresses do not interfere). And finally, in the commonly accepted definition lock freedom is the guarantee of system-wide progress, not absence of serializing synchronization.

Object allocation inline on the stack

What does that mean when it says 'object allocation inline on the stack'?
Especially the 'inline' bit
It means that all the data for the object is allocated on the stack, and will be popped off when the current method terminates.
The alternative (which occurs in C# and Java, or if you're using a pointer in C++) is to have a reference or pointer on the stack, which refers to the object data which is allocated on the heap.
I think the "inline" here just means "as part of the stack frame for this method" as opposed to existing separately from the method.
Well, you know what the stack is, right? If you declare a function in, say, C:
int foo() {
int bar = 42;
return bar;
}
When the function is called, some space is created for information about the function on the stack, and the integer bar is allocated there as well. When the function returns, everything in that stack frame is deallocated.
Now, in C++:
class A {
int a;
int b;
A(int x, int y) {
a = x;
b = y;
}
~A() { // destructor
cout << "A(" << a << "," << b << ") being deleted!" << endl;
}
}
void foo() {
A on_the_stack(1,2);
A *on_the_heap = new A(3,4);
}
In languages like Java, all objects are allocated on the heap (unless the compiler does some sort of optimization). But in some languages like C++, the class objects can go right on the stack just like ints or floats. Memory from the heap is not used unless you explicitly call new. Note that our on_the_heap object never gets deallocated (by calling delete on it), so it causes a memory leak. The on_the_stack object, on the other hand, is automatically deallocated when the function returns, and will have its destructor called prior to doing so.