Passing variables to CUDA kernel - cuda

So, I am writing a PDE solver in CUDA C++. The solver is a function which in turn calls the cuda kernel to solve the PDE. Now, I want to use the PDE parameters as arguments to the kernel. That means I have to malloc for those variables like
cudaMalloc((void **)&Nt_d,size); and then cudaMemcpy(&Nt_d,Nt,size,cudaMemcpyHostToDevice);(Nt is an integer) which is for pointers. I want to pass integers and as well as floats, i.e non-pointer variables but can't find the correct syntax for it. I don't want to use the parameters as global constants. I want to use them as arguments to the kernel. Is there any way to do so?
Your help is highly appreciated.

You pass them directly; pass-by-value.
A kernel may have a prototype like this:
__global__ void mykernel(int *p1, float *p2, int i1, float f2);
in that case, p1 and p2 are pointer parameters, whereas i1 is an int parameter passed by value, and f2 is a float parameter passed by value.
This is more-or-less just a recital of what you would do for a function call in C or C++ for these types of parameters. You can use those parameters like i1 and f2 in your kernel code directly, just as you would with an ordinary C/C++ function.
As you've already pointed out, pointer variables should presumably be pointing to allocations you've already set up on the device via e.g. cudaMalloc
You might want to study some CUDA sample codes, such as vectorAdd.

Related

CUDA: cublas: exception raised at dot product [duplicate]

I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application.
Is there a way to compute this index efficiently and store it in the GPU?
Thanks!
Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than into a host variable (which entails a device to host transfer and might leave the result in the wrong memory space).
To use this, you need to use the cublasSetPointerMode call to tell the CUBLAS context to expect pointers for scalar arguments to be device pointers by using the CUBLAS_POINTER_MODE_DEVICE mode. This then implies that in a call like
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
that result must be a device pointer.
If you want to use CUBLAS and you have a GPU with compute capability 3.5 (K20, Titan) than you can use CUBLAS with dynamic parallelism. Than you can call CUBLAS from within a kernel on the GPU and no data will be returned to the CPU.
If you have no device with cc 3.5 you will probably have to implement a find max function by yourself or look for an aditional library.

Passing value from device memory as kernel parameter in CUDA

I'm writing a CUDA application that has a step where the variance of some complex-valued input data is computed, and then that variance is used to threshold the data. I've got a reduction kernel that computes the variance for me, but I'm not sure if I have to pull the value back to the host to pass it to the thresholding kernel or not.
Is there a way to pass the value directly from device memory?
You can use a __device__ variable to hold the variance value in-between kernel calls.
Put this before the definition of the kernels that use it:
__device__ float my_variance = 0.0f;
Variables defined this way can be used by any kernel executing on the device (without requiring that they be explicitly passed as a kernel function parameter) and persist for the lifetime of the context, i.e. beyond the lifetime of any single kernel call.
It's not entirely clear from your question, but you can also define an array of data this way.
__device__ float my_variance[32] = {0.0f};
Likewise, allocations created by cudaMalloc live for the duration of the application/context (or until an appropriate cudaFree is encountered) and so there is no need to "pull back the data" to the host if you want to use it in a successive kernel:
float *d_variance;
cudaMalloc((void **)&d_variance), sizeof(float));
my_reduction_kernel<<<...>>>(..., d_variance, ...);
my_thresholding_kernel<<<...>>>(..., d_variance, ...);
Any value set in *d_variance by the reduction kernel above will be properly observed by the thresholding kernel.

Thrust device_malloc and device_new

What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...
device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)
So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));

why do we have to pass a pointer to a pointer to cudaMalloc

The following codes are widely used for GPU global memory allocation:
float *M;
cudaMalloc((void**)&M,size);
I wonder why do we have to pass a pointer to a pointer to cudaMalloc, and why it was not designed like:
float *M;
cudaMalloc((void*)M,size);
Thanks for any plain descriptions!
cudaMalloc needs to write the value of the pointer to M (not *M), so M must be passed by reference.
Another way would be to return the pointer in the classic malloc fashion. Unlike malloc, however, cudaMalloc returns an error status, like all CUDA runtime functions.
To explain the need in a little more detail:
Before the call to cudaMalloc, M points... anywhere, undefined. After the call to cudaMalloc you want a valid array to be present at the memory location where it points at. One could naïvely say "then just allocate the memory at this location", but that's of course not possible in general: the undefined address will normally not even be inside valid memory. cudaMalloc need to be able to choose the location. But if the pointer is called by value, there's no way to tell the caller where.
In C++, one could make the signature
template<typename PointerType>
cudaStatus_t cudaMalloc(PointerType& ptr, size_t);
where passing ptr by reference allows the function to change the location, but since cudaMalloc is part of the CUDA C API this is not an option. The only way to pass something as modifiable in C is to pass a pointer to it. And the object is itself a pointer what you need to pass is a pointer to a pointer.

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.