CUDA - Transfering CPU variables to GPU __constant__ variables - cuda

As anything with CUDA, the most basic things are sometimes the hardest...
So...I just want to copy a variable from the CPU to a GPU's constant variable, and I am having a hard time.
This is what i have:
__constant__ int contadorlinhasx_d;
int main(){
(...)
int contadorlinhasx=100;
status=cudaMemcpyToSymbol(contadorlinhasx_d,contadorlinhasx,1*sizeof(int),0,cudaMemcpyHostToDevice);
And i get this error
presortx.cu(222): error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list
argument types are: (int, int, unsigned long, int, cudaMemcpyKind)
Could anyone help me? I know it is some stupid error, but I am tired of googling it, and I have spent almost 30 minutes just trying to copy a stupid variable :/
Thanks in advance

You need to do something like
cudaMemcpyToSymbol("contadorlinhasx_d",
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
[Note this is the old API call, now deprecated in CUDA 4.0 and newer]
or
cudaMemcpyToSymbol(contadorlinhasx_d,
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
If you look at the API documentation, the first two arguments are pointers. The first can either be a string, which will force a symbol lookup internally in the API (pre CUDA 4), or a device symbol address (CUDA 4 and later). The second argument is the address of the host source memory for the copy. The compiler error message is pretty explicit - you are passing the wrong types of argument and the compiler can't find an instance in the library which matches.

Related

Passing variables to CUDA kernel

So, I am writing a PDE solver in CUDA C++. The solver is a function which in turn calls the cuda kernel to solve the PDE. Now, I want to use the PDE parameters as arguments to the kernel. That means I have to malloc for those variables like
cudaMalloc((void **)&Nt_d,size); and then cudaMemcpy(&Nt_d,Nt,size,cudaMemcpyHostToDevice);(Nt is an integer) which is for pointers. I want to pass integers and as well as floats, i.e non-pointer variables but can't find the correct syntax for it. I don't want to use the parameters as global constants. I want to use them as arguments to the kernel. Is there any way to do so?
Your help is highly appreciated.
You pass them directly; pass-by-value.
A kernel may have a prototype like this:
__global__ void mykernel(int *p1, float *p2, int i1, float f2);
in that case, p1 and p2 are pointer parameters, whereas i1 is an int parameter passed by value, and f2 is a float parameter passed by value.
This is more-or-less just a recital of what you would do for a function call in C or C++ for these types of parameters. You can use those parameters like i1 and f2 in your kernel code directly, just as you would with an ordinary C/C++ function.
As you've already pointed out, pointer variables should presumably be pointing to allocations you've already set up on the device via e.g. cudaMalloc
You might want to study some CUDA sample codes, such as vectorAdd.

CUDA: cublas: exception raised at dot product [duplicate]

I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application.
Is there a way to compute this index efficiently and store it in the GPU?
Thanks!
Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than into a host variable (which entails a device to host transfer and might leave the result in the wrong memory space).
To use this, you need to use the cublasSetPointerMode call to tell the CUBLAS context to expect pointers for scalar arguments to be device pointers by using the CUBLAS_POINTER_MODE_DEVICE mode. This then implies that in a call like
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
that result must be a device pointer.
If you want to use CUBLAS and you have a GPU with compute capability 3.5 (K20, Titan) than you can use CUBLAS with dynamic parallelism. Than you can call CUBLAS from within a kernel on the GPU and no data will be returned to the CPU.
If you have no device with cc 3.5 you will probably have to implement a find max function by yourself or look for an aditional library.

How to declare a struct with a dynamic array inside it in device

How to declare a struct in device that a member of it, is an array and then dynamically allocated memory for this. for example in below code, compiler said: error : calling a __host__ function("malloc") from a __global__ function("kernel_ScoreMatrix") is not allowed. is there another way for perform this action?
Type ofdev_size_idx_threads is int* and value of it, sent to kernel and used for allocate memory.
struct struct_matrix
{
int *idx_threads_x;
int *idx_threads_y;
int thread_diag_length;
int idx_length;
};
struct struct_matrix matrix[BLOCK_SIZE_Y];
matrix->idx_threads_x= (int *) malloc ((*(dev_size_idx_threads) * sizeof(int) ));
From device code, dynamic memory allocations (malloc and new) are supported only with devices of cc2.0 and greater. If you have a cc2.0 device or greater, and you pass an appropriate flag to nvcc (such as -arch=sm_20) you should not see this error. Note that if you are passing multiple compilation targets (sm_10, sm_20, etc.), if even one of the targets does not meet the cc2.0+ requirement, you will see this error.
If you have a cc1.x device, you will need to perform these types of allocations from the host (e.g. using cudaMalloc) and pass appropriate pointers to your kernel.
If you choose that route (allocating from the host), you may also be interested in my answer to questions like this one.
EDIT: responding to questions below:
In visual studio (2008 express, should be similar for other versions), you can set the compilation target as follows: open project, select Project...Properties, select Configuration Properties...CUDA Runtime API...GPU Now, on the right hand pane, you will see entries like GPU Architecture (1) (and (2) etc.) These are drop-downs that you can click on and select the target(s) you want to compile for. If your GPU is sm_21 I would select that for (1) and leave the others blank, or select compatible versions like sm_20.
To see worked examples, please follow the link I gave above. A couple worked examples are linked from my answer here as well as a description of how it is done.

in cuda,how can I send an integer to the constant memory? [duplicate]

As anything with CUDA, the most basic things are sometimes the hardest...
So...I just want to copy a variable from the CPU to a GPU's constant variable, and I am having a hard time.
This is what i have:
__constant__ int contadorlinhasx_d;
int main(){
(...)
int contadorlinhasx=100;
status=cudaMemcpyToSymbol(contadorlinhasx_d,contadorlinhasx,1*sizeof(int),0,cudaMemcpyHostToDevice);
And i get this error
presortx.cu(222): error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list
argument types are: (int, int, unsigned long, int, cudaMemcpyKind)
Could anyone help me? I know it is some stupid error, but I am tired of googling it, and I have spent almost 30 minutes just trying to copy a stupid variable :/
Thanks in advance
You need to do something like
cudaMemcpyToSymbol("contadorlinhasx_d",
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
[Note this is the old API call, now deprecated in CUDA 4.0 and newer]
or
cudaMemcpyToSymbol(contadorlinhasx_d,
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
If you look at the API documentation, the first two arguments are pointers. The first can either be a string, which will force a symbol lookup internally in the API (pre CUDA 4), or a device symbol address (CUDA 4 and later). The second argument is the address of the host source memory for the copy. The compiler error message is pretty explicit - you are passing the wrong types of argument and the compiler can't find an instance in the library which matches.

Passing a struct pointer to a CUDA kernel [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Copying a struct containing pointers to CUDA device
I have a structure of device pointers, pointing to arrays allocated on the device.
like this
struct mystruct{
int* dev1;
double* dev2;
.
.
}
There are a large number of arrays in this structure. I started writing a CUDA kernel in which
I passed the pointer to mystruct and then derefernce it within the
CUDA kernel code like this mystruct->dev1[i].
But I realized after writing a few lines that this will not work since by CUDA first principles
you cannot derefernce a host pointer (in this case to mystruct) within a CUDA kernel.
But this is kind of inconveneint, since I will have to pass a larger number of arguments
to my kernels. Is there any way to avoid this. I would like to keep the number of arguments
to my kernel calls as short as possible.
As I explain in this answer, you can pass your struct by value to the kernel, so you don't have to worry about dereferencing a host pointer:
__global__ void kernel(mystruct in)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
in.dev1[idx] *= 2;
in.dev2[idx] += 3.14159;
}
There is the overhead of passing the struct by value to be aware of. However if your struct is not too large, it shouldn't matter.
If you pass the same struct to a lot of kernels, or repeatedly, you may consider copying the struct itself to global or constant memory instead as suggested by aland, or use mapped host memory as suggested by Mark Ebersole. But passing the struct by value is a much simpler way to get started.
(Note: please search StackOverflow before duplicating questions...)
You can copy your mystruct structure to global memory and pass its device address to kernel.
From performance viewpoint, however, it would be better to store mystruct in constant memory, since (I guess) there are a lot of random reads from it by many threads.
You could also use page-locked (pinned) host memory and create the structure within that region if your setup supports it. Please see 3.2.4 of the CUDA programming guide.