in cuda,how can I send an integer to the constant memory? [duplicate] - cuda

As anything with CUDA, the most basic things are sometimes the hardest...
So...I just want to copy a variable from the CPU to a GPU's constant variable, and I am having a hard time.
This is what i have:
__constant__ int contadorlinhasx_d;
int main(){
(...)
int contadorlinhasx=100;
status=cudaMemcpyToSymbol(contadorlinhasx_d,contadorlinhasx,1*sizeof(int),0,cudaMemcpyHostToDevice);
And i get this error
presortx.cu(222): error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list
argument types are: (int, int, unsigned long, int, cudaMemcpyKind)
Could anyone help me? I know it is some stupid error, but I am tired of googling it, and I have spent almost 30 minutes just trying to copy a stupid variable :/
Thanks in advance

You need to do something like
cudaMemcpyToSymbol("contadorlinhasx_d",
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
[Note this is the old API call, now deprecated in CUDA 4.0 and newer]
or
cudaMemcpyToSymbol(contadorlinhasx_d,
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
If you look at the API documentation, the first two arguments are pointers. The first can either be a string, which will force a symbol lookup internally in the API (pre CUDA 4), or a device symbol address (CUDA 4 and later). The second argument is the address of the host source memory for the copy. The compiler error message is pretty explicit - you are passing the wrong types of argument and the compiler can't find an instance in the library which matches.

Related

What are the types of these CUDA pointer attributes?

The cuGetPointerAttribute() is passed a pointer to one of multiple types, filled according to the actual attribute requested. Some of those types are stated explicitly or may be deduced implicitly to deduce, but some - not so much. Specifically... what are the types to which a pointer must be passed for the attributes:
CU_POINTER_ATTRIBUTE_BUFFER_ID - probably a numeric ID, but what's its type?
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES - a bitmask, supposedly, but how wide?
The CUDA driver API doesn't seem to answer these questions.
PS - Even for the boolean attributes it's not made clear enough whether you should pass an int* or a bool*.
According to the documentation, the buffer id is stored as unsigned long long:
CU_POINTER_ATTRIBUTE_BUFFER_ID: Returns in *data a buffer ID which is guaranteed to be unique within the process. data must point to an unsigned long long.
When I try to pass a char* with CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES, valgrind reports an invalid write of size 8. Passing std::size_t* does not cause errors.
Similarly, using char* with CU_POINTER_ATTRIBUTE_IS_LEGACY_CUDA_IPC_CAPABLE, reports an invalid write of size 4, which is not the case with int*
(using NVCC V11.5.119)

What's the difference between launching with an API call vs the triple-chevron syntax?

Consider the following two function templates:
template <typename... Params>
void foo(Params... params)
{
/* etc etc */
my_kernel<<<grid_dims, block_dims, shmem_size, stream_id>>>(params...);
}
and:
template <typename... Params>
void bar(Params... params)
{
/* etc etc */
void* arguments_ptrs[sizeof...(Params)];
auto arg_index = 0;
for_each_argument(
[&](auto param) {arguments_ptrs[arg_index++] = &param;},
params...);
cudaLaunchKernel<decltype(my_kernel)>(
&my_kernel, grid_dims, block_dims, argument_ptrs, shmem_size, stream_id);
}
with for_each_argument being as defined by Sean Parent.
Questions:
Are the semantics of foo and bar exactly identical?
Is there some kind of benefit to using one over the other? (e.g. perhaps the first form does heap allocation under the hood or something....)
Is it a good idea to use forwarding references in the second function? Both functions?
Are the semantics of foo and bar exactly identical?
I haven't checked in CUDA 9, but prior to that, no. The <<<>>> syntax is inline expanded to an API call and a wrapper function call. Interestingly the kernel launch APIs used are long deprecated. But the wrapper function allows for explicit argument type safety checking at compile time, which is helpful.
[EDIT: I checked CUDA 9.1 and it still uses cudaLaunch as all previous versions of the runtime API did]
Is there some kind of benefit to using one over the other? (e.g. perhaps the first form does heap allocation under the hood or something....)
Not that I am aware of.
Is it a good idea to use forwarding references in the second function? Both functions?
If the kernels are compiled at the same compilation unit scope as the calling code, then no. The toolchain automatically emits forward declarations for kernels .
Remember that, eventually, the runtime API needs to make driver API calls (assuming it doesn't make secret API calls which we don't know about), so eventually, what's used is cuLaunchKernel():
CUresult cuLaunchKernel (
CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void** kernelParams,
void** extra )
and that's a non-templated interface which doesn't care about kinds-of-references and such.
Of course, there is the fact that there are two ways to specify launch arguments - using kernelParams and using extra. So if you want to tweak how you go about launching kernels, you might just want to play with that.

CUDA Fortran: Cannot copy array of arrays(or array of pointers to arrays) using cudaMemCpy

I am trying to compile a basic memory transfer code using PGI's fortran compiler(Workstation/PGI Visual Fortran). The compiler throws an error on the line where I have a cudamemcpy call. The exact error message is "Could not resolve generic procedure cudamemcpy" for the line
istat=cudaMemcpy(arr(1),arr(2),800,cudaMemcpyDevicetoDevice)
I am also using the cuda fortran module--"use cudafor". What's the solution to this compiler error? Thanks!
The arrays arr(1) and arr(2) are of type
type subgrid
integer, device, dimension(:,:,:), allocatable :: field
end type subgrid
The problem was resolved by not using the 4th argument and by specifying the actual field data that needed to be transferred. 800 is the number of integers I needed to be transferred from one slice to the other.
istat=cudaMemcpy(arr(1)%field(:,:,:), arr(2)%field(:,:,:), 800)
Also, the cudaMemcpyDevicetoDevice doesn't affect the function call. It works fine with/without it.

How to declare a struct with a dynamic array inside it in device

How to declare a struct in device that a member of it, is an array and then dynamically allocated memory for this. for example in below code, compiler said: error : calling a __host__ function("malloc") from a __global__ function("kernel_ScoreMatrix") is not allowed. is there another way for perform this action?
Type ofdev_size_idx_threads is int* and value of it, sent to kernel and used for allocate memory.
struct struct_matrix
{
int *idx_threads_x;
int *idx_threads_y;
int thread_diag_length;
int idx_length;
};
struct struct_matrix matrix[BLOCK_SIZE_Y];
matrix->idx_threads_x= (int *) malloc ((*(dev_size_idx_threads) * sizeof(int) ));
From device code, dynamic memory allocations (malloc and new) are supported only with devices of cc2.0 and greater. If you have a cc2.0 device or greater, and you pass an appropriate flag to nvcc (such as -arch=sm_20) you should not see this error. Note that if you are passing multiple compilation targets (sm_10, sm_20, etc.), if even one of the targets does not meet the cc2.0+ requirement, you will see this error.
If you have a cc1.x device, you will need to perform these types of allocations from the host (e.g. using cudaMalloc) and pass appropriate pointers to your kernel.
If you choose that route (allocating from the host), you may also be interested in my answer to questions like this one.
EDIT: responding to questions below:
In visual studio (2008 express, should be similar for other versions), you can set the compilation target as follows: open project, select Project...Properties, select Configuration Properties...CUDA Runtime API...GPU Now, on the right hand pane, you will see entries like GPU Architecture (1) (and (2) etc.) These are drop-downs that you can click on and select the target(s) you want to compile for. If your GPU is sm_21 I would select that for (1) and leave the others blank, or select compatible versions like sm_20.
To see worked examples, please follow the link I gave above. A couple worked examples are linked from my answer here as well as a description of how it is done.

CUDA - Transfering CPU variables to GPU __constant__ variables

As anything with CUDA, the most basic things are sometimes the hardest...
So...I just want to copy a variable from the CPU to a GPU's constant variable, and I am having a hard time.
This is what i have:
__constant__ int contadorlinhasx_d;
int main(){
(...)
int contadorlinhasx=100;
status=cudaMemcpyToSymbol(contadorlinhasx_d,contadorlinhasx,1*sizeof(int),0,cudaMemcpyHostToDevice);
And i get this error
presortx.cu(222): error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list
argument types are: (int, int, unsigned long, int, cudaMemcpyKind)
Could anyone help me? I know it is some stupid error, but I am tired of googling it, and I have spent almost 30 minutes just trying to copy a stupid variable :/
Thanks in advance
You need to do something like
cudaMemcpyToSymbol("contadorlinhasx_d",
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
[Note this is the old API call, now deprecated in CUDA 4.0 and newer]
or
cudaMemcpyToSymbol(contadorlinhasx_d,
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
If you look at the API documentation, the first two arguments are pointers. The first can either be a string, which will force a symbol lookup internally in the API (pre CUDA 4), or a device symbol address (CUDA 4 and later). The second argument is the address of the host source memory for the copy. The compiler error message is pretty explicit - you are passing the wrong types of argument and the compiler can't find an instance in the library which matches.