Unexpected CUDA_ERROR_INVALID_VALUE from cuLaunchKernel() - cuda

I'm trying to launch a kernel using the CUDA driver API. Specifically I'm calling
CUresult CUDAAPI cuLaunchKernel(
CUfunction f,
unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ,
unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void **kernelParams,
void **extra);
I'm only using kernelParams, and passing nullptr for extra. Now, for one of my kernels, I get CUDA_ERROR_INVALID_VALUE.
The documentation says:
The error CUDA_ERROR_INVALID_VALUE will be returned if kernel parameters are specified with both kernelParams and extra (i.e. both kernelParams and extra are non-NULL).
well, I'm not doing that, and am still getting CUDA_ERROR_INVALID_VALUE. To be extra-safe, I synch'ed the stream right before launching the kernel - but to no avail.
What are the other reasons for getting CUDA_ERROR_INVALID_VALUE when trying to launch?

Apparently, you can get a CUDA_ERROR_INVALID_VALUE error in multiple cases involving issues with your kernelParams and/or extras arguments:
Both kernelParams and extras are null, but the kernel takes parameters.
Both kernelParams and extras are non-null (this is what's officially documented)
The number of elements in kernelParams before the terminating nullptr value doesn't match the number of kernel parameters.
and this is not an exhaustive list. Probably misusing extras can cause this too.

Related

What are the types of these CUDA pointer attributes?

The cuGetPointerAttribute() is passed a pointer to one of multiple types, filled according to the actual attribute requested. Some of those types are stated explicitly or may be deduced implicitly to deduce, but some - not so much. Specifically... what are the types to which a pointer must be passed for the attributes:
CU_POINTER_ATTRIBUTE_BUFFER_ID - probably a numeric ID, but what's its type?
CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES - a bitmask, supposedly, but how wide?
The CUDA driver API doesn't seem to answer these questions.
PS - Even for the boolean attributes it's not made clear enough whether you should pass an int* or a bool*.
According to the documentation, the buffer id is stored as unsigned long long:
CU_POINTER_ATTRIBUTE_BUFFER_ID: Returns in *data a buffer ID which is guaranteed to be unique within the process. data must point to an unsigned long long.
When I try to pass a char* with CU_POINTER_ATTRIBUTE_ALLOWED_HANDLE_TYPES, valgrind reports an invalid write of size 8. Passing std::size_t* does not cause errors.
Similarly, using char* with CU_POINTER_ATTRIBUTE_IS_LEGACY_CUDA_IPC_CAPABLE, reports an invalid write of size 4, which is not the case with int*
(using NVCC V11.5.119)

What's the difference between launching with an API call vs the triple-chevron syntax?

Consider the following two function templates:
template <typename... Params>
void foo(Params... params)
{
/* etc etc */
my_kernel<<<grid_dims, block_dims, shmem_size, stream_id>>>(params...);
}
and:
template <typename... Params>
void bar(Params... params)
{
/* etc etc */
void* arguments_ptrs[sizeof...(Params)];
auto arg_index = 0;
for_each_argument(
[&](auto param) {arguments_ptrs[arg_index++] = &param;},
params...);
cudaLaunchKernel<decltype(my_kernel)>(
&my_kernel, grid_dims, block_dims, argument_ptrs, shmem_size, stream_id);
}
with for_each_argument being as defined by Sean Parent.
Questions:
Are the semantics of foo and bar exactly identical?
Is there some kind of benefit to using one over the other? (e.g. perhaps the first form does heap allocation under the hood or something....)
Is it a good idea to use forwarding references in the second function? Both functions?
Are the semantics of foo and bar exactly identical?
I haven't checked in CUDA 9, but prior to that, no. The <<<>>> syntax is inline expanded to an API call and a wrapper function call. Interestingly the kernel launch APIs used are long deprecated. But the wrapper function allows for explicit argument type safety checking at compile time, which is helpful.
[EDIT: I checked CUDA 9.1 and it still uses cudaLaunch as all previous versions of the runtime API did]
Is there some kind of benefit to using one over the other? (e.g. perhaps the first form does heap allocation under the hood or something....)
Not that I am aware of.
Is it a good idea to use forwarding references in the second function? Both functions?
If the kernels are compiled at the same compilation unit scope as the calling code, then no. The toolchain automatically emits forward declarations for kernels .
Remember that, eventually, the runtime API needs to make driver API calls (assuming it doesn't make secret API calls which we don't know about), so eventually, what's used is cuLaunchKernel():
CUresult cuLaunchKernel (
CUfunction f,
unsigned int gridDimX,
unsigned int gridDimY,
unsigned int gridDimZ,
unsigned int blockDimX,
unsigned int blockDimY,
unsigned int blockDimZ,
unsigned int sharedMemBytes,
CUstream hStream,
void** kernelParams,
void** extra )
and that's a non-templated interface which doesn't care about kinds-of-references and such.
Of course, there is the fact that there are two ways to specify launch arguments - using kernelParams and using extra. So if you want to tweak how you go about launching kernels, you might just want to play with that.

Can an unsigned long long int be used to store the output from clock64()?

I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?
There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options.
Referring to the documentation on clock64():
long long int clock64(); when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle.
So, since it is a 64-bit signed quantity, it is bit-wise identical to an unsigned long long int until it becomes negative. Let's assume the counter is reset to zero either at the start of your kernel, the start of the cuda context, or machine power-on. This counter will not become negative until around:
2^63(cycles)/1,000,000,000(cycles/s) = ~292 years after whichever of the above events is the actual reset point.
(I'm using 1GHz here as an estimate of the GPU core clock)
So for the first 200-300 years (after machine power-on, let's say), the clock64() function will not return a negative value. So I'd say it's pretty safe to consider it as "always" positive, and therefore always identical to unsigned long long int, meaning you can safely cast it to that, and use it in one of the atomic functions that support unsigned long long int.
On the other hand, it's probably not safe to cast it into an unsigned quantity. That arithmetic would be:
2^32(cycles)/1,000,000,000(cycles/s) = ~4 seconds (after machine power on)
So in about 4 seconds, the clock64() function will numerically exceed the value that can be safely recorded in an unsigned quantity.

in cuda,how can I send an integer to the constant memory? [duplicate]

As anything with CUDA, the most basic things are sometimes the hardest...
So...I just want to copy a variable from the CPU to a GPU's constant variable, and I am having a hard time.
This is what i have:
__constant__ int contadorlinhasx_d;
int main(){
(...)
int contadorlinhasx=100;
status=cudaMemcpyToSymbol(contadorlinhasx_d,contadorlinhasx,1*sizeof(int),0,cudaMemcpyHostToDevice);
And i get this error
presortx.cu(222): error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list
argument types are: (int, int, unsigned long, int, cudaMemcpyKind)
Could anyone help me? I know it is some stupid error, but I am tired of googling it, and I have spent almost 30 minutes just trying to copy a stupid variable :/
Thanks in advance
You need to do something like
cudaMemcpyToSymbol("contadorlinhasx_d",
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
[Note this is the old API call, now deprecated in CUDA 4.0 and newer]
or
cudaMemcpyToSymbol(contadorlinhasx_d,
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
If you look at the API documentation, the first two arguments are pointers. The first can either be a string, which will force a symbol lookup internally in the API (pre CUDA 4), or a device symbol address (CUDA 4 and later). The second argument is the address of the host source memory for the copy. The compiler error message is pretty explicit - you are passing the wrong types of argument and the compiler can't find an instance in the library which matches.

CUDA - Transfering CPU variables to GPU __constant__ variables

As anything with CUDA, the most basic things are sometimes the hardest...
So...I just want to copy a variable from the CPU to a GPU's constant variable, and I am having a hard time.
This is what i have:
__constant__ int contadorlinhasx_d;
int main(){
(...)
int contadorlinhasx=100;
status=cudaMemcpyToSymbol(contadorlinhasx_d,contadorlinhasx,1*sizeof(int),0,cudaMemcpyHostToDevice);
And i get this error
presortx.cu(222): error: no instance of overloaded function "cudaMemcpyToSymbol" matches the argument list
argument types are: (int, int, unsigned long, int, cudaMemcpyKind)
Could anyone help me? I know it is some stupid error, but I am tired of googling it, and I have spent almost 30 minutes just trying to copy a stupid variable :/
Thanks in advance
You need to do something like
cudaMemcpyToSymbol("contadorlinhasx_d",
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
[Note this is the old API call, now deprecated in CUDA 4.0 and newer]
or
cudaMemcpyToSymbol(contadorlinhasx_d,
&contadorlinhasx,
1*sizeof(int),
0,
cudaMemcpyHostToDevice);
If you look at the API documentation, the first two arguments are pointers. The first can either be a string, which will force a symbol lookup internally in the API (pre CUDA 4), or a device symbol address (CUDA 4 and later). The second argument is the address of the host source memory for the copy. The compiler error message is pretty explicit - you are passing the wrong types of argument and the compiler can't find an instance in the library which matches.