Why the __device__ variable can't be marked as constexpr? - cuda

If I declare/define a variable like this:
__device__ constexpr int test{5};
I get an error
error: A __device__ variable cannot be marked constexpr
I can't find in the guide that restriction. The guide says:
I.4.20.9. __managed__ and __shared__ variables cannot be marked with the
keyword constexpr.
Moreover, my colleague with the same major compiler version (11) doesn't have this error.
What exactly causes this error on my machine?

In CUDA 11.2, such usage was not allowed. See here:
G.4.16.9. __device__/__constant__/__shared__ variables
__device__, __constant__ and __shared__ variables cannot be marked with the keyword constexpr.
In CUDA 11.4 (and beyond) such usage is allowed.
The change in behavior allowed by the compiler took place sometime between the CUDA 11.2.0 version and the CUDA 11.4.0 version.
Presumably the passing machine has a newer CUDA 11 version than yours.

Related

Unable to call CUDA half precision functions from the host

I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2

How to redefine malloc/free in CUDA?

I want to redefine malloc() and free() in my code, but when I run, two errors appear:
allowing all exceptions is incompatible with previous function "malloc";
allowing all exceptions is incompatible with previous function "free";
Then I search for this error, it seems CUDA doesn't allow us to redefine libary function, is this true? If we can't redefine those functions, how can I resolve the error?
The very short answer is that you cannot.
malloc is fundamentally a C++ standard library function which the CUDA toolchain internally overloads with a device hook in device code. Attempting to define your own device version of malloc or free can and will break the toolchain's internals. Exactly how depends on platform and compiler.
In your previous question on this, you had code like this:
__device__ void* malloc(size_t)
{ return theHeap.alloc(t); }
__device__ void free(void* p)
{ the Heap.dealloc(p); }
Because of existing standard library requirements, malloc and free must be defined as __device__ __host__ at global namespace scope. It is illegal in CUDA to have separate __device__ and __host__definitions of the same function. You could probably get around this restriction by using a private namespace for the custom allocator, or using different function names. But don't try and redefine anything from the standard library in device or host code. It will break things.

complex CUDA kernel in MATLAB

I wrote a CUDA kernel to run via MATLAB,
with several cuDoubleComplex pointers. I activated the kernel with complex double vectors (defined as gpuArray), and gםt the error message: "unsupported type in argument specification cuDoubleComplex".
how do I set MATLAB to know this type?
The short answer, you can't.
The list of supported types for kernels is shown here, and that is all your kernel code can contain to compile correctly with the GPU computing toolbox. You will need either modify you code to use double2 in place of cuDoubleComplex, or supply Matlab with compiled PTX code and a function declaration which maps cuDoubleComplex to double2. For example
__global__ void mykernel(cuDoubleComplex *a) { .. }
would be compiled to PTX using nvcc and then loaded up in Matlab as
k = parallel.gpu.CUDAKernel('mykernel.ptx','double2*');
Either method should work.

"Unexpected address space" compilation error while using shared memory in PTX

I have written a trivial kernel in which I declare my shared memory array as
extern __shared__ float As[100];
In my kernel launch I specify the number_of_bytes of shared memory. I get the error "Unexpected address space" while compiling the kernel(to PTX). I am using fairly new version of LLVM from svn(3.3 in progress). Any ideas what I am doing wrong here ? the problem seems to be with extern keyword, but then how else am I gonna specify it?(Shared memory).
Should I use a different LLVM build?
Config CUDA 5.0 , Nvidia Tesla C1060
Well, it runs out that extern keyword is not really required in this case as per Gert-Jan from Nvidia forum. I am not sure what his id is on SO.
His reply --
"If you know how many elements your shared memory array has (e.g. 100 elements), you should not use the extern keyword, and you don't have to specify the number of bytes of shared memory in the kernel launch (the compiler can figure it out by himself). Only if you don't know how many elements you will need, you have to specify this in the kernel launch, and in your kernel you have to write "extern shared float *As"."
Hope this help other users.
I am not sure if CUDA-C/C++ supports this but perhaps try to set the address space attribute as a work-around:
__attribute__((address_space(3)))
extern __shared__ float As[100];
That should force llvm to put it in shared address space....
Good luck!

How do I retrieve the parameter list information for a CUDA 4.0+ kernel?

According to the NVidia documentation for the cuLaunchKernel function, kernels compiled with CUDA 3.2+ contain information regarding their parameter list. Is there a way to retrieve this information programmatically from a CUfunction handle? I need to know the number of arguments and the size of each argument in bytes of a kernel from its CUfunction handle. I have seen the above-referenced NVidia documentation saying that this information exists, but I haven't seen anywhere in the CUDA documentation indicating a programmatic way to access this information.
To add a little more explanation: I'm working with a middleware system. Its frontside library replaces libcuda (the driver API library) on the target system. The backside then runs as a daemon on another host that has the GPGPU resource being used and calls into the real libcuda on that machine. There are other middleware solutions that already do this with cuLaunchKernel, so it's definitely possible. Also, CUDA itself uses this information in order to know how to parse the parameters from the pointer that you pass into cuLaunchKernel.
Edit: I originally had the CUDA version where this metadata was introduced listed incorrectly. It was 3.2, not 4.0, according to the cuLaunchKernel documentation.
cuLaunchKernel is designed to launch kernels for which you know the function prototype. There is no API for "reverse engineering" the function prototype.
I'm working on the same issue (I don't know if in between you solved it).
I'm using a known kernel to investigate how che CUfunction pointed memory is used.
This is the no parameters version:
#include<cstdio>
extern "C" {
__global__ void HelloWorld(){
int thid = (blockIdx.x * blockDim.x) + threadIdx.x;
}
}
This is the one parameter version and so on.
#include<cstdio>
extern "C" {
__global__ void HelloWorld(int a) {
int thid = (blockIdx.x * blockDim.x) + threadIdx.x;
}
}
I suggest you to dump the first 1024 bytes of the memory pointed by CUfunction and follow the pointers. For example at the 0x30 offset there is a pointer pointing to a table of pointers. I noticed that the size of the struct posted by CUfunction doesn't change with the number of the function parameters, so the table we are looking have to be hunted following the pointers.