Can I get CUDA Compute capability (version) in compile time by #define? - cuda

How can I get CUDA Compute capability (version) in compile time by #define?
For example, if I use __ballot and compile with
nvcc -c -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_13,code=sm_13
source.cu
can I get version of compute capability in my code by #define for choose the branch of code with __ballot and without?

Yes. First, it's best to understand what happens when you use -gencode. NVCC will compile your input device code multiple times, once for each device target architecture. So in your example, NVCC will run compilation stage 1 once for compute_20 and once for compute_13.
When nvcc compiles a .cu file, it defines two preprocessor macros, __CUDACC__ and __CUDA_ARCH__. __CUDACC__ does not have a value, it is simply defined if cudacc is the compiler, and not defined if it isn't.
__CUDA_ARCH__ is defined to an integer value representing the SM version being compiled.
100 = compute_10
110 = compute_11
200 = compute_20
etc. To quote the NVCC documentation included with the CUDA Toolkit:
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy. This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
So, in your case where you want to use __ballot(), you can do this:
....
#if __CUDA_ARCH__ >= 200
int b = __ballot();
int p = popc(b & lanemask);
#else
// do something else for earlier architectures
#endif

Related

How do I force a CUDA warp to re-converge after a condition and before __shfl_down_sync?

I'm trying to write a CUDA kernel that uses warp-sync intrinsics like __shfl_down_sync(). Part of it looks like (simplified):
if (lane_id < 31) {
do_something();
}
x = __shfl_down_sync(0xffffffff, y, 1);
I'm targeting compute capability 5.3. My understanding is that the behavior of __shfl_down_sync() is undefined if the lanes are divergent (per e.g. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/). So I was careful to put the __shfl_down_sync() outside the if block; I expected the lanes to re-converge at the end of the if block. However, at high optimization levels, NVCC produces code that doesn't re-converge before the shuffle! The generated PTX code looks like (simplified):
setp.lt.u32 %p1, %r55, 31; // r55 is lane_id
#%p1 bra BB0_7;
bra.uni BB0_1;
BB0_7:
do_something();
shfl.down.sync.b32 ...;
// and so on
BB0_1:
shfl.down.sync.b32 ...;
// and so on
So lane 31 executes the shfl.down.sync.b32 instruction separately from the rest of the warp, producing undefined & incorrect results. How can I prevent this from happening?
(Alternatively -- Have I misunderstood the rules? Is the PTX code actually correct, and my bug is being caused by something else? But the output I observe is that lane 31 doesn't seem to participate in the sync, causing incorrect values to be computed, so this seems to be the culprit...)
(If it helps, a longer version of the code is available here: https://gist.github.com/timmaxw/bf317303155a74298c3805cce92630be Also, I'm using CUDA 10.2 with compute capability 5.3. Compiler flags are nvcc --generate-line-info -O2 -MMD -ccbin g++ -Xcompiler -fPIC -Xcompier -Wall,-Werror -gencode arch=compute_53,code=sm_53.)

CUDA atomicAdd for doubles definition error

In previous versions of CUDA, atomicAdd was not implemented for doubles, so it is common to implement this like here. With the new CUDA 8 RC, I run into troubles when I try to compile my code which includes such a function. I guess this is due to the fact that with Pascal and Compute Capability 6.0, a native double version of atomicAdd has been added, but somehow that is not properly ignored for previous Compute Capabilities.
The code below used to compile and run fine with previous CUDA versions, but now I get this compilation error:
test.cu(3): error: function "atomicAdd(double *, double)" has already been defined
But if I remove my implementation, I instead get this error:
test.cu(33): error: no instance of overloaded function "atomicAdd" matches the argument list
argument types are: (double *, double)
I should add that I only see this if I compile with -arch=sm_35 or similar. If I compile with -arch=sm_60 I get the expected behavior, i.e. only the first error, and successful compilation in the second case.
Edit: Also, it is specific for atomicAdd -- if I change the name, it works well.
It really looks like a compiler bug. Can someone else confirm that this is the case?
Example code:
__device__ double atomicAdd(double* address, double val)
{
unsigned long long int* address_as_ull = (unsigned long long int*)address;
unsigned long long int old = *address_as_ull, assumed;
do {
assumed = old;
old = atomicCAS(address_as_ull, assumed,
__double_as_longlong(val + __longlong_as_double(assumed)));
} while (assumed != old);
return __longlong_as_double(old);
}
__global__ void kernel(double *a)
{
double b=1.3;
atomicAdd(a,b);
}
int main(int argc, char **argv)
{
double *a;
cudaMalloc(&a,sizeof(double));
kernel<<<1,1>>>(a);
cudaFree(a);
return 0;
}
Edit: I got an answer from Nvidia who recognize this problem, and here is what the developers say about it:
The sm_60 architecture, that is newly supported in CUDA 8.0, has
native fp64 atomicAdd function. Because of the limitations of our
toolchain and CUDA language, the declaration of this function needs to
be present even when the code is not being specifically compiled for
sm_60. This causes a problem in your code because you also define a
fp64 atomicAdd function.
CUDA builtin functions such as atomicAdd are implementation-defined
and can be changed between CUDA releases. Users should not define
functions with the same names as any CUDA builtin functions. We would
suggest you to rename your atomicAdd function to one that is not the
same as any CUDA builtin functions.
That flavor of atomicAdd is a new method introduced for compute capability 6.0. You may keep your previous implementation of other compute capabilities guarding it using macro definition
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
<... place here your own pre-pascal atomicAdd definition ...>
#endif
This macro named architecture identification macro is documented here:
5.7.4. Virtual Architecture Identification Macro
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy.
This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
I assume NVIDIA did not place it for previous CC to avoid conflict for users defining it and not moving to Compute Capability >= 6.x. I would not consider it a BUG though, rather a release delivery practice.
EDIT: macro guard was incomplete (fixed) - here a complete example.
#if !defined(__CUDA_ARCH__) || __CUDA_ARCH__ >= 600
#else
__device__ double atomicAdd(double* a, double b) { return b; }
#endif
__device__ double s_global ;
__global__ void kernel () { atomicAdd (&s_global, 1.0) ; }
int main (int argc, char* argv[])
{
kernel<<<1,1>>> () ;
return ::cudaDeviceSynchronize () ;
}
Compilation with:
$> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
Command lines (both successful):
$> nvcc main.cu -arch=sm_60
$> nvcc main.cu -arch=sm_35
You may find why it works with the include file: sm_60_atomic_functions.h, where the method is not declared if __CUDA_ARCH__ is lower than 600.

How can I dereference a thrust::device_vector from within a thrust functor?

I'm doing a thrust transform_reduce and need to access a thrust::device_vector from within the functor. I am not iterating on the device_vector. It allows me to declare the functor, passing in the device_vector reference, but won't let me dereference it, either with begin() or operator[].
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include\thrust/detail/function.h(187): warning : calling a host function("thrust::detail::vector_base > ::operator []") from a host device function("thrust::detail::host_device_function ::operator () ") is not allowed
I assume I'll be able to pass in the base pointer and do the pointer math myself, but is there a reason this isn't supported?
Just expanding on what #JaredHoberock has already indicated. I think he will not mind.
A functor usable by thrust must (for the most part) conform to the requirements imposed on any CUDA device code.
Both thrust::host_vector and thrust::device_vector are host code containers used to manipulate host data and device data respectively. A reference to the host code container cannot be used successfully in device code. This means even if you passed a reference to the container successfully, you could not use it (i.e. could not do .push_back(), for example) in device code.
For direct manipulation in device code (such as functors, or kernels) you must extract raw device pointers from thrust and use those directly, with your own pointer arithmetic. And advanced functions (such as .push_back()) will not be available.
There are a variety of ways to extract the raw device pointer corresponding to thrust data, and the following example code demonstrates two possibilities:
$ cat t651.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
__global__ void printkernel(float *data){
printf("data = %f\n", *data);
}
int main(){
thrust::device_vector<float> mydata(5);
thrust::sequence(mydata.begin(), mydata.end());
printkernel<<<1,1>>>(mydata.data().get());
printkernel<<<1,1>>>(thrust::raw_pointer_cast(&mydata[2]));
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t651 t651.cu
$ ./t651
data = 0.000000
data = 2.000000
$

thrust functor: "too many resources requested for launch"

I'm trying to implement something like this in CUDA:
for each element
p = { p if p >= floor
z if p < floor
Where floor and z are constants configured at the start of the test.
I have attempted to implement it like so, but I get the error "too many resources requested for launch"
A functor:
struct floor_functor : thrust::unary_function <float, float>
{
const float floorLevel, floorVal;
floor_functor(float _floorLevel, float _floorVal) : floorLevel(_floorLevel), floorVal(_floorVal){}
__host__
__device__
float operator()(float& x) const
{
if (x >= floorLevel)
return x;
else
return floorVal;
}
};
Used by a transform:
thrust::transform(input->begin(), input->end(), output.begin(), floor_functor(floorLevel, floorVal));
If I remove one of the members of my functor, say floorVal, and use a functor with only one member variable, it works fine.
Does anyone know why this might be, and how I could fix it?
Additional info:
My array is 786432 elements long.
My GPU is a GeForce GTX590
I am building with the command:
`nvcc -c -g -arch sm_11 -Xcompiler -fPIC -Xcompiler -Wall -DTHRUST_DEBUG -I <my_include_dir> -o <my_output> <my_source>`
My cuda version is 4.0:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:45_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221
And my maximum number of threads per block is 1024 (reported by deviceQuery):
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
UPDATE::
I have stumbled upon a fix for my problem, but do not understand it. If I rename my functor from "floor_functor" to basically anything else, it works! I have no idea why this is the case, and would be interested to hear anyone's ideas about this.
For an easier CUDA implementation, you could do this with ArrayFire in one line of code:
p(p < floor) = z;
Just declare your variables as af::array's.
Good luck!
Disclaimer: I work on all sorts of CUDA projects, including ArrayFire.

CUDA, cuPrintf causes "unspecified launch failure"?

I have a kernel which runs twice with different grid size.
My problem is with cuPrintf. When I don't have cudaPrintfInit() before kernel run and cudaPrintfDisplay(stdout, true) and cudaPrintfEnd() after kernel run, I have no error but when I put them there I get "unspecified launch failure" error.
In my device code, there is only one loop like this for printing:
if (threadIdx.x==0) {
cuPrintf("MAX:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
}
I'm using CUDA 4.0 with a card with cuda capability 2.0 and so I'm compiling my code with this syntax:
nvcc LB2.0.cu -arch=compute_20 -code=sm_20
If you are on a CC 2.0 GPU, you don't need cuPrintf at all -- CUDA has printf built-in for CC-2.0 and higher GPUs. So just replace your call to cuPrintf with this:
#if __CUDA_ARCH__ >= 200
if (threadIdx.x==0) {
printf("MAX:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
}
#endif
(Note you only need the #if / #endif lines if you are compiling your code for sm_20 and also earlier versions. With the example compilation command line you gave, you can eliminate them.)
With printf, you don't need cudaPrintfInit() or cudaPrintfDisplay() -- it is automatic. However if you print a lot of data, you may need to increase the default printf FIFO size with cudaDeviceSetLimit(), passing the cudaLimitPrintfFifoSize option.