How do I force a CUDA warp to re-converge after a condition and before __shfl_down_sync? - cuda

I'm trying to write a CUDA kernel that uses warp-sync intrinsics like __shfl_down_sync(). Part of it looks like (simplified):
if (lane_id < 31) {
do_something();
}
x = __shfl_down_sync(0xffffffff, y, 1);
I'm targeting compute capability 5.3. My understanding is that the behavior of __shfl_down_sync() is undefined if the lanes are divergent (per e.g. https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/). So I was careful to put the __shfl_down_sync() outside the if block; I expected the lanes to re-converge at the end of the if block. However, at high optimization levels, NVCC produces code that doesn't re-converge before the shuffle! The generated PTX code looks like (simplified):
setp.lt.u32 %p1, %r55, 31; // r55 is lane_id
#%p1 bra BB0_7;
bra.uni BB0_1;
BB0_7:
do_something();
shfl.down.sync.b32 ...;
// and so on
BB0_1:
shfl.down.sync.b32 ...;
// and so on
So lane 31 executes the shfl.down.sync.b32 instruction separately from the rest of the warp, producing undefined & incorrect results. How can I prevent this from happening?
(Alternatively -- Have I misunderstood the rules? Is the PTX code actually correct, and my bug is being caused by something else? But the output I observe is that lane 31 doesn't seem to participate in the sync, causing incorrect values to be computed, so this seems to be the culprit...)
(If it helps, a longer version of the code is available here: https://gist.github.com/timmaxw/bf317303155a74298c3805cce92630be Also, I'm using CUDA 10.2 with compute capability 5.3. Compiler flags are nvcc --generate-line-info -O2 -MMD -ccbin g++ -Xcompiler -fPIC -Xcompier -Wall,-Werror -gencode arch=compute_53,code=sm_53.)

Related

Cuda signed 128-bit multiplication error

I think I discovered a problem when doing 128-bit signed multiplication in cuda PTX using signed integers.
Here is my sample code:
long long result_lo, result_hi;
asm(" mul.lo.s64 %0, 0, -1; \n\t" // 0 * -1 = 0
" mul.hi.s64 %1, 0, -1; \n\t"
: "=l"(result_lo), "=l"(result_hi));
This should produce the result result_lo = 0x0, result_hi = 0x0. However this produces the result: result_lo = 0x0, result_hi = 0xFFFFFFFFFFFFFFFF which is actualy the value 2^127 - (2^126 - 1) if I'm not mistaken and clearly not zero.
First off, I want to make sure my understanding is correct, but moreso, is there a way around this?
Update Changing from Debug mod to Release mode fixes this issue, still wondering if this is a bug in cuda?
Update 2
Reported this bug to NVIDIA
Used Cuda toolkit 7.5 with Visual Studio 2013. x64 Debug, sm_52, compute_52.
TL;DR This appears to be a bug in the emulation of the PTX instruction mul.hi.s64 that is specific to sm_5x platforms, so filing a bug report with NVIDIA is the recommended course of action.
Generally, NVIDIA GPUs are 32-bit architectures, so all 64-bit integer instructions require emulation sequences. In the particular case of 64-bit integer multiplies, for sm_2x and sm_3x platforms, these are constructed from the machine code instruction IMAD.U32, which is a 32-bit integer multiply-add instruction.
For the Maxwell architecture (that is, sm_5x), a high-throughput, but lower-width, integer multiply-add instruction XMAD was introduced, although a low-throughput legacy 32-bit integer multipy IMUL was apparently retained. Inspection of disassembled machine code generated for sm_5x by the CUDA 7.5 toolchain with cuobjdump --dumpsass shows that for ptxas optimization level -O0 (which is used for debug builds), the 64-bit multiplies are emulated with the IMUL instruction, while for optimization level -O1 and higher XMAD is used. I cannot think of a reason why two fundamentally different emulation sequences are employed.
As it turns out, the IMUL-based emulation for mul.hi.s64 for sm_5x is broken while the XMAD-based emulation works fine. Therefore, one possible workaround is to utilize an optimization level of at least -O1 for ptxas, by specifying -Xptxas -O1 on the nvcc command line. Note that release builds use -Xptxas -O3 by default, so no corrective action is necessary for release builds.
From code analysis, the emulation for mul.hi.s64 is implemented as a wrapper around the emulation for mul.hi.u64, and this latter emulation seems to work fine on all platforms including sm_5x. Thus another possible workaround is to use our own wrapper around mul.hi.u64. Coding with inline PTX is unnecessary in this case, since mul.hi.s64 and mul.hi.u64 are accessible via the device intrinsics __mul64hi() and __umul64hi(). As can be seen from the code below, the adjustments to convert a result from unsigned to signed multiplication are fairly trivial.
long long int m1, m2, result;
#if 0 // broken on sm_5x at optimization level -O0
asm(" mul.hi.s64 %0, %1, %2; \n\t"
: "=l"(result)
: "l"(m1), "l"(m2));
#else
result = __umul64hi (m1, m2);
if (m1 < 0LL) result -= m2;
if (m2 < 0LL) result -= m1;
#endif

Can I get CUDA Compute capability (version) in compile time by #define?

How can I get CUDA Compute capability (version) in compile time by #define?
For example, if I use __ballot and compile with
nvcc -c -gencode arch=compute_20,code=sm_20 \
-gencode arch=compute_13,code=sm_13
source.cu
can I get version of compute capability in my code by #define for choose the branch of code with __ballot and without?
Yes. First, it's best to understand what happens when you use -gencode. NVCC will compile your input device code multiple times, once for each device target architecture. So in your example, NVCC will run compilation stage 1 once for compute_20 and once for compute_13.
When nvcc compiles a .cu file, it defines two preprocessor macros, __CUDACC__ and __CUDA_ARCH__. __CUDACC__ does not have a value, it is simply defined if cudacc is the compiler, and not defined if it isn't.
__CUDA_ARCH__ is defined to an integer value representing the SM version being compiled.
100 = compute_10
110 = compute_11
200 = compute_20
etc. To quote the NVCC documentation included with the CUDA Toolkit:
The architecture identification macro __CUDA_ARCH__ is assigned a three-digit value string xy0 (ending in a literal 0) during each nvcc compilation stage 1 that compiles for compute_xy. This macro can be used in the implementation of GPU functions for determining the virtual architecture for which it is currently being compiled. The host code (the non-GPU code) must not depend on it.
So, in your case where you want to use __ballot(), you can do this:
....
#if __CUDA_ARCH__ >= 200
int b = __ballot();
int p = popc(b & lanemask);
#else
// do something else for earlier architectures
#endif

thrust functor: "too many resources requested for launch"

I'm trying to implement something like this in CUDA:
for each element
p = { p if p >= floor
z if p < floor
Where floor and z are constants configured at the start of the test.
I have attempted to implement it like so, but I get the error "too many resources requested for launch"
A functor:
struct floor_functor : thrust::unary_function <float, float>
{
const float floorLevel, floorVal;
floor_functor(float _floorLevel, float _floorVal) : floorLevel(_floorLevel), floorVal(_floorVal){}
__host__
__device__
float operator()(float& x) const
{
if (x >= floorLevel)
return x;
else
return floorVal;
}
};
Used by a transform:
thrust::transform(input->begin(), input->end(), output.begin(), floor_functor(floorLevel, floorVal));
If I remove one of the members of my functor, say floorVal, and use a functor with only one member variable, it works fine.
Does anyone know why this might be, and how I could fix it?
Additional info:
My array is 786432 elements long.
My GPU is a GeForce GTX590
I am building with the command:
`nvcc -c -g -arch sm_11 -Xcompiler -fPIC -Xcompiler -Wall -DTHRUST_DEBUG -I <my_include_dir> -o <my_output> <my_source>`
My cuda version is 4.0:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Thu_May_12_11:09:45_PDT_2011
Cuda compilation tools, release 4.0, V0.2.1221
And my maximum number of threads per block is 1024 (reported by deviceQuery):
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
UPDATE::
I have stumbled upon a fix for my problem, but do not understand it. If I rename my functor from "floor_functor" to basically anything else, it works! I have no idea why this is the case, and would be interested to hear anyone's ideas about this.
For an easier CUDA implementation, you could do this with ArrayFire in one line of code:
p(p < floor) = z;
Just declare your variables as af::array's.
Good luck!
Disclaimer: I work on all sorts of CUDA projects, including ArrayFire.

Thrust sort_by_key issue when using zip_iterator values

I'm trying to use zip_iterator with a sort_by_key() in cuda and the values inside the zip_iterator are not getting re-ordered during the sort (the positions of the data stay the same as they were originally).
Example code:
typedef thrust::device_vector<int> IntVec;
IntVec keyVec(100);
IntVec fooVec(100);
IntVec barVec(100);
for (int z = 0; z < 100; z++)
{
keyVec[z] = rand();
fooVec[z] = z;
barVec[z] = z;
}
thrust::sort_by_key( keyVec.begin(), keyVec.end(),
thrust::make_zip_iterator( make_tuple( fooVec.begin(), barVec.begin() ) ) );
What i expect this code to do is sort based on the value in keyVec (which it does properly) while maintaining the order of fooVec and barVec. Is this not what sort_by_key does? does sort_by_key work with zip_iterators? Am i doing something incorrect when setting up/pulling the data from the zip_iterator? If this method is incorrect what is the proper method to keep value ordering?
EX:
key,foo,bar (presort)
3,1,1
2,2,2
...
key,foo,bar (what i expect post sort)
2,2,2
3,1,1
...
key,foo,bar (what i actually get)
2,1,1
3,2,2
...
Using Thrust that ships with CUDA 4.1
System Details:
OS: RHEL 6.0 x86_64
CUDA Version: 4.1 (also tested with 4.1.1.5)
Thrust Version: 1.5
GPU: 4x nVidia Corporation GF100 [GeForce GTX 480] (rev a3)
nvidia driver: 290.10
nvcc version: release 4.1, V0.2.1221
compile string: nvcc testfile.cu
UPDATE:
Still cannot get sort_by_key() to work with zip_iterators but it works correctly with a standard thrust::device_vector<>.begin() iterator.
thrust::sort_by_key should be able to sort zip_iterator in the manner of your example.
I've not been able to reproduce the behavior you describe on any of several different platforms, but it's possible there's something unique about your system which causes an issue.
You should post the contents of testfile.cu and the details of your system to Thrust's bug tracker on Google Code so the developers can take a closer look.

CUDA, cuPrintf causes "unspecified launch failure"?

I have a kernel which runs twice with different grid size.
My problem is with cuPrintf. When I don't have cudaPrintfInit() before kernel run and cudaPrintfDisplay(stdout, true) and cudaPrintfEnd() after kernel run, I have no error but when I put them there I get "unspecified launch failure" error.
In my device code, there is only one loop like this for printing:
if (threadIdx.x==0) {
cuPrintf("MAX:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
}
I'm using CUDA 4.0 with a card with cuda capability 2.0 and so I'm compiling my code with this syntax:
nvcc LB2.0.cu -arch=compute_20 -code=sm_20
If you are on a CC 2.0 GPU, you don't need cuPrintf at all -- CUDA has printf built-in for CC-2.0 and higher GPUs. So just replace your call to cuPrintf with this:
#if __CUDA_ARCH__ >= 200
if (threadIdx.x==0) {
printf("MAX:%f x:%d y:%d\n", maxVal, blockIdx.x, blockIdx.y);
}
#endif
(Note you only need the #if / #endif lines if you are compiling your code for sm_20 and also earlier versions. With the example compilation command line you gave, you can eliminate them.)
With printf, you don't need cudaPrintfInit() or cudaPrintfDisplay() -- it is automatic. However if you print a lot of data, you may need to increase the default printf FIFO size with cudaDeviceSetLimit(), passing the cudaLimitPrintfFifoSize option.