Why is addition without overflow set CC.CF to 1? - cuda

I have the next code
#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
__global__ void cuda_test() {
int result;
asm(
".reg .u32 r1;\n\t"
"add.cc.u32 r1, 0, 0;\n\t"
"subc.u32 %0, 0, 0; \n\t"
:"=r"(result)
);
printf("r= %x\n", result);
}
int main() {
cuda_test<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
This code prints
r= ffffffff
Why? As far, as I undertand, operation add.cc.u32 r1, 0, 0 must set carry flag to 0. I am under the impression that the subc.u32 operation uses the inverse of the CC.CF. But from the documentation it shouldn't be that way.

I cannot find information anywhere in the PTX documentation on how what PTX calls the CC.CF flag is actually generated. Looking at the generated machine code (SASS) I see that subtraction is implemented via addition, and the use of an extend flag CC.X.
Based on some quick experiments, this .X flag always seems to be the normal carry-out from the adder. Since a-b = a+~b+1, on a subtraction .X will be set if a >= b. It represents the carry-out from the adder which is the one's complement of an x86-style borrow on subtracts, which is set when a < b.
In other words, the extended arithmetic instructions of the GPU appear to use the same convention that is used by the ARM and PowerPC architectures for their extended arithmetic instructions. The Wikipedia article on the carry flag covers the two design alternatives for handling of the flag during subtraction.
In the code in the question, add.cc.u32 clears CC.CF, which signals to the subsequent subc.u32 that a borrow has occured, causing it to compute a+~b.
You may wish to file an enhancement request with NVIDIA to clarify the PTX documentation regarding details of CC.CF generation and handling.

Related

Implementing of mutex on cuda kernel function happens to be deadlocked

I'm a newcomer to cuda, and I try to perform mutex in the kernel function.
I read some tutorials and wrote my function, but in some case, deadlock happened.
Here are my codes, kernel function is very simple to count numbers of running thread started by main function.
#include <iostream>
#include <cuda_runtime.h>
__global__ void countThreads(int* sum, int* mutex) {
while(atomicCAS(mutex, 0, 1) != 0); // lock
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
int main() {
int* mutex = nullptr;
cudaMalloc(&mutex, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int* sum = nullptr;
cudaMalloc(&sum, sizeof(int));
cudaMemset(&mutex, 0, sizeof(int));
int ret = 0;
// pass, result is 1024
countThreads<<<1024, 1>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
// deadlock, why?
countThreads<<<1, 2>>>(sum, mutex);
cudaMemcpy(&ret, sum, sizeof(int), cudaMemcpyDeviceToHost);
std::cout << ret << std::endl;
return 0;
}
So, can anyone tell me why the program deadlocked when calling countThreads<<<1, 2>>>(), and how to fix it? I want to perform cross-block mutex, may be it is not a good idea though. Many thanks.
I experimented for some time, and found if use thread in the same block, deadlock happens, otherwise, everything works well.
Threads in the same warp attempting to negotiate for a lock or mutex is probably the worst-case scenario. It is fairly difficult to program correctly, and the behavior may change depending on the exact GPU you are running on.
Here is an example of the type of analysis needed to explain the exact reason for the deadlock in a particular case. Such analysis is not readily done on what you have shown here because you have not indicated the type of GPU you are compiling for, or running on. It's also fairly important to provide the CUDA version you are using for compilation. I have witnessed code changes from one compiler generation to another, that may impact this. Even if you provided that information, I'm not sure the analysis is really worth-while, because I consider the negotiation-within-a-warp case to be extra troublesome to program correctly. This question/answer may also be of interest.
My general suggestion for a newcomer in CUDA (as you say) would be to use a method similar to what is described here. Briefly, negotiate for a lock at the threadblock level (ie. have one thread in each block negotiate among other blocks for the lock) then manage singleton activity within the block using standard, available block-level coordination schemes, such as __syncthreads(), and conditional coding.
You can learn more about this topic by searching on the cuda tag for such keywords as "lock" "critical section" etc.
FWIW, for me, anyway, your code does deadlock on a Kepler device and does not deadlock on a Volta device, as suggested by the reference in the comments. I'm not attempting to communicate any statement about whether your code is defect-free, it's just an observation. If I modify your kernel to look like this:
__global__ void countThreads(int* sum, int* mutex) {
int old = 1;
while (old){
old = atomicCAS(mutex, 0, 1); // lock
if (old == 0){
*sum += 1;
__threadfence();
atomicExch(mutex, 0); // unlock
}
}
}
Then it seems to me to work in either the Kepler case or the Volta case. I'm not advancing this example to suggest it is "correct", rather to show that somewhat innocuous code modifications can change a code from deadlock to non-deadlock case, or vice versa. This kind of fragility is best avoided, certainly in the pre-Volta case, in my opinion.
For the volta and forward case, CUDA 11 and forward, you may want to use capability from the libcu++ library such as semaphore

Bug in PTX ISA (carry propagation)?

Is there a bug in Cuda? I have run the following code on my GTX580 and r1 is zero at the end. I expect that it is one due to carry propagation? I have tested the code with Cuda Toolkit 4.2.9 and 5.5 and use "nvcc -arch=sm_20 bug.cu -o bug && ./bug" to compile and run it.
#include <stdio.h>
#include <cuda.h>
__global__ void bug()
{
unsigned int r1 = 0;
unsigned int r2 = 0;
asm( "\n\t"
"sub.cc.u32 %0, 0, 1;\n\t"
"addc.cc.u32 %1, 0, 0;\n\t"
: "=r"(r1), "=r"(r2) );
printf("r1 >> %04X\n", r1);
printf("r2 >> %04X\n", r2);
}
int main(void)
{
float *a_d;
cudaMalloc((void **) &a_d, 1);
bug <<< 1,1 >>> ();
cudaFree(a_d);
}
Output
r1 >> FFFFFFFF
r2 >> 0000
I believe you're making some assumptions about the CC.CF flag referenced in the PTX ISA documentation that may not be valid.
Note that the definition of specific states (e.g. 0 or 1) of this bit are never given that I can see. Furthermore, I don't find any mapping between the definition of "carry-in/carry-out" and "borrow-in/borrow-out"
Stated another way, I think you are assuming that a "borrow" status in this flag is identical to a "carry" status. In other words, you are assuming something like:
CF:
0 = (NO CARRY) or (NO BORROW)
1 = (CARRY) or (BORROW)
But such a truth table or mapping is never given. Furthermore the manual states:
The condition code register ... is mainly intended for use in straight-line code sequences for computing extended-precision integer addition, subtraction, and multiplication.
I don't think your code satisfies the intent, nor do I think the above assumption of truth table for CC.CF is valid.
In fact what I think is happening is a truth table like this:
CF:
0 = (CARRY) or (NO BORROW)
1 = (NO CARRY) or (BORROW)
(the 0 and 1 here are arbitrary; that is also not defined in the manual.)
All examples of code I have tried (about 6 cases, including yours) have fit the definition I have given above.
Having said this, I would think it unwise to depend on this, as it is mostly undocumented. A safe rule for computer architecture is that undocumented behavior may change in the future.
I think I have found an explanation. There is a note in the PTX manual which says for the sub.cc instruction: "Behavior is the same for unsigned and signed integers."

STL unordered_map crashes with __m128 values

I tracked a bug to the use of a __m128 (SSE vector) as a value in a std::unordered_map.
This causes a runtime segmentation fault with mingw32 g++4.7.2.
Please see the example below.
Is there any reason why this should fail?
Or, might there be a workaround? (I tried wrapping the value in a class but it did not help.)
Thanks.
#include <unordered_map>
#include <xmmintrin.h> // __m128
#include <iostream>
int main()
{
std::unordered_map<int,__m128> m;
std::cerr << "still ok\n";
m[0] = __m128();
std::cerr << "crash in previous statement\n";
return 0;
}
Compilation settings:
g++ -march=native -std=c++11
There are 2 issues regarding alignment:
Does the ABI ensure that __m128 variables are always aligned on the stack?
Does the global new operator return memory suitably aligned for the __m128 type? i.e., returns memory with a 16-byte alignment.
C++ currently doesn't handle dynamic allocation of over-aligned types. With usual x86 ABIs, standard alignment is 8 and __m128 has an alignment of 16 bytes, so it is overaligned. With usual x86_64 ABIs, the standard alignment is 16 which makes __m128 safe (but __m256 is unsafe again with its 32-byte alignment).
See this paper for a possible change in the next standard that would make things "just work":
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3396.htm
In the meantime, you can specify your own allocator, for instance based on aligned_alloc (C11), posix_memalign (unix), _aligned_malloc (Microsoft), etc.

CUDA performance: branching and shared memory

I wish to ask two questions on performance. I have been unable to create simple code to illustrate.
Question 1: How expensive is non-divergent branching? In my code it seems that it even goes up as to more then the equivalent of 4 non-fma FLOPS. Note that I am speaking of the BRA PTX code whereby the predicate is already calculated
Question 2: I have been reading a lot about performance of shared memory and some articles like a Dr Dobbs article even state that it can be as fast as registers (as far as accessed well). In my code all threads within the warps within the block access the same shared variable. I believe in this case shared memory is accessed in broadcast mode, isn't it? Should it reach the performance of registers in this way? Is there any special things that should be considered to make it work?
EDIT: I have been able to construct some simple code that give more insight for my query
Here it is
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <float.h>
#include "cuComplex.h"
#include "time.h"
#include "cuda_runtime.h"
#include <iostream>
using namespace std;
__global__ void test()
{
__shared__ int t[1024];
int v=t[0];
bool b=(v==-1);
bool c=(v==-2);
int myValue=0;
for (int i=0;i<800;i++)
{
#if 1
v=i;
#else
v=t[i];
#endif
#if 0
if (b) {
printf("abs");
}
#endif
if (c)
{
printf ("IT HAPPENED");
v=8;
}
myValue+=v;
}
if (myValue==1000)
printf ("IT HAPPENED");
}
int main(int argc, char *argv[])
{
cudaEvent_t event_start,event_stop;
float timestamp;
float4 *data;
// Initialise
cudaDeviceReset();
cudaSetDevice(0);
dim3 threadsPerBlock;
dim3 blocks;
threadsPerBlock.x=32;
threadsPerBlock.y=32;
threadsPerBlock.z=1;
blocks.x=1;
blocks.y=1000;
blocks.z=1;
cudaEventCreate(&event_start);
cudaEventCreate(&event_stop);
cudaEventRecord(event_start, 0);
test<<<blocks,threadsPerBlock,0>>>();
cudaEventRecord(event_stop, 0);
cudaEventSynchronize(event_stop);
cudaEventElapsedTime(&timestamp, event_start, event_stop);
printf("Calculated in %f", timestamp);
}
I am running this code on a GTX680.
Now the results are as follows ..
If run as it is it takes 5.44 ms
If I change the first #if conditional to 0 (which will enable reading from shared memory) it will take 6.02ms.. Not much more but still not enough for me
If I enable the second #if conditional (inserts a branch that will never evaluate to true) the it runs in 9.647040ms. The performance reduction is very big. What is the cause and what can be done?
I have also changed slightly the code to make further checks with shared memory
Instead of
__shared__ int t[1024]
I did
__shared__ int2 t[1024]
and wherever I access t[] I just access t[].x. In got a further drop in performance to 10ms..(another 400micro seconds) Why this should happen?
Regards
Daniel
Have you determined if your kernel is compute bound or memory bound? Your first question would be most relevant if your kernel is compute bound, while the second wold be most relevant if your kernel is memory bound. You might be getting results that are confusing or hard to reproduce if you're assuming one, while it is the other.
(1) I don't think the cost of a branch has been published. You might be left to determining that experimentally for your architecture. The CUDA Programming Guide does say that there is no "branch prediction and no speculative execution."
(2) You're right that when you access a single 32-bit value in shared memory from all the threads in a warp, the value is broadcast. But my guess would be that accessing a single value from all threads would have the same cost as accessing any combination of values as long as you don't incur any bank conflicts. So you end up with the latency of a single fetch from shared memory. I don't think the number of cycles of latency has been published. It is short enough that it is normally easily hidden.
You need to keep in mind that the compiler is highly optimizing. So if you comment out the branch, you also eliminate the evaluation of the conditional, wether or not you leave it in the source code. Thus a difference of four instructions seems very plausible for your example:
load -1,
compare v to it (and store result in b),
test b,
branch,
although I have not compiled your example and looked at the code (which is what you should do - run cuobjdump -sass on your binaries and look at the actual differences in machine code.
Using the only the .x compnent of an int2 changes the layout in shared memory so that you go from bank conflict free access to a 2-way bank conflict, which causes the slight further slowdown in your example. IIRC the latency of a shared memory access is of the order of 30 cycles, which usually is easily hidden by other threads (as Roger has already mentioned).

clock() in opencl

I know that there is function clock() in CUDA where you can put in kernel code and query the GPU time. But I wonder if such a thing exists in OpenCL? Is there any way to query the GPU time in OpenCL? (I'm using NVIDIA's tool kit).
There is no OpenCL way to query clock cycles directly. However, OpenCL does have a profiling mechanism that exposes incremental counters on compute devices. By comparing the differences between ordered events, elapsed times can be measured. See clGetEventProfilingInfo.
Just for others coming her for help: Short introduction to profiling kernel runtime with OpenCL
Enable profiling mode:
cmdQueue = clCreateCommandQueue(context, *devices, CL_QUEUE_PROFILING_ENABLE, &err);
Profiling kernel:
cl_event prof_event;
clEnqueueNDRangeKernel(cmdQueue, kernel, 1 , 0, globalWorkSize, NULL, 0, NULL, &prof_event);
Read profiling data in:
cl_ulong ev_start_time=(cl_ulong)0;
cl_ulong ev_end_time=(cl_ulong)0;
clFinish(cmdQueue);
err = clWaitForEvents(1, &prof_event);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ev_start_time, NULL);
err |= clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ev_end_time, NULL);
Calculate kernel execution time:
float run_time_gpu = (float)(ev_end_time - ev_start_time)/1000; // in usec
Profiling of individual work-items / work-goups is NOT possible yet.
You can set globalWorkSize = localWorkSize for profiling. Then you have only one workgroup.
Btw: Profiling of a single work-item (some work-items) isn't very helpful. With only some work-items you won't be able to hide memory latencies and the overhead leading to not meaningful measurements.
Try this (Only work with NVidia OpenCL of course) :
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time));
return clock_time;
}
The NVIDIA OpenCL SDK has an example Using Inline PTX with OpenCL. The clock register is accessible through inline PTX as the special register %clock. %clock is described in PTX: Parallel Thread Execution ISA manual. You should be able to replace the %%laneid with %%clock.
I have never tested this with OpenCL but use it in CUDA.
Please be warned that the compiler may reorder or remove the register read.
On NVIDIA you can use the following:
typedef unsigned long uint64_t; // if you haven't done so earlier
inline uint64_t n_nv_Clock()
{
uint64_t n_clock;
asm volatile("mov.u64 %0, %%clock64;" : "=l" (n_clock)); // make sure the compiler will not reorder this
return n_clock;
}
The volatile keyword tells the optimizer that you really mean it and don't want it moved / optimized away. This is a standard way of doing so both in PTX and e.g. in gcc.
Note that this returns clocks, not nanoseconds. You need to query for device clock frequency (using clGetDeviceInfo(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, sizeof(freq), &freq, 0))). Also note that on older devices there are two frequencies (or three if you count the memory frequency which is irrelevant in this case): the device clock and the shader clock. What you want is the shader clock.
With the 64-bit version of the register you don't need to worry about overflowing as it generally takes hundreds of years. On the other hand, the 32-bit version can overflow quite often (you can still recover the result - unless it overflows twice).
Now, 10 years later after the question was posted I did some tests on NVidia. I tried running the answers given by user 'Spectral' and 'the swine'. Answer given by 'Spectral' does not work. I always got same invalid values returned by clock_time function.
uint clock_time()
{
uint clock_time;
asm("mov.u32 %0, %%clock;" : "=r"(clock_time)); // this is wrong
return clock_time;
}
After subtracting start and end time I got zero.
So had a look at the PTX assembly which in PyOpenCL you can get this way:
kernel_string = """
your OpenCL code
"""
prg = cl.Program(ctx, kernel_string).build()
print(prg.binaries[0].decode())
It turned out that the clock command was optimized away! So there was no '%clock' instruction in the printed assembly.
Looking into Nvidia's PTX documentation I found the following:
'Normally any memory that is written to will be specified as an out operand, but if there is a hidden side effect on user memory (for example, indirect access of a memory location via an operand), or if you want to stop any memory optimizations around the asm() statement performed during generation of PTX, you can add a "memory" clobbers specification after a 3rd colon, e.g.:'
So the function that actually work is this:
uint clock_time()
{
uint clock_time;
asm volatile ("mov.u32 %0, %%clock;" : "=r"(clock_time) :: "memory");
return clock_time;
}
The assembly contained lines like:
// inline asm
mov.u32 %r13, %clock;
// inline asm
The version given by 'the swine' also works.