CUDA - atomicAdd(float) does not add very small values

CUDA - atomicAdd(float) does not add very small values - cuda

When I use float atomicAdd(float *address, float val) to add a float value smaller than approx. 1e-39 to 0, the addition does not work, and the value at address remains 0.
Here is the simplest code:
__device__ float test[6] = {0};
__global__ void testKernel() {
float addit = sinf(1e-20);
atomicAdd(&test[0], addit);
test[1] += addit;
addit = sinf(1e-37);
atomicAdd(&test[2], addit);
test[3] += addit;
addit = sinf(1e-40);
atomicAdd(&test[4], addit);
test[5] += addit;
}
When I run the code above as testKernel<<<1, 1>>>(); and stop with the debugger I see:
test 0x42697800
[0] 9.9999997e-21
[1] 9.9999997e-21
[2] 9.9999999e-38
[3] 9.9999999e-38
[4] 0
[5] 9.9999461e-41
Notice the difference between test[4] and test[5]. Both did the same thing, yet the simple addition worked, and the atomic one did nothing at all.
What am I missing here?
Update: System info: CUDA 5.5.20, NVidia Titan card, Driver 331.82, Windows 7x64, Nsight 3.2.1.13309.

atomicAdd is a special instruction that does not necessarily obey the same flush and rounding behaviors that you might get if you specify for example -ftz=true or -ftz=false on other floating point operations (e.g. ordinary fp add)
As documented in the PTX ISA manual:
The floating-point operation .add is a single-precision, 32-bit operation. atom.add.f32 rounds to nearest even and flushes subnormal inputs and results to sign-preserving zero.
So even though ordinary floating point add should not flush denormals to zero if you specify -ftz=false (which is the default, I believe, for nvcc), the floating point atomic add operation to global memory will flush to zero (always).

Related

How to debug numba error involving synchronization [duplicate]

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.

__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

CUDA float addition gives wrong answer (compared to CPU float ops) [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I am new to CUDA. I was using cuda to find the dot prod of float vectors and I came across a float point addition issue in cuda. In essence following is the simple kernel. I'm using -arch=sm_50
So the basic idea is for the thread_0 to add the values of vector a.
__global__ void temp(float *a, float *b, float *c) {
if (0 == threadIdx.x && blockIdx.x == 0 && blockIdx.y ==0 ) {
float xx = 0.0f;
for (int i = 0; i < LENGTH; i++){
xx += a[i];
}
*c = xx;
}
}
When I initialize 'a' with 1000 elements of 1.0 I get the desired result of 1000.00
but when I initialize 'a' with 1.1, I should get 1100.00xx but istead, I am getting 1099.989014. The cpu implementation simply yields 1100.000024
I am trying to understand what the issue here! :-(
I even tried to count the number of 1.1 elements in the a vector and that yeilds 1000, which is expected. and I even used atomicAdd and still I have the same issue.
would be very grateful if someone could help me out here!
best
EDIT:
Biggest concern here is the disparity of the CPU result vs GPU result! I understand floats can be off by some decimal points. But the GPU error is very significant! :-(

It is not possible to represent 1.1 exactly using IEEE-754 floating point representation. As #RobertCrovella mentionned in his comment, the computation performed on the CPU does not use the same IEEE-754 settings than the GPU one.
Indeed, 1.1 in floating point is stored as 0x3F8CCCCD = which is 1.10000002384185. Performing the sum on 1000 elements, the last bits gets lost in rouding, one bit for the first addition, two bits after four, etc, until 10 bits after 1000. Depending on rounding mode, you may truncate the 10 bits for the last half of operations, hence ending up summing 0x3F8CCC00 which is 1.09997558.
The result from CUDA divided by 1000 is 0x3F8CCC71, which is consistent with a calculation in 32 bits.
When compiling on CPU, depending on optimization flags, you may be using fast math, which uses the internal register precision. It can be, if not specifying vector registers, using the x87 FPU which is 80 bits precision. In that occurence, the computation would read 1.1 in float which is 1.10000002384185, add it 1000 times using higher precision, hence not loosing any bit in rounding resulting in 1100.00002384185, and display 1100.000024 which is its round to nearest display.
Depending on compilation flags, the actual equivalent computation on Cpu may require enforcement of 32 bits floating-point arithmetics which can be done using addss of the SSE2 instruction set for example.
You can also play with /fp: option or -mfpmath with the compiler and explore issued instructions. In that case assembly instruction fadd is the 80-bits precision addition.
All of this has nothing to do with GPU floating-point precision. It is rather some misunderstanding of the IEEE-754 norm and the legacy x87 FPU behaviour.

Why does cuFFT performance suffer with overlapping inputs?

I'm experimenting with using cuFFT's callback feature to perform input format conversion on the fly (for instance, calculating FFTs of 8-bit integer input data without first doing an explicit conversion of the input buffer to float). In many of my applications, I need to calculate overlapped FFTs on an input buffer, as described in this previous SO question. Typically, adjacent FFTs might overlap by 1/4 to 1/8 of the FFT length.
cuFFT, with its FFTW-like interface, explicitly supports this via the idist parameter of the cufftPlanMany() function. Specifically, if I want to calculate FFTs of size 32768 with an overlap of 4096 samples between consecutive inputs, I would set idist = 32768 - 4096. This does work properly in the sense that it yields the correct output.
However, I'm seeing strange performance degradation when using cuFFT in this way. I have devised a test that implements this format conversion and overlap in two different ways:
Explicitly tell cuFFT about the overlapping nature of the input: set idist = nfft - overlap as I described above. Install a load callback function that just does the conversion from int8_t to float as needed on the buffer index provided to the callback.
Don't tell cuFFT about the overlapping nature of the input; lie to it an dset idist = nfft. Then, let the callback function handle the overlapping by calculating the correct index that should be read for each FFT input.
A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist. I didn't reproduce it all here for brevity. The program calculates a batch of 1024 32768-point FFTs that overlap by 4096 samples; the input data type is 8-bit integers. When I run it on my machine (with a Geforce GTX 660 GPU, using CUDA 8.0 RC on Ubuntu 16.04), I get the following result:
executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec
Method 2 is noticeably faster, which I would not expect. Look at the implementations of the callback functions:
Method 1:
template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index,
void *, void *)
{
return (cufftReal)(((const T *) inbuf)[fft_index]);
}
Method 2:
template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf,
size_t fft_index, void *, void *)
{
// fft_index is the index of the sample that we need, not taking
// the overlap into account. Convert it to the appropriate sample
// index, considering the overlap structure. First, grab the FFT
// parameters from constant memory.
int nfft = overlap_params.nfft;
int overlap = overlap_params.overlap;
// Calculate which FFT in the batch that we're reading data for. This
// tells us how much overlap we need to account for. Just use integer
// arithmetic here for speed, knowing that this would cause a problem
// if we did a batch larger than 2Gsamples long.
int fft_index_int = fft_index;
int fft_batch_index = fft_index_int / nfft;
// For each transform past the first one, we need to slide "overlap"
// samples back in the input buffer when fetching the sample.
fft_index_int -= fft_batch_index * overlap;
// Cast the input pointer to the appropriate type and convert to a float.
return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}
Method 2 has a significantly more complex callback function, one that even involves integer division by a non-compile time value! I would expect this to be much slower than method 1, but I'm seeing the opposite. Is there a good explanation for this? Is it possible that cuFFT structures its processing much differently when the input overlaps, thus resulting in the degraded performance?
It seems like I should be able to achieve performance that is quite a bit faster than method 2 if the index calculations could be removed from the callback (but that would require the overlapping to be specified to cuFFT).
Edit: After running my test program under nvvp, I can see that cuFFT definitely seems to be structuring its computations differently. It's hard to make sense of the kernel symbol names, but the kernel invocations break down like this:
Method 1:
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE: 3.72 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 7.71 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.75 msec (yes, it gets invoked twice)
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.49 msec
Method 2:
spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>: 5.15 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 msec
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.51 msec
Interestingly, it looks like cuFFT invokes two kernels to actually compute the FFTs using method 1 (when cuFFT knows about the overlapping), but with method 2 (where it doesn't know that the FFTs are overlapped), it does the job with just one. For the kernels that are used in both cases, it does seem to use the same grid parameters between methods 1 and 2.
I don't see why it should have to use a different implementation here, especially since the input stride istride == 1. It should just use a different base address when fetching data at the transform input; the rest of the algorithm should be exactly the same, I think.
Edit 2: I'm seeing some even more bizarre behavior. I realized by accident that if I fail to destroy the cuFFT handles appropriately, I see differences in measured performance. For example, I modified the test program to skip destruction of the cuFFT handles and then executed the tests in a different sequence: method 1, method 2, then method 2 and method 1 again. I got the following results:
executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec
So the performance seems to change depending upon whether there are other cuFFT plans in existence when creating a plan for the test case! Using the profiler, I see that the structure of the kernel launches doesn't change between the two cases; the kernels just all seem to execute faster. I have no reasonable explanation for this effect either.

If you specify non-standard strides (doesn't matter if batch/transform) cuFFT uses different path internally.
ad edit 2:
This is likely GPU Boost adjusting clocks on GPU. cuFFT plan do not have impact one on another
Ways to get more stable results:
run warmup kernel (anything that would make full GPU work is good) and then your problem
increase batch size
run test several times and take average
lock clocks of the GPU (not really possible on GeForce - Tesla can do it)

At the suggestion of #llukas, I filed a bug report with NVIDIA regarding the issue (https://partners.nvidia.com/bug/viewbug/1821802 if you're registered as a developer). They acknowledged the poorer performance with overlapped plans. They actually indicated that the kernel configuration used in both cases is suboptimal and they plan to improve that eventually. No ETA was given, but it will likely not be in the next release (8.0 was just released last week). Finally, they said that as of CUDA 8.0, there is no workaround to make cuFFT use a more efficient method with strided inputs.

Cuda measurement of loop

I launch a very simple kernel <<<1,512>>> on a CUDA Fermi GPU.
__global__ void kernel(){
int x1,x2;
x1=5;
x2=1;
for (int k=0;k<=1000000;k++)
{
x1+=x2;
}
}
The kernel is very simple, it does 10^6 additions and does not transfer anything back to global memory. The result is correct, i.e. after the loop x1 (in all its 512 thread instances) contains 10^6 + 5
I am trying to measure the execution time of the kernel. using both visual studio parallel nsight and nvvp. Nsight measures 2.5 microseconds and nvvp measures 4 microseconds.
The issue is the following: I may increase largely the size of the loop eg to 10^8 and the time remains constant. Same if I decrease the loop size a lot. Why does this happen?
Please note that if I use shared memory or global memory inside the loop, the measurements reflect the work being performed (i.e. there is proportionality).

As noted, CUDA compiler optimisation is very aggressive at removing dead code. Because x2 doesn't participate in a value which is written to memory, it and the loop can be removed. The compiler will also pre-calculate any results which can be deduced at compile time, so if all the constants in the loop are known to the compiler, it can compute the final result and replace it with a constant.
To get around both of these problems, rewrite your code like this:
__global__
void kernel(int *out, int x0, bool flag)
{
int x1 = x0, x2 = 1;
for (int k=0; k<=1000000; k++) {
x1+=x2;
}
if (flag) out[threadIdx.x + blockIdx.x*blockDim.x] = x1;
}
and then run it like this:
kernel<<<1,512>>>((int *)0, 5, false);
By passing the initial value of x1 as an argument to the kernel, you ensure that the loop result isn't available to the compiler. The flag makes the memory store conditional, and then memory store makes the whole calculation unsafe to remove. As long as the flag is set to false at runtime, there is no store performed, so that doesn't effect the timing of the loop.

Because compiler eliminates the dead paths. You code doesn't actually do anything. Look at the assembly.
If you are actually seeing the value, then the compiler may have just optimized out the loop as it can know the value during compile time.
When you write out the register contents to shared memory, compiler cannot guarantee that the result will not be used, and hence the value will actually be computed. In other words, the value you compute must be used somewhere eventually or written to memory otherwise its computation will be dropped.

Counting registers/thread in Cuda kernel

The nSight profiler tells me that the following kernel uses 52 registers per thread:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
If each register has 4 bytes I don't understand how we get to 52 registers, even
assuming that the arrays params[9] and ctrl_params[4] end up in registers (in which
case using shared memory as I did doesn't make sense). I would
like to increase occupancy, but I don't get why I'm using so many registers.
Any ideas?

It's generally difficult to look at C code and predict the register usage from it. The compiler may aggressively optimize code by increasing register usage, perhaps to save an instruction here or there. You seem to be making an assumption that register usage can be predicted from your C code variable allocations, and while there is some connection between the two, you cannot assume register usage can be computed directly from C code variable allocations.
Since you haven't provided your code, nobody can actually help with the register usage. If you want to better understand the register usage, you will need to look at the PTX code directly. To do this, compile your code using nvcc with the -ptx switch, and inspect the resultant .ptx file directly. To do this you may wish to refer to the PTX documentation as well as the nvcc documentation to look at the various compiler options.
You haven't provided your code, so it's not really possible to make any direct suggestions, but you may be able to reduce register usage by reducing constant usage, reducing or refactoring arithmetic usage, switching from double to float, and I'm sure there are many other suggestions as well. Register usage will also be affected if you are passing the -G switch to the compiler.
You can limit the compiler's usage of registers per thread by passing the -maxrregcount switch to nvcc with an appropriate parameter, such as -maxrregcount 20 which will instruct the compiler to limit itself to 20 registers per thread. This tactic may not give good results, however, or you may need to tune the parameter to a value which doesn't sacrifice too much performance. However you may find an optimum choice which doesn't sacrifice too much basic performance but allows you to improve occupancy. If you constrain the compiler too much, it will begin to spill it's needed register usage to local memory, which will generally reduce performance.
You should also be aware that you can pass -Xptxas -v to nvcc which will give useful output about the compiler's register usage and other related data (spilling, etc.) at compile time.

If you want to increase the occupancy, a direct way is using compiler flag: maxregcount to restrict the usage of registers, but it may suffer a performance loss because some registers will be spilled to local memory, which is very slow.

I suggest you debug your code with Eclipse Nsight.
Create a breakpoint at the first line of your kernel and step to there.
In Debug Perspective, inside the CUDA Thread, you have the current stack trace. Right-click on the stack and click on "Instruction Stepping Mode". The window "Disassembly" will open your kernel PTX Assembly. You can continue stepping in your kernel to track the correlation of your source code and the assembly. So you can discover which register is used for.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008