I've been debugging a C++ application in VS2015 and found that a number of my double variables were ending up a NaN following a divide by zero. While this is reasonable, I have floating point exceptions enabled (/fp:except) so I would have expected this to raise an exception. Looking at the MS help page, it doesn't list what causes a floating point exception. According to this answer to a related question, divide by zero is a floating point exception. This is not the case, i.e. the following test program with /fp:except enabled
int main()
{
try
{
double x = 1;
double y = 0;
double z = x / y;
printf("%f\n", z);
return 0;
}
catch (...)
{
printf("Exception!\n");
return 0;
}
}
displays "inf". Should this raise a floating point exception?
Edit: Checked the exceptions were enabled in the debugger and get the same result regardless
Edit2: Further reading here on IEE 754 suggests to me that with floating point exceptions enabled I should be getting an exception. A comment to the previously linked question however states 'The name "floating point exception" is a historical misnomer. Floating point division by zero is well-defined (per Annex F/IEEE754) and does not produce any signal.'
Floating point exceptions are not the same as C++ exceptions. They either have to be checked using std::fetestexcept() (below) or trapped (SIGFPE). Using traps requires calling _controlfp() in MSVC and feenableexcept() in GCC.
#include <iostream>
#include <cfenv>
#include <stdexcept>
// MSVC specific, see
// https://learn.microsoft.com/en-us/cpp/preprocessor/fenv-access?view=msvc-160
#pragma fenv_access (on)
int main() {
std::feclearexcept(FE_ALL_EXCEPT);
double y = 0.0;
double result{};
result = 1 / y;
std::cout << result << std::endl;
if (std::fetestexcept(FE_ALL_EXCEPT) & FE_DIVBYZERO) {
throw std::runtime_error("Division by zero");
}
return 0;
}
There are more examples of this in the SEI CERT C Coding Standard: https://wiki.sei.cmu.edu/confluence/display/c/FLP03-C.+Detect+and+handle+floating-point+errors.
Note that division by zero is undefined; it's not required to signal, but C and C++ implementations that define __STDC_IEC_559__ "shall" signal.
Related
I'm trying to run mainSift.cpp from CudaSift on a Nvidia Tesla M2090. First of all, as explained in this question, I had to change from sm_35 to sm_20 the CMakeLists.txt.
Unfortunatley now this error is returned:
checkMsg() CUDA error: LaplaceMulti() execution failed
in file </ghome/rzhengac/Downloads/CudaSift/cudaSiftH.cu>, line 318 : unknown error.
And this is the LaplaceMulti code:
double LaplaceMulti(cudaTextureObject_t texObj, CudaImage *results, float baseBlur, float diffScale, float initBlur)
{
float kernel[12*16];
float scale = baseBlur;
for (int i=0;i<NUM_SCALES+3;i++) {
float kernelSum = 0.0f;
float var = scale*scale - initBlur*initBlur;
for (int j=-LAPLACE_R;j<=LAPLACE_R;j++) {
kernel[16*i+j+LAPLACE_R] = (float)expf(-(double)j*j/2.0/var);
kernelSum += kernel[16*i+j+LAPLACE_R];
}
for (int j=-LAPLACE_R;j<=LAPLACE_R;j++)
kernel[16*i+j+LAPLACE_R] /= kernelSum;
scale *= diffScale;
}
safeCall(cudaMemcpyToSymbol(d_Kernel2, kernel, 12*16*sizeof(float)));
int width = results[0].width;
int pitch = results[0].pitch;
int height = results[0].height;
dim3 blocks(iDivUp(width+2*LAPLACE_R, LAPLACE_W), height);
dim3 threads(LAPLACE_W+2*LAPLACE_R, LAPLACE_S);
LaplaceMulti<<<blocks, threads>>>(texObj, results[0].d_data, width, pitch, height);
checkMsg("LaplaceMulti() execution failed\n");
return 0.0;
}
I've read already this question that seems somewhat similar, but I don't understand what the solution means or how to use it for my problem.
Why does the error occur?
The error occurs because you are running code which has features which are not supported on your GPU (texture objects). I am a little surprised that the compiler doesn't generate an error during compilation, but that is another question.
There is no solution except to use supported hardware, or to rewrite the code.
[This answer assembled from comments and added as a community wiki entry to get this answer off the unanswered list for the CUDA tag]
I read the wiki page about heisenbug, but don't understand this example. Can
anyone explain it in detail?
One common example
of a heisenbug is a bug that appears when the program is compiled with an
optimizing compiler, but not when the same program is compiled without
optimization (as is often done for the purpose of examining it with a debugger).
While debugging, values that an optimized program would normally keep in
registers are often pushed to main memory. This may affect, for instance, the
result of floating-point comparisons, since the value in memory may have smaller
range and accuracy than the value in the register.
Here's a concrete example recently posted:
Infinite loop heisenbug: it exits if I add a printout
It's a really nice specimen because we can all reproduce it: http://ideone.com/rjY5kQ
These bugs are so dependent on very precise features of the platform that people also find them very difficult to reproduce.
In this case when the 'print-out' is omitted the program performs a high precision comparison inside the CPU registers (higher than stored in a double).
But to print out the value the compiler decides to move the result to main memory which results in an implicit truncation of the precision. When it uses that truncated value for the comparison it succeeds.
#include <iostream>
#include <cmath>
double up = 19.0 + (61.0/125.0);
double down = -32.0 - (2.0/3.0);
double rectangle = (up - down) * 8.0;
double f(double x) {
return (pow(x, 4.0)/500.0) - (pow(x, 2.0)/200.0) - 0.012;
}
double g(double x) {
return -(pow(x, 3.0)/30.0) + (x/20.0) + (1.0/6.0);
}
double area_upper(double x, double step) {
return (((up - f(x)) + (up - f(x + step))) * step) / 2.0;
}
double area_lower(double x, double step) {
return (((g(x) - down) + (g(x + step) - down)) * step) / 2.0;
}
double area(double x, double step) {
return area_upper(x, step) + area_lower(x, step);
}
int main() {
double current = 0, last = 0, step = 1.0;
do {
last = current;
step /= 10.0;
current = 0;
for(double x = 2.0; x < 10.0; x += step) current += area(x, step);
current = rectangle - current;
current = round(current * 1000.0) / 1000.0;
//std::cout << current << std::endl; //<-- COMMENT BACK IN TO "FIX" BUG
} while(current != last);
std::cout << current << std::endl;
return 0;
}
Edit: Verified bug and fix still exhibit: 03-FEB-22, 20-Feb-17
It comes from Uncertainty Principle which basically states that there is a fundamental limit to the precision with which certain pairs of physical properties of a particle can be known simultaneously. If you start observing some particle too closely,(i.e., you know its position precisely) then you can't measure its momentum precisely. (And if you have precise speed, then you can't tell its exact position)
So following this, Heisenbug is a bug which disappears when you are watching closely.
In your example, if you need the program to perform well, you will compile it with optimization and there will be a bug. But as soon as you enter in debugging mode, you will not compile it with optimization which will remove the bug.
So if you start observing the bug too closely, you will be uncertain to know its properties(or unable to find it), which resembles the Heisenberg's Uncertainty Principle and hence called Heisenbug.
The idea is that code is compiled to two states - one is normal or debug mode and other is optimised or production mode.
Just as it is important to know what happens to matter at quantum level, we should also know what happens to our code at compiler level!
I would like to use Thrust's stream compaction functionality (copy_if) for distilling indices of elements from a vector if the elements adhere to a number of constraints. One of these constraints depends on the values of neighboring elements (8 in 2D and 26 in 3D). My question is: how can I obtain the neighbors of an element in Thrust?
The function call operator of the functor for the 'copy_if' basically looks like:
__host__ __device__ bool operator()(float x) {
bool mark = x < 0.0f;
if (mark) {
if (left neighbor of x > 1.0f) return false;
if (right neighbor of x > 1.0f) return false;
if (top neighbor of x > 1.0f) return false;
//etc.
}
return mark;
}
Currently I use a work-around by first launching a CUDA kernel (in which it is easy to access neighbors) to appropriately mark the elements. After that, I pass the marked elements to Thrust's copy_if to distill the indices of the marked elements.
I came across counting_iterator as a sort of substitute for directly using threadIdx and blockIdx to acquire the index of the processed element. I tried the solution below, but when compiling it, it gives me a "/usr/include/cuda/thrust/detail/device/cuda/copy_if.inl(151): Error: Unaligned memory accesses not supported". As far as I know I'm not trying to access memory in an unaligned fashion. Anybody knows what's going on and/or how to fix this?
struct IsEmpty2 {
float* xi;
IsEmpty2(float* pXi) { xi = pXi; }
__host__ __device__ bool operator()(thrust::tuple<float, int> t) {
bool mark = thrust::get<0>(t) < -0.01f;
if (mark) {
int countindex = thrust::get<1>(t);
if (xi[countindex] > 1.01f) return false;
//etc.
}
return mark;
}
};
thrust::copy_if(indices.begin(),
indices.end(),
thrust::make_zip_iterator(thrust::make_tuple(xi, thrust::counting_iterator<int>())),
indicesEmptied.begin(),
IsEmpty2(rawXi));
#phoad: you're right about the shared mem, it struck me after I already posted my reply, subsequently thinking that the cache probably will help me. But you beat me with your quick response. The if-statement however is executed in less than 5% of all cases, so either using shared mem or relying on the cache will probably have negligible impact on performance.
Tuples only support 10 values, so that would mean I would require tuples of tuples for the 26 values in the 3D case. Working with tuples and zip_iterator was already quite cumbersome, so I'll pass for this option (also from a code readability stand point). I tried your suggestion by directly using threadIdx.x etc. in the device function, but Thrust doesn't like that. I seem to be getting some unexplainable results and sometimes I end up with an Thrust error. The following program for example generates a 'thrust::system::system_error' with an 'unspecified launch failure', although it first correctly prints "Processing 10" to "Processing 41":
struct printf_functor {
__host__ __device__ void operator()(int e) {
printf("Processing %d\n", threadIdx.x);
}
};
int main() {
thrust::device_vector<int> dVec(32);
for (int i = 0; i < 32; ++i)
dVec[i] = i + 10;
thrust::for_each(dVec.begin(), dVec.end(), printf_functor());
return 0;
}
Same applies to printing blockIdx.x Printing blockDim.x however generates no error. I was hoping for a clean solution, but I guess I am stuck with my current work-around solution.
Do any of the operations dealing with masking or extracting individual bits from an integer depend on endianness? I've written some code, but with access only to hardware of one type, I can't really check that its operators are endian-independent. Please let me know if you see any bugs. NOTE: This code was written for a homework problem, and personal edification:
void PrintDecimalIntegerInBinary (long long n)
{
PrintDecimalInBinaryRecursion(n, n >= 0);
}
void PrintDecimalInBinaryRecursion (long long n, bool sign)
{
if (n == 0) {
cout << (sign ? 0x0 : 0x1);
}
else {
PrintDecimalInBinaryRecursion((unsigned long long)n >> 1, sign);
cout << (n & 0x1);
}
}
Thanks for your help.
Endianness only determines how data is stored, not how it's processed. So any bitwise operators or bit shifting are unaffected by endianness. Specifically, 0x1 means the same thing regardless of the endianness.
Hey all, I am using CUDA and the Thrust library. I am running into a problem when I try to access a double pointer on the CUDA kernel loaded with a thrust::device_vector of type Object* (vector of pointers) from the host. When compiled with 'nvcc -o thrust main.cpp cukernel.cu' i receive the warning 'Warning: Cannot tell what pointer points to, assuming global memory space' and a launch error upon attempting to run the program.
I have read the Nvidia forums and the solution seems to be 'Don't use double pointers in a CUDA kernel'. I am not looking to collapse the double pointer into a 1D pointer before sending to the kernel...Has anyone found a solution to this problem? The required code is below, thanks in advance!
--------------------------
main.cpp
--------------------------
Sphere * parseSphere(int i)
{
Sphere * s = new Sphere();
s->a = 1+i;
s->b = 2+i;
s->c = 3+i;
return s;
}
int main( int argc, char** argv ) {
int i;
thrust::host_vector<Sphere *> spheres_h;
thrust::host_vector<Sphere> spheres_resh(NUM_OBJECTS);
//initialize spheres_h
for(i=0;i<NUM_OBJECTS;i++){
Sphere * sphere = parseSphere(i);
spheres_h.push_back(sphere);
}
//initialize spheres_resh
for(i=0;i<NUM_OBJECTS;i++){
spheres_resh[i].a = 1;
spheres_resh[i].b = 1;
spheres_resh[i].c = 1;
}
thrust::device_vector<Sphere *> spheres_dv = spheres_h;
thrust::device_vector<Sphere> spheres_resv = spheres_resh;
Sphere ** spheres_d = thrust::raw_pointer_cast(&spheres_dv[0]);
Sphere * spheres_res = thrust::raw_pointer_cast(&spheres_resv[0]);
kernelBegin(spheres_d,spheres_res,NUM_OBJECTS);
thrust::copy(spheres_dv.begin(),spheres_dv.end(),spheres_h.begin());
thrust::copy(spheres_resv.begin(),spheres_resv.end(),spheres_resh.begin());
bool result = true;
for(i=0;i<NUM_OBJECTS;i++){
result &= (spheres_resh[i].a == i+1);
result &= (spheres_resh[i].b == i+2);
result &= (spheres_resh[i].c == i+3);
}
if(result)
{
cout << "Data GOOD!" << endl;
}else{
cout << "Data BAD!" << endl;
}
return 0;
}
--------------------------
cukernel.cu
--------------------------
__global__ void deviceBegin(Sphere ** spheres_d, Sphere * spheres_res, float
num_objects)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
spheres_res[index].a = (*(spheres_d+index))->a; //causes warning/launch error
spheres_res[index].b = (*(spheres_d+index))->b;
spheres_res[index].c = (*(spheres_d+index))->c;
}
void kernelBegin(Sphere ** spheres_d, Sphere * spheres_res, float num_objects)
{
int threads = 512;//per block
int grids = ((num_objects)/threads)+1;//blocks per grid
deviceBegin<<<grids,threads>>>(spheres_d, spheres_res, num_objects);
}
The basic problem here is that device vector spheres_dv contains host pointers. Thrust cannot do "deep copying" or pointer translation between the GPU and host CPU address spaces. So when you copy spheres_h to GPU memory, you are winding up with a GPU array of host pointers. Indirection of host pointers on the GPU is illegal - they are pointers in the wrong memory address space, thus you are getting the GPU equivalent of a segfault inside the kernel.
The solution is going to involve replacing your parseSphere function with something that performs memory allocation on the GPU, rather than using the parseSphere, which presently allocates each new structure in host memory. If you had a Fermi GPU (which it appears you do not) and are using CUDA 3.2 or 4.0, then one approach would be to turn parseSphere into a kernel. The C++ new operator is supported in device code, so structure creation would occur in device memory. You would need to modify the definition of Sphere so that the constructor is defined as a __device__ function for this approach to work.
The alternative approach will involve creating a host array holding device pointers, then copy that array to device memory. You can see an example of that in this answer. Note that it is probably the case that declaring a thrust::device_vector containing thrust::device_vector won't work, so you will likely need to do this array of device pointers construction using the underlying CUDA API calls.
You should also note that I haven't mentioned the reverse copy operation, which is equally as difficult to do.
The bottom line is that thrust (and C++ STL containers for that matter) really are not intended to hold pointers. They are intended to hold values, and abstract away pointer indirection and direct memory access through the use of iterators and underlying algorithms which the user isn't supposed to see. Further, the "deep copy" problem is main the reason why the wise people on the NVIDIA forums counsel against multiple levels of pointers in GPU code. It greatly complicates code, and it executes slower on the GPU as well.