cudaErrorLaunchFailure when running MD5 5000 times [duplicate] - cuda

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:
__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap,
int segmentCount, int* output)
{
int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
int pointCount = segmentCount + 1;
if(segmentIndex >= segmentCount)
return;
int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
int result = 0;
if(polylineIndex >= 0)
{
float2 p1 = points[segmentIndex];
float2 p2 = points[segmentIndex+1];
float2 A = p2;
float2 a;
a.x = p2.x - p1.x;
a.y = p2.y - p1.y;
for(int i = segmentIndex+2; i < segmentCount; i++)
{
int currentPolylineIndex = segmentToPolylineIndexMap[i];
// if not a different segment within out polyline and
// not a fake segment
bool isLegit = (currentPolylineIndex != polylineIndex &&
currentPolylineIndex >= 0);
float2 p3 = points[i];
float2 p4 = points[i+1];
float2 B = p4;
float2 b;
b.x = p4.x - p3.x;
b.y = p4.y - p3.y;
float2 c;
c.x = B.x - A.x;
c.y = B.y - A.y;
float2 b_perp;
b_perp.x = -b.y;
b_perp.y = b.x;
float numerator = dot(b_perp, c);
float denominator = dot(b_perp, a);
bool isParallel = (denominator == 0.0);
float quotient = numerator / denominator;
float2 intersectionPoint;
intersectionPoint.x = quotient * a.x + A.x;
intersectionPoint.y = quotient * a.y + A.y;
result = result | (isLegit && !isParallel &&
intersectionPoint.x > min(p1.x, p2.x) &&
intersectionPoint.x > min(p3.x, p4.x) &&
intersectionPoint.x < max(p1.x, p2.x) &&
intersectionPoint.x < max(p3.x, p4.x) &&
intersectionPoint.y > min(p1.y, p2.y) &&
intersectionPoint.y > min(p3.y, p4.y) &&
intersectionPoint.y < max(p1.y, p2.y) &&
intersectionPoint.y < max(p3.y, p4.y));
}
}
output[segmentIndex] = result;
}
Here is the call to execute the kernel function:
DoCheck<<<702, 32>>>(
(float2*)devicePoints,
deviceSegmentsToPolylineIndexMap,
numSegments,
deviceOutput);
The sizes of the parameters are as follows:
devicePoints = 22,464 float2s = 179,712 bytes
deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
numSegments = 1 int = 4 bytes
deviceOutput = 22,463 ints = 89,852 bytes
When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.
Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.
Update
More information on my system configuration:
Video Card: nVidia 8800GT
CUDA Version: 1.1
OS: Windows Server 2008 R2
I also tried it on a laptop with the following configuration, but got the same results:
Video Card: nVidia Quadro FX 880M
CUDA Version: 1.2
OS: Windows 7 64-bit

The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.
On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:
Get a Telsa card and use the TCC driver, which eliminates the problem completely
Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.

Related

Why does my code fail when calling curand too many times? [duplicate]

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:
__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap,
int segmentCount, int* output)
{
int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
int pointCount = segmentCount + 1;
if(segmentIndex >= segmentCount)
return;
int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
int result = 0;
if(polylineIndex >= 0)
{
float2 p1 = points[segmentIndex];
float2 p2 = points[segmentIndex+1];
float2 A = p2;
float2 a;
a.x = p2.x - p1.x;
a.y = p2.y - p1.y;
for(int i = segmentIndex+2; i < segmentCount; i++)
{
int currentPolylineIndex = segmentToPolylineIndexMap[i];
// if not a different segment within out polyline and
// not a fake segment
bool isLegit = (currentPolylineIndex != polylineIndex &&
currentPolylineIndex >= 0);
float2 p3 = points[i];
float2 p4 = points[i+1];
float2 B = p4;
float2 b;
b.x = p4.x - p3.x;
b.y = p4.y - p3.y;
float2 c;
c.x = B.x - A.x;
c.y = B.y - A.y;
float2 b_perp;
b_perp.x = -b.y;
b_perp.y = b.x;
float numerator = dot(b_perp, c);
float denominator = dot(b_perp, a);
bool isParallel = (denominator == 0.0);
float quotient = numerator / denominator;
float2 intersectionPoint;
intersectionPoint.x = quotient * a.x + A.x;
intersectionPoint.y = quotient * a.y + A.y;
result = result | (isLegit && !isParallel &&
intersectionPoint.x > min(p1.x, p2.x) &&
intersectionPoint.x > min(p3.x, p4.x) &&
intersectionPoint.x < max(p1.x, p2.x) &&
intersectionPoint.x < max(p3.x, p4.x) &&
intersectionPoint.y > min(p1.y, p2.y) &&
intersectionPoint.y > min(p3.y, p4.y) &&
intersectionPoint.y < max(p1.y, p2.y) &&
intersectionPoint.y < max(p3.y, p4.y));
}
}
output[segmentIndex] = result;
}
Here is the call to execute the kernel function:
DoCheck<<<702, 32>>>(
(float2*)devicePoints,
deviceSegmentsToPolylineIndexMap,
numSegments,
deviceOutput);
The sizes of the parameters are as follows:
devicePoints = 22,464 float2s = 179,712 bytes
deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
numSegments = 1 int = 4 bytes
deviceOutput = 22,463 ints = 89,852 bytes
When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.
Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.
Update
More information on my system configuration:
Video Card: nVidia 8800GT
CUDA Version: 1.1
OS: Windows Server 2008 R2
I also tried it on a laptop with the following configuration, but got the same results:
Video Card: nVidia Quadro FX 880M
CUDA Version: 1.2
OS: Windows 7 64-bit
The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.
On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:
Get a Telsa card and use the TCC driver, which eliminates the problem completely
Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.

Cuda program not scaling [duplicate]

I have a CUDA program that seems to be hitting some sort of limit of some resource, but I can't figure out what that resource is. Here is the kernel function:
__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap,
int segmentCount, int* output)
{
int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
int pointCount = segmentCount + 1;
if(segmentIndex >= segmentCount)
return;
int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
int result = 0;
if(polylineIndex >= 0)
{
float2 p1 = points[segmentIndex];
float2 p2 = points[segmentIndex+1];
float2 A = p2;
float2 a;
a.x = p2.x - p1.x;
a.y = p2.y - p1.y;
for(int i = segmentIndex+2; i < segmentCount; i++)
{
int currentPolylineIndex = segmentToPolylineIndexMap[i];
// if not a different segment within out polyline and
// not a fake segment
bool isLegit = (currentPolylineIndex != polylineIndex &&
currentPolylineIndex >= 0);
float2 p3 = points[i];
float2 p4 = points[i+1];
float2 B = p4;
float2 b;
b.x = p4.x - p3.x;
b.y = p4.y - p3.y;
float2 c;
c.x = B.x - A.x;
c.y = B.y - A.y;
float2 b_perp;
b_perp.x = -b.y;
b_perp.y = b.x;
float numerator = dot(b_perp, c);
float denominator = dot(b_perp, a);
bool isParallel = (denominator == 0.0);
float quotient = numerator / denominator;
float2 intersectionPoint;
intersectionPoint.x = quotient * a.x + A.x;
intersectionPoint.y = quotient * a.y + A.y;
result = result | (isLegit && !isParallel &&
intersectionPoint.x > min(p1.x, p2.x) &&
intersectionPoint.x > min(p3.x, p4.x) &&
intersectionPoint.x < max(p1.x, p2.x) &&
intersectionPoint.x < max(p3.x, p4.x) &&
intersectionPoint.y > min(p1.y, p2.y) &&
intersectionPoint.y > min(p3.y, p4.y) &&
intersectionPoint.y < max(p1.y, p2.y) &&
intersectionPoint.y < max(p3.y, p4.y));
}
}
output[segmentIndex] = result;
}
Here is the call to execute the kernel function:
DoCheck<<<702, 32>>>(
(float2*)devicePoints,
deviceSegmentsToPolylineIndexMap,
numSegments,
deviceOutput);
The sizes of the parameters are as follows:
devicePoints = 22,464 float2s = 179,712 bytes
deviceSegmentsToPolylineIndexMap = 22,463 ints = 89,852 bytes
numSegments = 1 int = 4 bytes
deviceOutput = 22,463 ints = 89,852 bytes
When I execute this kernel, it crashes the video card. It would appear that I am hitting some sort of limit, because if I execute the kernel using DoCheck<<<300, 32>>>(...);, it works. Just to be clear, the parameters are the same, just the number of blocks is different.
Any idea why one crashes the video driver, and the other doesn't? The one that fail seems to be still within the card's limit on number of blocks.
Update
More information on my system configuration:
Video Card: nVidia 8800GT
CUDA Version: 1.1
OS: Windows Server 2008 R2
I also tried it on a laptop with the following configuration, but got the same results:
Video Card: nVidia Quadro FX 880M
CUDA Version: 1.2
OS: Windows 7 64-bit
The resource which is being exhausted is time. On all current CUDA platforms, the display driver includes a watchdog timer which will kill any kernel which takes more than a few seconds to execute. Running code on a card which is running a display is subject to this limit.
On the WDDM Windows platforms you are using, there are three possible solutions/work-arounds:
Get a Telsa card and use the TCC driver, which eliminates the problem completely
Try modifying registry settings to increase the timer limit (google for the TdrDelay registry key for more information, but I am not a Windows user and can't be more specific than that)
Modify your kernel code to be "re-entrant" and process the data parallel work load in several kernel launches rather than one. Kernel launch overhead isn't all that large and processing the workload over several kernel runs is often pretty easy to achieve, depending on the algorithm you are using.

GPU Precision issues for relatively small array sizes?

I have some CUDA code that does some linear algebra to invert a special type of structured matrix. I calculate RMS error using the results of a serialized version of the algorithm. The error grows with problem size to a greater extent that I would expect. Can anyone provide insight as to why this may be the case?
The GPU code is very naive. This is intentional, and I will optimize it very soon - I just wanted a simple baseline kernel that gives the proper results.
__global__ void levinson_durbin_gpu(TYPE *h0_d, TYPE *h_d, TYPE *v_d, TYPE *x_d, TYPE *y_d, int N) //Naive kernel
{
int j = threadIdx.x;
int i;
__shared__ TYPE hn_1[512];
hn_1[j] = h_d[j];
for(i=1; i<N; i++)
{
if(j < i)
{
TYPE hn = h_d[i];
TYPE yn = y_d[i];
__syncthreads();
//Set up temporary arrays, compute inner products
__shared__ TYPE temp[512]; //Temp for hn_1_J_v
__shared__ TYPE temp2[512]; //Temp for hn_1_J_x
__shared__ TYPE temp3[512]; //Temp for hn_1_v
temp[j] = hn_1[j]*v_d[i-j-1];
temp2[j] = hn_1[j]*x_d[i-j-1];
temp3[j] = hn_1[j]*v_d[j];
__syncthreads();
//Three reductions at once
for(unsigned int s=1; s<i; s*=2)
{
int index = 2*s*j;
if((index+s) < i)
{
temp[index] += temp[index+s];
temp2[index] += temp2[index+s];
temp3[index] += temp3[index+s];
}
__syncthreads();
}
TYPE hn_1_J_v = temp[0];
TYPE hn_1_J_x = temp2[0];
TYPE hn_1_v = temp3[0];
TYPE alpha_v = (hn - hn_1_J_v)/(h0_d[0] - hn_1_v);
TYPE alpha_x = (yn - hn_1_J_x)/(h0_d[0] - hn_1_v);
__shared__ TYPE w_v[512];
w_v[j] = v_d[j] - alpha_v*v_d[i-j-1];
__shared__ TYPE w_x[512];
w_x[j] = x_d[j] - alpha_x*v_d[i-j-1];
v_d[j] = w_v[j];
x_d[j] = w_x[j];
if(j == 0)
{
v_d[i] = alpha_v;
x_d[i] = alpha_x;
}
}
__syncthreads();
}
}
The identifier TYPE is either float or double depending on how I compile the code. I'm using 1 block with N threads (again, keeping things naive and simple here). With single precision I see the following results:
N=4: RMS Error = 0.0000000027
N=8: RMS Error = 0.0000001127
N=16: RMS Error = 0.0000008832
N=32: RMS Error = 0.0000009233
N=64: RMS Error = 42.0136776452
N=80: RMS Error = 281371.7533760048
I can't tell if this is an error with my algorithm or some sort of precision issue. If it helps I can show the above results using double precision, the CPU version of the algorithm, or the code that calculates the RMS error. I'm using a GeForce GTX 660 Ti (cc 3.0) GPU. The variable x_d contains the end result.
Thanks to the help from the comments section I was able to solve the problem myself, so I'll document it here in case others experience a similar issue.
The problem indeed was synchronization issue - my use of __syncthreads() within a divergent control flow block. The solution was to break that control flow block into multiple parts and calling __syncthreads() after each part:
__global__ void levinson_durbin_gpu(TYPE *h0_d, TYPE *h_d, TYPE *v_d, TYPE *x_d, TYPE *y_d, int N) //Naive kernel
{
int j = threadIdx.x;
int i;
__shared__ TYPE hn_1[512];
hn_1[j] = h_d[j];
__syncthreads();
//Set up temporary arrays
__shared__ TYPE temp[512]; //Temp for hn_1_J_v
__shared__ TYPE temp2[512]; //Temp for hn_1_J_x
__shared__ TYPE temp3[512]; //Temp for hn_1_v
TYPE hn;
TYPE yn;
for(i=1; i<N; i++)
{
if(j < i)
{
hn = h_d[i];
yn = y_d[i];
//Compute inner products
temp[j] = hn_1[j]*v_d[i-j-1];
temp2[j] = hn_1[j]*x_d[i-j-1];
temp3[j] = hn_1[j]*v_d[j];
}
__syncthreads();
//Have all threads complete this section to avoid synchronization issues
//Three reductions at once
for(unsigned int s=1; s<i; s*=2)
{
int index = 2*s*j;
if((index+s) < i)
{
temp[index] += temp[index+s];
temp2[index] += temp2[index+s];
temp3[index] += temp3[index+s];
}
__syncthreads();
}
if(j < i)
{
TYPE hn_1_J_v = temp[0];
TYPE hn_1_J_x = temp2[0];
TYPE hn_1_v = temp3[0];
TYPE alpha_v = (hn - hn_1_J_v)/(h0_d[0] - hn_1_v);
TYPE alpha_x = (yn - hn_1_J_x)/(h0_d[0] - hn_1_v);
__shared__ TYPE w_v[512];
w_v[j] = v_d[j] - alpha_v*v_d[i-j-1];
__shared__ TYPE w_x[512];
w_x[j] = x_d[j] - alpha_x*v_d[i-j-1];
v_d[j] = w_v[j];
x_d[j] = w_x[j];
if(j == 0)
{
v_d[i] = alpha_v;
x_d[i] = alpha_x;
}
}
__syncthreads();
}
}
N=32: RMS Error = 0.0000009233
N=64: RMS Error = 0.0000027644
N=128: RMS Error = 0.0000058276
N=256: RMS Error = 0.0000117755
N=512: RMS Error = 0.0000237040
what I learned: When you use synchronization mechanisms in CUDA, make sure all threads reach the same barrier point! I feel as though this sort of thing should produce a compiler warning.

2D kernel calling and launch parameters for non-square matrix

I am attempting to port the following (simplified) nested loop as a CUDA 2D kernel. The sizes of NgS and NgO will increase with larger data sets; for now I just want to get this kernel to output the correct results for all values:
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int NgS = 1859;
int NgO = 900;
// 1D flattened matrices have been initialized as:
Radio_cpu = new double [NgS*NgO];
Result_cpu = new double [NgS*NgO];
// ignoring the part where they are filled w/ data
for (m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Result_cpu[idx(n,m,NgO)]] = k0*Radio_cpu[idx(n,m,NgO)]];
}
}
The examples I have come across usually deal with square loops, and I have been unable to get the correct output for all the GPU array indices compared to the CPU version. Here is the host code calling the kernel:
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
// Result_gpu and Radio_gpu are allocated versions of the CPU variables on GPU
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, Radio_gpu, Result_gpu);
Here is the kernel:
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgS || m > NgO) return;
// map the two 2D indices to a single linear, 1D index
int grid_width = gridDim.x * blockDim.x;
int idxxx = m + (n * grid_width);
Result[idxxx] = k0 * Radio[idxxx];
}
With the current code, I proceeded to compare the Result_cpu variable with Result_gpu variable once copied back. When I cycle through the values I get:
// matches from NgS = 0...913
Result_gpu[NgS = 913][NgO = 0]: -56887.2
Result_cpu[Ngs = 913][NgO = 0]: -56887.2
// mismatches from NgS = 914...1858
Result_gpu[NgS = 914][NgO = 0]: -12.2352
Result_cpu[NgS = 914][NgO = 0]: 79448.6
This pattern is the same, irregardless of the value of NgO. I have been trying to figure out where I have made a mistake by looking at various examples for a few hours and trying out changes, but so far this scheme has worked minus the obvious issue at hand whereas the others have caused kernel invocation errors/left the GPU array uninitialized for all values. Since I clearly cannot see the mistake, I'd really appreciate if someone could point me in the right direction towards a fix. I'm pretty sure it's right under my nose and I can't see it.
In case it matters, I'm testing this code on a Kepler card, compiling using MSVC 2010, CUDA 4.2 and 304.79 driver and have compiled the code with both arch=compute_20,code=sm_20 and arch=compute_30,code=compute_30 flags with no difference.
#vaca_loca: I tested the following kernel (it works for me also with non-square block dimensions):
__global__ void trans(int NgO, int NgS,
double k0, double * Radio, double * Result) {
int n = blockIdx.x * blockDim.x + threadIdx.x;
int m = blockIdx.y * blockDim.y + threadIdx.y;
if(n > NgO || m > NgS) return;
int ofs = m * NgO + n;
Result[ofs] = k0 * Radio[ofs];
}
void test() {
int NgS = 1859, NgO = 900;
int data_sz = NgS * NgO, bytes = data_sz * sizeof(double);
cudaSetDevice(0);
double *Radio_cpu = new double [data_sz*3],
*Result_cpu = Radio_cpu + data_sz,
*Result_gpu = Result_cpu + data_sz;
double k0 = -1.7961233;
srand48(time(NULL));
int i, j, n, m;
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
Radio_cpu[m + n*NgO] = lrand48() % 234234;
Result_cpu[m + n*NgO] = k0*Radio_cpu[m + n*NgO];
}
}
double *g_Radio, *g_Result;
cudaMalloc((void **)&g_Radio, bytes * 2);
g_Result = g_Radio + data_sz;
cudaMemcpy(g_Radio, Radio_cpu, bytes, cudaMemcpyHostToDevice);
dim3 dimBlock(16, 16);
dim3 dimGrid;
dimGrid.x = (NgO + dimBlock.x - 1) / dimBlock.x;
dimGrid.y = (NgS + dimBlock.y - 1) / dimBlock.y;
trans<<<dimGrid,dimBlock>>>(NgO, NgS, k0, g_Radio, g_Result);
cudaMemcpy(Result_gpu, g_Result, bytes, cudaMemcpyDeviceToHost);
for(m=0; m<NgO; m++) {
for (n=0; n<NgS; n++) {
double c1 = Result_cpu[m + n*NgO],
c2 = Result_gpu[m + n*NgO];
if(std::abs(c1-c2) > 1e-4)
printf("(%d;%d): %.7f %.7f\n", n, m, c1, c2);
}
}
cudaFree(g_Radio);
delete []Radio_cpu;
}
though, in my opinion, accessing data from global memory using quads might not be very cache-friendly since access stride is pretty large. You might consider using 2D textures instead if it's critical for your algorithm to access data in 2D locality

Cuda demoting double to float error despite no doubles in code

I'm writing a kernel using PyCUDA. My GPU device only supports compute capability 1.1 (arch sm_11) and so I can only use floats in my code. I've taken great effort to ensure I'm doing everything with floats, but despite that, there is a particular line in my code that keeps causing a compiler error.
The chunk of code is:
// Gradient magnitude, so 1 <= x <= width, 1 <= y <= height.
if( j > 0 && j < im_width && i > 0 && i < im_height){
gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
}
Here, idx() is a __device__ helper function that returns a linear index based on pixel indices i and j, and it only works with integers. I use it throughout and it doesn't give errors anywhere else, so I strongly suspect it's not idx(). The sqrt() call is just from the standard C math functions which support floats. All of the arrays involved, x_gradient , y_gradient, and gradient_mag are all float* and they are part of the input to my function (i.e. declared in Python, then converted to device variables, etc.).
I've tried removing the extra cast to float in my code above, with no luck. I've also tried doing something completely stupid like this:
// Gradient magnitude, so 1 <= x <= width, 1 <= y <= height.
if( j > 0 && j < im_width && i > 0 && i < im_height){
gradient_mag[idx(i,j)] = 3.0f; // also tried float(3.0) here
}
All of these variations give the same error:
pycuda.driver.CompileError: nvcc said it demoted types in source code it compiled--this is likely not what you want.
[command: nvcc --cubin -arch sm_11 -I/usr/local/lib/python2.7/dist-packages/pycuda-2011.1.2-py2.7-linux-x86_64.egg/pycuda/../include/pycuda kernel.cu]
[stderr:
ptxas /tmp/tmpxft_00004329_00000000-2_kernel.ptx, line 128; warning : Double is not supported. Demoting to float
]
Any ideas? I've debugged many errors in my code and was hoping to get it working tonight, but this has proved to be a bug that I cannot understand.
Added -- Here is a truncated version of the kernel that produces the same error above on my machine.
every_pixel_hog_kernel_source = \
"""
#include <math.h>
#include <stdio.h>
__device__ int idx(int ii, int jj){
return gridDim.x*blockDim.x*ii+jj;
}
__device__ int bin_number(float angle_val, int total_angles, int num_bins){
float angle1;
float min_dist;
float this_dist;
int bin_indx;
angle1 = 0.0;
min_dist = abs(angle_val - angle1);
bin_indx = 0;
for(int kk=1; kk < num_bins; kk++){
angle1 = angle1 + float(total_angles)/float(num_bins);
this_dist = abs(angle_val - angle1);
if(this_dist < min_dist){
min_dist = this_dist;
bin_indx = kk;
}
}
return bin_indx;
}
__device__ int hist_number(int ii, int jj){
int hist_num = 0;
if(jj >= 0 && jj < 11){
if(ii >= 0 && ii < 11){
hist_num = 0;
}
else if(ii >= 11 && ii < 22){
hist_num = 3;
}
else if(ii >= 22 && ii < 33){
hist_num = 6;
}
}
else if(jj >= 11 && jj < 22){
if(ii >= 0 && ii < 11){
hist_num = 1;
}
else if(ii >= 11 && ii < 22){
hist_num = 4;
}
else if(ii >= 22 && ii < 33){
hist_num = 7;
}
}
else if(jj >= 22 && jj < 33){
if(ii >= 0 && ii < 11){
hist_num = 2;
}
else if(ii >= 11 && ii < 22){
hist_num = 5;
}
else if(ii >= 22 && ii < 33){
hist_num = 8;
}
}
return hist_num;
}
__global__ void every_pixel_hog_kernel(float* input_image, int im_width, int im_height, float* gaussian_array, float* x_gradient, float* y_gradient, float* gradient_mag, float* angles, float* output_array)
{
/////
// Setup the thread indices and linear offset.
/////
int i = blockDim.y * blockIdx.y + threadIdx.y;
int j = blockDim.x * blockIdx.x + threadIdx.x;
int ang_limit = 180;
int ang_bins = 9;
float pi_val = 3.141592653589f; //91
/////
// Compute a Gaussian smoothing of the current pixel and save it into a new image array
// Use sync threads to make sure everyone does the Gaussian smoothing before moving on.
/////
if( j > 1 && i > 1 && j < im_width-2 && i < im_height-2 ){
// Hard-coded unit standard deviation 5-by-5 Gaussian smoothing filter.
gaussian_array[idx(i,j)] = float(1.0/273.0) *(
input_image[idx(i-2,j-2)] + float(4.0)*input_image[idx(i-2,j-1)] + float(7.0)*input_image[idx(i-2,j)] + float(4.0)*input_image[idx(i-2,j+1)] + input_image[idx(i-2,j+2)] +
float(4.0)*input_image[idx(i-1,j-2)] + float(16.0)*input_image[idx(i-1,j-1)] + float(26.0)*input_image[idx(i-1,j)] + float(16.0)*input_image[idx(i-1,j+1)] + float(4.0)*input_image[idx(i-1,j+2)] +
float(7.0)*input_image[idx(i,j-2)] + float(26.0)*input_image[idx(i,j-1)] + float(41.0)*input_image[idx(i,j)] + float(26.0)*input_image[idx(i,j+1)] + float(7.0)*input_image[idx(i,j+2)] +
float(4.0)*input_image[idx(i+1,j-2)] + float(16.0)*input_image[idx(i+1,j-1)] + float(26.0)*input_image[idx(i+1,j)] + float(16.0)*input_image[idx(i+1,j+1)] + float(4.0)*input_image[idx(i+1,j+2)] +
input_image[idx(i+2,j-2)] + float(4.0)*input_image[idx(i+2,j-1)] + float(7.0)*input_image[idx(i+2,j)] + float(4.0)*input_image[idx(i+2,j+1)] + input_image[idx(i+2,j+2)]);
}
__syncthreads();
/////
// Compute the simple x and y gradients of the image and store these into new images
// again using syncthreads before moving on.
/////
// X-gradient, ensure x is between 1 and width-1
if( j > 0 && j < im_width){
x_gradient[idx(i,j)] = float(input_image[idx(i,j)] - input_image[idx(i,j-1)]);
}
else if(j == 0){
x_gradient[idx(i,j)] = float(0.0);
}
// Y-gradient, ensure y is between 1 and height-1
if( i > 0 && i < im_height){
y_gradient[idx(i,j)] = float(input_image[idx(i,j)] - input_image[idx(i-1,j)]);
}
else if(i == 0){
y_gradient[idx(i,j)] = float(0.0);
}
__syncthreads();
// Gradient magnitude, so 1 <= x <= width, 1 <= y <= height.
if( j < im_width && i < im_height){
gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
}
__syncthreads();
/////
// Compute the orientation angles
/////
if( j < im_width && i < im_height){
if(ang_limit == 360){
angles[idx(i,j)] = float((atan2(y_gradient[idx(i,j)],x_gradient[idx(i,j)])+pi_val)*float(180.0)/pi_val);
}
else{
angles[idx(i,j)] = float((atan( y_gradient[idx(i,j)]/x_gradient[idx(i,j)] )+(pi_val/float(2.0)))*float(180.0)/pi_val);
}
}
__syncthreads();
// Compute the HoG using the above arrays. Do so in a 3x3 grid, with 9 angle bins for each grid.
// forming an 81-vector and then write this 81 vector as a row in the large output array.
int top_bound, bot_bound, left_bound, right_bound, offset;
int window = 32;
if(i-window/2 > 0){
top_bound = i-window/2;
bot_bound = top_bound + window;
}
else{
top_bound = 0;
bot_bound = top_bound + window;
}
if(j-window/2 > 0){
left_bound = j-window/2;
right_bound = left_bound + window;
}
else{
left_bound = 0;
right_bound = left_bound + window;
}
if(bot_bound - im_height > 0){
offset = bot_bound - im_height;
top_bound = top_bound - offset;
bot_bound = bot_bound - offset;
}
if(right_bound - im_width > 0){
offset = right_bound - im_width;
right_bound = right_bound - offset;
left_bound = left_bound - offset;
}
int counter_i = 0;
int counter_j = 0;
int bin_indx, hist_indx, glob_col_indx, glob_row_indx;
int row_width = 81;
for(int pix_i = top_bound; pix_i < bot_bound; pix_i++){
for(int pix_j = left_bound; pix_j < right_bound; pix_j++){
bin_indx = bin_number(angles[idx(pix_i,pix_j)], ang_limit, ang_bins);
hist_indx = hist_number(counter_i,counter_j);
glob_col_indx = ang_bins*hist_indx + bin_indx;
glob_row_indx = idx(i,j);
output_array[glob_row_indx*row_width + glob_col_indx] = float(output_array[glob_row_indx*row_width + glob_col_indx] + float(gradient_mag[idx(pix_i,pix_j)]));
counter_j = counter_j + 1;
}
counter_i = counter_i + 1;
counter_j = 0;
}
}
"""
Here's an unmistakable case of using doubles:
gaussian_array[idx(i,j)] = float(1.0/273.0) *
See the double literals being divided?
But really, use float literals instead of double literals cast to floats - the casts are ugly, and I suggest they will hide bugs like this.
-------Edit 1/Dec---------
Firstly, thanks #CygnusX1, constant folding would prevent that calculation - I didn't even think of it.
I've tried to reproduce the environment of the error: I installed the CUDA SDK 3.2 (That #EMS has mentioned they seem to use in the lab), compiling the truncated kernel version above, and indeed nvopencc did optimize the above calculation away (thanks #CygnusX1), and indeed it didn't use doubles anywhere in the generated PTX code. Further, ptxas didn't give the error received by #EMS. From that, I thought the problem is outside of the every_pixel_hog_kernel_source code itself, perhaps in PyCUDA. However, using PyCUDA 2011.1.2 and compiling with that still does not produce a warning like in #EMS's question. I can get the error in the question, however it is by introducing a double calculation, such as removing the cast from gaussian_array[idx(i,j)] = float(1.0/273.0) *
To get to the same python case, does the following produce your error:
import pycuda.driver as cuda
from pycuda.compiler import compile
x=compile("""put your truncated kernel code here""",options=[],arch="sm_11",keep=True)
It doesn't produce an error in my circumstance, so there is a possibility I simply can't replicate your result.
However, I can give some advice. When using compile (or SourceModule), if you use keep=True, python will print out the folder where the ptx file is being generated just before showing the error message.
Then, if you can examine the ptx file generated in that folder and looking where .f64 appears it should give some idea of what is being treated as a double - however, deciphering what code that is in your original kernel is difficult - having the simplest example that produces your error will help you.
Your problem is here:
angle1 = 0.0;
0.0 is a double precision constant. 0.0f is a single precision constant.
(a comment, not an answer, but it is too big to put it as a comment)
Could you provide the PTX code around the line where the error occurs?
I tried compiling a simple kernel using the code you provided:
__constant__ int im_width;
__constant__ int im_height;
__device__ int idx(int i,int j) {
return i+j*im_width;
}
__global__ void kernel(float* gradient_mag, float* x_gradient, float* y_gradient) {
int i = threadIdx.x;
int j = threadIdx.y;
// Gradient magnitude, so 1 <= x <= width, 1 <= y <= height.
if( j > 0 && j < im_width && i > 0 && i < im_height){
gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
}
}
using:
nvcc.exe -m32 -maxrregcount=32 -gencode=arch=compute_11,code=\"sm_11,compute_11\" --compile -o "Debug\main.cu.obj" main.cu
got no errors.
Using the CUDA 4.1 beta compiler
Update
I tried compiling your new code (I am working within CUDA/C++, not PyCUDA, but this shouldn't matter). Didn't catch the error either! Used CUDA 4.1 and CUDA 4.0.
What is your version of CUDA installation?
C:\>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Wed_Oct_19_23:13:02_PDT_2011
Cuda compilation tools, release 4.1, V0.2.1221