Number of threads in a block - cuda

I used x & y for calculating cells of a matrix in device.
when I used more than 32 for lenA & lenB, the breakpoint (in int x= threadIdx.x; in device code) can't work and output isn't correct.
in host code:
int lenA=52;
int lenB=52;
dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);
kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);
in device code:
int x= threadIdx.x;
int y= threadIdx.y;
...

Your threadsPerBlock dim3 variable must satisfy the requirements for the compute capability that you are targetting.
CC 1.x devices can handle up to 512 threads per block
CC 2.0 - 8.6 devices can handle up to 1024 threads per block.
Your dim3 variable at (32,32) is specifying 1024 (=32x32) threads per block. When you exceed that you are getting a kernel launch fail.
If you did cuda error checking on your kernel launch, you would see the error.
Since the kernel doesn't actually launch with this type of error, any breakpoints set in the kernel code also won't be hit.
Additional notes:
You won't get any compilation error for threads per block, regardless of what you do. It doesn't work that way. The compiler doesn't check that.
If you do proper CUDA error checking you will get a runtime error report, and even if you don't do proper CUDA error checking, your kernel will not actually run with that sort of error.

Related

CUDA printf() crashes when large number of threads and blocks are launched

I'm using CUDA 6.5 + VS2013 + GTX Titan black. I observe that the following printing codes will crash when the total number of threads larger than 65536. I googled a bit but havent seen anything useful. Does anyone else observe the same behaviour? Or can anyone provide some explanation? Thank you very much!
__global__ void testKernel(int val)
{
int X = blockDim.x * blockIdx.x + threadIdx.x;
int Y = blockDim.y * blockIdx.y + threadIdx.y;
printf("[%d, %d]:\t" "\tValue is:%d\n", X, Y, val);
}
void main(){
dim3 block(16,16);
dim3 grid(16,16);
testKernel << <grid, block >> >(10);
cudaDeviceSynchronize();
cudaGetLastError();
cudaDeviceReset();
}
And I got the following error message when I use block(32,16) and grid(16,16):
Gpu API call (the launch timed out and was terminated)...
Your kernel is taking too long to execute:
the launch timed out and was terminated
This is a limitation of the windows operating system, when running on WDDM devices.
There are a variety of workarounds possible. Some are:
reduce your kernel execution time
switch the GPU to TCC mode, if possible (not possible with GeForce GPUs).
extend the TDR timeout delay (or remove it) via windows registry modification
Also, the in-kernel printf feature has significant limits. It's really not designed for large-scale output for a variety of reasons. One in particular is that the buffer for this activity is limited, and when overflowed, the previous buffer data will be lost (i.e. not printed out).
Thanks to Robert's answer, I realize that the problem might due to the size of buffer. I use the following codes to find out that by default the size of the printing buffer is 1048576 bytes (1M)
size_t sz;
cudaDeviceGetLimit(&sz, cudaLimitPrintfFifoSize);
std::cout << sz << std::endl;
When I increase the buffer size to 100 Mb using the following codes, the error disappears and I have all expected outputs, 131072 lines in total! (I use block(32,16); .. grid(16,16); ... )
sz = 1048576 * 100;
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, sz);
Somehow, the overflow of the printing buffer causes longer response time than usual and triggers a TDR. When I increase the buffer size accordingly, the codes manage to finish before time out. More importantly, sufficient buffer size ensures no data lost.
But, I think the upper bound of buffer size and execution time depends on devices. It works well on Titan Black does not necessarily mean it also works for other NVidia cards. Again, I agree with Robert that to use printf for exporting large amount of data from CUDA kernels are unreliable in practice. I just use it to dump some info to debug the kernel.

How to properly add in global memory in CUDA?

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?
Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

CUDA summation reduction puzzle

Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.
The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:
sum<<< gridsize, blocksize, sharedmem_size >>>(....);
The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.

Maximum number of threads for a CUDA kernel on Tesla M2050

I am testing what is maximum number of threads for a simple kernel. I find total number of threads cannot exceed 4096. The code is as follows:
#include <stdio.h>
#define N 100
__global__ void test(){
printf("%d %d\n", blockIdx.x, threadIdx.x);
}
int main(void){
double *p;
size_t size=N*sizeof(double);
cudaMalloc(&p, size);
test<<<64,128>>>();
//test<<<64,128>>>();
cudaFree(p);
return 0;
}
My test environment: CUDA 4.2.9 on Tesla M2050. The code is compiled with
nvcc -arch=sm_20 test.cu
While checking what's the output, I found some combinations are missing. Run the command
./a.out|wc -l
I always got 4096. When I check cc2.0, I can only find the maximum number of blocks for x,y,z dimensions are (1024,1024,512) and maximum number of threads per block is 1024. And the calls to the kernel (either <<<64,128>>> or <<<128,64>>>) are well in the limits. Any idea?
NB: The CUDA memory operations are there to block the code so that the output from the kernel will be shown.
You are abusing kernel printf, and using it to judge how many threads you can run is a completely nonsensical idea. The runtime has a limited buffer size for printf output, and you are simply overflowing it with output when you run enough threads. There is an API to query and set the printf buffer size, using cudaDeviceGetLimit and cudaDeviceSetLimit (thanks to Robert Crovella for the link to the printf documentation in comments).
You can find the maximum number of threads a given kernel can run by looking here in the documentation.

Memory Error in CUDA Program for Fermi GPU

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]