How to avoid using number of threads exceeding the maximum allowed on GPU? - cuda

As described in a previous post:
how to find the number of maximum available threads in CUDA?
I found the maximum number of threads on my GPU card is 21504. However, when I assigned more than that number to the kernel, everything runs smoothly.
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void dummy()
{
}
int main()
{
//int N=21504;
int N=21504*40;
dummy<<<1,N>>>();
return 0;
}
I don't know what happened, but I believe we should avoid this, and not sure how to do it.

Your example did not run correctly. It only appeared to run correctly because you did not check the CUDA error status after the kernel launch.
The comment I made on your other question also applies here:
The maximum number of threads per multiprocessor is the upper limit to how many threads can be "in flight" at the same time. Other limiting factors will normally limit the number further. This value does not affect how many threads can be launched at the same time and it is not very useful for finding out the number of threads needed for optimal performance.
Your card is a compute capability 2.0 device. See the Features and Technical Specifications section in the CUDA Programming Guide for details on the limitations of your device. In particular, your device is limited to a grid size of 65535 in each of the X, Y and Z dimensions. You attempted to launch with a grid size of X = 21504*40, Y = 1, Z = 1.
Your device is limited to 1024 threads per block. So, in theory, you can launch up to 65535 * 65535 * 65535 blocks, each with 1024 threads at the same time.
There is no performance penalty to launching kernels with many more threads than the maximum number of resident threads your device supports.

Related

Maximum number of resident blocks per SM?

It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.
In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.
What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?
I am running CUDA 10.1
#include <stdio.h>
#include <sys/time.h>
double cpuSecond() {
struct timeval tp;
gettimeofday(&tp,NULL);
return (double) tp.tv_sec + (double)tp.tv_usec*1e-6;
}
#define CLOCK_RATE 1328500 /* Modify from below */
__device__ void sleep(float t) {
clock_t t0 = clock64();
clock_t t1 = t0;
while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
t1 = clock64();
}
__global__ void mykernel() {
sleep(1.0);
}
int main(int argc, char* argv[]) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int mp = prop.multiProcessorCount;
//clock_t clock_rate = prop.clockRate;
int num_blocks = atoi(argv[1]);
dim3 block(1);
dim3 grid(num_blocks); /* N blocks */
double start = cpuSecond();
mykernel<<<grid,block>>>();
cudaDeviceSynchronize();
double etime = cpuSecond() - start;
printf("mp %10d\n",mp);
printf("blocks/SM %10.2f\n",num_blocks/((double)mp));
printf("time %10.2f\n",etime);
cudaDeviceReset();
}
Results :
% srun -p gpuq sm_short 1792
mp 56
blocks/SM 32.00
time 1.16
% srun -p gpuq sm_short 1793
mp 56
blocks/SM 32.02
time 2.16
% srun -p gpuq sm_short 3584
mp 56
blocks/SM 64.00
time 2.16
% srun -p gpuq sm_short 3585
mp 56
blocks/SM 64.02
time 3.16
Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Hardware limit stated by CUDA.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
Limit derived from the amount of resources used by each block.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks.
The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.
In the last tab of the file you will find the hardware limits grouped by Compute capability.

How cuda handle __syncthreads() in kernel?

Think i have a block with 1024 size and assume my gpu has 192 cuda cores.
How cuda handle __syncthreads() in kernels when cuda cores size is lower than block size?
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[1024];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
How 'tr' remaining in local memory?
I think you are mixing a few things.
First of all, GPU having 192 CUDA cores is the total core count. Each block however maps to a single Streaming Multiprocessor (SM) which may have a lower core count (depending on the GPU generation).
Let us assume that you own a Pascal GPU which has 64 cores per SM and you have 3
SMs.
A single block maps to a single SM. So you will have 64 cores handling 1024 threads concurrently. Such an SM has enough registers to hold all the necessary data for 1024 threads, but it has only 64 cores which quickly swap which threads they are handling.
This way all the local data, e.g. tr can remain in memory.
Now, because of this quick swapping and concurrent execution, it may happen -- completely by accident -- that some threads get ahead of others. If you want to ensure that at certain point all threads are at the same spot, you use __syncthreads(). All that function does is to instruct the scheduler to properly assign work to the CUDA cores so that they all are at that spot in program at some moment.

Should I check the number of threads in kernel code?

I am a beginner with CUDA, and my coworkers always design kernels with the following wrapping:
__global__ void myKernel(int nbThreads)
{
int threadId = blockDim.x*blockIdx.y*gridDim.x //rows preceeding current row in grid
+ blockDim.x*blockIdx.x //blocks preceeding current block
+ threadIdx.x;
if (threadId < nbThreads)
{
statement();
statement();
statement();
}
}
They think there are some situations where CUDA might launch more threads than specified for alignment/warping sake, so we need to check it every time.
However, I've seen no example kernel on the internet so far where they actually do this verification.
Can CUDA actually launch more threads than specified block/grid dimensions?
CUDA will not launch more threads than what are specified by the block/grid dimensions.
However, due to the granularity of block dimensions (e.g. it's desirable to have block dimensions be a multiple of 32, and it is limited in size to 1024 or 512), it is frequently the case that it is difficult to match a grid of threads to be numerically equal to the desired problem size.
In these cases, the typical behavior is to launch more threads, effectively rounding up to the next even size based on the block granularity, and use the "thread check" code in the kernel to make sure that the "extra threads", i.e. those beyond the problem size, don't do anything.
In your example, this could be clarified by writing:
__global__ void myKernel(int problem_size)
if (threadId < problem_size)
which communicates what is intended, that only threads corresponding to the problem size (which may not match the launched grid size) do any actual work.
As a very simple example, suppose I wanted to do a vector add, on a vector whose length was 10000 elements. 10000 is not a multiple of 32, nor is it less than 1024, so in a typical implementation I would launch multiple threadblocks to do the work.
If I want each threadblock to be a multiple of 32, there is no number of threadblocks that I can choose which will give me exactly 10000 threads. Therefore, I might choose 256 threads in a threadblock, and launch 40 threadblocks, giving me 10240 threads total. Using the thread check, I prevent the "extra" 240 threads from doing anything.

Empirically determining how many threads are in a warp

Is it possible to write a CUDA kernel that shows how many threads are in a warp without using any of the warp related CUDA device functions and without using benchmarking? If so, how?
Since you indicated a solution with atomics would be interesting, I advance this as something that I believe gives an answer, but I'm not sure it is necessarily the answer you are looking for. I acknowledge it is somewhat statistical in nature. I provide this merely because I found the question interesting. I don't suggest that it is the "right" answer, and I suspect someone clever will come up with a "better" answer. This may provide some ideas, however.
In order to avoid using anything that explicitly references warps, I believe it is necessary to focus on "implicit" warp-synchronous behavior. I initially went down a path thinking about how to use an if-then-else construct, (which has some warp-synchronous implications) but struggled with that and came up with this approach instead:
#include <stdio.h>
#define LOOPS 100000
__device__ volatile int test2 = 0;
__device__ int test3 = 32767;
__global__ void kernel(){
for (int i = 0; i < LOOPS; i++){
unsigned long time = clock64();
// while (clock64() < (time + (threadIdx.x * 1000)));
int start = test2;
atomicAdd((int *)&test2, 1);
int end = test2;
int diff = end - start;
atomicMin(&test3, diff);
}
}
int main() {
kernel<<<1, 1024>>>();
int result;
cudaMemcpyFromSymbol(&result, test3, sizeof(int));
printf("result = %d threads\n", result);
return 0;
}
I compile with:
nvcc -O3 -arch=sm_20 -o t331 t331.cu
I call it "statistical" because it requres a large number of iterations (LOOPS) to produce a correct estimate (32). As the iteration count is decreased, the "estimate" increases.
We can apply additional warp-synchronous leverage by uncommenting the line that is commented out in the kernel. For my test case*, with that line uncommented, the estimate is correct even when LOOPS = 1
*my test case is CUDA 5, Quadro5000, RHEL 5.5
Here are several easy solutions. There are other solutions that use warp synchronous programming; however, many of the solutions will not work across all devices.
SOLUTION 1: Launch one or more blocks with max threads per block, read the special registers %smid and %warpid, and blockIdx and write values to memory. Group data by the three variables to find the warp size. This is even easier if you limit the launch to a single block then you only need %warpid.
SOLUTION 2: Launch one block with max threads per block and read the special register %clock. This requires the following assumptions which can be shown to be true on CC 1.0-3.5 devices:
%clock is defined as a unsigned 32-bit read-only cycle counter that wraps silently and updates every issue cycle
all threads in a warp read the same value for %clock
due to warp launch latency and instruction fetch warps on the same SM but different warp schedulers cannot issue the first instruction of a warp on the same cycle
All threads in the block that have the same clock time on CC1.0 - 3.5 devices (may change in the future) will have the same clock time.
SOLUTION 3: Use Nsight VSE or cuda-gdb debugger. The warp state views show you sufficient information to determine the warp size. It is also possible to single step and see the change to the PC address for each thread.
SOLUTION 4: Use Nsight VSE, Visual Profiler, nvprof, etc. Launch kernels of of 1 block with increasing thread count per launch. Determine when the thread count causing warps_launched to go from 1 to 2.

What CUDA shared memory size means

I am trying to solve this problem myself but I can't.
So I want to get yours advice.
I am writing kernel code like this. VGA is GTX 580.
xxxx <<< blockNum, threadNum, SharedSize >>> (... threadNum ...)
(note. SharedSize is set 2*threadNum)
__global__ void xxxx(..., int threadNum, ...)
{
extern __shared__ int shared[];
int* sub_arr = &shared[0];
int* sub_numCounting = &shared[threadNum];
...
}
My program creates about 1085 blocks and 1024 threads per block.
(I am trying to handle huge size of array)
So size of shared memory per block is 8192(1024*2*4)bytes, right?
I figure out I can use maximum 49152bytes in shared memory per block on GTX 580 by using cudaDeviceProp.
And I know GTX 580 has 16 processors, thread block can be implemented on processor.
But my program occurs error.(8192bytes < 49152bytes)
I use "printf" in kernel to see whether well operates or not but several blocks not operates. (Although I create 1085blocks, actually only 50~100 blocks operates.)
And I want to know whether blocks which operated on same processor share same shared memory address or not. ( If not, allocates other memory for shared memory? )
I can't certainly understand what maximum size of shared memory per block means.
Give me advice.
Yes, blocks on the same multiprocessor shared the same amount of shared memory, which is 48KB per multiprocessor for your GPU card (compute capability 2.0). So if you have N blocks on the same multiprocessor, the maximum size of shared memory per block is (48/N) KB.