What CUDA shared memory size means - cuda

I am trying to solve this problem myself but I can't.
So I want to get yours advice.
I am writing kernel code like this. VGA is GTX 580.
xxxx <<< blockNum, threadNum, SharedSize >>> (... threadNum ...)
(note. SharedSize is set 2*threadNum)
__global__ void xxxx(..., int threadNum, ...)
{
extern __shared__ int shared[];
int* sub_arr = &shared[0];
int* sub_numCounting = &shared[threadNum];
...
}
My program creates about 1085 blocks and 1024 threads per block.
(I am trying to handle huge size of array)
So size of shared memory per block is 8192(1024*2*4)bytes, right?
I figure out I can use maximum 49152bytes in shared memory per block on GTX 580 by using cudaDeviceProp.
And I know GTX 580 has 16 processors, thread block can be implemented on processor.
But my program occurs error.(8192bytes < 49152bytes)
I use "printf" in kernel to see whether well operates or not but several blocks not operates. (Although I create 1085blocks, actually only 50~100 blocks operates.)
And I want to know whether blocks which operated on same processor share same shared memory address or not. ( If not, allocates other memory for shared memory? )
I can't certainly understand what maximum size of shared memory per block means.
Give me advice.

Yes, blocks on the same multiprocessor shared the same amount of shared memory, which is 48KB per multiprocessor for your GPU card (compute capability 2.0). So if you have N blocks on the same multiprocessor, the maximum size of shared memory per block is (48/N) KB.

Related

Maximum number of resident blocks per SM?

It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.
In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.
What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?
I am running CUDA 10.1
#include <stdio.h>
#include <sys/time.h>
double cpuSecond() {
struct timeval tp;
gettimeofday(&tp,NULL);
return (double) tp.tv_sec + (double)tp.tv_usec*1e-6;
}
#define CLOCK_RATE 1328500 /* Modify from below */
__device__ void sleep(float t) {
clock_t t0 = clock64();
clock_t t1 = t0;
while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
t1 = clock64();
}
__global__ void mykernel() {
sleep(1.0);
}
int main(int argc, char* argv[]) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int mp = prop.multiProcessorCount;
//clock_t clock_rate = prop.clockRate;
int num_blocks = atoi(argv[1]);
dim3 block(1);
dim3 grid(num_blocks); /* N blocks */
double start = cpuSecond();
mykernel<<<grid,block>>>();
cudaDeviceSynchronize();
double etime = cpuSecond() - start;
printf("mp %10d\n",mp);
printf("blocks/SM %10.2f\n",num_blocks/((double)mp));
printf("time %10.2f\n",etime);
cudaDeviceReset();
}
Results :
% srun -p gpuq sm_short 1792
mp 56
blocks/SM 32.00
time 1.16
% srun -p gpuq sm_short 1793
mp 56
blocks/SM 32.02
time 2.16
% srun -p gpuq sm_short 3584
mp 56
blocks/SM 64.00
time 2.16
% srun -p gpuq sm_short 3585
mp 56
blocks/SM 64.02
time 3.16
Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Hardware limit stated by CUDA.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
Limit derived from the amount of resources used by each block.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks.
The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.
In the last tab of the file you will find the hardware limits grouped by Compute capability.

How cuda handle __syncthreads() in kernel?

Think i have a block with 1024 size and assume my gpu has 192 cuda cores.
How cuda handle __syncthreads() in kernels when cuda cores size is lower than block size?
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[1024];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
How 'tr' remaining in local memory?
I think you are mixing a few things.
First of all, GPU having 192 CUDA cores is the total core count. Each block however maps to a single Streaming Multiprocessor (SM) which may have a lower core count (depending on the GPU generation).
Let us assume that you own a Pascal GPU which has 64 cores per SM and you have 3
SMs.
A single block maps to a single SM. So you will have 64 cores handling 1024 threads concurrently. Such an SM has enough registers to hold all the necessary data for 1024 threads, but it has only 64 cores which quickly swap which threads they are handling.
This way all the local data, e.g. tr can remain in memory.
Now, because of this quick swapping and concurrent execution, it may happen -- completely by accident -- that some threads get ahead of others. If you want to ensure that at certain point all threads are at the same spot, you use __syncthreads(). All that function does is to instruct the scheduler to properly assign work to the CUDA cores so that they all are at that spot in program at some moment.

Dynamically allocated shared memory in CUDA. Execution Configuration

What does by this Nvidia means?
Ns is of type size_t and specifies the number of bytes in shared
memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in __shared__; Ns is an optional
argument which defaults to 0;
Size of shared memory in my GPU is 48kB.
For example I want to run 4 kernels at the same time, every of them uses 12kB of shared memory.
In order to do that, should I call kernek this way
kernel<<< gridSize, blockSize, 12 * 1024 >>>();
or should the third argument be 48 * 1024 ?
Ns in a size in bytes. If you want to reserve 12kB of shared memory you would do 12*1024*1024.
I doubt you want to do this. Ns value is PER BLOCK. So it is the amount of shared memory per block executing on the device. I'm guessing you'd like to do something around the lines of 12*1024*1024/number_of_blocks;
Kernel launching with concurrency:
If as mentioned in a comment, you are using streams there is a fourth input in the kernel launch which is the cuda stream.
If you want to launch a kernel on another stream without any shared memory it will look like:
kernel_name<<<128, 128, 0, mystream>>>(...);
but concurrency is a whole different issue.

CUDA summation reduction puzzle

Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.
The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:
sum<<< gridsize, blocksize, sharedmem_size >>>(....);
The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.

How to avoid using number of threads exceeding the maximum allowed on GPU?

As described in a previous post:
how to find the number of maximum available threads in CUDA?
I found the maximum number of threads on my GPU card is 21504. However, when I assigned more than that number to the kernel, everything runs smoothly.
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void dummy()
{
}
int main()
{
//int N=21504;
int N=21504*40;
dummy<<<1,N>>>();
return 0;
}
I don't know what happened, but I believe we should avoid this, and not sure how to do it.
Your example did not run correctly. It only appeared to run correctly because you did not check the CUDA error status after the kernel launch.
The comment I made on your other question also applies here:
The maximum number of threads per multiprocessor is the upper limit to how many threads can be "in flight" at the same time. Other limiting factors will normally limit the number further. This value does not affect how many threads can be launched at the same time and it is not very useful for finding out the number of threads needed for optimal performance.
Your card is a compute capability 2.0 device. See the Features and Technical Specifications section in the CUDA Programming Guide for details on the limitations of your device. In particular, your device is limited to a grid size of 65535 in each of the X, Y and Z dimensions. You attempted to launch with a grid size of X = 21504*40, Y = 1, Z = 1.
Your device is limited to 1024 threads per block. So, in theory, you can launch up to 65535 * 65535 * 65535 blocks, each with 1024 threads at the same time.
There is no performance penalty to launching kernels with many more threads than the maximum number of resident threads your device supports.