Where do shared memory of non-resident threadblocks go? - cuda

I am trying to understand how shared memory works, when blocks use alot of it.
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory. I force both blocks to communicate before finishing, but since they can't run in parallel, this should be a deadlock. The program however does terminate, whether i run 2 blocks or 1000.
Is this because block 1 is paused once it runs into the deadlock, and switched with block 2? If yes, where does the 48 kb of data from block 1 go while block 2 is active? Is it stored in global memory?
Kernel:
__global__ void testKernel(uint8_t* globalmem_message_buffer, int n) {
const uint32_t size = 48000;
__shared__ uint8_t data[size];
for (int i = 0; i < size; i++)
data[i] = 1;
globalmem_message_buffer[blockIdx.x] = 1;
while (globalmem_message_buffer[(blockIdx.x + 1) % n] == 0) {}
printf("ID: %d\n", blockIdx.x);
}
Host code:
int n = 2; // Still works with n=1000
cudaStream_t astream;
cudaStreamCreate(&astream);
uint8_t* globalmem_message_buffer;
cudaMallocManaged(&globalmem_message_buffer, sizeof(uint8_t) * n);
for (int i = 0; i < n; i++) globalmem_message_buffer[i] = 0;
cudaDeviceSynchronize();
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Edit: Changed "threadIdx" to "blockIdx"

So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory.
That wouldn't happen. The general premise here is flawed. The GPU block scheduler only deposits a block on a SM when there are free resources sufficient to support that block.
An SM with 48KB of shared memory, that already has a block resident on it that uses 48KB of shared memory, will not get any new blocks of that type deposited on it, until the existing/resident block "retires" and releases the resources it is using.
Therefore in the normal CUDA scheduling model, the only way a block can be non-resident is if it has never been scheduled yet on a SM. In that case, it uses no resources, while it is waiting in the queue.
The exceptions to this would be in the case of CUDA preemption. This mechanism is not well documented, but would occur for example at the point of a context switch. In such a case, the entire threadblock state is somehow removed from the SM and stored somewhere else. However preemption is not applicable in the case where we are analyzing the behavior of a single kernel launch.
You haven't provided a complete code example, however, for the n=2 case, your claim that these will somehow deposit on the same SM simply isn't true.
For the n=1000 case, your code only requires that a single location in memory be set to 1:
while (globalmem_message_buffer[(threadIdx.x + 1) % n] == 0) {}
threadIdx.x for your code is always 0, since you are launching threadblocks of only 1 thread:
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Therefore the index generated here is always 1 (for n greater than or equal to 2). All threadblocks are checking location 1. Therefore, when the threadblock whose blockIdx.x is 1 executes, all threadblocks in the grid will be "unblocked", because they are all testing the same location. In short, your code may not be doing what you think it is or intended. Even if you had each threadblock check the location of another threadblock, we can imagine a sequence of threadblock deposits that would satisfy this without requiring all n threadblocks to be simultaneously resident, so I don't think that would prove anything either. (There is no specified order for the block deposit sequence.)

Related

Maximum number of resident blocks per SM?

It seems the that there is a maximum number of resident blocks allowed per SM. But while other "hard" limits are easily found (via, for example, `cudaGetDeviceProperties'), a maximum number of resident blocks doesn't seem to be widely documented.
In the following sample code, I configure the kernel with one thread per block. To test the hypothesis that this GPU (a P100) has a maximum of 32 resident blocks per SM, I create a grid of 56*32 blocks (56 = number of SMs on the P100). Each kernel takes 1 second to process (via a "sleep" routine), so if I have configured the kernel correctly, the code should take 1 second. The timing results confirm this. Configuring with 32*56+1 blocks takes 2 seconds, suggesting the 32 blocks per SM is the maximum allowed per SM.
What I wonder is, why isn't this limit made more widely available? For example, it doesn't show up `cudaGetDeviceProperties'. Where can I find this limit for various GPUs? Or maybe this isn't a real limit, but is derived from other hard limits?
I am running CUDA 10.1
#include <stdio.h>
#include <sys/time.h>
double cpuSecond() {
struct timeval tp;
gettimeofday(&tp,NULL);
return (double) tp.tv_sec + (double)tp.tv_usec*1e-6;
}
#define CLOCK_RATE 1328500 /* Modify from below */
__device__ void sleep(float t) {
clock_t t0 = clock64();
clock_t t1 = t0;
while ((t1 - t0)/(CLOCK_RATE*1000.0f) < t)
t1 = clock64();
}
__global__ void mykernel() {
sleep(1.0);
}
int main(int argc, char* argv[]) {
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int mp = prop.multiProcessorCount;
//clock_t clock_rate = prop.clockRate;
int num_blocks = atoi(argv[1]);
dim3 block(1);
dim3 grid(num_blocks); /* N blocks */
double start = cpuSecond();
mykernel<<<grid,block>>>();
cudaDeviceSynchronize();
double etime = cpuSecond() - start;
printf("mp %10d\n",mp);
printf("blocks/SM %10.2f\n",num_blocks/((double)mp));
printf("time %10.2f\n",etime);
cudaDeviceReset();
}
Results :
% srun -p gpuq sm_short 1792
mp 56
blocks/SM 32.00
time 1.16
% srun -p gpuq sm_short 1793
mp 56
blocks/SM 32.02
time 2.16
% srun -p gpuq sm_short 3584
mp 56
blocks/SM 64.00
time 2.16
% srun -p gpuq sm_short 3585
mp 56
blocks/SM 64.02
time 3.16
Yes, there is a limit to the number of blocks per SM. The maximum number of blocks that can be contained in an SM refers to the maximum number of active blocks in a given time. Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension but the SM of your gpu will be able to accommodate only a certain number of blocks. This limit is linked in two ways to the Compute Capability of your Gpu.
Hardware limit stated by CUDA.
Each gpu allows a maximum limit of blocks per SM, regardless of the number of threads it contains and the amount of resources used. For example, a Gpu with compute capability 2.0 has a limit of 8 Blocks/SM while one with compute capability 7.0 has a limit of 32 Blocks/SM. This is the best number of active blocks for each SM that you can achieve: let's call it MAX_BLOCKS.
Limit derived from the amount of resources used by each block.
A block is made up of threads and each thread uses a certain number of registers: the more registers it uses, the greater the number of resources used by the block that contains it. Similarly, the amount of shared memory assigned to a block increases the amount of resources the block needs to be allocated. Once a certain value is exceeded, the number of resources needed for a block will be so large that SM will not be able to allocate as many blocks as it is allowed by MAX_BLOCKS: this means that the amount of resources needed for each block is limiting the maximum number of active blocks for each SM.
How do I find these boundaries?
CUDA thought about that too. On their site is available the Cuda Occupancy Calculator file with which you can discover the hardware limits grouped by compute capability. You can also enter the amount of resources used by your blocks (number of threads, registers per threads, bytes of shared memory) and get graphs and important information about the number of active blocks.
The first tab of the linked file allows you to calculate the actual use of SM based on the resources used. If you want to know how many registers per thread you use you have to add the -Xptxas -v option to have the compiler tell you how many registers it is using when it creates the PTX.
In the last tab of the file you will find the hardware limits grouped by Compute capability.

Getting an unexpected value in global device memory when multiple threads write to it

Here is problem with cuda threads , memory magament, it returns single threads result "100" but would expect 9 threads result "900".
#indudel <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <helper_functions.h>
#include <helper_cuda.h>
__global__
void test(int in1,int*ptr){
int e = 0;
for (int i = 0; i < 100; i++){
e++;
}
*ptr +=e;
}
int main(int argc, char **argv)
{
int devID = 0;
cudaError_t error;
error = cudaGetDevice(&devID);
if (error == cudaSuccess)
{
printf("GPU Device fine\n");
}
else{
printf("GPU Device problem, aborting");
abort();
}
int* d_A;
cudaMalloc(&d_A, sizeof(int));
int res=0;
//cudaMemcpy(d_A, &res, sizeof(int), cudaMemcpyHostToDevice);
test <<<3, 3 >>>(0,d_A);
cudaDeviceSynchronize();
cudaMemcpy(&res, d_A, sizeof(int),cudaMemcpyDeviceToHost);
printf("res is : %i",res);
Sleep(10000);
return 0;
}
It returns:
GPU Device fine\n
res is : 100
Would expect it to return higher number because 3x3(blocks,threads), insted of just one threads result?
What is done wrong and where does the numbers get lost?
You can't write your sum in this way to global memory.
You have to use an atomic function to ensure that the store is atomic.
In general, when having multiple device threads writing into the same values on global memory, you have to use either atomic functions :
float atomicAdd(float* address, float val);
double atomicAdd(double*
address, double val);
reads the 32-bit or 64-bit word old located at the address address in
global or shared memory, computes (old + val), and stores the result
back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
or thread synchronization :
Throughput for __syncthreads() is 16 operations per clock cycle for
devices of compute capability 2.x, 128 operations per clock cycle for
devices of compute capability 3.x, 32 operations per clock cycle for
devices of compute capability 6.0 and 64 operations per clock cycle
for devices of compute capability 5.x, 6.1 and 6.2.
Note that __syncthreads() can impact performance by forcing the
multiprocessor to idle as detailed in Device Memory Accesses.
(adapting another answer of mine:)
You are experiencing the effects of the increment operator not being atomic. (C++-oriented description of what that means). What's happening, chronologically, is the following sequence of events (not necessarily in the same order of threads though):
...(other work)...
block 0 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 1 issues a LOAD instruction with address ptr into register r
...
block 2 thread 0 issues a LOAD instruction with address ptr into register r
block 0 thread 0 completes the LOAD, now having 0 in register r
...
block 2 thread 2 completes the LOAD, now having 0 in register r
block 0 thread 0 adds 100 to r
...
block 2 thread 2 adds 100 to r
block 0 thread 0 issues a STORE instruction from register r to address ptr
...
block 2 thread 2 issues a STORE instruction from register r to address ptr
Thus every thread sees the initial value of *ptr, which is 0; adds 100; and stores 0+100=100 back. The order of the stores doesn't matter here as long as all of the threads try to store the same false value.
What you need to do is either:
Use atomic operations - the least amount of modifications to your code, but very inefficient, since it serializes your work to a great extent, or
Use a block-level reduction primitive. This will ensure some partial ordering of the computational activity vis-a-vis shared block memory - using __syncthreads() or other mechanisms. Thus it might first have each thread add its own two elements up; then synchronize block threads; then have less threads add up pairs of pair-sums and so on. Here's an nVIDIA blog post on implementing fast reductions on their more modern GPU architectures.
block-local or warp-local and/or work-group-specific partial results, which require less/cheaper synchronization, and combine them eventually after having done a lot of work on them.

Bank conflict in CUDA when reading from the same location

I have a CUDA kernel where there is a point where each thread is reading the same value from the global memory. So something like:
__global__ void my_kernel(const float4 * key_pts)
{
if (key_pts[blockIdx.x] < 0 return;
}
The kernel is configured as follows:
dim3 blocks(16, 16);
dim3 grid(2000);
my_kernel<<<grid, blocks, 0, stream>>>(key_pts);
My question is whether this will lead to some sort bank conflict or sub-optimal access in CUDA. I must confess I do not understand this issue in detail yet.
I was thinking I could do something like the following in case we have sub-optimal access:
__global__ void my_kernel(const float4 * key_pts)
{
__shared__ float x;
if (threadIdx.x == 0 && threadIdx.y == 0)
x = key_pts[blockIdx.x];
__syncthreads();
if (x < 0) return;
}
Doing some timing though, I do not see any difference between the two but so far my tests are with limited data.
bank conflicts apply to shared memory, not global memory.
Since all threads need (ultimately) the same value to make their decision, this won't yield sub-optimal access on global memory because there is a broadcast mechanism so that all threads in the same warp, requesting the same location/value from global memory, will retrieve that without any serialization or overhead. All threads in the warp can be serviced at the same time:
Note that threads can access any words in any order, including the same words.
Furthermore, assuming your GPU has a cache (cc2.0 or newer) the value retrieved from global memory for the first warp encountering this will likely be available in the cache for subsequent warps that hit this point.
I wouldn't expect much performance difference between the two cases.

Maximum threads per block vs shared memory size

Is there any relation between the size of the shared memory and the maximum number of threads per block?. In my case I use Max threads per block = 512, my program makes use of all the threads and it uses considerable amount of shared memory.
Each thread has to do a particular task repeatedly. For example my kernel might look like,
int threadsPerBlock = (blockDim.x * blockDim.y * blockDim.z);
int bId = (blockIdx.x * gridDim.y * gridDim.z) + (blockIdx.y * gridDim.z) + blockIdx.z;
for(j = 0; j <= N; j++) {
tId = threadIdx.x + (j * threadsPerBlock);
uniqueTid = bId*blockDim.x + tId;
curand_init(uniqueTid, 0, 0, &seedValue);
randomP = (float) curand_uniform( &seedValue );
if(randomP <= input_value)
/* Some task */
else
/* Some other task */
}
But my threads are not going into next iteration (say j = 2). Am i missing something obvious here?
You have to distinct between shared memory and global memory. The former is always per block. The latter refers to the off-chip memory that is available on the GPU.
So generally speaking, there is a kind of relation when it comes to threads, i.e. when having more threads per block, the maximum amount of shared memory stays the same.
Also refer to e.g. Using Shared Memory in CUDA C/C++.
There is no immediate relationship between the maximum number of threads per block and the size of the shared memory (not 'device memory' - they're not the same thing).
However, there is an indirect relationship, in that with different Compute Capabilities, both these numbers change:
Compute Capability
1.x
2.x - 3.x
Threads per block
512
1024
Max shared memory (per block)
16KB
48KB
as one of them has increased with newer CUDA devices, so has the other.
Finally, there is a block-level resource which is affected, used up, by the launching of more threads: The Register File. There is a single register file which all block threads share, and the constraint is
ThreadsPerBlock x RegistersPerThread <= RegisterFileSize
It is not trivial to determine how many registers your kernel code is using; but as a rule of thumb, if you use "a lot" of local variables, function call parameters etc., you might hit the above limit, and will not be able to schedule as many threads.

CUDA shared memory broadcast and __syncthreads behavior

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:
#define NUM_VALUES 16
#define NUM_LOOPS 1024
__global__ void shared_memory_test(float *output)
{
// Create some shared memory
__shared__ int dm_delays[NUM_VALUES];
// Loop over NUM_LOOPS
float accumulator = 0;
for(unsigned c = 0; c < NUM_LOOPS; c++)
{
// Force shared memory update
for(int d = threadIdx.x; d < NUM_VALUES; d++)
dm_delays[d] = c * d;
// __syncthreads();
for(int d = 0; d < NUM_VALUES; d++)
accumulator += dm_delays[d];
}
// Store accumulated value to global memory
for(unsigned d = 0; d < NUM_VALUES; d++)
output[d] = accumulator;
}
I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.
What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.
Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):
without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts
I'm running this on a GTX 670, set at compute_capability 3.0
Thank you in advance
UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.
Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.