Maximum threads per block vs shared memory size - cuda

Is there any relation between the size of the shared memory and the maximum number of threads per block?. In my case I use Max threads per block = 512, my program makes use of all the threads and it uses considerable amount of shared memory.
Each thread has to do a particular task repeatedly. For example my kernel might look like,
int threadsPerBlock = (blockDim.x * blockDim.y * blockDim.z);
int bId = (blockIdx.x * gridDim.y * gridDim.z) + (blockIdx.y * gridDim.z) + blockIdx.z;
for(j = 0; j <= N; j++) {
tId = threadIdx.x + (j * threadsPerBlock);
uniqueTid = bId*blockDim.x + tId;
curand_init(uniqueTid, 0, 0, &seedValue);
randomP = (float) curand_uniform( &seedValue );
if(randomP <= input_value)
/* Some task */
else
/* Some other task */
}
But my threads are not going into next iteration (say j = 2). Am i missing something obvious here?

You have to distinct between shared memory and global memory. The former is always per block. The latter refers to the off-chip memory that is available on the GPU.
So generally speaking, there is a kind of relation when it comes to threads, i.e. when having more threads per block, the maximum amount of shared memory stays the same.
Also refer to e.g. Using Shared Memory in CUDA C/C++.

There is no immediate relationship between the maximum number of threads per block and the size of the shared memory (not 'device memory' - they're not the same thing).
However, there is an indirect relationship, in that with different Compute Capabilities, both these numbers change:
Compute Capability
1.x
2.x - 3.x
Threads per block
512
1024
Max shared memory (per block)
16KB
48KB
as one of them has increased with newer CUDA devices, so has the other.
Finally, there is a block-level resource which is affected, used up, by the launching of more threads: The Register File. There is a single register file which all block threads share, and the constraint is
ThreadsPerBlock x RegistersPerThread <= RegisterFileSize
It is not trivial to determine how many registers your kernel code is using; but as a rule of thumb, if you use "a lot" of local variables, function call parameters etc., you might hit the above limit, and will not be able to schedule as many threads.

Related

Where do shared memory of non-resident threadblocks go?

I am trying to understand how shared memory works, when blocks use alot of it.
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory. I force both blocks to communicate before finishing, but since they can't run in parallel, this should be a deadlock. The program however does terminate, whether i run 2 blocks or 1000.
Is this because block 1 is paused once it runs into the deadlock, and switched with block 2? If yes, where does the 48 kb of data from block 1 go while block 2 is active? Is it stored in global memory?
Kernel:
__global__ void testKernel(uint8_t* globalmem_message_buffer, int n) {
const uint32_t size = 48000;
__shared__ uint8_t data[size];
for (int i = 0; i < size; i++)
data[i] = 1;
globalmem_message_buffer[blockIdx.x] = 1;
while (globalmem_message_buffer[(blockIdx.x + 1) % n] == 0) {}
printf("ID: %d\n", blockIdx.x);
}
Host code:
int n = 2; // Still works with n=1000
cudaStream_t astream;
cudaStreamCreate(&astream);
uint8_t* globalmem_message_buffer;
cudaMallocManaged(&globalmem_message_buffer, sizeof(uint8_t) * n);
for (int i = 0; i < n; i++) globalmem_message_buffer[i] = 0;
cudaDeviceSynchronize();
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Edit: Changed "threadIdx" to "blockIdx"
So my gpu (RTX 2080 ti) has 48 kb of shared memory per SM, and the same per threadblock. In my example below i have 2 blocks forced on the same SM, each using the full 48 kb of memory.
That wouldn't happen. The general premise here is flawed. The GPU block scheduler only deposits a block on a SM when there are free resources sufficient to support that block.
An SM with 48KB of shared memory, that already has a block resident on it that uses 48KB of shared memory, will not get any new blocks of that type deposited on it, until the existing/resident block "retires" and releases the resources it is using.
Therefore in the normal CUDA scheduling model, the only way a block can be non-resident is if it has never been scheduled yet on a SM. In that case, it uses no resources, while it is waiting in the queue.
The exceptions to this would be in the case of CUDA preemption. This mechanism is not well documented, but would occur for example at the point of a context switch. In such a case, the entire threadblock state is somehow removed from the SM and stored somewhere else. However preemption is not applicable in the case where we are analyzing the behavior of a single kernel launch.
You haven't provided a complete code example, however, for the n=2 case, your claim that these will somehow deposit on the same SM simply isn't true.
For the n=1000 case, your code only requires that a single location in memory be set to 1:
while (globalmem_message_buffer[(threadIdx.x + 1) % n] == 0) {}
threadIdx.x for your code is always 0, since you are launching threadblocks of only 1 thread:
testKernel << <n, 1, 0, astream >> > (globalmem_message_buffer, n);
Therefore the index generated here is always 1 (for n greater than or equal to 2). All threadblocks are checking location 1. Therefore, when the threadblock whose blockIdx.x is 1 executes, all threadblocks in the grid will be "unblocked", because they are all testing the same location. In short, your code may not be doing what you think it is or intended. Even if you had each threadblock check the location of another threadblock, we can imagine a sequence of threadblock deposits that would satisfy this without requiring all n threadblocks to be simultaneously resident, so I don't think that would prove anything either. (There is no specified order for the block deposit sequence.)

How to properly coalesce writes from global memory into global memory?

Please understand me, but I don't know English.
My Computing environment is
CPU : Intel Xeon x5690 3.46Ghz * 2EA
OS : CentOS 5.8
VGA : Nvidia Geforce GTX580 (CC is 2.0)
I read already the documents about "coalesced memory access" on CUDA C programming guide.
But I can't apply them in my case.
I've 32x32 blocks/grid and 16x16 threads/block.
That means as following code.
dim3 grid(32, 32);
dim3 block(16,16);
kernel<<<grid, block>>>(...);
Then, How can I use that coalesced memory access?
I used code in below kernel.
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i*512+j] = ...;
I used the constant 512 because total amount of threads is 512x512 threads:It is grid_size x block_size.
But, I saw "Low Global Memory Store Efficiency[9.7% avg, for kernels accounting for 100% of compute]" from Visual Profiler.
Helper says using the coalesced memory access.
But, I cannot know what should I use the index context of the memory.
For more information for detail code, The result of an experiment different from CUDA Occupancy Calculator
Coalescing memory loads and stores in CUDA is a pretty straightforward concept - threads in the same warp need to load or store from/into suitably aligned, consecutive words in memory.
The warp size is 32 in CUDA, and warps are formed from threads within the same block, ordered so that the x dimension of threadIdx.{xyz} varies the fastest, the y the next fastest, and the z the slowest (functionally this is the same as column major ordering in arrays).
The code you have posted isn't achieving coalesced memory stores because threads within the same warp are storing with a pitch of 512 words, not within the required 32 consecutive words.
A simple hack to improve coalescing would be to address the memory in column major order, so:
int i = blockIdx.x*16 + threadIdx.x;
int j = blockIdx.y*16 + threadIdx.y;
...
global_memory[i+512*j] = ...;
A more general approach on a 2D block and grid to achieve coalescing in the spirit of what you showed in the question would be like this:
tid_in_block = threadIdx.x + threadIdx.y * blockDim.x;
bid_in_grid = blockIdx.x + blockIdx.y * gridDim.x;
threads_per_block = blockDim.x * blockDim.y;
tid_in_grid = tid_in_block + thread_per_block * bid_in_grid;
global_memory[tid_in_grid] = ...;
The most appropriate solution will depend on other details of the code and data which you have not described.

Memory coalescing while implementing FDTD equations

I was trying to implement FDTD equations on the GPU. I initially
had implemented the kernel which used global memory. The memory
coalescing wasn't that great. Hence I implemented another kernel
which used shared memory to load the values. I am working on a grid
of 1024x1024.
The code is below
__global__ void update_Hx(float *Hx, float *Ez, float *coef1, float* coef2){
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int offset = x + y * blockDim.x * gridDim.x;
__shared__ float Ez_shared[BLOCKSIZE_HX][BLOCKSIZE_HY + 1];
/*int top = offset + x_index_dim;*/
if(threadIdx.y == (blockDim.y - 1)){
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Ez_shared[threadIdx.x][threadIdx.y + 1] = Ez[offset + x_index_dim];
}
else{
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
}
}
The constants BLOCKSIZE_HX = 16 and BLOCKSIZE_HY = 16.
When I run the visual profiler, it still says that the memory is not coalesced.
EDIT:
I am using GT 520 graphic card with cuda compute capability of 2.1.
My Global L2 transactions / Access = 7.5 i.e there is 245 760 L2 transactions for
32768 executions of the line
Ez_shared[threadIdx.x][threadIdx.y] = Ez[offset];
Global memory load efficiency is 50%.
Global memory load efficiency = 100 * gld_requested_throughput/ gld_throughput
I am not able to figure out why there are so many memory accesses, though my threads are looking at 16 consecutive values. Can somebody point to me what I am doing wrong?
EDIT: Thanks for all the help.
Your memory access pattern is the problem here. You are getting only 50% efficiency (for both L1 and L2) because you are accessing consecutive regions of 16 floats, that is 64 bytes but the L1 transaction size is 128 bytes. This means that for every 64 bytes requested 128 bytes must be loaded into L1 (and in consequence also into L2).
You also have a problem with shared memory bank conflicts but that is currently not negatively affecting your global memory load efficiency.
You could solve the the load efficiency problem in several ways. The easiest would be to change the x dimension block size to 32. If that is not an option you could change the global memory data layout so that each two consecutive blockIdx.y ([0, 1], [2,3] etc.) values would map to a continuous memory block. If even that is not an option and you have to load the global data only once anyway you could use non-cached global memory loads to bypass the L1 - that would help because L2 uses 32 byte transactions so your 64bytes would be loaded in two L2 transactions without overhead.

Why is global + shared faster than global alone

I need some help understanding the behavior of Ron Farber's code: http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
I'm not understanding how the use of shared mem is giving faster performance over the non-shared memory version. i.e. If I add a few more index calculation steps and use add another Rd/Wr cycle to access the shared mem, how can this be faster than just using global mem alone? The same number or Rd/Wr cycles access global mem in either case. The data is still access only once per kernel instance. Data still goes in/out using global mem. The num of kernel instances is the same. The register count looks to be the same. How can adding more processing steps make it faster. (We are not subtracting any process steps.) Essentially we are doing more work, and it is getting done faster.
Shared mem access is much faster than global, but it is not zero, (or negative).
What am I missing?
The 'slow' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
The 'fast' code:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.
Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.
The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.
For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.
this is because the shared memory is closer to the computing units, hence the latency and peak bandwidth will not be the bottleneck for this computation (at least in the case of matrix multiplication)
But most importantly, the top reason is that a lot of the numbers in the tile are being reused by many threads. So if you access from global you are retrieving those numbers multiple times. Writing them once to shared memory will eliminate that wasted bandwidth usage
When looking at the global memory accesses, the slow code reads forwards and writes backwards. The fast code both read and writes forwards. I think the fast code if faster because the cache hierarchy is optimized in, some way, for accessing the global memory in descending order (towards higher memory addresses).
CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.

Copying whole global memory buffer many times to shared memory buffer

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.