Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed] - cuda

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How are threads organized to be executed by a GPU?

Hardware
If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).
Software
threads are organized in blocks. A block is executed by a multiprocessing unit.
The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).
Obviously, if you need more than those 4*768 threads you need more than 4 blocks.
Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter
the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are
being executed simultaneously).
Now a simple case: processing a 512x512 image
Suppose we want one thread to process one pixel (i,j).
We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks
(so to have 512x512 threads = 4096*64)
It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.
dim3 threadsPerBlock(8, 8); // 64 threads
and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.
dim3 numBlocks(imageWidth/threadsPerBlock.x, /* for instance 512/8 = 64*/
imageHeight/threadsPerBlock.y);
The kernel is launched like this:
myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );
Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.
In the kernel the pixel (i,j) to be processed by a thread is calculated this way:
uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;

Suppose a 9800GT GPU:
it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)
https://www.tutorialspoint.com/cuda/cuda_threads.htm
A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:
func1();
__syncthreads();
func2();
__syncthreads();
then the kernel must run twice and the order of execution will be:
func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads
Note:
The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.
I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.
Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.
I would greatly appreciate if someone test this again for a more accurate piece of information.

Related

When can threads of a warp get scheduled independently on Volta+?

Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?
From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.

Why does this kernel not achieve peak IPC on a GK210?

I decided that it would be educational for me to try to write a CUDA kernel that achieves peak IPC, so I came up with this kernel (host code omitted for brevity but is available here)
#define WORK_PER_THREAD 4
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
i *= WORK_PER_THREAD;
if (i < n)
{
#pragma unroll
for(int j=0; j<WORK_PER_THREAD; j++)
y[i+j] = a * x[i+j] + y[i+j];
}
}
I ran this kernel on a GK210, with n=32*1000000 elements, and expected to see an IPC of close to 4, but ended up with a lousy IPC of 0.186
ubuntu#ip-172-31-60-181:~/ipc_example$ nvcc saxpy.cu
ubuntu#ip-172-31-60-181:~/ipc_example$ sudo nvprof --metrics achieved_occupancy --metrics ipc ./a.out
==5828== NVPROF is profiling process 5828, command: ./a.out
==5828== Warning: Auto boost enabled on device 0. Profiling results may be inconsistent.
==5828== Profiling application: ./a.out
==5828== Profiling result:
==5828== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "Tesla K80 (0)"
Kernel: saxpy_parallel(int, float, float*, float*)
1 achieved_occupancy Achieved Occupancy 0.879410 0.879410 0.879410
1 ipc Executed IPC 0.186352 0.186352 0.186352
I was even more confused when I set WORK_PER_THREAD=16, resulting in less threads launched, but 16, as opposed to 4, independent instructions for each to execute, the IPC dropped to 0.01
My two questions are:
What is the peak IPC I can expect on a GK210? I think it is 8 = 4 warp schedulers * 2 instruction dispatches per cycle, but I want to be sure.
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
What is the peak IPC I can expect on a GK210?
The peak IPC per SM is equal to the number of warp schedulers in an SM times the issue rate of each warp scheduler. This information can be found in the whitepaper for a particular GPU. The GK210 whitepaper is here. From that document (e.g. SM diagram on p8) we see that each SM has 4 warp schedulers capable of dual issue. Therefore the peak theoretically achievable IPC is 8 instructions per clock per SM. (however as a practical matter even for well-crafted codes, you're unlikely to see higher than 6 or 7).
Why does this kernel achieve such low IPC while achieved occupancy is high, why does IPC decrease as WORK_PER_THREAD increases, and how can I improve the IPC of this kernel?
Your kernel requires global transactions at nearly every operation. Global loads and even L2 cache loads have latency. When everything you do is dependent on those, there is no way to avoid the latency, so your warps are frequently stalled. The peak observable IPC per SM on a GK210 is somewhere in the vicinity of 6, but you won't get that with continuous load and store operations. Your kernel does 2 loads, and one store (12 bytes total moved), for each multiply/add. You won't be able to improve it. (Your kernel has high occupancy because the SMs are loaded up with warps, but low IPC because those warps are frequently stalled, unable to issue an instruction, waiting for latency of load operations to expire.) You'll need to find other useful work to do.
What might that be? Well if you do a matrix multiply operation, which has considerable data reuse and a relatively low number of bytes per math op, you're likely to see better measurements.
What about your code? Sometimes the work you need to do is like this. We'd call that a memory-bound code. For a kernel like this, the figure of merit to use for judging "goodness" is not IPC but achieved bandwidth. If your kernel requires a particular number of bytes loaded and stored to perform its work, then if we compare the kernel duration to just the memory transactions, we can get a measure of goodness. Stated another way, for a pure memory bound code (i.e. your kernel) we would judge goodness by measuring the total number of bytes loaded and stored (profiler has metrics for this, or for a simple code you can compute it directly by inspection), and divide that by the kernel duration. This gives the achieved bandwidth. Then, we compare that to the achievable bandwidth based on a proxy measurement. A possible proxy measurement tool for this is bandwidthTest CUDA sample code.
As the ratio of these two bandwidths approaches 1.0, your kernel is doing "well", given the memory bound work it is trying to do.

CUDA: Why is striding faster than contiguous memory access? [duplicate]

What is "coalesced" in CUDA global memory transaction? I couldn't understand even after going through my CUDA guide. How to do it? In CUDA programming guide matrix example, accessing the matrix row by row is called "coalesced" or col.. by col.. is called coalesced?
Which is correct and why?
It's likely that this information applies only to compute capabality 1.x, or cuda 2.0. More recent architectures and cuda 3.0 have more sophisticated global memory access and in fact "coalesced global loads" are not even profiled for these chips.
Also, this logic can be applied to shared memory to avoid bank conflicts.
A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.
So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.
In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below
0 1 2 3
4 5 6 7
8 9 a b
could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)
0 1 2 3 4 5 6 7 8 9 a b
Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either
thread 0: 0, 1, 2
thread 1: 3, 4, 5
thread 2: 6, 7, 8
thread 3: 9, a, b
or
thread 0: 0, 4, 8
thread 1: 1, 5, 9
thread 2: 2, 6, a
thread 3: 3, 7, b
Which is better? Which will result in coalesced reads, and which will not?
Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!
The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.
Memory coalescing is a technique which allows optimal usage of the global memory bandwidth.
That is, when parallel threads running the same instruction access to consecutive locations in the global memory, the most favorable access pattern is achieved.
The example in Figure above helps explain the coalesced arrangement:
In Fig. (a), n vectors of length m are stored in a linear fashion. Element i of vector j is denoted by v j i. Each thread in GPU kernel is assigned to one m-length vector. Threads in CUDA are grouped in an array of blocks and every thread in GPU has a unique id which can be defined as indx=bd*bx+tx, where bd represents block dimension, bx denotes the block index and tx is the thread index in each block.
Vertical arrows demonstrate the case that parallel threads access to the first components of each vector, i.e. addresses 0, m, 2m... of the memory. As shown in Fig. (a), in this case the memory access is not consecutive. By zeroing the gap between these addresses (red arrows shown in figure above), the memory access becomes coalesced.
However, the problem gets slightly tricky here, since the allowed size of residing threads per GPU block is limited to bd. Therefore coalesced data arrangement can be done by storing the first elements of the first bd vectors in consecutive order, followed by first elements of the second bd vectors and so on. The rest of vectors elements are stored in a similar fashion, as shown in Fig. (b). If n (number of vectors) is not a factor of bd, it is needed to pad the remaining data in the last block with some trivial value, e.g. 0.
In the linear data storage in Fig. (a), component i (0 ≤ i < m) of vector indx
(0 ≤ indx < n) is addressed by m × indx +i; the same component in the coalesced
storage pattern in Fig. (b) is addressed as
(m × bd) ixC + bd × ixB + ixA,
where ixC = floor[(m.indx + j )/(m.bd)]= bx, ixB = j and ixA = mod(indx,bd) = tx.
In summary, in the example of storing a number of vectors with size m, linear indexing is mapped to coalesced indexing according to:
m.indx +i −→ m.bd.bx +i .bd +tx
This data rearrangement can lead to a significant higher memory bandwidth of GPU global memory.
source: "GPU‐based acceleration of computations in nonlinear finite element deformation analysis." International journal for numerical methods in biomedical engineering (2013).
If the threads in a block are accessing consecutive global memory locations, then all the accesses are combined into a single request(or coalesced) by the hardware. In the matrix example, matrix elements in row are arranged linearly, followed by the next row, and so on.
For e.g 2x2 matrix and 2 threads in a block, memory locations are arranged as:
(0,0) (0,1) (1,0) (1,1)
In row access, thread1 accesses (0,0) and (1,0) which cannot be coalesced.
In column access, thread1 accesses (0,0) and (0,1) which can be coalesced because they are adjacent.
The criteria for coalescing are nicely documented in the CUDA 3.2 Programming Guide, Section G.3.2. The short version is as follows: threads in the warp must be accessing the memory in sequence, and the words being accessed should >=32 bits. Additionally, the base address being accessed by the warp should be 64-, 128-, or 256-byte aligned for 32-, 64- and 128-bit accesses, respectively.
Tesla2 and Fermi hardware does an okay job of coalescing 8- and 16-bit accesses, but they are best avoided if you want peak bandwidth.
Note that despite improvements in Tesla2 and Fermi hardware, coalescing is BY NO MEANS obsolete. Even on Tesla2 or Fermi class hardware, failing to coalesce global memory transactions can result in a 2x performance hit. (On Fermi class hardware, this seems to be true only when ECC is enabled. Contiguous-but-uncoalesced memory transactions take about a 20% hit on Fermi.)

How cuda handle __syncthreads() in kernel?

Think i have a block with 1024 size and assume my gpu has 192 cuda cores.
How cuda handle __syncthreads() in kernels when cuda cores size is lower than block size?
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[1024];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
How 'tr' remaining in local memory?
I think you are mixing a few things.
First of all, GPU having 192 CUDA cores is the total core count. Each block however maps to a single Streaming Multiprocessor (SM) which may have a lower core count (depending on the GPU generation).
Let us assume that you own a Pascal GPU which has 64 cores per SM and you have 3
SMs.
A single block maps to a single SM. So you will have 64 cores handling 1024 threads concurrently. Such an SM has enough registers to hold all the necessary data for 1024 threads, but it has only 64 cores which quickly swap which threads they are handling.
This way all the local data, e.g. tr can remain in memory.
Now, because of this quick swapping and concurrent execution, it may happen -- completely by accident -- that some threads get ahead of others. If you want to ensure that at certain point all threads are at the same spot, you use __syncthreads(). All that function does is to instruct the scheduler to properly assign work to the CUDA cores so that they all are at that spot in program at some moment.

What is the number of registers in CUDA CC 5.0?

I have a GeForce GTX 745 (CC 5.0).
The deviceQuery command shows that the total number of registers available per block is 65536 (65536 * 4 / 1024 = 256KB).
I wrote a kernel that uses an array of size 10K and the kernel is invoked as follows. I have tried two ways of allocating the array.
// using registers
fun1() {
short *arr = new short[100*100]; // 100*100*sizeof(short)=256K / per block
....
delete[] arr;
}
fun1<<<4, 64>>>();
// using global memory
fun2(short *d_arr) {
...
}
fun2<<<4, 64>>>(d_arr);
I can get the correct result in both cases.
The first one which uses registers runs much faster.
But when invoking the kernel using 6 blocks I got the error code 77.
fun1<<<6, 64>>>();
an illegal memory access was encountered
Now I'm wondering, actually how many of registers can I use? And how is it related to the number of blocks?
The important misconception in your question is that the new operator somehow uses registers to store memory allocated at runtime on the device. It does not. Registers are only allocated statically by the compiler. The new operator uses a dedicated heap for device allocation.
In detail: In your code, fun1, the first line is invoked by all threads, hence each thread of each block would allocate 10,000 16 bits values, that is 1,280,000 bytes per block. For 4 blocks, that make 5,120,000 bytes, for 6 that makes 7,680,000 bytes which for some reason seems to overflow the preallocated limit (default limit is 8MB - see Heap memory allocation). This may be why you get this Illegal Access Error (77).
Using new will make use of some preallocated global memory as malloc would, but not registers - maybe the code you provided is not exactly the one you run. If you want registers, you need to define the data in a fixed array:
func1()
{
short arr [100] ;
...
}
The compiler will then try to fit the array in registers. Note however that this register data is per thread, and maximum number of 32 bits registers per thread is 255 on your device.