Optimizing Cuda Execution Configuration [closed] - cuda

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I’m trying to learn Cuda in order to speed up the resolution of some stochastic systems of differential equations for my PhD.
I will be using A100 gpus, which have 128 SM each with 64K registers and 164KB of shared memory. I will be referring to those as memory ressources (not sure if I also should have other things in mind when talking about ressources).
First, I have a general question concerning the best Cuda Execution Configuration.
I have been reading the book Professional Cuda Programming by Cheng, Grossman and McKercher in which they state:
“Avoid small block sizes: Start with at least 128 or 256 threads per block” 
and
 “Keep the number of blocks much greater than the number of SMs to expose sufficient parallelism to your device”.
The first sentence obviously refers to the need for enough warps on a SM to keep good occupancy in order to hide latencies.
However, I would like to validate my understanding of the second sentence. Is the following way of thinking correct:
Assume I have a streaming multiprocessor that has enough memory ressources for 2048 of my threads to run concurrently. This means I’m able to use threads that use less than 64K/2048 registers and 164KB/2048 of shared memory.
Since 2048 threads also corresponds to the maximum amount of warps per SM, I can assume that occupancy is high enough.
So what is the difference between having 4 blocks of 512 threads and 2 blocks with 1024 threads for the SM? In both cases I have the same number of warps soo those two approaches expose the SM to the same level of parallelism right?
Similarly if I only have enough ressources for 1024 threads, there is actually no difference between 1 block of 1024 threads, 2 blocks of 512 and 4 blocks of 256 threads. In this case I will just need twice the number of SM to run 2048 threads (threads using the same memory ressources).
When there is a possible difference, is when the ressources limit the number of threads that can run concurrently on a SM below the MAX number of threads per block. For example, assume there is only enough memory ressources for 256 threads. Now , 4 blocks of 256 threads is obviously much better than 2 of 512 threads because they can be spread over more SM to expose more parallelism.
So it is the limited amount of ressources which favors increasing the number of blocks? Is that how I should understand this sentence?
If this is true, the way to expose the most parallelism, is to minimize the number of ressources needed per thread, or subdivide the program in smaller independent threads when reasonable.
Now suppose that based on the application we can determine the ideal thread size we can work with.
Based on the available memory ressources we then have determined that we want X threads to run on each SM, separated in Y number of blocks in order to have good occupancy.
Using the Cuda Execution Configuration we can only give to Cuda a number of blocks and a number of threads per blocks. So should we expect <<<128*Y, X>>> to do what I described?
To make it concrete, let’s assume we calculate that the memory ressources allow us to have 256 independent threads on a single SM. Therefore, we want 1 block of 256 threads to run on a single SM. Then we would choose a grid dimension of 128 or more and a block dimension of 256 threads (X=256, Y=1+).
Is this way of thinking correct?

A100 gpus, which have 128 SM
A100 GPUs have 108 SMs. The A100 die has 128 SMs possible, but not all 128 are exposed in any actual product.
So what is the difference between having 4 blocks of 512 threads and 2 blocks with 1024 threads for the SM? In both cases I have the same number of warps so those two approaches expose the SM to the same level of parallelism right?
Yes, given your stipulations (max occupancy = 2048 threads/SM).
For example, assume there is only enough memory resources for 256 threads. Now , 4 blocks of 256 threads is obviously much better than 2 of 512 threads because they can be spread over more SM to expose more parallelism. So it is the limited amount of ressources which favors increasing the number of blocks? Is that how I should understand this sentence?
Given your stipulation ("only enough memory resources for 256 threads"), the case of two threadblocks of 512 threads would fail to launch. However, for these very small grid sizes, 8 blocks of 128 threads might be better than 4 blocks of 256 threads, because 8 blocks of 128 threads could conceivably bring 8 SMs to bear on the problem (depending on your GPU), whereas 4 blocks of 256 threads could only bring 4 SMs to bear on the problem.
To make it concrete, let’s assume we calculate that the memory resources allow us to have 256 independent threads on a single SM. Therefore, we want 1 block of 256 threads to run on a single SM. Then we would choose a grid dimension of 128 or more and a block dimension of 256 threads (X=256, Y=1+).
Yes, if your GPU has 108 SMs, then grid sizing choices of 108 * N where N is a positive integer would probably be sensible/reasonable, for the number of blocks. For N of 2 or larger, this would also tend to satisfy the 2nd statement given in the book:
“Keep the number of blocks much greater than the number of SMs to expose sufficient parallelism to your device”.
(This statement is a general statement, and is not advanced with a particular limit on block size or threads per SM in mind. If you truly have a limit due to code design of 256 threads per SM, and your threadblock size is 256, then N = 1 should be sufficient for "full occupancy".)
Kernel designs using a grid stride loop will often give you the flexibility to choose grid size independently of problem size.

Related

Is there a correlation between the exact meaning of gpu wave and thread block?

computation performed by a GPU kernel is partitioned into
groups of threads called thread blocks, which typically
execute in concurrent groups, resulting in waves of execution
What exactly does wave mean here? Isn't that the same meaning as warp ?
A GPU can execute a maximum number of threads, grouped in a maximum number of thread blocks. When the whole grid for a kernel is larger than the maximum of either of those limits, or if there are concurrent kernels occupying the GPU, it will launch as many thread blocks as possible. When the last thread of a block has terminated, a new block will start.
Since blocks typically have equal run times and scheduling has a certain latency, this often results in bursts of activity on the GPU that you can see in the occupancy. I believe this is what is meant by that sentence.
Do not confuse this with the term "wavefront" which is what AMD calls a warp.
Wave: a group of thread blocks running concurrently on GPU.
Full Wave: (number of SMs on the device) x (max active blocks per SM)
Launching the grid with thread-blocks less than a full wave results in low achieved occupancy. Mostly launching is composed of some number of full wave and possibly 1 incomplete wave. It should be to mention that maximum size of the wave is based on how many blocks can fit on one SM regarding registers per thread, shared memory per block etc.
If we look at the blog of the Julien Demoth and use that values to understand the issue:
max # of threads per SM: 2048 (NVIDIA Tesla K20)
kernel has 4 blocks of 256 threads per SM
Theoretical Occupancy: %50 (4*256/2048)
Full Wave: (# of SMs) x (max active blocks per SM) = 13x4 = 52 blocks
The kernel is launching with 128 blocks so there are 2 full wave and 1 incomplete wave with 24 blocks. The full wave value may be increased using the attribute (launch_bounds) or configuring the amount of shared memory per SM (for some device, see also related report) etc.
Also, the incomplete wave is named as partial last wave and it has negative effect on performance due to having low occupancy. This underutilization of GPU is named as tail effect and it’s dominant especially when launching few thread blocks in a grid.

Why use thread blocks larger than the number of cores per multiprocessor

I have a Nvidia GeForce GTX 960M graphics card, which has the following specs:
Multiprocessors: 5
Cores per multiprocessor: 128 (i.e. 5 x 128 = 640 cores in total)
Max threads per multiprocessor: 2048
Max block size (x, y, z): (1024, 1024, 64)
Warpsize: 32
If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time. However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently. So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).
My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?
If I run 1 block of 640 threads, then a single multiprocessor gets a workload of 640 threads, but will run concurrently only 128 threads at a time.
That isn't correct. All 640 threads run concurrently. The SM has instruction latency and is pipelined, so that all threads are active and have state simultaneously. Threads are not tied to a specific core and the execution model is very different from a conventional multi-threaded CPU execution model.
However, if I run 5 blocks of 128 threads then each multiprocessor gets a block and all 640 threads are run concurrently.
That may happen, but it is not guaranteed. All blocks will run. What SM they run on is determined by the block scheduling mechanism, and those heuristics are not documented.
So, as long as I create blocks of 128 threads then the distribution of threads per multiprocessor can be as evenly as possible (assuming at least 640 threads in total).
From the answers above, that does not follow either.
My question then is: why would I ever want to create blocks of sizes larger than the number of cores per multiprocessor (as long as I'm not hitting the max number of blocks per dimension)?
Because threads are not tied to cores, the architecture has a lot of latency and requires a significant number of threads in flight to hide all that latency and reach peak performance. Unfortunately basically none of the theses you suppose in your question are correct or relevant to determining the optimal number of blocks or their size for a given device.

CUDA purpose of manually specifying thread blocks

just started learning CUDA and there is something I can't quite understand yet. I was wondering whether there is a reason for splitting threads into blocks besides optimizing GPU workload. Because if there isn't, I can't understand why would you need to manually specify the number of blocks and their sizes. Wouldn't that be better to simply supply the number of threads needed to solve the task and let the GPU distribute the threads over the SMs?
That is, consider the following dummy task and GPU setup.
number of available SMs: 16
max number of blocks per SM: 8
max number of threads per block: 1024
Let's say we need to process every entry of a 256x256 matrix and we want a thread assigned to every entry, i.e. the overall number of threads is 256x256 = 65536. Then the number of blocks is:
overall number of threads / max number of threads per block = 65536 / 1024 = 64
Finally, 64 blocks will be distributed among 16 SMs, making it 8 blocks per SM. Now these are trivial calculations that GPU could handle automatically, right?.
The only other reason for manually supplying the number of blocks and their sizes, that I can think of, is separating threads in a specific fashion in order for them to have shared local memory, i.e. somewhat isolating one block of threads from another block of threads.
But surely there must be another reason?
I will try to answer your question from the point of view what I understand best.
The major factor that decides the number of threads per block is the multiprocessor occupancy.The occupancy of a multiprocessor is calculated as the ratio of the active warps to the max. number of active warps that is supported. The threads of a warps may be active or dormant for many reasons depending on the application. Hence a fixed structure for the number of threads may not be viable.
Besides each multiprocessor has a fixed number of registers shared among all the threads of that multiprocessor. If the total registers needed exceeds the max. number, the application is liable to fail.
Further to the above, the fixed shared memory available to a given block may also affect the decision on the number of threads, in case the shared memory is heavily used.
Hence a naive way to decide the number of threads is straightforwardly using the occupancy calculator spreadsheet in case you want to be completely oblivious to the type of application at hand. The other better option would be to consider the occupancy along with the type of application being run.

resident warps per SM in (GK20a GPU) tegra k1

How many resident warps are present per SM in (GK20a GPU) tegra k1?
As per documents I got following information
In tegra k1 there is 1 SMX and 192 cores/multiprocessor
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Can any one specify value of maximun blocks per SMX?
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing
four warps to be issued and executed concurrently) threads running concurrently ?
if NO, How many number of threads run concurrently?
Kindly help me to solve and understand it.
Can any one specify value of maximun blocks per SMX?
The maximum number of resident blocks per multiprocessor is 16 for kepler (cc 3.x) devices.
Is 32 * 4 = 128 (no of threads in warp * no of warp ) (AS kepler allowing four warps to be issued and executed concurrently) threads running concurrently ? if NO, How many number of threads run concurrently?
There is a difference between what can be issued in a given clock cycle and what may be executing "concurrently".
Since instruction execution is pipelined, multiple instructions from multiple different warps can be executing at any point in the pipeline(s).
Kepler has 4 warp schedulers which can each issue up two instructions from a given warp (4 warps total for 4 warp schedulers, up to 2 instructions per issue slot, maximum of 8 instructions that can be issued per clock cycle).
Up to 64 warps (32 threads per warp x 64 warps = 2048 max threads per multiprocessor) can be resident (i.e. open and schedulable) per multiprocessor. This is also the maximum number that may be currently executing (at various phases of the pipeline) at any given moment.
So, at any given instant, instructions from any of the 64 (maximum) available warps can be in various stages of execution, in the various pipelines for the various functional units in a Kepler multiprocessor.
However the maximum thread instruction issue per clock cycle per multiprocessor for Kepler is 4 warp schedulers x (max)2 instructions = 8 * 32 = 256. In practice, well optimized codes don't usually achieve this maximum but 4-6 instructions average per issue slot (i.e. per clock cycle) may in practice be achievable.
Each block deployed for execution to SM requires certain resources, either registers or shared memory. Let's imagine following situation:
each thread from certain kernel is using 64 32b registers (256B register memory),
kernel is launched with blocks of size 1024 threads,
obviously such block would consume 256*1024B of registers on particular SM
I don't know about tegra, but in case of card which I am using now (GK110 chip), every SM has 65536 of 32-bit registers (~256kB) available, therefore in following scenario all of the registers would got used by single block deployed to this SM, so limit of blocks per SM would be 1 in this case...
Example with shared memory works the same way, in kernel launch parameters you can define amount of shared memory used by each block launched so if you would set it to 32kB, then two blocks could be deployed to SM in case of 64kB shared memory size. Worth mentioning is that as of now I believe only blocks from same kernel can be deployed to one SM at the same time.
I am not sure at the moment whether there is some other blocking factor than registers or shared memory, but obviously, if blocking factor for registers is 1 and for shared memory is 2, then the lower number is the limit for number of blocks per SM.
As for your second question, how much threads can run concurrently, the answer is - as many as there are cores in one SM, so in case of SMX and Kepler architecture it is 192. Number of concurrent warps is obviously 192 / 32.
If you are interested in this stuff I advise you to use nsight profiling tool where you can inspect all kernel launches and their blocking factors and many more useful info.
EDIT:
Reading Robert Crovella's answer I realized there really are these limits for blocks per SM and threads per SM, but I was never able to reach them as my kernels typically were using too much registers or shared memory. Again, these values can be investigated using Nsight which displays all the useful info about available CUDA devices, but such info can be found for example in case of GK110 chip even on NVIDIA pages in related document.

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.