Conversion from block dimensions to warps in CUDA [duplicate] - cuda

This question already has answers here:
How are 2D / 3D CUDA blocks divided into warps?
(2 answers)
Closed 7 years ago.
I'm a little confused regarding how blocks of certain dimensions are mapped to warps of size 32.
I have read and experienced first hand that the inner dimension of a block being a multiple of 32 improves performance.
Say I create a block with dimensions 16x16.
Can a warp contain threads from two different y-dimensions, e.g. 1 and 2 ?
Why would having an inner dimension of 32 improve performance even though there technically are enough threads to be scheduled to a warp?

Your biggest question has already been answered in About warp and threads and How are CUDA threads divided into warps?. So, I have focuses this answer in the why.
The blocksize in CUDA is always a multiple of the warp size. The warp size is implementation defined and the numbe 32 is mainly related to shared memory organization, data access patterns and data flow control [ 1 ].
So, a blocksize being a multiple of 32 does not improves performance but means that all the threads will be used for something. Note that used for something depends on what you do with the threads within the block.
A blocksize being not a multiple of 32 will rounds up to the nearest multiple, even if you request fewer threads. See GPU Optimization Fundamentals presentation of Cliff Woolley from the NVIDIA
Developer Technology Group has interesting hints about performance.
In addition, memory operations and instructions are executed per warp, so you can understand the importance of this number. I think the reason why it is 32 and not 16 or 64 is undocumented. So I like remember the warp size as "The Answer to the Ultimate Question of Life, the Universe, and Everything" [ 2 ].
[1] David B Kirk and W Hwu Wen-mei. Programming Massively Parallel Processors: a Hands-on Approach. Elsevier, 2010.
[2] The Hitchhiker's Guide to the Galaxy.

Related

Optimizing Cuda Execution Configuration [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I’m trying to learn Cuda in order to speed up the resolution of some stochastic systems of differential equations for my PhD.
I will be using A100 gpus, which have 128 SM each with 64K registers and 164KB of shared memory. I will be referring to those as memory ressources (not sure if I also should have other things in mind when talking about ressources).
First, I have a general question concerning the best Cuda Execution Configuration.
I have been reading the book Professional Cuda Programming by Cheng, Grossman and McKercher in which they state:
“Avoid small block sizes: Start with at least 128 or 256 threads per block” 
and
 “Keep the number of blocks much greater than the number of SMs to expose sufficient parallelism to your device”.
The first sentence obviously refers to the need for enough warps on a SM to keep good occupancy in order to hide latencies.
However, I would like to validate my understanding of the second sentence. Is the following way of thinking correct:
Assume I have a streaming multiprocessor that has enough memory ressources for 2048 of my threads to run concurrently. This means I’m able to use threads that use less than 64K/2048 registers and 164KB/2048 of shared memory.
Since 2048 threads also corresponds to the maximum amount of warps per SM, I can assume that occupancy is high enough.
So what is the difference between having 4 blocks of 512 threads and 2 blocks with 1024 threads for the SM? In both cases I have the same number of warps soo those two approaches expose the SM to the same level of parallelism right?
Similarly if I only have enough ressources for 1024 threads, there is actually no difference between 1 block of 1024 threads, 2 blocks of 512 and 4 blocks of 256 threads. In this case I will just need twice the number of SM to run 2048 threads (threads using the same memory ressources).
When there is a possible difference, is when the ressources limit the number of threads that can run concurrently on a SM below the MAX number of threads per block. For example, assume there is only enough memory ressources for 256 threads. Now , 4 blocks of 256 threads is obviously much better than 2 of 512 threads because they can be spread over more SM to expose more parallelism.
So it is the limited amount of ressources which favors increasing the number of blocks? Is that how I should understand this sentence?
If this is true, the way to expose the most parallelism, is to minimize the number of ressources needed per thread, or subdivide the program in smaller independent threads when reasonable.
Now suppose that based on the application we can determine the ideal thread size we can work with.
Based on the available memory ressources we then have determined that we want X threads to run on each SM, separated in Y number of blocks in order to have good occupancy.
Using the Cuda Execution Configuration we can only give to Cuda a number of blocks and a number of threads per blocks. So should we expect <<<128*Y, X>>> to do what I described?
To make it concrete, let’s assume we calculate that the memory ressources allow us to have 256 independent threads on a single SM. Therefore, we want 1 block of 256 threads to run on a single SM. Then we would choose a grid dimension of 128 or more and a block dimension of 256 threads (X=256, Y=1+).
Is this way of thinking correct?
A100 gpus, which have 128 SM
A100 GPUs have 108 SMs. The A100 die has 128 SMs possible, but not all 128 are exposed in any actual product.
So what is the difference between having 4 blocks of 512 threads and 2 blocks with 1024 threads for the SM? In both cases I have the same number of warps so those two approaches expose the SM to the same level of parallelism right?
Yes, given your stipulations (max occupancy = 2048 threads/SM).
For example, assume there is only enough memory resources for 256 threads. Now , 4 blocks of 256 threads is obviously much better than 2 of 512 threads because they can be spread over more SM to expose more parallelism. So it is the limited amount of ressources which favors increasing the number of blocks? Is that how I should understand this sentence?
Given your stipulation ("only enough memory resources for 256 threads"), the case of two threadblocks of 512 threads would fail to launch. However, for these very small grid sizes, 8 blocks of 128 threads might be better than 4 blocks of 256 threads, because 8 blocks of 128 threads could conceivably bring 8 SMs to bear on the problem (depending on your GPU), whereas 4 blocks of 256 threads could only bring 4 SMs to bear on the problem.
To make it concrete, let’s assume we calculate that the memory resources allow us to have 256 independent threads on a single SM. Therefore, we want 1 block of 256 threads to run on a single SM. Then we would choose a grid dimension of 128 or more and a block dimension of 256 threads (X=256, Y=1+).
Yes, if your GPU has 108 SMs, then grid sizing choices of 108 * N where N is a positive integer would probably be sensible/reasonable, for the number of blocks. For N of 2 or larger, this would also tend to satisfy the 2nd statement given in the book:
“Keep the number of blocks much greater than the number of SMs to expose sufficient parallelism to your device”.
(This statement is a general statement, and is not advanced with a particular limit on block size or threads per SM in mind. If you truly have a limit due to code design of 256 threads per SM, and your threadblock size is 256, then N = 1 should be sufficient for "full occupancy".)
Kernel designs using a grid stride loop will often give you the flexibility to choose grid size independently of problem size.

NSIGHT: What are those Red and Black colour in kernel-level experiments?

I am trying to learn NSIGHT.
Can some one tell me what are these red marks indicating in the following screenshot taken from the User Guide ? There are two red marks in Occupancy per SM and two in warps section as you can see.
Similarly what are those black lines which are varying in length, indicating?
Another example from same page:
Here is the basic explanation:
Grey bars represent the available amount of resources your particular
device has (due to both its hardware and its compute capability).
Black bars represent the theoretical limit that it is possible to achieve for your kernel under your launch configuration (blocks per grid and threads per block)
The red dots represent your the resources that you are using.
For instance, looking at "Active warps" on the first picture:
Grey: The device supports 64 active warps concurrently.
Black: Because of the use of the registers, it is theoretically possible to map 64 warps.
Red: Your achieve 63.56 active warps.
In such case, the grey bar is under the black one, so you cant see the grey one.
In some cases, can happen that the theoretical limit its greater that the device limit. This is OK. You can see examples on the second picture (block limit (shared memory) and block limit (registers). That makes sense if you think that your kernel use only a little fraction of your resources; If one block uses 1 register, it could be possible to launch 65536 blocks (without taking into account other factors), but still your device limit is 16. Then, the number 128 comes from 65536/512. The same applies to the shared memory section: since you use 0 bytes of shared memory per block, you could launch infinite number of block according to shared memory limitations.
About blank spaces
The theoretical and the achieved values are the same for all rows except for "Active warps" and "Occupancy".
You are really executing 1024 threads per block with 32 warps per block on the first picture.
In the case of Occupancy and Active warps I guess the achieved number is a kind of statistical measure. I think that because of the nature of the CUDA model. In CUDA each thread within a warp is executed simultaneously on a SM. The way of hiding high latency operations -such as memory readings- is through "almost-free warps context switches". I guess that should be difficult to take a exact measure of the number of active warps in that situation. Beside hardware concepts, we also have to take into account the kernel implementation, branch-divergence, for instance could make a warp to slower than others... etc.
Extended information
As you saw, these numbers are closely related to your device specific hardware and compute capability, so perhaps a concrete example could help here:
A devide with CCC 3.0 can handle a maximum of 2048 threads per SM, 16
blocks per SM and 64 warps per SM. You also have a maximum number of
registers avaliable to use (65536 on that case).
This wikipedia entry is a handy site to be aware of each ccc features.
You can query this parameters using the deviceQuery utility sample code provided with the CUDA toolkit or, at execution time using the CUDA API as here.
Performance considerations
The thing is that, ideally, 16 blocks of 128 threads could be executed using less than 32 registers per thread. That means a high occupancy rate. In most cases your kernel needs more that 32 register per block, so it is no longer possible to execute 16 blocks concurrently on the SM, then the reduction is done at the block level granularity, i.e., decreasing the number of block. An this is what the bars capture.
You can play with the number of threads and blocks, or even with the _ _launch_bounds_ _ directive to optimize your kernel, or you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.

Does CUDA automatically load-balance for you?

I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...

thread & block configuration requirements [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I am developing a program in which I am calling a function which inputs random binary numbers.
The total number will be provided on run time eg: 1000, or 10,00,000..
And after generating the random numbers, I need to calculate total number of 0s and total number of 1s using COUNTERS.
I have the following queries:
How many threads, blocks & grids should I allot ?
Do I need 2D threads, or it can work with 1D thread only?
What function thread will do in it, I feel it should check whether particular value is 1 or 0 Does this sound right?
How should I use warps or tile method?
I'm guessing this might be a homework question, especially based on the only other question you've posted on SO.
How many threads/blocks/grids? The answer to this question depends on your thread strategy. What will each thread do? For problems that produce a large amount of output, like image processing or matrix multiply, a common thread strategy is to assign each thread to do the work to create one output point. But this problem only produces a small number of output values (2, it seems) and is in a category of problems including reductions, stream compactions, and histograms. These problems are often solved in two steps (maybe 2 kernels...) and a common thread strategy (at least for the first step or kernel) is to assign one thread to each input point. But see also my answer to 2 below. Once you know how many threads you need, it's common to pick some number of threads per block like 256 or 512 (definitely use a power of 2), and then create enough blocks so that the number of threads per block times the number of blocks is equal to or larger than the problem size (number of input points in this case).
2D or 1D? Your problem isn't inherently 2D in nature, so a 1D grid of threads is a reasonable starting point. However in a 1D grid of threads, the maximum number of threads you can create in the grid is limited to the max grid X dimension for the GPU you are using, times the number of threads per block. These numbers are typically something like 65535 and 1024, so after about 64M elements of input points you'll run out of threads. It's not hard to convert to using a 2D grid structure at this point, which will increase the number of possible threads to a size that is bigger than the GPU can handle at once. However another strategy rather than switching to a 2D grid of threadblocks is to retain a 1D grid of threadblocks, but have each thread process multiple input points/elements, probably using a loop in your kernel code. If your loop can handle up to 512 elements for example, then 65535x1024x512 should cover your problem size. This is also a convenient thread strategy for this type of problem, because a thread can keep a local copy of the intermediate results it creates (the counts of ones and zeros so far) without interference or synchronization with other threads.
My suggestion based on the above is that a single thread would execute a loop, and each pass of the loop would look at an element, and update local variables that contain the counts of ones and zeros. This would be the first part of a 2-part algorithm. The second part would then have to collect these intermediate results. You will want to give some thought to how the second part will collect the results from the first part. For example, at the completion of the kernel, you may want to store the intermediate results back to global memory.
warps/tiling? Warps refer to the grouping of threads into units of 32 threads for execution. This will happen automatically for you. You should arrange your algorithm such that when you are reading values from global memory (or writing values to global memory) that each thread reads (or writes) in a consecutive, contiguous block. That is thread 0 reads from location 0, thread 1 from the next location, etc. If you don't do anything unusual in your threads, this will happen more or less automatically for you. The data storage created by cudaMalloc will be properly aligned, and if your array indexing strategy is something like a[thread_number] then you will get aligned and coalesced accesses across the warp, which is recommended to get good speed out of the GPU. Tiling refers to a process of organizing data accesses to accentuate locality, which is usually beneficial for cache-dependent architectures. If you do a good job of memory coalescing you won't be depending on the cache much.
If you can spare the time, the CUDA C programming guide is a very readable document and will expose you to the basic concepts needed for good GPU programming. Also there are webinars on the nvidia web site which can cover the important material here in about 2 hours. Also, thrust can conveniently handle problems like this with a minimum of coding effort (in C++), but I'm guessing that's outside the scope of what you're trying to do right now.

how does CUDA schedule its threads

i've got a few questions regarding cuda's scheduling system.
A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?
B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>(); x,y =?
C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?
Thanks for your help
A. The compute work distributor will distribute a block from the grid to a SM. The SM will convert the block in warps (WARP_SIZE = 32 on all NVIDIA GPUs). Fermi 2.0 GPUs each SM has two warp schedulers which share a set of data paths. Every cycle each warp scheduler picks a warp and issues an instruction to one of data paths (please don't think of CUDA cores). On Fermi 2.1 GPUs each warp scheduler has independent data paths as well as a set of shared data paths. On 2.1 every cycle each warp scheduler will pick a warp and attempt to dual issue instructions for each warp.
The warp schedulers attempt to optimize the use of data paths. This means that it is possible that a single warp will execute multiple instructions in back to back cycle or the warp scheduler can choose to issue from a different warp every cycle.
The number of warps/threads that each SM can handle is specified in the CUDA Programming Guide v.4.2 Table F-1. This scales from 768 threads to 2048 threads (24-64 warps).
B. The maximum threads per launch is defined by the maximum GridDims * the maximum threads per block. See Table F-1 or refer to the documentation for cudaGetDeviceProperties.
C. See the same resources as (B). The optimum distribution of threads/block is defined by your algorithm partitioning and is influenced by the occupancy calculation. There are observable performance impacts based around problem set size of the warps on the SM and the amount of time blocked at instruction barriers (among other things). For starters I recommend at least 2 blocks per SM and ~50% occupancy.
B. It depends on your device. You can use the cuda function cudaGetDeviceProperties to see the specifications for your device. A common maximum number is y=1024 threads per block and x=65535 blocks per Grid dimension.
C.A common practise is to have powers of 2 (128,256,512 etc.) threads/block. Reducing large arrays is very effective that way (see Reduction). The optimum distribution of blocks and threads actually depends on your application and your hardware. I personally use 512 threads/block for large sparse linear algebra computations on a TeslaM2050 since it's the most efficient for my applications.