Threads hierarchy design in kernel in CUDA - cuda

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.

I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)
This link provides some insight into how CUDA will (likely) and organize your threads.
A quote from the summary:
To summarize, special parameters at a
kernel launch define the dimensions of
a grid and its blocks. Unique
coordinates in blockId and threadId
variables allow threads of a grid to
distinguish among them. It is the
programmer's responsibility to use
these variables in the kernel
functions so that the threads can
properly identify the portion of the
data to process. These variables
compel the programmers to organize
threads and there data into
hierarchical and multi-dimensional
organizations.

It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.
Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.

Related

CUDA purpose of manually specifying thread blocks

just started learning CUDA and there is something I can't quite understand yet. I was wondering whether there is a reason for splitting threads into blocks besides optimizing GPU workload. Because if there isn't, I can't understand why would you need to manually specify the number of blocks and their sizes. Wouldn't that be better to simply supply the number of threads needed to solve the task and let the GPU distribute the threads over the SMs?
That is, consider the following dummy task and GPU setup.
number of available SMs: 16
max number of blocks per SM: 8
max number of threads per block: 1024
Let's say we need to process every entry of a 256x256 matrix and we want a thread assigned to every entry, i.e. the overall number of threads is 256x256 = 65536. Then the number of blocks is:
overall number of threads / max number of threads per block = 65536 / 1024 = 64
Finally, 64 blocks will be distributed among 16 SMs, making it 8 blocks per SM. Now these are trivial calculations that GPU could handle automatically, right?.
The only other reason for manually supplying the number of blocks and their sizes, that I can think of, is separating threads in a specific fashion in order for them to have shared local memory, i.e. somewhat isolating one block of threads from another block of threads.
But surely there must be another reason?
I will try to answer your question from the point of view what I understand best.
The major factor that decides the number of threads per block is the multiprocessor occupancy.The occupancy of a multiprocessor is calculated as the ratio of the active warps to the max. number of active warps that is supported. The threads of a warps may be active or dormant for many reasons depending on the application. Hence a fixed structure for the number of threads may not be viable.
Besides each multiprocessor has a fixed number of registers shared among all the threads of that multiprocessor. If the total registers needed exceeds the max. number, the application is liable to fail.
Further to the above, the fixed shared memory available to a given block may also affect the decision on the number of threads, in case the shared memory is heavily used.
Hence a naive way to decide the number of threads is straightforwardly using the occupancy calculator spreadsheet in case you want to be completely oblivious to the type of application at hand. The other better option would be to consider the occupancy along with the type of application being run.

CUDA: GPU Working

I have a basic question for my understanding. I apologize if some reference to the answer is provided in some documentation. I couldn't find anything related to this in C programming guide.
I have a Fermi Achitecture GPU Geforce GTX 470. It has
14 Streaming Multiprocessors
32 Stream Cores per SM
I wanted to understand thread per-emption mechanism with an example. Suppose I have simplest kernel with a 'printf' statement (printing out thread id). And I use the following dimensions for grid and blocks
dim3 grid, block;
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 1;
block.z = 1;
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors. And as each streaming multiprocessor has 32 cores, each core will execute one kernel (one thread). Is this correct?
If this is correct, then what happens in the following case?
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 64;
block.y = 1;
block.z = 1;
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
Can someone please comment on this?
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors.
Not necessarily. A single block with 32 threads is not enough to saturate an SM, so multiple blocks may be scheduled on a single SM while some go unused. As you increase the number of blocks, you will get to a point where they get evenly distributed over all SMs.
And as each multiprocessor has 32 cores, each core will execute one kernel (one thread).
The CUDA cores are heavily pipelined so each core processes many threads at the same time. Each thread is in a different stage in the pipeline. There are also a varying number of different types of resources.
Taking a closer look at the Fermi SM (see below), you see the 32 CUDA Cores (marketing speak for ALUs), each of which can hold around 20 threads in their pipelines. But there are only 16 LD/ST (Load/Store) units and only 4 SFU (Special Function) units. So, when a warp gets to an instruction that is not supported by the ALUs, the warp will be scheduled multiple times. For instance, if the instruction requires the SFU units, the warp will be scheduled 8 (32 / 4) times.
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
Because the CUDA architecture guarantees that all threads in a block will have access to the same shared memory, a block can never move between SMs. When the first warp for a block has been scheduled on a given SM, all other warps in that block will be run on that same SM regardless of resources becoming available on other SMs.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
Think of blocks as sets of warps that are guaranteed to run on the same SM. So, in your example, the 64 threads (2 warps) of each block will be executed on the same SM. On the first clock, the first instruction of one warp is scheduled. On the second clock, that instruction has moved one step into the pipelines so the resource that was used is free to accept either the second instruction from the same warp or the first instruction from the second warp. Since there are around 20 steps in the ALU pipelines on Fermi, 2 warps will not contain enough explicit parallelism to fill all the stages in the pipeline and they will probably not contain enough ILP to do so.
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
The dimensions are only to enable offloading of generation of 2D and 3D thread indexes to dedicated hardware. The schedulers see the blocks as a 1D array of warps. The order in which they search for eligible warps is undefined. The scheduler will search in a fairly small set of "active" warps for a warp that has a current instruction that needs a resource that is currently open. When a warp is complete, a new one will be added to the active set. So, the order in which the warps are completed becomes unpredictable.
Fermi SM:

How do I appropriately size and launch a CUDA grid?

First question:
Suppose I need to launch a kernel with 229080 threads on a Tesla C1060 which has compute capability 1.3.
So according to the documentation this machine has 240 cores with 8 cores on each symmetric multiprocessor for a total of 30 SMs.
I can use up to 1024 per SM for a total of 30720 threads running "concurrently".
Now if I define blocks of 256 threads that means I can have 4 blocks for each SM because 1024/256=4. So those 30720 threads can be arranged in 120 blocks across all SMs.
Now for my example of 229080 threads I would need 229080/256=~895 (rounded up) blocks to process all the threads.
Now lets say I want to call a kernel and I must use those 229080 threads so I have two options. The first one is to I divide the problem so that I call the kernel ~8 times in a for loop with a Grid of 120 blocks and 30720 threads each time (229080/30720). That way I make sure the device will stay occupied completely. The other option is to call the kernel with a Grid of 895 blocks for the entire 229080 threads on which case many blocks will remain idle until a SM finishes with the 8 blocks it has.
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
Second question
Let's say that within the kernel I'm calling I need to access non coalesced global memory so an option is to use shared memory.
I can then use each thread to extract a value from an array on global memory say global_array which is of length 229080. Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
The problem here is that for the 229080 threads I need exactly 229080/256=894.84375 blocks because there is a residue of 216 threads. I can round up that number and get 895 blocks and the last block will just use 216 threads.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
A single kernel is preferred according to what you describe. Threadblocks queued up but not assigned to an SM don't take any resources you need to worry about, and the machine is definitely designed to handle situations just like that. The overhead of 8 kernel calls will definitely be slower, all other things being equal.
Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
This statement is not correct on the face of it. You can have branching while copying to shared memory. You just need to make sure that either:
The __syncthreads() is outside the branching construct, or,
The __syncthreads() is reached by all threads within the branching construct (which effectively means that the branch construct evaluates to the same path for all threads in the block, at least at the point where the __syncthreads() barrier is.)
Note that option 1 above is usually achievable, which makes code simpler to follow and easy to verify that all threads can reach the barrier.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
Do something like this:
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx < data_size)
shared[threadIdx.x] = global[idx];
__syncthreads();
This is perfectly legal. All threads in the block, whether they are participating in the data copy to shared memory or not, will reach the barrier.

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.

how does CUDA schedule its threads

i've got a few questions regarding cuda's scheduling system.
A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?
B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>(); x,y =?
C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?
Thanks for your help
A. The compute work distributor will distribute a block from the grid to a SM. The SM will convert the block in warps (WARP_SIZE = 32 on all NVIDIA GPUs). Fermi 2.0 GPUs each SM has two warp schedulers which share a set of data paths. Every cycle each warp scheduler picks a warp and issues an instruction to one of data paths (please don't think of CUDA cores). On Fermi 2.1 GPUs each warp scheduler has independent data paths as well as a set of shared data paths. On 2.1 every cycle each warp scheduler will pick a warp and attempt to dual issue instructions for each warp.
The warp schedulers attempt to optimize the use of data paths. This means that it is possible that a single warp will execute multiple instructions in back to back cycle or the warp scheduler can choose to issue from a different warp every cycle.
The number of warps/threads that each SM can handle is specified in the CUDA Programming Guide v.4.2 Table F-1. This scales from 768 threads to 2048 threads (24-64 warps).
B. The maximum threads per launch is defined by the maximum GridDims * the maximum threads per block. See Table F-1 or refer to the documentation for cudaGetDeviceProperties.
C. See the same resources as (B). The optimum distribution of threads/block is defined by your algorithm partitioning and is influenced by the occupancy calculation. There are observable performance impacts based around problem set size of the warps on the SM and the amount of time blocked at instruction barriers (among other things). For starters I recommend at least 2 blocks per SM and ~50% occupancy.
B. It depends on your device. You can use the cuda function cudaGetDeviceProperties to see the specifications for your device. A common maximum number is y=1024 threads per block and x=65535 blocks per Grid dimension.
C.A common practise is to have powers of 2 (128,256,512 etc.) threads/block. Reducing large arrays is very effective that way (see Reduction). The optimum distribution of blocks and threads actually depends on your application and your hardware. I personally use 512 threads/block for large sparse linear algebra computations on a TeslaM2050 since it's the most efficient for my applications.