The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
Related
computation performed by a GPU kernel is partitioned into
groups of threads called thread blocks, which typically
execute in concurrent groups, resulting in waves of execution
What exactly does wave mean here? Isn't that the same meaning as warp ?
A GPU can execute a maximum number of threads, grouped in a maximum number of thread blocks. When the whole grid for a kernel is larger than the maximum of either of those limits, or if there are concurrent kernels occupying the GPU, it will launch as many thread blocks as possible. When the last thread of a block has terminated, a new block will start.
Since blocks typically have equal run times and scheduling has a certain latency, this often results in bursts of activity on the GPU that you can see in the occupancy. I believe this is what is meant by that sentence.
Do not confuse this with the term "wavefront" which is what AMD calls a warp.
Wave: a group of thread blocks running concurrently on GPU.
Full Wave: (number of SMs on the device) x (max active blocks per SM)
Launching the grid with thread-blocks less than a full wave results in low achieved occupancy. Mostly launching is composed of some number of full wave and possibly 1 incomplete wave. It should be to mention that maximum size of the wave is based on how many blocks can fit on one SM regarding registers per thread, shared memory per block etc.
If we look at the blog of the Julien Demoth and use that values to understand the issue:
max # of threads per SM: 2048 (NVIDIA Tesla K20)
kernel has 4 blocks of 256 threads per SM
Theoretical Occupancy: %50 (4*256/2048)
Full Wave: (# of SMs) x (max active blocks per SM) = 13x4 = 52 blocks
The kernel is launching with 128 blocks so there are 2 full wave and 1 incomplete wave with 24 blocks. The full wave value may be increased using the attribute (launch_bounds) or configuring the amount of shared memory per SM (for some device, see also related report) etc.
Also, the incomplete wave is named as partial last wave and it has negative effect on performance due to having low occupancy. This underutilization of GPU is named as tail effect and it’s dominant especially when launching few thread blocks in a grid.
I am trying to learn NSIGHT.
Can some one tell me what are these red marks indicating in the following screenshot taken from the User Guide ? There are two red marks in Occupancy per SM and two in warps section as you can see.
Similarly what are those black lines which are varying in length, indicating?
Another example from same page:
Here is the basic explanation:
Grey bars represent the available amount of resources your particular
device has (due to both its hardware and its compute capability).
Black bars represent the theoretical limit that it is possible to achieve for your kernel under your launch configuration (blocks per grid and threads per block)
The red dots represent your the resources that you are using.
For instance, looking at "Active warps" on the first picture:
Grey: The device supports 64 active warps concurrently.
Black: Because of the use of the registers, it is theoretically possible to map 64 warps.
Red: Your achieve 63.56 active warps.
In such case, the grey bar is under the black one, so you cant see the grey one.
In some cases, can happen that the theoretical limit its greater that the device limit. This is OK. You can see examples on the second picture (block limit (shared memory) and block limit (registers). That makes sense if you think that your kernel use only a little fraction of your resources; If one block uses 1 register, it could be possible to launch 65536 blocks (without taking into account other factors), but still your device limit is 16. Then, the number 128 comes from 65536/512. The same applies to the shared memory section: since you use 0 bytes of shared memory per block, you could launch infinite number of block according to shared memory limitations.
About blank spaces
The theoretical and the achieved values are the same for all rows except for "Active warps" and "Occupancy".
You are really executing 1024 threads per block with 32 warps per block on the first picture.
In the case of Occupancy and Active warps I guess the achieved number is a kind of statistical measure. I think that because of the nature of the CUDA model. In CUDA each thread within a warp is executed simultaneously on a SM. The way of hiding high latency operations -such as memory readings- is through "almost-free warps context switches". I guess that should be difficult to take a exact measure of the number of active warps in that situation. Beside hardware concepts, we also have to take into account the kernel implementation, branch-divergence, for instance could make a warp to slower than others... etc.
Extended information
As you saw, these numbers are closely related to your device specific hardware and compute capability, so perhaps a concrete example could help here:
A devide with CCC 3.0 can handle a maximum of 2048 threads per SM, 16
blocks per SM and 64 warps per SM. You also have a maximum number of
registers avaliable to use (65536 on that case).
This wikipedia entry is a handy site to be aware of each ccc features.
You can query this parameters using the deviceQuery utility sample code provided with the CUDA toolkit or, at execution time using the CUDA API as here.
Performance considerations
The thing is that, ideally, 16 blocks of 128 threads could be executed using less than 32 registers per thread. That means a high occupancy rate. In most cases your kernel needs more that 32 register per block, so it is no longer possible to execute 16 blocks concurrently on the SM, then the reduction is done at the block level granularity, i.e., decreasing the number of block. An this is what the bars capture.
You can play with the number of threads and blocks, or even with the _ _launch_bounds_ _ directive to optimize your kernel, or you can use the --maxrregcount setting to lower the number of registers used by a single kernel to see if it improves overall execution speed.
I have a basic question for my understanding. I apologize if some reference to the answer is provided in some documentation. I couldn't find anything related to this in C programming guide.
I have a Fermi Achitecture GPU Geforce GTX 470. It has
14 Streaming Multiprocessors
32 Stream Cores per SM
I wanted to understand thread per-emption mechanism with an example. Suppose I have simplest kernel with a 'printf' statement (printing out thread id). And I use the following dimensions for grid and blocks
dim3 grid, block;
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 1;
block.z = 1;
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors. And as each streaming multiprocessor has 32 cores, each core will execute one kernel (one thread). Is this correct?
If this is correct, then what happens in the following case?
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 64;
block.y = 1;
block.z = 1;
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
Can someone please comment on this?
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors.
Not necessarily. A single block with 32 threads is not enough to saturate an SM, so multiple blocks may be scheduled on a single SM while some go unused. As you increase the number of blocks, you will get to a point where they get evenly distributed over all SMs.
And as each multiprocessor has 32 cores, each core will execute one kernel (one thread).
The CUDA cores are heavily pipelined so each core processes many threads at the same time. Each thread is in a different stage in the pipeline. There are also a varying number of different types of resources.
Taking a closer look at the Fermi SM (see below), you see the 32 CUDA Cores (marketing speak for ALUs), each of which can hold around 20 threads in their pipelines. But there are only 16 LD/ST (Load/Store) units and only 4 SFU (Special Function) units. So, when a warp gets to an instruction that is not supported by the ALUs, the warp will be scheduled multiple times. For instance, if the instruction requires the SFU units, the warp will be scheduled 8 (32 / 4) times.
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
Because the CUDA architecture guarantees that all threads in a block will have access to the same shared memory, a block can never move between SMs. When the first warp for a block has been scheduled on a given SM, all other warps in that block will be run on that same SM regardless of resources becoming available on other SMs.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
Think of blocks as sets of warps that are guaranteed to run on the same SM. So, in your example, the 64 threads (2 warps) of each block will be executed on the same SM. On the first clock, the first instruction of one warp is scheduled. On the second clock, that instruction has moved one step into the pipelines so the resource that was used is free to accept either the second instruction from the same warp or the first instruction from the second warp. Since there are around 20 steps in the ALU pipelines on Fermi, 2 warps will not contain enough explicit parallelism to fill all the stages in the pipeline and they will probably not contain enough ILP to do so.
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
The dimensions are only to enable offloading of generation of 2D and 3D thread indexes to dedicated hardware. The schedulers see the blocks as a 1D array of warps. The order in which they search for eligible warps is undefined. The scheduler will search in a fairly small set of "active" warps for a warp that has a current instruction that needs a resource that is currently open. When a warp is complete, a new one will be added to the active set. So, the order in which the warps are completed becomes unpredictable.
Fermi SM:
I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...
I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.