The goal is simple: plot the effect of block size on the execution time with CUDA. What one would expect to see is that for each blocksize that is a multiple of 32 the execution time is lowest, after thes multiples (e.g. 33, 65, 97, 129, ...) the execution time should increase. However this is not the result I'm getting. The execution time simply goes down and then flattens out.
I'm running CUDA runtime 10.0 on an NVIDIA GeForce 940M.
I've tried several ways of getting the execution time. The one recommended in the CUDA documentation says following should work:
cudaEventCreate(&gpu_execution_start);
cudaEventCreate(&gpu_execution_end);
cudaEventRecord(gpu_execution_start);
kernel<<<n_blocks,blocksize>>> (device_a, device_b, device_out, arraysize);
cudaEventRecord(gpu_execution_end);
cudaEventSynchronize(gpu_execution_end);
This way of timing however generates previously mentioned result.
Does the issue lie in timing the execution? Or does the specific GPU cause problems in the result maybe?
So each of those thread blocks will be translated into warps, and as you increase the number of threads per threadblock by 32, you decrease the percentage of diverged threads each time. For example, if you launch 33 threads per threadblock, each threadblock will have 1 warp with all 32 lanes active, and another with only 1 lane active. So at each increment of your test, you are not increasing the amount of divergence, you are just adding 1 more active warp to that threadblock.
If you are also not scaling your app correctly, all your work will be able to be scheduled at the same time anyway, so there won't be any effect on execution time.
Hope this helps!
The clock64() device-side function in CUDA gives us some sort of clock ticks value. The documentation says:
when executed in device code, [clock64()] returns the value of a per-multiprocessor counter that is incremented every clock cycle.
A little program I wrote to examine clock64() behavior suggests that you get about the same initial value when you start a kernel at different points in (wall clock) time (without rebooting the machine or "manually" resetting the device). For my specific case that seems to be about 5,200,000 to 6,400,000 for the first kernel a process starts. Also, the values increase slightly from SM to SM - while it's not clear they should be at all related, or perhaps, if they are, perhaps they should actually be identical.
I also found that with the next kernel launch, the initial clock64() value increases - but then after some more kernel runs jumps down to a much lower value (e.g. 350,000 or so) and gradually climbs again. There doesn't seem to be a consistent pattern to this behavior (that I can detect with a few runs and manual inspection).
So, my questions are:
Does clock64() actually return clock ticks, or something else that's time-based?
In what ways is clocks64() SM-specific, and in what ways are the values on different SMs related?
What resets/re-initializes the clock64() value?
Can I initialize the clock64() value(s) myself?
Does clock64() actually return clock ticks, or something else that's time-based?
clock64() reads a per-SM 64-bit counter (it actually returns a signed result, so 63 bits available). The clock source for this counter is the GPU core clock. The core clock frequency is discoverable using the deviceQuery sample code, for example. As an order-of-magnitude estimate, most CUDA GPUs I am familiar with have a clock period that is on the order of 1 nanosecond. If we multiply 2^63 by 1 nanosecond, I compute a counter rollover period of approximately 300 years.
In what ways is clock64() SM-specific, and in what ways are the values on different SMs related?
There is no guarantee that the counter in a particular SM has any defined relationship to a counter in another SM, other than that they will have the same clock period.
What resets/re-initializes the clock64() value?
The counter will be reset at some unspecified point, somewhere between machine power-on and the first point at which you access the counter for that SM. The counter may additionally be reset at any point when a SM is inactive, i.e. has no resident threadblocks. The counter should not be reset during any interval when one or more threadblocks are active on the SM.
Can I initialize the clock64() value(s) myself?
You cannot. You have no direct control over the counter value.
I have a basic question for my understanding. I apologize if some reference to the answer is provided in some documentation. I couldn't find anything related to this in C programming guide.
I have a Fermi Achitecture GPU Geforce GTX 470. It has
14 Streaming Multiprocessors
32 Stream Cores per SM
I wanted to understand thread per-emption mechanism with an example. Suppose I have simplest kernel with a 'printf' statement (printing out thread id). And I use the following dimensions for grid and blocks
dim3 grid, block;
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 32;
block.y = 1;
block.z = 1;
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors. And as each streaming multiprocessor has 32 cores, each core will execute one kernel (one thread). Is this correct?
If this is correct, then what happens in the following case?
grid.x = 14;
grid.y = 1;
grid.z = 1;
block.x = 64;
block.y = 1;
block.z = 1;
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
Can someone please comment on this?
So as I understand 14 blocks will be scheduled to 14 streaming multi-processors.
Not necessarily. A single block with 32 threads is not enough to saturate an SM, so multiple blocks may be scheduled on a single SM while some go unused. As you increase the number of blocks, you will get to a point where they get evenly distributed over all SMs.
And as each multiprocessor has 32 cores, each core will execute one kernel (one thread).
The CUDA cores are heavily pipelined so each core processes many threads at the same time. Each thread is in a different stage in the pipeline. There are also a varying number of different types of resources.
Taking a closer look at the Fermi SM (see below), you see the 32 CUDA Cores (marketing speak for ALUs), each of which can hold around 20 threads in their pipelines. But there are only 16 LD/ST (Load/Store) units and only 4 SFU (Special Function) units. So, when a warp gets to an instruction that is not supported by the ALUs, the warp will be scheduled multiple times. For instance, if the instruction requires the SFU units, the warp will be scheduled 8 (32 / 4) times.
I understand that whatever number of blocks I assign to the grid they will scheduled without any sequence or prediction. That is because as soon as there is a resource bottle neck encountered GPU will schedule those blocks with do not require those resources.
1) Is the same criteria for scheduling threads is used.
Because the CUDA architecture guarantees that all threads in a block will have access to the same shared memory, a block can never move between SMs. When the first warp for a block has been scheduled on a given SM, all other warps in that block will be run on that same SM regardless of resources becoming available on other SMs.
2) But like I mentioned I have a printf statement and no common resource usage what happens in that case? After the 32 threads are executed rest of the 32 threads are executed?
Think of blocks as sets of warps that are guaranteed to run on the same SM. So, in your example, the 64 threads (2 warps) of each block will be executed on the same SM. On the first clock, the first instruction of one warp is scheduled. On the second clock, that instruction has moved one step into the pipelines so the resource that was used is free to accept either the second instruction from the same warp or the first instruction from the second warp. Since there are around 20 steps in the ALU pipelines on Fermi, 2 warps will not contain enough explicit parallelism to fill all the stages in the pipeline and they will probably not contain enough ILP to do so.
3) If I also have a y-dimension in block, whats the sequence then? Is it first 32 threads in x-dimension for all y-dimension are done and then the rest?
The dimensions are only to enable offloading of generation of 2D and 3D thread indexes to dedicated hardware. The schedulers see the blocks as a 1D array of warps. The order in which they search for eligible warps is undefined. The scheduler will search in a fairly small set of "active" warps for a warp that has a current instruction that needs a resource that is currently open. When a warp is complete, a new one will be added to the active set. So, the order in which the warps are completed becomes unpredictable.
Fermi SM:
I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...
In CUDA when we talk about parallel threads executing the same code is there any order to their execution?
For-example:
If, I have 4 threads,for a 1D array of 4 elements.All four threads perfom some operation on some index of the array.
Will thread 4 always execute after thread 3 or there is no specific order in the execution?
Thank you!
Generally, there are no order in threads execution. It's wrong to rely on the order of threads designing your algorithm.
There is no deterministic order for threads' execution and if you need a specific order then you should be programming it sequentially instead of using a parallel execution model.
There is something that can be said about thread execution, though. In CUDA's execution model, threads are grouped in "warps". Depending on the compute capability of the underlying device, each warp (or half-warp) is executed simultaneously - literally at the same time. The execution proceeds until the code locks due to waiting for memory transference and another warp (or half-warp) is scheduled to run.
The documentation, though, is very specific about what assumptions you can make on the matter: the best execution barrier you have is a kernel call ending.