In CUDA when we talk about parallel threads executing the same code is there any order to their execution?
For-example:
If, I have 4 threads,for a 1D array of 4 elements.All four threads perfom some operation on some index of the array.
Will thread 4 always execute after thread 3 or there is no specific order in the execution?
Thank you!
Generally, there are no order in threads execution. It's wrong to rely on the order of threads designing your algorithm.
There is no deterministic order for threads' execution and if you need a specific order then you should be programming it sequentially instead of using a parallel execution model.
There is something that can be said about thread execution, though. In CUDA's execution model, threads are grouped in "warps". Depending on the compute capability of the underlying device, each warp (or half-warp) is executed simultaneously - literally at the same time. The execution proceeds until the code locks due to waiting for memory transference and another warp (or half-warp) is scheduled to run.
The documentation, though, is very specific about what assumptions you can make on the matter: the best execution barrier you have is a kernel call ending.
Related
...or just the threads in the current warp or block?
Also, when the threads in a particular block encounter (in the kernel) the following line
__shared__ float srdMem[128];
will they just declare this space once (per block)?
They all obviously operate asynchronously so if Thread 23 in Block 22 is the first thread to reach this line, and then Thread 69 in Block 22 is the last one to reach this line, Thread 69 will know that it already has been declared?
The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].
Example of using __syncthreads(): (source)
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.
Example of using this technique (source):
Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.
I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.
Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.
Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.
That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.
If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).
__syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.
If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.
Existing answers have done a great job answering how __syncthreads() works (it allows intra-block synchronization), I just wanted to add an update that there are now newer methods for inter-block synchronization. Since CUDA 9.0, "Cooperative Groups" have been introduced, which allow synchronizing an entire grid of blocks (as explained in the Cuda Programming Guide). This achieves the same functionality as launching a new kernel (as mentioned above), but can usually do so with lower overhead and make your code more readable.
In order to provide further details, aside of the answers, quoting seibert:
More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.
The rules of use are pretty simple:
Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.
__syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.
Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.
When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.
__syncthreads() does not mark a critical section, so don’t use it like that.
Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.
Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.
I'm hoping for some general advice and clarification on best practices for load balancing in CUDA C, in particular:
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
If so, will the spare processing capacity be assigned to another warp?
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
So in general, for a given call to a kernel what do I need load balance?
Threads in each warp?
Threads in each block?
Threads across all blocks?
Finally, to give an example, what load balancing techniques you would use for the following function:
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly select 5% of the points and log them (or some complicated function)
I write the resulting vector x1 (e.g. [1, log(2), 3, 4, 5, ..., N]) to memory
I repeat the above 2 operations on x1 to yield x2 (e.g. [1, log(log(2)), 3, 4, log(5), ..., N]), and then do a further 8 iterations to yield x3 ... x10
I return x10
Many thanks.
Threads are grouped into three levels that are scheduled differently. Warps utilize SIMD for higher compute density. Thread blocks utilize multithreading for latency tolerance. Grids provide independent, coarse-grained units of work for load balancing across SMs.
Threads in a warp
The hardware executes the 32 threads of a warp together. It can execute 32 instances of a single instruction with different data. If the threads take different control flow, so they are not all executing the same instruction, then some of those 32 execution resources will be idle while the instruction executes. This is called control divergence in CUDA references.
If a kernel exhibits a lot of control divergence, it may be worth redistributing work at this level. This balances work by keeping all execution resources busy within a warp. You can reassign work between threads as shown below.
// Identify which data should be processed
if (should_do_work(threadIdx.x)) {
int tmp_index = atomicAdd(&tmp_counter, 1);
tmp[tmp_index] = threadIdx.x;
}
__syncthreads();
// Assign that work to the first threads in the block
if (threadIdx.x < tmp_counter) {
int thread_index = tmp[threadIdx.x];
do_work(thread_index); // Thread threadIdx.x does work on behalf of thread tmp[threadIdx.x]
}
Warps in a block
On an SM, the hardware schedules warps onto execution units. Some instructions take a while to complete, so the scheduler interleaves the execution of multiple warps to keep the execution units busy. If some warps are not ready to execute, they are skipped with no performance penalty.
There is usually no need for load balancing at this level. Simply ensure that enough warps are available per thread block so that the scheduler can always find a warp that is ready to execute.
Blocks in a grid
The runtime system schedules blocks onto SMs. Several blocks can run concurrently on an SM.
There is usually no need for load balancing at this level. Simply ensure that enough thread blocks are available to fill all SMs several times over. It is useful to overprovision thread blocks to minimize the load imbalance at the end of a kernel, when some SMs are idle and no more thread blocks are ready to execute.
As others have already said, the threads within a warp use a scheme called Single Instruction, Multiple Data (SIMD.) SIMD means that there is a single instruction decoding unit in the hardware controling multiple arithmetic and logic units (ALU's.) A CUDA 'core' is basically just a floating-point ALU, not a full core in the same sense as a CPU core. While the exact CUDA core to instruction decoder ratio varies between different CUDA Compute Capability versions, all of them use this scheme. Since they all use the same instruction decoder, each thread within a warp of threads will execute the exact same instruction on every clock cycle. The cores assigned to the threads within that warp that do not follow the currently-executing code path will simply do nothing on that clock cycle. There is no way to avoid this, as it is an intentional physical hardware limitation. Thus, if you have 32 threads in a warp and each of those 32 threads follows a different code path, you will have no speedup from parallelism at all within that warp. It will execute each of those 32 code paths sequentially. This is why it is ideal for all threads within the warp to follow the same code path as much as possible, since parallelism within a warp is only possible when multiple threads are following the same code path.
The reason that the hardware is designed this way is that it saves chip space. Since each core doesn't have its own instruction decoder, the cores themselves take up less chip space (and use less power.) Having smaller cores that use less power per core means that more cores can be packed onto the chip. Having small cores like this is what allows GPU's to have hundreds or thousands of cores per chip while CPU's only have 4 or 8, even while maintaining similar chip sizes and power consumption (and heat dissipation) levels. The trade off with SIMD is that you can pack a lot more ALU's onto the chip and get a lot more parallelism, but you only get the speedup when those ALU's are all executing the same code path. The reason this trade off is made to such a high degree for GPU's is that much of the computation involved in 3D graphics processing is simply floating-point matrix multiplication. SIMD lends itself well to matrix multiplication because the process to compute each output value of the resultant matrix is identical, just on different data. Furthermore, each output value can be computed completely independently of every other output value, so the threads don't need to communicate with each other at all. Incidentally, similar patterns (and often even matrix multiplication itself) also happen to appear commonly in scientific and engineering applications. This is why General Purpose processing on GPU's (GPGPU) was born. CUDA (and GPGPU in general) was basically an afterthought on how existing hardware designs which were already being mass produced for the gaming industry could also be used to speed up other types of parallel floating-point processing applications.
If 1 thread in a warp takes longer than the other 31, will it hold up the other 31 from completing?
Yes. As soon as you have divergence in a Warp, the scheduler needs to take all divergent branches and process them one by one. The compute capacity of the threads not in the currently executed branch will then be lost. You can check the CUDA Programming Guide, it explains quite well what exactly happens.
If so, will the spare processing capacity be assigned to another warp?
No, unfortunately that is completely lost.
Why do we need the notion of warp and block? Seems to me a warp is just a small block of 32 threads.
Because a Warp has to be SIMD (single instruction, multiple data) to achieve optimal performance, the Warps inside a block can be completely divergent, however, they share some other resources. (Shared Memory, Registers, etc.)
So in general, for a given call to a kernel what do I need load balance?
I don't think load balance is the right word here. Just make sure, that you always have enough Threads being executed all the time and avoid divergence inside warps. Again, the CUDA Programming Guide is a good read for things like that.
Now for the example:
You could execute m threads with m=0..N*0.05, each picking a random number and putting the result of the "complicated function" in x1[m].
However, randomly reading from global memory over a large area isn't the most efficient thing you can do with a GPU, so you should also think about whether that really needs to be completely random.
Others have provided good answers for the theoretical questions.
For your example, you might consider restructuring the problem as follows:
have a vector x of N points: [1, 2, 3, ..., N]
compute some complicated function on every element of x, yielding y.
randomly sample subsets of y to produce y0 through y10.
Step 2 operates on every input element exactly once, without consideration for whether that value is needed. If step 3's sampling is done without replacement, this means that you'll be computing 2x the number of elements you'll actually need, but you'll be computing everything with no control divergence and all memory access will be coherent. These are often much more important drivers of speed on GPUs than the computation itself, but this depends on what the complicated function is really doing.
Step 3 will have a non-coherent memory access pattern, so you'll have to decide whether it's better to do it on the GPU or whether it's faster to transfer it back to the CPU and do the sampling there.
Depending on what the next computation is, you might restructure step 3 to instead randomly draw an integer in [0,N) for each element. If the value is in [N/2,N) then ignore it in the next computation. If it's in [0,N/2), then associate its value with an accumulator for that virtual y* array (or whatever is appropriate for your computation).
Your example is a really good way of showing of reduction.
I have a vector x0 of N points: [1, 2, 3, ..., N]
I randomly pick 50% of the points and log them (or some complicated function) (1)
I write the resulting vector x1 to memory (2)
I repeat the above 2 operations on x1 to yield x2, and then do a further 8 iterations to yield x3 ... x10 (3)
I return x10 (4)
Say |x0| = 1024, and you pick 50% of the points.
The first stage could be the only stage where you have to read from the global memory, I will show you why.
512 threads read 512 values from memory(1), it stores them into shared memory (2), then for step (3) 256 threads will read random values from shared memory and store them also in shared memory. You do this until you end up with one thread, which will write it back to global memory (4).
You could extend this further by at the initial step having 256 threads reading two values, or 128 threads reading 4 values, etc...
Which is better, the atomic's competition (concurrency) between threads of the single Warp or between threads of different Warps in one block? I think that when you access the shared memory is better when threads of one warp are competing with each other is less than the threads of different warps. And with access to global memory on the contrary, it is better that a threads of different warps of one block competed less than the threads of single warp, isn't it?
I need it to know how better to resolve competition (concurrency) and what better to separate store: between threads in single warp or between warps.
Incidentally it may be said that the team __ syncthreads (); synchronizes it warps in a single block and not the threads of one warp?
If a significant number of threads in a block perform atomic updates to the same value, you will get poor performance since those threads must all be serialized. In such cases, it is usually better to have each thread write its result to a separate location and then, in a separate kernel, process those values.
If each thread in a warp performs an atomic update to the same value, all the threads in the warp perform the update in the same clock cycle, so they must all be serialized at the point of the atomic update. This probably means that the warp is scheduled 32 times to get all the threads serviced (very bad).
On the other hand, if a single thread in each warp in a block performs an atomic update to the same value, the impact will be lower because the pairs of warps (the two warps processed at each clock by the two warp schedulers) are offset in time (by one clock cycle), as they move through the processing pipelines. So you end up with only two atomic updates (one from each of the two warps), getting issued within one cycle and needing to immediately be serialized.
So, in the second case, the situation is better, but still problematic. The reason is that, depending on where the shared value is, you can still get serialization between SMs, and this can be very slow since each thread may have to wait for updates to go all the way out to global memory, or at least to L2, and then back. It may be possible to refactor the algorithm in such a way that threads within a block perform atomic updates to a value in shared memory (L1), and then have one thread in each block perform an atomic update to a value in global memory (L2).
The atomic operations can be complete lifesavers but they tend to be overused by people new to CUDA. It is often better to use a separate step with a parallel reduction or parallel stream compaction algorithm (see thrust::copy_if).
Section B.10 of CUDA Programming Guide 4.1 explains that:
[...] the number of clock cycles taken by the device to completely
execute the thread, [is different from] the number of clock cycles the
device actually spent executing thread instructions. The former number
is greater than the latter [...]
I understand that the first is the wall clock time for the completion of thread execution. The second time is first time minus the time the thread spent idle. The thread would be idle when its instructions need to wait for results from previous instructions (instruction dependency), or waiting for operand values from memory or waiting at a synchronization point.
The guide then goes on to say that:
The former number is greater than the latter since threads are time
sliced.
What is the meaning of time sliced in this context? What does it mean by saying that threads are time sliced?
Note that this term does not appear anywhere else in the guide. (Forgive me if I am missing something obvious by context here, I am not a native English speaker.)
Time slicing in this context refers to the fact that multiple warps are running on a multiprocessor (SM) and that the SM switches among warps as execution proceeds in order to hide latency. This is not the same as preemption in traditional CPU threading; nor is it the same as pipelining.
If you have code like this:
if (threadIdx.x == 0 && blockIdx.x == 0) x = clock();
// other work done by all threads
if (threadIdx.x == 0 && blockIdx.x == 0) y = clock();
If there is more than one warp running on the SM, then the value of y-x will be greater than the actual time spent executing in thread 0 (== warp 0). And that is not just because of thread 0 having to wait for results from instructions or memory accesses, it is also due to the time spent executing other warps.
The point of this statement in the programming guide is that it's tricky to use clock() to do absolute timing or latency measurements.
When multiple threads are running and they have to share a processing unit then the way this is usually handled is that each thread is given a fixed maximum period of time to run (your timeslice) and then it gets preempted and another thread gets to run for a period of time. So if your thread cannot finish its work in one timeslice then it might have to wait until it is its turn again. How long that is depends on the number of parallel threads, what they are doing, how the scheduler is implemented and which processing resources are available.
Time Slicing in this context also means preemption.
You can think of a time slice as being some percentage of the total available execution time.
Effectively your thread is scheduled to run for some period of time, however the scheduler may only give you a smaller time slice if other threads need to be executed.
I'm writing a cuda program that can be interpreted as producer/consumer model.
There are two kernels,
one produces a data on the device memory,
and the other kernel the produced data.
The number of comsuming threads are set two a multiple of 32 which is the warp size.
and each warp waits utill 32 data have been produced.
I've got some problem here.
If the consumer kernel is loaded later than the producer,
the program doesn't halt.
The program runs indefinately sometimes even though consumer is loaded first.
What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?
Can anybody give me a direction or reference?
here is the skeleton of my code.
**kernel1**:
while LOOP_COUNT
compute something
if SOME CONDITION
atomically increment PRODUCE_COUNT
write data into DATA
atomically increment PRODUCER_DONE
**kernel2**:
while FOREVER
CURRENT=0
if FINISHED CONDITION
return
if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
return
if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
else if PRODUCUER_DONE==TOTAL_PRODUCER
if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.
One of possible issues that I can think of:
By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.
Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.
Another issue that may or may not appear:
From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.
(Edit)
Third issue:
I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.