If I'm doing an element-by-element operation on a matrix M, say M[i, j] *= (1 - M[i, j]), is it fine to launch a thread for each element (i, j)? I'm just concerned at what point the overhead of launching threads outweighs the parallelism achieved.
It's oftentimes a better idea to try to do more work per thread if possible, with the goal of having instruction-level parallelism. If a given thread executes multiple, independant operations, the instructions can be pipelined and executed without stalls, which will increase your arithmetic throuput. In contrast, if you have each thread doing 1 piece of (trivial) work, then there's no opportunity for any sort of instruction-level parallelism and no opportunity to hide any of your memory latency times.
Also, there's a finite number of registers available, so the more threads you launch with, the fewer the number of register available per thread. I'm not sure about Kepler cards, but back in the Fermi-card generation, registers had roughly 8x the bandwidth of shared memory, so using registers when possible was important (again, I don't have a kepler card, so I don't know if this has since changed).
Although it's a bit dated, the recommendations detailed here are still very relevant
Related
Note: I don't have my computer and GPU with me so this me typing from memory. I timed this and compiled it correctly so ignore any odd typos should they exist.
I don't know if the overhead of what I'm going to describe below is the problem, or if I'm doing this wrong, or why launching kernels in kernels is slower than one big kernel that has a lot of threads predicate off and not get used. Maybe this is because I'm not swamping the GPU with work that I don't notice the saturation.
Suppose we're doing something simple for the sake of this example, like multiplying all the values in a square matrix by two. The matrices can be any size, but they won't be larger than 16x16.
Now suppose I have 200 matrices all in the device memory ready to go. I launch a kernel like
// One matrix given to each block
__global__ void matrixFunc(Matrix** matrices)
{
Matrix* m = matrices[blockIdx.x];
int area = m->width * m->height;
if (threadIdx.x < area)
// Heavy calculations
}
// Assume 200 matrices, no larger than 16x16
matrixFunc<<<200, 256>>>(ptrs);
whereby I'm using one block per matrix, and an abundance of threads such that I know I'm never going to have less threads per block than cells in a matrix.
The above runs in 0.17 microseconds.
This seems wasteful. I know that I have a bunch of small matrices (so 256 threads is overkill when a 2x2 matrix can function on 4 threads), so why not launch a bunch of them dynamically from a kernel to see what the runtime overhead is? (for learning reasons)
I change my code to be like the following:
__device__ void matrixFunc(float* matrix)
{
// Heavy calculations (on threadIdx.x for the cell)
}
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
matrixFunc<<<1, area>>>(m.data);
}
matrixFuncCaller<<<1, 200>>>(ptrs);
But this performs a lot worse at 11.3 microseconds.
I realize I could put them all on a stream, so I do that. I then change this to make a new stream:
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
// Create `stream`
matrixFunc<<<1, area, 0, stream>>>(m.data);
// Destroy `stream`
}
This does better, it's now 3 microseconds instead of 11, but it's still much worse than 0.17 microseconds.
I want to know why this is worse.
Is this kernel launching overhead? I figure that maybe my examples are small enough such that the overhead drowns out the work seen here. In my real application which I cannot post, there is a lot more work done than just "2 * matrix", but it still is probably small enough that there might be decent overhead.
Am I doing anything wrong?
Put it shortly: the benchmark is certainly biased and the computation is latency bound.
I do not know how did you measure the timings but I do not believe "0.17 microseconds" is even possible. In fact the overhead of launching a kernel is typically few microseconds (I never saw an overhead smaller than 1 microsecond). Indeed, running a kernel should typically require a system call that are expensive and known to take an overhead of at least about 1000 cycles. An example of overhead analysis can be found in this research paper (confirming that it should takes several microseconds). Not to mention current RAM accesses should take at least 50-100 ns on mainstream x86-64 platforms and the one one of GPU requires several hundreds of cycles. While everything may fit in both the CPU and GPU cache is possible this is very unlikely to be the case regarding the kernels (and the fact the GPU may be used for other tasks during multiple kernel executions). For more information about this, please read this research paper. Thus, what you measure has certainly nothing to do with the kernel execution. To measure the overhead of the kernel, you need to care about synchronizations (eg. call cudaDeviceSynchronize) since kernels are launched asynchronously.
When multiple kernels are launched, you may pay the overhead of an implicit synchronization since the queue is certainly bounded (for sake of performance). In fact, as pointed out by #talonmies in the comments, the number of concurrent kernels is bounded to 16-128 (so less than the number of matrices).
Using multiple streams reduces the need for synchronizations hence the better performance results but there is certainly still a synchronization. That being said, for the comparison to be fair, you need to add a synchronization in all cases or measure the execution time on the GPU itself (without taking care of the launching overhead) still in all cases.
Profilers like nvvp help a lot to understand what is going on in such a case. I strongly advise you to use them.
As for the computation, please note that GPU are designed for heavy computational SIMT-friendly kernels, not low-latency kernel operating on small variable-sized matrices stored in unpredictable memory locations. In fact, the overhead of a global memory access is so big that it should be much bigger than the actual matrix computation. If you want GPUs to be useful, then you need to submit more work to them (so to provide more parallelism to them and so to overlap the high latencies). If you cannot provide more work, then the latency cannot be overlapped and if you care about microsecond latencies then GPUs are clearly not suited for the task.
By the way, not that Nvidia GPUs operate on warp of typically 32 threads. Threads should perform coalesced memory loads/stores to be efficient (otherwise they are split in many load/store requests). Operating on very small matrices like this likely prevent that. Not to mention most threads will do nothing. Flattening the matrices and sorting them by size as proposed by #sebastian in the comments help a bit but the computations and memory access will still be very inefficient for a GPU (not SIMT-friendly). Note that using less thread and make use of unrolling should also be a bit more efficient (but still far from being great). CPUs are better suited for such a task (thanks to a higher frequency, instruction-level parallelism combined with an out-of-order execution). For fast low-latency kernels like this FPGAs can be even better suited (though they are hard to program).
The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.
I'm studying OpenCL concepts as well as the CUDA architecture for a small project, and there is one thing that is unclear to me: the necessity for Warps.
I know a lot of questions have been asked on this subject, however after having read some articles i still don't get the "meaning" of warps.
As far as I understand (speaking for my GPU card which is a Tesla, but i guess this easily translates to other boards):
A work-item is linked to a CUDA thread, which several of them can be executed by a Streaming Processor (SP). BTW, does a SP treats those WI in parallel?
Work-items are grouped into Work-groups. Work-groups operate on a Stream Multiprocessor and can not migrate. However, work-items in a work-group can collaborate via shared memory (a.k.a local memory). One or more work-groups may be executed by a Stream MultiProcessor. BTW, does a SM treats those WG in parallel?
Work-item are executed in parallel inside a work-group. However, synchronization is NOT guaranteed, that's why you need concurrent programming primitives, such as barriers.
As far as I understand, all of this is rather a logical view than a 'physical', hardware perspective.
If all of the above is correct, can you help me on the following. Is that true to say that:
1 - Warps execute 32 threads or work-items simultaneously. Thus, they will 'consume' parts of a work-group. And that's why in the end you need stuff like memory fences to synchronize work-items in work groups.
2 - The Warp scheduler allocates the registers for the 32 threads of warp when it becomes active.
3 - Also, are executed thread in a warp synchronized at all?
Thanks for any input on Warps, and especially why they are necessary in the CUDA architecture.
My best analogon is that a Warp is the vector that be processed in parallel, not unlike an AVX or SSE vector with an Intel CPU. This makes an SM a 32-length vector processor.
Then, to your questions:
Yes, all 32 elements will be run in parallel. Note that also a GPU puts hyperthreading to the extreme: a workgroup will consist of multiple Warps, which all are run more-or-less in parallel. You will need memory fences to sychronise that all.
Yes, typically all 32 work elements (CUDA: thread) in a Warp will work in parallel. Note that you typically will have multiple regsters per work element.
Not guaranteed, AFAIK.
I wanted to ask.We say that using --ptxas-options=-v doesn't give the exact number of registers that our program uses.
1) Then , how am I going to supply the occupancu calculator with registers per thread and shared memory per block?
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!)
(the same applies for the shared memory)
1) Then , how am I going to supply the occupancy calculator with registers per thread and shared memory per block?
The only other thing needed should be rounding up (if necessary) the output of ptxas to an even granularity of register allocation, which varies by device (see Greg's answer here) I think the common register allocation granularities are 4 and 8, but I don't have a table of register allocation granularity by compute capability.
I think shared memory also has an allocation granularity. Since the max number of threadblocks per SM is limited anyway, this should only matter (for occupancy) if your allocation/usage is within a granular amount of exceeding the limit for however many blocks you are otherwise limited to.
I think in most cases you'll get a pretty good feel by using the numbers from ptxas without rounding. If you feel you need this level of accuracy in the occupancy calculator, asking a nice directed question like "what are the allocation granularities for registers and shared memory for various GPUs" may get someone like Greg to give you a crisp answer.
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!) (the same applies for the shared memory)
Fundamentally I believe this thinking is incorrect. The only place I could see where it might matter is if you are running concurrent kernels, and I doubt that is the case since you mention thrust. The only figures that matter for occupancy are the metrics for a single kernel launch. You do not add threads, or registers, or shared memory across different kernels, to calculate resource usage. When a kernel completes execution, it releases its resource usage, at least for these resource types (registers, shared memory, threads).
I am a mathematician using CUDA for some numerical integration. My understanding is that each Nvidia streaming multiprocessor has 8 CUDA cores. So to me it seems that there is no benefit to using more than 8 threads per block. However, when I run my code I get huge performance gain by using 32 threads per block as opposed to 8 threads per block.
Also I noticed there is huge gain using more than 12 blocks ( even though my card only has 12 streaming multiprocessors).
Is there a reason for this?
talonmies and chaohuang provide good information in the comments, and you should look into that (not sure why these aren't answers, but that's their call). In any event, I will provide an abbreviated partial answer to explain something that you might not be considering.
Let's say that you have 8 threads of control, and 8 processors. If all the instructions in all 8 threads are on-chip instructions taking only a single cycle, then all 8 threads will finish in n cycles (assuming n total instructions per thread).
Now let's say that each thread of control consists of n instructions, where a fraction r of these are off-chip memory instructions, which take, e.g., 100 cycles to complete. These 8 threads will now take [(1 - r) + 100r]n cycles to complete. If r=0.1, this is about 11 times more than the previous case.
Now let's say that we have 16 threads. When the first batch of 8 threads is blocked on the slow operations, the other threads can execute; on-chip instructions can execute, and off-chip instructions can start. So instead of needing 2[(1 - r) + 100r]n cycles to complete all threads, you might need only a little more than [(1 - r) + 100r]n. In essence, because you have some room to overlap waiting threads with other threads, you can add more threads for free.
This is the great strength of the GPU model: massive parallelism to overcome long latency. It takes a long time to do a little bit of work, but not much more time to do a lot more work. Note that occupancy - related to the amount of work (in threads) you have ready to hide latency - isn't all that important for peak performance when the arithmetic intensity (related to r in the above formulae) is high. You might play around with the CUDA Occupancy Calculator to see the effect I descibe for different scenarios.
The short answer is latency hiding.
If you only have as many units of work (threads & blocks) as you have cores to work on them and execution hits a memory operation that needs hundreds of clock cycles to complete, the GPU has nothing else to work on so the cores sit idle until the memory op completes. That's wasting compute cycles.
If you offer more units of work than you have cores to do the work, then when one of the units of work hits a long-latency memory operation, the hardware scheduler can swap some other unit of work into the core(s) so that the cores are kept busy doing productive work while the long-latency memory operation completes. Having an excess of threads or blocks provides a better opportunity to use all the compute cycles when there are long-latency memory ops in the mix.
There are basically 2 ways to memory latency hiding in GPU:
increase in occupancy, which means have more threads than required to hide memory latency.
increase in independent operations per thread. This occupy those cores with the required parallelism.
Consider this sequence of computer instructions to compute large number of elements.
a = b + c;
d = a + c;
Second instruction will stall as it is waiting for result from first instruction to complete.
When you use only 8 threads, these threads are waiting and the GPU cores are idling.
However, if you have more threads, the GPU is able to schedule other elements' computations to be computed while the current warp is waiting. That is why when you increase the number of threads, it perform better. It is utilizing the CPU cores more efficiently =)
Hope this helps~