I would like to ask you a question about the concurrent kernel execution in Nvidia GPUs. I explain us my situation. I have an code which launchs 1 sparse matrix multiplication for 2 different matrix (one for each one). These matrix multiplications are performed with the cuSPARSE Library. I want both operations can be concurrently performed, so I use 2 streams to launch them. With Nvidia Visual profiler, I´ve observed that both operations (cuSPARSE kernels) are completely overlaped. The time stamps for both kernels are:
Kernel 1) Start Time: 206,205 ms - End Time: 284,177 ms.
Kernel 2) Start Time: 263,519 ms - End Time: 278,916 ms.
I´m using a Tesla K20c with 13 SMs which can execute up 16 blocks per SM. Both kernels have 100% occupancy and launch an enough amount of blocks:
Kernel 1) 2277 blocks, 32 Register/Thread, 1,156 KB shared memory.
Kernel 2) 46555 blocks, 32 Register/Thread, 1,266 KB shared memory.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU. However, Nvidia Visual Profiler shows that these kernels are being overlaped. Why?. Anyone could explain me why this behaviour can occur?
Many thanks in advance.
With this configuration, both kernels shouldn´t show this behaviour, since both kernels launch an enough number of blocks to fill all SMs of the GPU.
I think this is an incorrect statement. As far as I know, the low-level scheduling behavior of blocks is not specified in CUDA.
Newer devices (cc3.5+ with hyper-Q) can more easily schedule blocks from concurrent kernels at the same time.
So, if you launch 2 kernels (A and B), each with large numbers of blocks, concurrently, then you may observe either
blocks from kernel A execute concurrently with kernel B
all (or nearly all) of the blocks of kernel A execute before kernel B
all (or nearly all) of the blocks of kernel B execute before kernel A
Since there is no specification at this level, there is no direct answer. Any of the above are possible. The low level block scheduler is free to choose blocks in any order, and the order is not specified.
If a given kernel launch "completely saturates" the machine (i.e. uses enough resources to fully occupy the machine while it is executing) then there is no reason to think that the machine has extra capacity for a second concurrent kernel. Therefore there would be no reason to expect much, if any, speed up from running the two kernels concurrently as opposed to sequentially. In such a scenario, whether they execute concurrently or not, we would expect the total execution time for the 2 kernels running concurrently to be approximately the same as the total execution time if the two kernels are launched or scheduled sequentially (ignoring tail effects and launch overheads, and the like).
Related
I have read that CUDA prevents synchronization between threads from different blocks to allow scalability. How is that even if increasing the number of MP in a device will increase execution speed either threads were synchronized in the same block or between different blocks?
There are at least two related things to consider.
Scalability goes in both directions, up and down. The desire is that your CUDA program be able to run on a GPU other than the one(s) it was "designed" for. This implies it might be running on a larger GPU or on a smaller GPU. Suppose I had a synchronization requirement among 128 threadblocks (out of perhaps a larger program), which all must execute concurrently in order to exchange data and satisfy the synchronization requirement, before any of them can complete. If I run this program on a GPU with 16 SMs, each of which can schedule up to 16 threadblocks, its reasonable to conclude the program might work, since the momentary capacity of the GPU is 256 threadblocks. But if I run the program on a GPU with 4 SMs, each of which can schedule 16 threadblocks, there is no circumstance under which the program can sensibly complete.
CUDA Scheduling will interfere (to some degree) with desired program execution. There is no guaranteed or specified order in which the CUDA scheduler may schedule threadblocks. Therefore, extending the above example, let's say for an 8 SM GPU (128 threadblocks momentary capacity), it might be that 127 of my 128 critical threadblocks get scheduled, while at the same time other, non-synchronization-critical threadblocks are scheduled. The 127 critical threadblocks will occupy 127 of the 128 available "slots", leaving only 1 slot remaining to process other threadblocks. The 127 critical threadblocks may well be idle, waiting for threadblock 128 to appear, consuming slots but not necessarily doing useful work. This will bring the performance of the GPU to a very low level, until the 128th threadblock eventually gets scheduled.
These are some examples of reasons why it is desirable not to have inter-threadblock synchronization requirements in the design of a CUDA program.
I have a kernel that runs on my GPU (GeForce 690) and uses a single block. It runs in about 160 microseconds. My plan is to launch 8 of these kernels separately, each of which only uses a single block, so each would run on a separate SM, and then they'd all run concurrently, hopefully in about 160 microseconds.
However, when I do that, the total time increases linearly with each kernel: 320 microseconds if I run 2 kernels, about 490 microseconds for 3 kernels, etc.
My question: Do I need to set any flag somewhere to get these kernels to run concurrently? Or do I have to do something that isn't obvious?
As #JackOLantern indicated concurrent kernels require the usage of streams, which are required for all forms of asynchronous activity scheduling on the GPU. It also requires a GPU of compute capability 2.0 or greater, generally speaking. If you do not use streams in your application, all cuda API and kernel calls will be executed sequentially, in the order in which they were issued in the code, with no overlap from one call/kernel to the next.
Rather than give a complete tutorial here, please review the concurrent kernels cuda sample that JackOlantern referenced.
Also note that actually witnessing concurrent execution can be more difficult on windows, for a variety of reasons. If you run the concurrent kernels sample, it will indicate pretty quickly if the environment you are in (OS, driver, etc.) is providing concurrent execution.
I was wondering, if I run a kernel with 10 blocks of 1000 threads in one stream to analyse an array of data, and then launch a kernel that requires 10 blocks of 1000 threads to analyse another array in a second stream, what is going to happen?
Are the un-active threads on my card going to begin the process of analysing my second array ?
or is the second stream going to be paused until the first stream will have to finish ?
Thank you.
Generally speaking, if the kernels are issued from different (non-default) streams of the same application, and all requirements for execution of concurrent kernels are met, and there are enough resources available (SMs, especially -- I guess this is what you mean by "un-active threads") to schedule both kernels, then some of the blocks of the second kernel will begin executing along side of the blocks of the first kernel that are already executing. This may occur on the same SMs that the blocks of the first kernel are already scheduled on, or it may occur on other, unoccupied SMs, or both (for example if your GPU has 14 SMs, the work distributor would distribute the 10 blocks of the first kernel on 10 of the SMs, leaving 4 that are unused at that point.)
If on the other hand, your kernels had threadblocks requiring 32KB of shared memory usage, and your GPU had 8 SMs, then the threadblocks of the first kernel would effectively "use up" the 8 SMs, and the threadblocks of the second kernel would not begin executing until some of the threadblocks of the first kernel had "drained" i.e. completed and been retired. That's just one example of resource utilization that could inhibit concurrent execution. And of course, if you were launching kernels with many threadblocks each (say 100 or more) then the first kernel would mostly occupy the machine, and the second kernel would not begin executing until the first kernel had largely finished.
If you search in the upper right hand corner on "cuda concurrent kernels" you'll find a number of questions that highlight some of the challenges associated with observing concurrent kernel execution.
Question
total GPU time + total CPU overhead is smaller than the total execution time. Why?
Detail
I am studying how frequent global memory access and kernel launch may affect the performance and I have designed a code which has multiple small kernels and ~0.1 million kernel calls in total. Each kernel reads data from global memory, processes them and then writes back to the global memory. As expected, the code runs much slower than the original design which has only one large kernel and very few kernel launches.
The problem arose as I used command line profiler to get "gputime" (execution time for the GPU kernel or memory copy method) and "cputime" (CPU overhead for non-blocking method, the sum of gputime and CPU overhead for blocking method ). To my understanding, the sum of all gputimes and all cputimes should exceed the entire execution time (the last "gpuendtimestamp" minus the first "gpustarttimestamp"), but it turns out the contrary is true (sum of gputimes=13.835064 s,
sum of cputimes=4.547344 s, total time=29.582793). Between the end of one kernel and the start of the next, there is often a large amount of waiting time, larger than the CPU overhead of the next kernel. Most of the kernels suffer from this problem are: memcpyDtoH, memcpyDtoD and thrust internel functions such as launch_closure_by_value, fast_scan, etc. What is the probable reason?
System
Windows 7, TCC driver, VS 2010, CUDA 4.2
Thanks for your help!
This is possibly a combination of profiling, which increases latency, and the Windows WDDM subsystem. To overcome the high latency of the latter, the CUDA driver batches GPU operations and submits them in groups with a single Windows kernel call. This can cause large periods of GPU inactivity if CUDA API commands are sitting in an unsubmitted batch.
(Copied #talonmies' comment to an answer, to enable voting and accepting.)
i am having some troubles understanding threads in NVIDIA gpu architecture with cuda.
please could anybody clarify these info:
an 8800 gpu has 16 SMs with 8 SPs each. so we have 128 SPs.
i was viewing Stanford University's video presentation and it was saying that every SP is capable of running 96 threads concurrently. does this mean that it (SP) can run 96/32=3 warps concurrently?
moreover, since every SP can run 96 threads and we have 8 SPs in every SM. does this mean that every SM can run 96*8=768 threads concurrently?? but if every SM can run a single Block at a time, and the maximum number of threads in a block is 512, so what is the purpose of running 768 threads concurrently and have a max of 512 threads?
a more general question is:how are blocks,threads,and warps distributed to SMs and SPs? i read that every SM gets a single block to execute at a time and threads in a block is divided into warps (32 threads), and SPs execute warps.
You should check out the webinars on the NVIDIA website, you can join a live session or view the pre-recorded sessions. Below is a quick overview, but I strongly recommend you watch the webinars, they will really help as you can see the diagrams and have it explained at the same time.
When you execute a function (a kernel) on a GPU it is executes as a grid of blocks of threads.
A thread is the finest granularity, each thread has a unique identifier within the block (threadIdx) which is used to select which data to operate on. The thread can have a relatively large number of registers and also has a private area of memory known as local memory which is used for register file spilling and any large automatic variables.
A block is a group of threads which execute together in a batch. The main reason for this level of granularity is that threads within a block can cooperate by communicating using the fast shared memory. Each block has a unique identifier (blockIdx) which, in conjunction with the threadIdx, is used to select data.
A grid is a set of blocks which together execute the GPU operation.
That's the logical hierarchy. You really only need to understand the logical hierarchy to implement a function on the GPU, however to get performance you need to understand the hardware too which is SMs and SPs.
A GPU is composed of SMs, and each SM contains a number of SPs. Currently there are 8 SPs per SM and between 1 and 30 SMs per GPU, but really the actual number is not a major concern until you're getting really advanced.
The first point to consider for performance is that of warps. A warp is a set of 32 threads (if you have 128 threads in a block (for example) then threads 0-31 will be in one warp, 32-63 in the next and so on. Warps are very important for a few reasons, the most important being:
Threads within a warp are bound together, if one thread within a warp goes down the 'if' side of a if-else block and the others go down the 'else', then actually all 32 threads will go down both sides. Functionally there is no problem, those threads which should not have taken the branch are disabled so you will always get the correct result, but if both sides are long then the performance penalty is important.
Threads within a warp (actually a half-warp, but if you get it right for warps then you're safe on the next generation too) fetch data from the memory together, so if you can ensure that all threads fetch data within the same 'segment' then you will only pay one memory transaction and if they all fetch from random addresses then you will pay 32 memory transactions. See the Advanced CUDA C presentation for details on this, but only when you are ready!
Threads within a warp (again half-warp on current GPUs) access shared memory together and if you're not careful you will have 'bank conflicts' where the threads have to queue up behind each other to access the memories.
So having understood what a warp is, the final point is how the blocks and grid are mapped onto the GPU.
Each block will start on one SM and will remain there until it has completed. As soon as it has completed it will retire and another block can be launched on the SM. It's this dynamic scheduling that gives the GPUs the scalability - if you have one SM then all blocks run on the same SM on one big queue, if you have 30 SMs then the blocks will be scheduled across the SMs dynamically. So you should ensure that when you launch a GPU function your grid is composed of a large number of blocks (at least hundreds) to ensure it scales across any GPU.
The final point to make is that an SM can execute more than one block at any given time. This explains why a SM can handle 768 threads (or more in some GPUs) while a block is only up to 512 threads (currently). Essentially, if the SM has the resources available (registers and shared memory) then it will take on additional blocks (up to 8). The Occupancy Calculator spreadsheet (included with the SDK) will help you determine how many blocks can execute at any moment.
Sorry for the brain dump, watch the webinars - it'll be easier!
It's a little confusing at first, but it helps to know that each SP does something like 4 way SMT - it cycles through 4 threads, issuing one instruction per clock, with a 4 cycle latency on each instruction. So that's how you get 32 threads per warp running on 8 SPs.
Rather than go through all the rest of the stuff with warps, blocks, threads, etc, I'll refer you to the nVidia CUDA Forums, where this kind of question crops up regularly and there are already some good explanations.