On a compute capablility 1.3 GPU cuda card,
we run the following code
for(int i=1;i<20;++i)
kernelrun<<<30,320>>>(...);
we know that each SM has 8 SP and can run 1024 threads,
so there are 30 SM in tesla C1060 which can run 30*1024 threads concurrently.
As per the given code, how many threads can run concurrently ?
If there are 48 registers for the kernelrun kernel , what are the limitations on tesla C1060?
which has 16384 registers and 16KB shared memory?
Since concurrent kernel execution is not supported in Tesla C1060, how
can we execute the kernel in loop concurrently ? IS streams possible?
only one concurrent copy and execute engine in tesla C1060?
NVIDIA have been shipping an Occupancy calculator which you can use to answer this question for yourself since 2007. You should try it.
But to answer your question, each SM in your compute 1.3 device has 16384 registers per SM, so the number of threads per block if your kernel is register limited would be roughly 352 (16384/45 rounded down to the nearest 32). There is also a register page allocation granularity to consider.
Related
Is it possible, using streams, to have multiple unique kernels on the same streaming multiprocessor in Kepler 3.5 GPUs? I.e. run 30 kernels of size <<<1,1024>>> at the same time on a Kepler GPU with 15 SMs?
On a compute capability 3.5 device, it might be possible.
Those devices support up to 32 concurrent kernels per GPU and 2048 threads peer multi-processor. With 64k registers per multi-processor, two blocks of 1024 threads could run concurrently if their register footprint was less than 16 per thread, and less than 24kb shared memory per block.
You can find all of this is the hardware description found in the appendices of the CUDA programming guide.
I am readind from kepler whitepaper here
that kepler supports up to 16 blocks / mp.
But threads/blocks = 1024 and threads/mp = 2048 ,hence blocks/mp = 2 .
Am I missing something here?
As you said right for kepler a streaming muliprocessor can ran up to 16 threadblocks.
In your example if a thread block consists of 1024 threads, than only two blocks can launched at the same time at one mp, because in this case you will be limited by the maximum amount of threads per multiprocessor - 2048 / 1024 = 2 blocks.
There are several factors that will influence how many blocks can ran concurrently in a streaming multiprocessor. A SM have a limited amount of registers and shared memory, too. If you use too much registers or too much shared memory, than you will be limited by these factors.
A good overview for this is the CUDA occupancy calculator. With the excel sheet you can easily set up a kernel configuration for all CUDA architectures and you will see by what the kernel will be limited.
Also the CUDA programming guide provide all the required informations.
Maybe a simple example can help - done with occupancy calculator for compute capability 3.0:
If your thread block consists of 512 threads and you won't use any registers or shared memory, than the amount of parallel blocks is only influenced by the block size.
For cc 3.0 per SM 2048 threads can be launched. So 2048 / 512 = 4. It's only possible to use 4 thread blocks at the same time.
In the second step you'll use additional 48 registers per thread.
Per thread block 512 * 48 = 24576 registers will be used. But a SM can only use 65536 registers. Now it's only possible to run two blocks instead of four.
In the last step let's assume a block uses 32000 bytes of shared memory. Because a SM can only use 49152 bytes for shared memory, it's only possible to use 1 thread block anymore.
My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.
The GeForce GTX 560 Ti has 8 SM and each SM has 48 cuda cores (SP). I'm going to launch kernel in this way: kernel<<<1024,1024>>> The SM schedules threads in groups of 32 parallel threads called warps. How will blocks and threads be distributed between 8 SM and 48 SP in each SM ? We have 1024 blocks and 1024 threads so what is possible scenario ? What is the maximum number of threads executing literally at the same time ? What is difference between fermi dual warp scheduler and earlier schedulers ?
The NVIDIA supplied occupancy calculator spreadsheet, which ships in every SDK or is available for download here, can provide the answer to the first three "sub-questions" you have asked.
As for the difference between multiprocessor level scheduling in Fermi compared with earlier architectures, the name ("dual warp scheduler") really says it all. In Fermi, MPs retire instructions from two warps simultaneously, compared to a single warp, as was the case in the first two generations of CUDA capable architectures. If you want a more detailed answer than that, I recommend reading the Fermi architecture whitepaper, available for download here.
Dear friends:
i am want to study the CUDA programming, i bought a Nvidia GTS 450 PCI_E car. it has 192 SMs, then how many threads does it has. 192 threads? or 192*512 threads?
Regards
in CUDA the term threads refers to the a property of a specific kernel invocation, not of a property of the hardware.
For instance in this CUDA invocation:
someFunction<<<2,32>>>(1,2,3);
you have 32 threads in 2 blocks so 64 threads in total.
The hardware schedules threads to processors automatically.
According to the specs, your device has 192 "processor cores" - these are not the same as SMs. In CUDA, a SM is a multiprocessor that executes multiple threads in lockstep (8 for the 1.3 family of devices, more for later devices).
As shoosh pointed out, the number of threads used is a function of your kernel invocation.
Typically to get good performance in CUDA, you should run many more threads than you have CUDA processor cores - this is to hide the latency of your global memory accesses.