Dear friends:
i am want to study the CUDA programming, i bought a Nvidia GTS 450 PCI_E car. it has 192 SMs, then how many threads does it has. 192 threads? or 192*512 threads?
Regards
in CUDA the term threads refers to the a property of a specific kernel invocation, not of a property of the hardware.
For instance in this CUDA invocation:
someFunction<<<2,32>>>(1,2,3);
you have 32 threads in 2 blocks so 64 threads in total.
The hardware schedules threads to processors automatically.
According to the specs, your device has 192 "processor cores" - these are not the same as SMs. In CUDA, a SM is a multiprocessor that executes multiple threads in lockstep (8 for the 1.3 family of devices, more for later devices).
As shoosh pointed out, the number of threads used is a function of your kernel invocation.
Typically to get good performance in CUDA, you should run many more threads than you have CUDA processor cores - this is to hide the latency of your global memory accesses.
Related
I have already read several threads about the capacity about the GPU and understood that the concept of blocks and threads has to be seperated from the physical Hardware. Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. However, as the number of streaming processors is finite, there has to be a physical limit. After I wrote a GPU program, I would be interested in evaluating the used capacity of my GPU. To do this, I have to know how many threads I could start theoretically at one time on hardware. My graphics card is a Nvidia Geforce 1080Ti, so I have 3584 CUDA-Cores. As far as I understood, each Cuda core executes one Thread, so in theory, I would be able to execute 3584 threads per cycle. Is this correct?
Another question is the one about memory. I installed and used nvprof to get some insight into the used kernels. What is displayed there is for example the number of used registers. I transfer my arrays to the GPU using cuda.to_device (in Python Numba) and as far as I understood, the arrays then reside in global memory. How do I find out how big this global memory is? Is it equivalent to the DRAM size?
Thanks in advance
I'll focus on the first part of the question. The second should really be its own separate question.
CUDA cores do not map 1-to-1 to threads. They are more like ports in a multiscalar CPU. Multiple threads can issue instructions to the same CUDA core in different clock cycles. Sort of like hyperthreading in a CPU.
You can see the relation and numbers in the documentation in chapter K Compute Capabilities compared to the table Table 3. Throughput of Native Arithmetic Instructions. Depending on your architecture, you may have for example for your card (compute capability 6.1) 2048 threads per SM and 128 32 bit floating point operations per clock cycle. That means you have 128 CUDA cores shared by a maximum of 2048 threads.
Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). TechPowerup's excellent GPU database documents 28 SMs for your card which should give you 28 * 2048 threads unless I did something wrong.
Is it possible, using streams, to have multiple unique kernels on the same streaming multiprocessor in Kepler 3.5 GPUs? I.e. run 30 kernels of size <<<1,1024>>> at the same time on a Kepler GPU with 15 SMs?
On a compute capability 3.5 device, it might be possible.
Those devices support up to 32 concurrent kernels per GPU and 2048 threads peer multi-processor. With 64k registers per multi-processor, two blocks of 1024 threads could run concurrently if their register footprint was less than 16 per thread, and less than 24kb shared memory per block.
You can find all of this is the hardware description found in the appendices of the CUDA programming guide.
On a compute capablility 1.3 GPU cuda card,
we run the following code
for(int i=1;i<20;++i)
kernelrun<<<30,320>>>(...);
we know that each SM has 8 SP and can run 1024 threads,
so there are 30 SM in tesla C1060 which can run 30*1024 threads concurrently.
As per the given code, how many threads can run concurrently ?
If there are 48 registers for the kernelrun kernel , what are the limitations on tesla C1060?
which has 16384 registers and 16KB shared memory?
Since concurrent kernel execution is not supported in Tesla C1060, how
can we execute the kernel in loop concurrently ? IS streams possible?
only one concurrent copy and execute engine in tesla C1060?
NVIDIA have been shipping an Occupancy calculator which you can use to answer this question for yourself since 2007. You should try it.
But to answer your question, each SM in your compute 1.3 device has 16384 registers per SM, so the number of threads per block if your kernel is register limited would be roughly 352 (16384/45 rounded down to the nearest 32). There is also a register page allocation granularity to consider.
My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.
The GeForce GTX 560 Ti has 8 SM and each SM has 48 cuda cores (SP). I'm going to launch kernel in this way: kernel<<<1024,1024>>> The SM schedules threads in groups of 32 parallel threads called warps. How will blocks and threads be distributed between 8 SM and 48 SP in each SM ? We have 1024 blocks and 1024 threads so what is possible scenario ? What is the maximum number of threads executing literally at the same time ? What is difference between fermi dual warp scheduler and earlier schedulers ?
The NVIDIA supplied occupancy calculator spreadsheet, which ships in every SDK or is available for download here, can provide the answer to the first three "sub-questions" you have asked.
As for the difference between multiprocessor level scheduling in Fermi compared with earlier architectures, the name ("dual warp scheduler") really says it all. In Fermi, MPs retire instructions from two warps simultaneously, compared to a single warp, as was the case in the first two generations of CUDA capable architectures. If you want a more detailed answer than that, I recommend reading the Fermi architecture whitepaper, available for download here.