CUDA: Does the compute capability impact the maximum number of active threads? - cuda

If I have a device supporting CC 3.0 that means it has maximum number of active threads equal to 2048 per multiprocessor. And If would set the CC to 2.0 (compute_20,sm_20) does it mean that the maximum number of active threads will be only 1536 per multiprocessor or the compute capability has no impact to this?
Or is it have impact to the shared memory size?

CUDA is designed for scalability; kernels will expand to use all of the resources it can. So it doesn't matter how you compile the kernel; it will fill up all of the available threads unless you do something that prevents it from doing so, like launching it with 768 threads per block.
Now, GPU threads aren't like CPU cores; you aren't losing the ability to do computation if you aren't using all of the threads. A streaming multiprocessor (SM) on a device of compute capability 3.0 can manage 2048 threads simultaneously, but is only capable of executing 256 instructions per tick. There are other limits too; e.g. if you're doing 32-bit floating point addition, it can only do 192 of those per tick. Doing left shifts on 32-bit integers? Only 64 per tick.
The point of having more threads is for latency reasons -- when one thread is blocked for some reason, such as waiting to fetch a value from memory or to get the result of an arithmetic instruction, the SM will run a different thread instead. The point of using more threads is that it gives you more opportunities to hide this latency: more chances to have independent work available to do when some instructions are blocked, waiting for data.

Related

Maximum number of GPU Threads on Hardware and used memory

I have already read several threads about the capacity about the GPU and understood that the concept of blocks and threads has to be seperated from the physical Hardware. Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. However, as the number of streaming processors is finite, there has to be a physical limit. After I wrote a GPU program, I would be interested in evaluating the used capacity of my GPU. To do this, I have to know how many threads I could start theoretically at one time on hardware. My graphics card is a Nvidia Geforce 1080Ti, so I have 3584 CUDA-Cores. As far as I understood, each Cuda core executes one Thread, so in theory, I would be able to execute 3584 threads per cycle. Is this correct?
Another question is the one about memory. I installed and used nvprof to get some insight into the used kernels. What is displayed there is for example the number of used registers. I transfer my arrays to the GPU using cuda.to_device (in Python Numba) and as far as I understood, the arrays then reside in global memory. How do I find out how big this global memory is? Is it equivalent to the DRAM size?
Thanks in advance
I'll focus on the first part of the question. The second should really be its own separate question.
CUDA cores do not map 1-to-1 to threads. They are more like ports in a multiscalar CPU. Multiple threads can issue instructions to the same CUDA core in different clock cycles. Sort of like hyperthreading in a CPU.
You can see the relation and numbers in the documentation in chapter K Compute Capabilities compared to the table Table 3. Throughput of Native Arithmetic Instructions. Depending on your architecture, you may have for example for your card (compute capability 6.1) 2048 threads per SM and 128 32 bit floating point operations per clock cycle. That means you have 128 CUDA cores shared by a maximum of 2048 threads.
Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). TechPowerup's excellent GPU database documents 28 SMs for your card which should give you 28 * 2048 threads unless I did something wrong.

How do a SM in CUDA run multiple blocks simultaneously?

In CUDA, can a SM run multiple blocks simultaneously if each block won't cost too much resource.
On Fermi, we know that a SM consists of 32kb register space for use. suppose a thread use 32 register, so this SM can lanuch one block which contains 256 ((32*1024)/(32*4)) threads. If SM can run multiple blocks simultaneously, we can also configure 32 theards for a block, and 8 block for the SM. Is there any difference?
As #talonmies commented, your math is not entirely correct. But the key point is that an SM contains a balance of many different types of resources. The better your kernel and kernel launch parameters fit with this balance, the better your performance.
I haven't checked the numbers for Kepler (compute capability 3.x) but for Fermi (2.x), an SM can keep track of 48 concurrent warps (1,536 threads) and 8 concurrent blocks. This means that if you chose a low thread count for your blocks, the 8 concurrent blocks becomes the limiting factor to occupancy in your kernel. For instance, if you chose 32 threads per block, you get up to 256 (8 * 32) concurrent threads running on the SM while the SM can run up to 1,536 threads (48 * 32).
In the occupancy calculator, you can see what the different hardware limits are and it will tell you which of them becomes the limiting factor with your specific kernel. You can experiment with variations in launch parameters, shared memory usage and register usage to see how they affect your occupancy.
Occupancy is not everything when it comes to performance. Increased occupancy translates to increased ability to hide the latency of memory transfers. When the memory bandwidth is saturated, increasing occupancy further does not help. There is another effect in play as well. Increasing the size of a block may decrease occupancy but at the same time increase the amount of instruction level parallelism (ILP) available in your kernel. In this case, decreasing occupancy can increase performance.

My GPU has 2 multiprocessors with 48 CUDA cores each. What does this mean?

My GPU has 2 multiprocessors with 48 CUDA cores each. Does this mean that I can execute 96 thread blocks in parallel?
No it doesn't.
From chapter 4 of the CUDA C programming guide:
The number of blocks and warps that can reside and be processed together on the multiprocessor for a given kernel depends on the amount of registers and shared memory used by the kernel and the amount of registers and shared memory available on the multiprocessor. There are also a maximum number of resident blocks and a maximum number of resident warps per multiprocessor. These limits as well the amount of registers and shared memory available on the multiprocessor are a function of the compute capability of the device and are given in Appendix F. If there are not enough registers or shared memory available per multiprocessor to process at least one block, the kernel will fail to launch.
Get the guide at: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
To check the limits for your specific device compile and execute the cudaDeviceQuery example from the SDK.
So far the maximum number of resident blocks per multiprocessor is the same across all compute capabilities and is equal to 8.
This comes down to semantics. What does "execute" and "running in parallel" really mean?
At a basic level, having 96 CUDA cores really means that you have a potential throughput of 96 results of calculations per cycle of the core clock.
A core is mainly an Arithmetic Logic Unit (ALU), it performs basic arithmetic and logical operations. Aside from access to an ALU, a thread needs other resources, such as registers, shared memory and global memory to run. The GPU will keep many threads "in flight" to keep all these resources utilized to the fullest. The number of threads "in flight" will typically be much higher than the number of cores. On one hand, these threads can be seen as being "executed in parallel" because they are all consuming resources on the GPU at the same time. But on the other hand, most of them are actually waiting for something, such as data to arrive from global memory or for results of arithmetic to go through the pipelines in the cores. The GPU puts threads that are waiting for something on the "back burner". They are consuming some resources, but are they actually running? :)
The number of concurrently executed threads depends on your code and type of your CUDA device. For example Fermi has 2 thread schedulers for each stream multiprocessor and for current CPU clock will be scheduled 2 half-warps for calculation or memory load or transcendent function calculation. While one half-warp wait load or executed transcendent function CUDA cores may execute anything else. So you can get 96 threads on cores but if your code may get it. And, of course, you must have enough memory.

Relation between number of blocks of threads and cuda cores on machine (in CUDA C)

I have CUDA 2.1 installed on my machine and it has a graphic card with 64 cuda cores.
I have written a program in which I initialize simultaneously 30000 blocks (and 1 thread per block). But am not getting satisfying results from the gpu (It performs slowly than the cpu)
Is it that the number of blocks must be smaller than or equal to the number of cores for good performance? Or is it that the performance has nothing to do with number of blocks
CUDA cores are not exactly what you might call a core on a classical CPU. Indeed, they have to be viewed as nothing more than ALUs (Arithmetic and Logic Units), which are just able to compute ready operations.
You might know that threads are handled per warps (groups of 32 threads) inside the blocks you've defined. When your blocks are dispatched on the different SMs (Streaming Multiprocessors, they are the actual cores of the GPU), each SM schedules warps within a block to optimize the computation time in regard of the memory access time needed to get threads' input data.
The problem is threads are always handled through their belonging warp, so if you have only one thread per block, the SM it is running on won't be able to schedule through warps and you won't take advantage of the multiple CUDA cores available. Your CUDA cores will be waiting for data to process, since CUDA cores compute far quicker than data are retrieved through memory.
Having lots of blocks with few threads is not what the GPU is awaiting. In this case, you face the block per SM limitation (this number depends on your device), which force your GPU to spend a lot of time to put blocks on SM and then remove them to treat the next ones. You should rather increase the number of threads in your blocks instead of the number of blocks in your application.
The warp size in all current CUDA hardware is 32. Using less than 32 threads per block (or not using a round multiple of 32 threads per block) just wastes cycles. As it stands, using 1 thread per block is leaving something like 95% of the ALU cycles of your GPU idle. That is the underlying reason for the poor performance.

CUDA determining threads per block, blocks per grid

I'm new to the CUDA paradigm. My question is in determining the number of threads per block, and blocks per grid. Does a bit of art and trial play into this? What I've found is that many examples have seemingly arbitrary number chosen for these things.
I'm considering a problem where I would be able to pass matrices - of any size - to a method for multiplication. So that, each element of C (as in C = A * B) would be calculated by a single thread. How would you determine the threads/block, blocks/grid in this case?
In general you want to size your blocks/grid to match your data and simultaneously maximize occupancy, that is, how many threads are active at one time. The major factors influencing occupancy are shared memory usage, register usage, and thread block size.
A CUDA enabled GPU has its processing capability split up into SMs (streaming multiprocessors), and the number of SMs depends on the actual card, but here we'll focus on a single SM for simplicity (they all behave the same). Each SM has a finite number of 32 bit registers, shared memory, a maximum number of active blocks, AND a maximum number of active threads. These numbers depend on the CC (compute capability) of your GPU and can be found in the middle of the Wikipedia article http://en.wikipedia.org/wiki/CUDA.
First of all, your thread block size should always be a multiple of 32, because kernels issue instructions in warps (32 threads). For example, if you have a block size of 50 threads, the GPU will still issue commands to 64 threads and you'd just be wasting them.
Second, before worrying about shared memory and registers, try to size your blocks based on the maximum numbers of threads and blocks that correspond to the compute capability of your card. Sometimes there are multiple ways to do this... for example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads. This means if you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048 thread limit. If you use 256 threads, you can only fit 8, but you're still using all of the available threads and will still have full occupancy. However using 64 threads per block will only use 1024 threads when the 16 block limit is hit, so only 50% occupancy. If shared memory and register usage is not a bottleneck, this should be your main concern (other than your data dimensions).
On the topic of your grid... the blocks in your grid are spread out over the SMs to start, and then the remaining blocks are placed into a pipeline. Blocks are moved into the SMs for processing as soon as there are enough resources in that SM to take the block. In other words, as blocks complete in an SM, new ones are moved in. You could make the argument that having smaller blocks (128 instead of 256 in the previous example) may complete faster since a particularly slow block will hog fewer resources, but this is very much dependent on the code.
Regarding registers and shared memory, look at that next, as it may be limiting your occupancy. Shared memory is finite for a whole SM, so try to use it in an amount that allows as many blocks as possible to still fit on an SM. The same goes for register use. Again, these numbers depend on compute capability and can be found tabulated on the wikipedia page.
https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of a GPU by a given CUDA kernel. The multiprocessor occupancy is the ratio of active warps to the maximum number of warps supported on a multiprocessor of the GPU. Each multiprocessor on the device has a set of N registers available for use by CUDA program threads. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. The CUDA compiler attempts to minimize register usage to maximize the number of thread blocks that can be active in the machine simultaneously. If a program tries to launch a kernel for which the registers used per thread times the thread block size is greater than N, the launch will fail...
With rare exceptions, you should use a constant number of threads per block. The number of blocks per grid is then determined by the problem size, such as the matrix dimensions in the case of matrix multiplication.
Choosing the number of threads per block is very complicated. Most CUDA algorithms admit a large range of possibilities, and the choice is based on what makes the kernel run most efficiently. It is almost always a multiple of 32, and at least 64, because of how the thread scheduling hardware works. A good choice for a first attempt is 128 or 256.
You also need to consider shared memory because threads in the same block can access the same shared memory. If you're designing something that requires a lot of shared memory, then more threads-per-block might be advantageous.
For example, in terms of context switching, any multiple of 32 works just the same. So for the 1D case, launching 1 block with 64 threads or 2 blocks with 32 threads each makes no difference for global memory accesses. However, if the problem at hand naturally decomposes into 1 length-64 vector, then the first option will be better (less memory overhead, every thread can access the same shared memory) than the second.
There is no silver bullet. The best number of threads per block depends a lot on the characteristics of the specific application being parallelized. CUDA's design guide recommends using a small amount of threads per block when a function offloaded to the GPU has several barriers, however, there are experiments showing that for some applications a small number of threads per block increases the overhead of synchronizations, imposing a larger overhead. In contrast, a larger number of threads per block may decrease the amount of synchronizations and improve the overall performance.
For an in-depth discussion (too lengthy for StackOverflow) about the impact of the number of threads per block on CUDA kernels, check this journal article, it shows tests of different configurations of the number of threads per block in the NPB (NAS Parallel Benchmarks) suite, a set of CFD (Computational Fluid Dynamics) applications.