Scalability Analysis on GPU - cuda

I am trying to do a scalability analysis using my Quadro FX 5800 which have 240 cores to the run time scales with number of cores which is a classic study for parallel computing.
I was wondering how does the definition of the core fit it in this ?
And how can I use it to run on different core settings say ( 8,16,32,64,128,240 cores ) ?
My test case is the simple matrix multiplication.

Scalability on the GPU should not be measured in terms of CUDA cores but in terms of SM utilization. IPC is probably the best single measure of SM utilization. When developing an algorithm you want to partition your work such that you can distribute sufficient work to all SMs such that on every cycle the warp scheduler has at least one warp eligible to issue and instruction. In general this means you have to have sufficient warps on each SM to hide instruction and memory latency and to provide a variety of instruction types to fill the execution pipeline.
If you want to test scaling across CUDA cores (meaningless) then you can launch thread blocks containing 1, 2, 3, ... 32 threads per block. Launching non-multiple of WARP_SIZE (=32) threads per thread block will result in only using a subset of the cores. These are basically wasted execution slots.
If you want to test scaling in terms of SMs you can scale you algorithm from 1 thread block to 1000s of thread blocks. In order to understand scaling you can artificially limit the thread blocks per SM by configuring the shared memory per thread block when you launch.
Re-writing matrix multiply to optimally scale in each of these directions is likely to be frustrating. Before you undertake that project I would recommend understanding how distributing a simply parallel computation such as summing from 0-100000 or calculating a factorial scales across multiple thread blocks. These algorithms are only a few lines of code and the aforementioned scaling can be tried by varying the launch configuration (GridDim, BlockDim, SharedMemoryPerBlock) and kernel 1-2 parameters You can time the different launches using the CUDA profiler, Visual Profiler, Parallel Nsight, or CUevents.

Assuming you are using CUDA or OpenCL as the programming model: One simple way to restrict the utilization to M number of multiprocessors (SM) is to launch your kernel with an execution configuration of M blocks (of threads). If each SM is composed of N cores, in this way you can test scalability across N, 2N, 4N, ... cores.
For example, if the GPU has 4 SMs, each SM having 32 cores. By running kernels of 1, 2 and 4 blocks, your kernel will utilize 32, 64 and 128 cores of the GPU.

Related

Maximum number of GPU Threads on Hardware and used memory

I have already read several threads about the capacity about the GPU and understood that the concept of blocks and threads has to be seperated from the physical Hardware. Although the maximum amount of threads per block is 1024, there is no limit on the number of blocks one can use. However, as the number of streaming processors is finite, there has to be a physical limit. After I wrote a GPU program, I would be interested in evaluating the used capacity of my GPU. To do this, I have to know how many threads I could start theoretically at one time on hardware. My graphics card is a Nvidia Geforce 1080Ti, so I have 3584 CUDA-Cores. As far as I understood, each Cuda core executes one Thread, so in theory, I would be able to execute 3584 threads per cycle. Is this correct?
Another question is the one about memory. I installed and used nvprof to get some insight into the used kernels. What is displayed there is for example the number of used registers. I transfer my arrays to the GPU using cuda.to_device (in Python Numba) and as far as I understood, the arrays then reside in global memory. How do I find out how big this global memory is? Is it equivalent to the DRAM size?
Thanks in advance
I'll focus on the first part of the question. The second should really be its own separate question.
CUDA cores do not map 1-to-1 to threads. They are more like ports in a multiscalar CPU. Multiple threads can issue instructions to the same CUDA core in different clock cycles. Sort of like hyperthreading in a CPU.
You can see the relation and numbers in the documentation in chapter K Compute Capabilities compared to the table Table 3. Throughput of Native Arithmetic Instructions. Depending on your architecture, you may have for example for your card (compute capability 6.1) 2048 threads per SM and 128 32 bit floating point operations per clock cycle. That means you have 128 CUDA cores shared by a maximum of 2048 threads.
Within one GPU generation, the absolute number of threads and CUDA cores only scales with the number of multiprocessors (SMs). TechPowerup's excellent GPU database documents 28 SMs for your card which should give you 28 * 2048 threads unless I did something wrong.

CUDA blocks & warps - which can run in parallel on a single SM?

Ok I know that related questions have been asked over and over again and I read pretty much everything I found about this, but things are still unclear. Probably also because I found and read things contradicting each other (maybe because, being from different times, they referred to devices with different compute capability, between which there seems to be quite a gap). I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel. Also I was thinking of generalizing this and calculating an optimal number of threads and blocks to pass to my kernel based only on the number of operations I know I have to do (for simpler programs) and the system specs.
I have a GTX 550Ti, btw with compute capability 2.1.
4 SMs x 48 cores = 192 CUDA cores.
Ok so what's unclear to me is:
Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)? I read that up to 8 blocks can be assigned to a SM, but nothing as to how they're ran. From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?). Or at least not if I have a max number of threads on them. Also if I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each?
Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel aswell? Because in the Fermi architecture it states that 2 warps are executed concurrently, sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32*48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
On a simpler note, what I'm asking is: (for ex) if I want to sum 2 vectors in a third one, what length should I give them (nr of operations) and how should I split them in blocks and threads for my device to work concurrently (in parallel) at full capacity (without having idle cores or SMs).
I'm sorry if this was asked before and I didn't get it or didn't see it. Hope you can help me. Thank you!
The distribution and parallel execution of work are determined by the launch configuration and the device. The launch configuration states the grid dimensions, block dimensions, registers per thread, and shared memory per block. Based upon this information and the device you can determine the number of blocks and warps that can execute on the device concurrently. When developing a kernel you usually look at the ratio of warps that can be active on the SM to the maximum number of warps per SM for the device. This is called the theoretical occupancy. The CUDA Occupancy Calculator can be used to investigate different launch configurations.
When a grid is launched the compute work distributor will rasterize the grid and distribute thread blocks to SMs and SM resources will be allocated for the thread block. Multiple thread blocks can execute simultaneously on the SM if the SM has sufficient resources.
In order to launch a warp, the SM assigns the warp to a warp scheduler and allocates registers for the warp. At this point the warp is considered an active warp.
Each warp scheduler manages a set of warps (24 on Fermi, 16 on Kepler). Warps that are not stalled are called eligible warps. On each cycle the warp scheduler picks an eligible warp and issue instruction(s) for the warp to execution units such as int/fp units, double precision floating point units, special function units, branch resolution units, and load store units. The execution units are pipelined allowing many warps to have 1 or more instructions in flight each cycle. Warps can be stalled on instruction fetch, data dependencies, execution dependencies, barriers, etc.
Each kernel has a different optimal launch configuration. Tools such as Nsight Visual Studio Edition and the NVIDIA Visual Profiler can help you tune your launch configuration. I recommend that you try to write your code in a flexible manner so you can try multiple launch configurations. I would start by using a configuration that gives you at least 50% occupancy then try increasing and decreasing the occupancy.
Answers to each Question
Q: Can more than 1 block run AT ONCE (in parallel) on a multiprocessor (SM)?
Yes, the maximum number is based upon the compute capability of the device. See Tabe 10. Technical Specifications per Compute Capability : Maximum number of residents blocks per multiprocessor to determine the value. In general the launch configuration limits the run time value. See the occupancy calculator or one of the NVIDIA analysis tools for more details.
Q:From the fact that my max number of threads per SM (1536) is barely larger than my max number of threads per block (1024) I would think that blocks aren't ran in parallel (maybe 1 and a half?).
The launch configuration determines the number of blocks per SM. The ratio of maximum threads per block to maximum threads per SM is set to allow developer more flexibility in how they partition work.
Q: If I set the number of blocks to, let's say 4 (my number of SMs), will they be sent to a different SM each? Or I can't really control how all this is distributed on the hardware and then this is a moot point, my execution time will vary based on the whims of my device ...
You have limited control of work distribution. You can artificially control this by limiting occupancy by allocating more shared memory but this is an advanced optimization.
Q: Secondly, I know that a block will divide it's threads into groups of 32 threads that run in parallel, called warps. Now these warps (presuming they have no relation to each other) can be ran in parallel as well?
Yes, warps can run in parallel.
Q: Because in the Fermi architecture it states that 2 warps are executed concurrently
Each Fermi SM has 2 warps schedulers. Each warp scheduler can dispatch instruction(s) for 1 warp each cycle. Instruction execution is pipelined so many warps can have 1 or more instructions in flight every cycle.
Q: Sending one instruction from each warp to a group of 16 (?) cores, while somewhere else i read that each core handles a warp, which would explain the 1536 max threads (32x48) but seems a bit much. Can 1 CUDA core handle 32 threads concurrently?
Yes. CUDA cores is the number of integer and floating point execution units. The SM has other types of execution units which I listed above. The GTX550 is a CC 2.1 device. On each cycle a SM has the potential to dispatch at most 4 instructions (128 threads) per cycle. Depending on the definition of execution the total threads in flight per cycle can range from many hundreds to many thousands.
I am looking to be more efficient, to reduce my execution time and thus I need to know exactly how many threads/warps/blocks can run at once in parallel.
In short, the number of threads/warps/blocks that can run concurrently depends on several factors. The CUDA C Best Practices Guide has a writeup on Execution Configuration Optimizations that explains these factors and provides some tips for reasoning about how to shape your application.
One of the concepts that took a whle to sink in, for me, is the efficiency of the hardware support for context-switching on the CUDA chip.
Consequently, a context-switch occurs on every memory access, allowing calculations to proceed for many contexts alternately while the others wait on theri memory accesses. ne of the ways that GPGPU architectures achieve performance is the ability to parallelize this way, in addition to parallelizing on the multiples cores.
Best performance is achieved when no core is ever waiting on a memory access, and is achieved by having just enough contexts to ensure this happens.

clarification about CUDA number of threads executed per SM

I am new to cuda programming and am reading about a G80 chip which has 128 SPs(16 SMs, each with 8 SPs) from the book "Programming Massively Parallel Processors - A hands on approach".
There is a comparison between Intel CPUs and G80 chip.
Intel CPUs support 2 to 4 threads, depending on the machine model, per core.
where as the G80 chip supports 768 threads per SM, which sums up to 12000 threads for this chip.
My question here is it that the G80 chip can execute 768 threads simultaneously ?
If not simultaneously then what is meant by Intel CPUs support 2 to 4 threads per core ? We can always have many threads/processes running on the Intel CPU scheduled by the OS.
G80 keep the context for 768 threads per SM concurrently and interleaves their execution. This is the key difference between CPU and GPU. GPUs are deep-multithreaded processor hiding memory accesses of some threads by the computation from other threads. The latency of executing a thread is much higher that the CPU and GPU is optimized for thread throughput instead of thread latency. In comparison, CPUs use out-of-order speculative execution to reduce the execution delay of one thread. There are several technique used by GPUs to reduce thread scheduling overhead. For example, GPUs group threads in coarser schedulable element called warps of wavefront and execute threads of the warp over an SIMD. GPU threads are identical making them suitable choice for SIMD model. In the eye of the programmer, threads are executed in MIMD fashion and they are grouped in thread blocks to reduce communication overhead.
Threads employed in a CPU core are used to fill different execution units by dynamic scheduling. CPU threads are not necessarily at the same type. It means once a thread is busy with the floating point other threads may find ALU idle. Therefore, execution of these thread can be done concurrently. Multiple threads per core are maintained to fill different execution units effectively preventing idle units. However, dynamic scheduling is costly in term of power and energy consumption. Therefore, manufacturer use a few threads per CPU core.
In answer to second part of your question: Threads in GPUs are scheduled by hardware (per SM warp scheduler) and the OS and even driver do not affect the scheduling.
As far as I know, 768 is the max number of resident threads in an SM. And the threads are executed in warps which consists of 32 threads. So in an SM, all 768 threads will not be executed at the same time, but they will be scheduled in chunks of 32 threads at a time, i.e. one warp at a time.
The analogous technology on CPUs is called "simultanous multithreading" (SMT), or hyperthreading in Intel's marketing speech. It allows usually two, on some CPUs four threads to be scheduled by the CPU itself in hardware.
This is different from the fact that the operating system may on top of that schedule a larger number of threads in software.

blocks, threads, warpSize

There has been much discussion about how to choose the #blocks & blockSize, but I still missing something. Many of my concerns address this question: How CUDA Blocks/Warps/Threads map onto CUDA Cores? (To simplify the discussion, there is enough perThread & perBlock memory. Memory limits are not an issue here.)
kernelA<<<nBlocks, nThreads>>>(varA,constB, nThreadsTotal);
1) To keep the SM as busy as possible, I should set nThreads to a multiple of warpSize. True?
2) An SM can only execute one kernel at a time. That is all HWcores of that SM are executing only kernelA. (Not some HWcores running kernelA, while others run kernelB.) So if I have only one thread to run, I'm "wasting" the other HWcores. True?
3)If the warp-scheduler issues work in units of warpSize (32 threads), and each SM has 32 HWcores, then the SM would be full utilized. What happens when the SM has 48 HWcores? How can I keep all 48 cores full utilized when the scheduler is issuing work in chunks of 32? (If the previous paragraph is true, wouldn't it be better if the scheduler issued work in units of HWcore size?)
4) It looks like the warp-scheduler queues up 2 tasks at a time. So that when the currently-executing kernel stalls or blocks, the 2nd kernel is swapped in. (It is not clear, but I'll guess the queue here is more than 2 kernels deep.) Is this correct?
5) If my HW has an upper limit of 512 threads-per-block (nThreadsMax), that doesn't mean the kernel with 512 threads will run fastest on one block. (Again, mem not an issue.) There is a good chance I'll get better performance if I spread the 512-thread kernel across many blocks, not just one. The block is executed on one or many SM's. True?
5a) I'm thinking the smaller the better, but does it matter how small I make nBlocks? The question is, how to choose the value of nBlocks that is decent? (Not necessarily optimal.) Is there a mathematical approach to choosing nBlocks, or is it simply trial-n-err.
1) Yes.
2) CC 2.0 - 3.0 devices can execute up to 16 grids concurrently. Each SM is limited to 8 blocks so in order to reach full concurrency the device has to have at least 2 SMs.
3) Yes the warp schedulers select and issue warps at at time. Forget the concept of CUDA cores they are irrelevant. In order to hide latency you need to have high instruction level parallelism or a high occupancy. It is recommended to have >25% for CC 1.x and >50% for CC >= 2.0. In general CC 3.0 requires higher occupancy than 2.0 devices due to the doubling of schedulers but only a 33% increase in warps per SM. The Nsight VSE Issue Efficiency experiment is the best way to determine if you had sufficient warps to hide instruction and memory latency. Unfortunately, the Visual Profiler does not have this metric.
4) The warp scheduler algorithm is not documented; however, it does not consider which grid the thread block originated. For CC 2.x and 3.0 devices the CUDA work distributor will distribute all blocks from a grid before distributing blocks from the next grid; however, this is not guaranteed by the programming model.
5) In order to keep the SM busy you have to have sufficient blocks to fill the device. After that you want to make sure you have sufficient warps to reach a reasonable occupancy. There are both pros and cons to using large thread blocks. Large thread blocks in general use less instruction cache and have smaller footprints on cache; however, large thread blocks stall at syncthreads (SM can become less efficient as there are less warps to choose from) and tend to keep instructions executing on similar execution units. I recommend trying 128 or 256 threads per thread block to start. There are good reasons for both larger and smaller thread blocks.
5a) Use the occupancy calculator. Picking too large of a thread block size will often cause you to be limited by registers. Picking too small of a thread block size can find you limited by shared memory or the 8 blocks per SM limit.
Let me try to answer your questions one by one.
That is correct.
What do you mean exactly by "HWcores"? The first part of your statement is correct.
According to the NVIDIA Fermi Compute Architecture Whitepaper: "The SM schedules threads in groups of 32 parallel threads called warps. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. Because warps execute independently, Fermi’s scheduler does not need to check for dependencies from within the instruction stream".
Furthermore, the NVIDIA Keppler Architecture Whitepaper states: "Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle."
The "excess" cores are therefore used by scheduling more than one warp at a time.
The warp scheduler schedules warps of the same kernel, not different kernels.
Not quite true: Each block is locked in to a single SM, since that's where its shared memory resides.
That's a tricky issue and depends on how your kernel is implemented. You may want to have a look at the nVidia Webinar Better Performance at Lower Occupancy by Vasily Volkov which explains some of the more important issues. Primarily, though, I would suggest you choose your thread count to improve occupancy, using the CUDA Occupancy Calculator.

Relation between number of blocks of threads and cuda cores on machine (in CUDA C)

I have CUDA 2.1 installed on my machine and it has a graphic card with 64 cuda cores.
I have written a program in which I initialize simultaneously 30000 blocks (and 1 thread per block). But am not getting satisfying results from the gpu (It performs slowly than the cpu)
Is it that the number of blocks must be smaller than or equal to the number of cores for good performance? Or is it that the performance has nothing to do with number of blocks
CUDA cores are not exactly what you might call a core on a classical CPU. Indeed, they have to be viewed as nothing more than ALUs (Arithmetic and Logic Units), which are just able to compute ready operations.
You might know that threads are handled per warps (groups of 32 threads) inside the blocks you've defined. When your blocks are dispatched on the different SMs (Streaming Multiprocessors, they are the actual cores of the GPU), each SM schedules warps within a block to optimize the computation time in regard of the memory access time needed to get threads' input data.
The problem is threads are always handled through their belonging warp, so if you have only one thread per block, the SM it is running on won't be able to schedule through warps and you won't take advantage of the multiple CUDA cores available. Your CUDA cores will be waiting for data to process, since CUDA cores compute far quicker than data are retrieved through memory.
Having lots of blocks with few threads is not what the GPU is awaiting. In this case, you face the block per SM limitation (this number depends on your device), which force your GPU to spend a lot of time to put blocks on SM and then remove them to treat the next ones. You should rather increase the number of threads in your blocks instead of the number of blocks in your application.
The warp size in all current CUDA hardware is 32. Using less than 32 threads per block (or not using a round multiple of 32 threads per block) just wastes cycles. As it stands, using 1 thread per block is leaving something like 95% of the ALU cycles of your GPU idle. That is the underlying reason for the poor performance.