Questions regarding Fermi-Architecture, Warps and Perfomance - cuda

As the fermi-whitepaper suggests, there are 16 SMs (Streaming Multiprocessors), whereas each of them consists of 32 cores. The gpu executes a thread of a group of 32 threads, called warp.
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
And if so, does it mean that in total the fermi-architecture allows exeucting operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
If so, how many times can it execute 512 datas in parallel in one second?

First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
Yes.
And if so, does it mean that in total the fermi-architecture allows executing operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
Yes, possibly, depending on the operation type. A GPU SM includes functional units that handle different types of operations (instructions). An integer add may not be handled by the same functional unit as a floating-point add, for example. Because different operations are handled by different functional units, and due to the fact that there is no particular requirement that the GPU SM contain 32 functional units for each instruction (type), the specific throughput will depend on the instruction. However the 32 functional units you are referring to can handle a float add, multiply, or multiply-add. So for those specific operation types, your calculation is correct.
If so, how many times can it execute 512 datas in parallel in one second?
This is given by the clock rate, divided by the number of clocks to service an instruction. For example, with 32 FP add units, the GPU can theoretically retire one of these, for 512 "datas" in a single clock cycle. If there were another operation, such as integer add, which only had 16 functional units to service it, then it would require 2 clocks to service it warp-wide. So we would divide the number by 2. And if you had a mix of operations, say 8 floating-point adds issued on 8 SMs, and 8 integer adds issued on the other 8 SMs, then you would have a more complex calculation, perhaps.
The theoretical maximum floating point throughput is computed this way. For example, the Fermi M2090 has all 16 SMs enabled, and is claimed to have a peak theoretical throughput of 1332 GF/s for FP32 ops. That calculation is as follows:
16 SMs * 32 functional units/SM * 2 ops/functional unit/hotclk * 2 hotclock/clk * 651M clks/sec = 1333GF/s FP32

Related

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

How will GPU execute when the number of threads per block is larger than the maximum number of active threads on a Streaming Multiprocessor? [duplicate]

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.
deviceQuery gives me the following possibly relevant information:
CUDA Capability Major/Minor version number: 2.0
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA
Maximum number of threads per block: 1024
I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?
The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.
Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.
While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.
So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.
Taking into account all of the above is too difficult, though, so most people compare on two metrics:
Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.
The most important comparison is always measured wall-clock time on a real application.
There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:
The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.
To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.
I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.

CUDA Kepler: not enough ALUs

According to the Kepler whitepage, the warp size for a Kepler based GPU is 32 and each multiprocessor contains 4 warp schedulars which select two independant instructions from a chosen warp. This means that each clock cycle, 32*4*2 = 256 calculations are to be performed, but a multiprocessor only contains 192 ALUs. How are these calculations performed then?
The actual whitepaper wording is as follows:
The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp
schedulers and eight instruction dispatch units, allowing four warps to be issued and executed
concurrently. Kepler’s quad warp scheduler selects four warps, and two independent instructions per
warp can be dispatched each cycle.
The interpretation is that in any given cycle, at most 4 warps can be scheduled. For each of those 4 warps, (up to) 2 independent instructions per warp can be dispatched. "can be dispatched" is not the same as "will be dispatched".
The 192 ALUs you are referring to are related to single precision floating point arithmetic operations (SP units for the purpose of this discussion). However there are other functional units in the SM(X) such as double precision floating point arithmetic units (DP units), load/store units (LD/ST units), and other units. Refer to the diagram on page 8 of the whitepaper linked above. If a given set of instructions were all using the SP units, then 8 instructions could not be scheduled, at most 6 (32x6=192) could be scheduled. However, if the instruction mix contains independent instructions of different types (e.g. loads, stores, SP ops, etc.) then the limitation of 192 SP units will not necessarily be the determining factor in how many instructions actually get scheduled in any given cycle.
The bottom line is that 8 instructions (2 inst/scheduler x 4 schedulers) per cycle is the maximum possible instruction issue rate per SM(X). Real world codes do not necessarily achieve this. It's entirely possible that in a given cycle no instructions could get issued, due to stall/starvation conditions.

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.

CUDA: How many concurrent threads in total?

I have a GeForce GTX 580, and I want to make a statement about the total number of threads that can (ideally) actually be run in parallel, to compare with 2 or 4 multi-core CPU's.
deviceQuery gives me the following possibly relevant information:
CUDA Capability Major/Minor version number: 2.0
(16) Multiprocessors x (32) CUDA Cores/MP: 512 CUDA
Maximum number of threads per block: 1024
I think I heard that each CUDA core can run a warp in parallel, and that a warp is 32 threads. Would it be correct to say that the card can run 512*32 = 16384 threads in parallel then, or am I way off and the CUDA cores are somehow not really running in parallel?
The GTX 580 can have 16 * 48 concurrent warps (32 threads each) running at a time. That is 16 multiprocessors (SMs) * 48 resident warps per SM * 32 threads per warp = 24,576 threads.
Don't confuse concurrency and throughput. The number above is the maximum number of threads whose resources can be stored on-chip simultaneously -- the number that can be resident. In CUDA terms we also call this maximum occupancy. The hardware switches between warps constantly to help cover or "hide" the (large) latency of memory accesses as well as the (small) latency of arithmetic pipelines.
While each SM can have 48 resident warps, it can only issue instructions from a small number (on average between 1 and 2 for GTX 580, but it depends on the program instruction mix) of warps at each clock cycle.
So you are probably better off comparing throughput, which is determined by the available execution units and how the hardware is capable of performing multi-issue. On GTX580, there are 512 FMA execution units, but also integer units, special function units, memory instruction units, etc, which can be dual-issued to (i.e. issue independent instructions from 2 warps simultaneously) in various combinations.
Taking into account all of the above is too difficult, though, so most people compare on two metrics:
Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
Measured throughput on the application you are interested in.
The most important comparison is always measured wall-clock time on a real application.
There are certain traps that you can fall into by doing that comparison to 2 or 4-core CPUs:
The number of concurrent threads does not match the number of threads that actually run in parallel. Of course you can launch 24576 threads concurrently on GTX 580 but the optimal value is in most cases lower.
A 2 or 4-core CPU can have arbitrary many concurrent threads! Similarly as with GPU, from some point adding more threads won't help, or even it may slow down.
A "CUDA core" is a single scalar processing unit, while CPU core is usually a bigger thing, containing for example a 4-wide SIMD unit. To compare apples-to-apples, you should multiply the number of advertised CPU cores by 4 to match what NVIDIA calls a core.
CPU supports hyperthreading, which allows a single core to process 2 threads concurrently in a light way. Because of that, an operating system may actually see 2 times more "logical cores" than the hardware cores.
To sum it up: For a fair comparison, your 4-core CPU can actually run 32 "scalar threads" concurrently, because of SIMD and hyperthreading.
I realize this is a bit late but I figured I'd help out anyway. From page 10 the CUDA Fermi architecture whitepaper:
Each SM features two
warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently.
To me this means that each SM can have 2*32=64 threads running concurrently. I don't know if that means that the GPU can have a total of 16*64=1024 threads running concurrently.