The question is:
A wide bus configuration has the following parameters:
Number of cycles to send the address
Number of cycles for a bus transfer = 2 cycles
Memory Access = 30 cycles
How many cycles are need to transfer a block of 32 bytes?
So, since it's a wide bus configuration I assumed that the bus transfer cycle will be done over one iteration and same for memory access
Which means that I got 30 + 2 cycles = 32
However I can't make sense of the size of the bus and its impact. I can't understand how i can calculate the the number of cycles left from it
Related
As the fermi-whitepaper suggests, there are 16 SMs (Streaming Multiprocessors), whereas each of them consists of 32 cores. The gpu executes a thread of a group of 32 threads, called warp.
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
And if so, does it mean that in total the fermi-architecture allows exeucting operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
If so, how many times can it execute 512 datas in parallel in one second?
First question: Am I right to assume, that each warp could be treated as something like the vector-width, meaning: I could execute a single instruction on 32 "datas" in parallel?
Yes.
And if so, does it mean that in total the fermi-architecture allows executing operations on 16 * 32 = 512 data in parallel, whereas 16 operations can differ respectively?
Yes, possibly, depending on the operation type. A GPU SM includes functional units that handle different types of operations (instructions). An integer add may not be handled by the same functional unit as a floating-point add, for example. Because different operations are handled by different functional units, and due to the fact that there is no particular requirement that the GPU SM contain 32 functional units for each instruction (type), the specific throughput will depend on the instruction. However the 32 functional units you are referring to can handle a float add, multiply, or multiply-add. So for those specific operation types, your calculation is correct.
If so, how many times can it execute 512 datas in parallel in one second?
This is given by the clock rate, divided by the number of clocks to service an instruction. For example, with 32 FP add units, the GPU can theoretically retire one of these, for 512 "datas" in a single clock cycle. If there were another operation, such as integer add, which only had 16 functional units to service it, then it would require 2 clocks to service it warp-wide. So we would divide the number by 2. And if you had a mix of operations, say 8 floating-point adds issued on 8 SMs, and 8 integer adds issued on the other 8 SMs, then you would have a more complex calculation, perhaps.
The theoretical maximum floating point throughput is computed this way. For example, the Fermi M2090 has all 16 SMs enabled, and is claimed to have a peak theoretical throughput of 1332 GF/s for FP32 ops. That calculation is as follows:
16 SMs * 32 functional units/SM * 2 ops/functional unit/hotclk * 2 hotclock/clk * 651M clks/sec = 1333GF/s FP32
I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.
Nvidia web-site mentions a few causes of low achieved occupancy, among them uneven distribution of workload among blocks, which results in blocks hoarding shared memory resources and not releasing them until block is finished. The suggestion is to decrease the size of a block, thus increasing the overall number of blocks (given that we keep the number of threads constant, of course).
A good explanation on that was also given here on stackoverflow.
Given aforementioned information, shouldn't the right course of actions be (in order to maximize performance) simply setting the size of a block as small as possible (equal to the size of a warp, say 32 threads)? That is, unless you need to make sure that a larger number of threads needs to communicate through shared memory, I assume.
Given aforementioned information, shouldn't the right course of
actions be (in order to maximize performance) simply setting the size
of a block as small as possible (equal to the size of a warp, say 32
threads)?
No.
As shown in the documentation here, there is a limit on the number of blocks per multiprocessor which would leave you with a maximum theoretical occupancy of 25% or 50% when using 32 thread blocks, depending on what hardware you run the kernel on.
Usually it is a good approach to use as small blocks as possbile but big enough to saturate device (64 or 128 threads per block depending on device) - it is not always possible since you might want to synchronize threads or communicate via shared memory.
Having large number of small blocks allows GPU to do kind of "autobalancing" and keep all SMs running.
The same applies to CPU - if you have 5 independent taks and each takes 4 seconds to finish, but you have only 4 cores then it will end after 8 seconds(during first 4 seconds 4 cores are running on first 4 tasks and then 1 core is running on last task and 3 cores are idling).
If you are able to divide whole job to 20 tasks that take 1 second then whole job will be done in 5 seconds. So having a lot of small tasks helps to utilize hardware.
In case of GPU you can have large number of active blocks (on Titan X it is 24 SM x 32 active blocks = 768 blocks) and would be good to use this power.
Anyway it is not always true that you need to fully saturate device. On many tasks I can see that using 32 threads per block (so having 50% possible occupancy) gives same performance as using 64 threads per block.
In the end all is a matter of doing some benchmarks, and choosing whatever is best for you in given case with given hardware.
I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.
I have to find the execution time (in microseconds) of a small block of MIPS code, given that:
it will take a total of 30 cycles
total of 10 MIPS instructions
2.0 GHz CPU
That's all the information I am given to solve this question with (I already added up the total number of cycles, given the assumptions I am supposed to make about how many cycles different kinds of instructions are supposed to take). I have been playing around with the formulas from the book trying to find the execution time, but I can't get an answer that seems right. Whats the process for solving a problem like this? Thanks.
My best guess at interpreting your problem is that on average each instruction takes 3 cycles to complete. Because you were given the total number of cycles I'm not sure that the instruction count even matters.
You have a 2Ghz machine so that is 2 * 10^9 cycles per second. This equates to each cycle taking 5 * 10^(-10) seconds. (twice as fast as a 1Ghz machine which is 1*10^(-9)).
We have 30 cycles to complete to run the program so...
30 * (5 * 10^(-10)) = 1.5 * 10^(-8) or 15 nano seconds to execute all 10 instructions in 30 cycles.