Clarifying memory transactions in CUDA - cuda

I am confused about the following statements in the CUDA programming guide 4.0 section 5.3.2.1
in the chapter of Performance Guidelines.
Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions.
These memory transactions must be naturally aligned:Only the 32-, 64- ,
128- byte segments of device memory
that are aligned to their size (i.e. whose first address is a
multiple of their size) can be read or written by memory
transactions.
1)
My understanding of device memory was that accesses to the device memory by threads is uncached: So if thread accesses memory location a[i] it will fetch only a[i] and none of the
values around a[i]. So the first statement seems to contradict this. Or perhaps I am misunderstanding the usage of the phrase "memory transaction" here?
2) The second sentence does not seem very clear. Can someone explain this?

Memory transactions are performed per warp. So 32 byte transactions is a warp sized read of an 8 bit type, 64 byte transactions is a warp sized read of an 16 bit type, and 128 byte transactions is a warp sized read of an 32 bit type.
It just means that all reads have to be aligned to a natural word size boundary. It is not possible for a warp to read a 128 byte transaction with a one byte offset. See this answer for more details.

Related

Cuda: async-copy vs coalesced global memory read atomicity

I was reading something about the memory model in Cuda. In particular, when copying data from global to shared memory, my understanding of shared_mem_data[i] = global_mem_data[i] is that it is done in a coalesced atomic fashion, i.e each thread in the warp reads global_data[i] in a single indivisible transaction. Is that correct?
tl;dr: No.
It is not guaranteed, AFAIK, that all values are read in a single transaction. In fact, a GPU's memory bus is not even guaranteed to be wide enough for a single transaction to retrieve a full warp's width of data (1024 bits for a full warp read of 4 bytes each). It is theoretically for some values in the read-from locations in memory to change while the read is underway.

CUDA coalesced memory access speed depending on word size

I have a CUDA program where one warp needs to access (for example) 96 bytes of global memory.
It properly aligns the memory location and lane indices such that the access is coalesced and done in a single transaction.
The program could do the access using 12 lanes each accessing a uint8_t. Alternately it would use 6 lanes accessing a uint16_t, or 3 lanes accessing a uint32_t.
Is there a performance difference between these alternatives, is the access faster if each thread accesses a smaller amount of memory?
When the amounts of memory each warp needs to access vary, is there a benefit in optimizing it such that the threads are made to access smaller units (16bit or 8bit) when possible?
Without knowing how the data will be used once in registers it is hard to state the optimal option. For almost all GPUs the performance difference between these options will likely be very small.
NVIDIA GPU L1 supports returning either 64 bytes/warp (CC5.,6.) or 128 bytes/warp (CC3., CC7.) returns from L1. As long as the size <= 32 bits per thread then the performance should be very similar.
In CC 5./6. there may be a small performance benefit to reduce the number of predicated true threads (prefer larger data). The L1TEX unit breaks global access into 4 x 8 thread requests. If full groups of 8 threads are predicated off then a L1TEX cycle is saved. Write back to the register file takes the same number of cycles. The grouping order of threads is not disclosed.
Good practice is to write a micro-benchmark. The CUDA profilers have numerous counters for different portions of the L1TEX path to help see the difference.

CUDA coalesced access of FP64 data

I am a bit confused with how memory access issued by a warp is affected by FP64 data.
A warp always consists of 32 threads regardless if these threads are doing FP32 or FP64 calculations. Right?
I have read that each time a thread in a warp tries to read/write the global memory, the warp accesses 128 bytes (32 single-precision floats). Right?
So if all the threads in a warp are reading different single precision floats (a total of 128 bytes) from the memory but in a coalesced manner, the warp will issue a single memory transaction. Right?
Here is my question now:
What if all threads in the warp try to access different double-precision floats (a total of 256 bytes) in a coalesced manner? Will the warp issue two memory transactions (128+128)?
PS: I am mostly interested in Compute Capability 2.0+ architectures
A warp always consists of 32 threads regardless if these threads are
doing FP32 or FP64 calculations. Right?
Correct
I have read that each time a thread in a warp tries to read/write the
global memory, the warp accesses 128 bytes (32 single-precision
floats). Right?
Not exactly. There are also 32 byte transaction sizes.
So if all the threads in a warp are reading different single precision
floats (a total of 128 bytes) from the memory but in a coalesced
manner, the warp will issue a single memory transaction. Right?
Correct
What if all threads in the warp try to access different
double-precision floats (a total of 256 bytes) in a coalesced manner?
Will the warp issue two memory transactions (128+128)?
Yes. The compiler will emit a 64 bit load instruction which will be serviced by two 128 byte transactions per warp when coalesced memory access is possible.

Can many threads set bit on the same word simultaneously?

I need each thread of a warp deciding on setting or not its respective bit in a 32 bits word. Does this multiple setting take only one memory access, or will be one memory access for each bit set?
There is no independent bit-setting capability in CUDA. (There is a bit-field-insert instruction in PTX, but it nevertheless operates on a 32-bit quantity.)
Each thread would set a bit by doing a full 32-bit write. Such a write would need to be an atomic RMW operation in order to preserve the other bits. Therefore the accesses would effectively be serialized, at whatever the throughput of atomics are.
If memory space is not a concern, breaking the bits out into separate integers would allow you to avoid atomics.
A 32-bit packed quantity could then be quickly assembled using the __ballot() warp vote function. An example is given in the answer here.
(In fact, the warp vote function may allow you to avoid memory transactions altogether; everything can be handled in registers, if the only result you need is the 32-bit packed quantity.)

Why bother to know about CUDA Warps?

I have GeForce GTX460 SE, so it is: 6 SM x 48 CUDA Cores = 288 CUDA Cores.
It is known that in one Warp contains 32 threads, and that in one block simultaneously (at a time) can be executed only one Warp.
That is, in a single multiprocessor (SM) can simultaneously execute only one Block, one Warp and only 32 threads, even if there are 48 cores available?
And in addition, an example to distribute concrete Thread and Block can be used threadIdx.x and blockIdx.x. To allocate them use kernel <<< Blocks, Threads >>> ().
But how to allocate a specific number of Warp-s and distribute them, and if it is not possible then why bother to know about Warps?
The situation is quite a bit more complicated than what you describe.
The ALUs (cores), load/store (LD/ST) units and Special Function Units (SFU) (green in the image) are pipelined units. They keep the results of many computations or operations at the same time, in various stages of completion. So, in one cycle they can accept a new operation and provide the results of another operation that was started a long time ago (around 20 cycles for the ALUs, if I remember correctly). So, a single SM in theory has resources for processing 48 * 20 cycles = 960 ALU operations at the same time, which is 960 / 32 threads per warp = 30 warps. In addition, it can process LD/ST operations and SFU operations at whatever their latency and throughput are.
The warp schedulers (yellow in the image) can schedule 2 * 32 threads per warp = 64 threads to the pipelines per cycle. So that's the number of results that can be obtained per clock. So, given that there are a mix of computing resources, 48 core, 16 LD/ST, 8 SFU, each which have different latencies, a mix of warps are being processed at the same time. At any given cycle, the warp schedulers try to "pair up" two warps to schedule, to maximize the utilization of the SM.
The warp schedulers can issue warps either from different blocks, or from different places in the same block, if the instructions are independent. So, warps from multiple blocks can be processed at the same time.
Adding to the complexity, warps that are executing instructions for which there are fewer than 32 resources, must be issued multiple times for all the threads to be serviced. For instance, there are 8 SFUs, so that means that a warp containing an instruction that requires the SFUs must be scheduled 4 times.
This description is simplified. There are other restrictions that come into play as well that determine how the GPU schedules the work. You can find more information by searching the web for "fermi architecture".
So, coming to your actual question,
why bother to know about Warps?
Knowing the number of threads in a warp and taking it into consideration becomes important when you try to maximize the performance of your algorithm. If you don't follow these rules, you lose performance:
In the kernel invocation, <<<Blocks, Threads>>>, try to chose a number of threads that divides evenly with the number of threads in a warp. If you don't, you end up with launching a block that contains inactive threads.
In your kernel, try to have each thread in a warp follow the same code path. If you don't, you get what's called warp divergence. This happens because the GPU has to run the entire warp through each of the divergent code paths.
In your kernel, try to have each thread in a warp load and store data in specific patterns. For instance, have the threads in a warp access consecutive 32-bit words in global memory.
Are threads grouped into Warps necessarily in order, 1 - 32, 33 - 64 ...?
Yes, the programming model guarantees that the threads are grouped into warps in that specific order.
As a simple example of optimizing of the divergent code paths can be used the separation of all the threads in the block in groups of 32 threads? For example: switch (threadIdx.s/32) { case 0: /* 1 warp*/ break; case 1: /* 2 warp*/ break; /* Etc */ }
Exactly :)
How many bytes must be read at one time for single Warp: 4 bytes * 32 Threads, 8 bytes * 32 Threads or 16 bytes * 32 Threads? As far as I know, the one transaction to the global memory at one time receives 128 bytes.
Yes, transactions to global memory are 128 bytes. So, if each thread reads a 32-bit word from consecutive addresses (they probably need to be 128-byte aligned as well), all the threads in the warp can be serviced with a single transaction (4 bytes * 32 threads = 128 bytes). If each thread reads more bytes, or if the the addresses are not consecutive, more transactions need to be issued (with separate transactions for each separate 128-byte line that is touched).
This is described in the CUDA Programming Manual 4.2, section F.4.2, "Global Memory". There's also a blurb in there saying that the situation is different with data that is cached only in L2, as the L2 cache has 32-byte cache lines. I don't know how to arrange for data to be cached only in L2 or how many transactions one ends up with.