Does serializing of simultaneous global memory accesses to one address happen when there are L1 and L2 cache levels? - cuda

Based on what I know, when threads of a warp access the same address in global memory, requests get serialized so it's better to use constant memory. Does serializing of simultaneous global memory accesses happen when GPU is equipped with L1 and L2 cache levels (in Fermi and Kepler architecture)? In other words, when threads of a warp access the same global memory address, do 31 threads of a warp benefit from cache existence because 1 thread has already requested that address? What happens when the access is a read and also when access is a write?

Simultaneous global accesses to the same address by threads in the same warp in Fermi and Kepler do not get serialized. The warp read has a broadcast mechanism which satisfies all such reads from a single cacheline read with no performance impact. The performance is the same as if it were a fully coalesced read. This is true regardless of cache specifics, for example it is true even if L1 caching is disabled.
The performance of simultaneous writes is not specified (AFAIK) but behaviorally, simultaneous writes always get serialized, and the order is undefined.
EDIT responding to additional questions below:
Even if all threads in the warp write the same value into the same address, does it get serialized? Isn't there a write broadcast mechanism that recognizes such situation?
There is not a write broadcast mechanism that looks at all the simultaneous writes to see if they are all the same, and then take some action based on that. The correct answer is that the writes happen in unspecified order, and the performance characteristics are undefined. Obviously, if all the values being written are the same, you can be assured that the value that ends up in the location will be that value. But if you're asking whether the write activity is collapsed to a single cycle or requires multiple cycles to complete, that actual behavior is undefined (undocumented) and in fact may vary from one architecture to the next (for example, cc1.x may serialize in such all way that all the writes are performed, whereas cc2.x may "serialize" in such a way that one write "wins" and all the others are discarded, not consuming actual cycles.) Again, the performance is undocumented/unspecified, but the program-observable behavior is defined.
2 With this broadcast mechanism you explained, the only difference between constant memory broadcast access and global memory broadcast access is that the first one may route the access all the way to the global memory but the latter has a dedicated hardware and is faster, right?
__constant__ memory uses the constant cache, which is a dedicated piece of hardware that is available on a per-SM basis, and caches a particular section of global memory in a read-only fashion. This HW cache is physically and logically separate from L1 cache (if it exists and is enabled) and L2 cache. For Fermi and beyond, both mechanisms support broadcast on read, and for constant cache, this is the preferred access pattern, because the constant cache can only service one read access per cycle (i.e. does not support a whole cacheline read by a warp.) Either mechanism may "hit" in the cache (if present) or "miss" and trigger a global read. On the first read of a given location (or cacheline), niether cache will have the requested data, and it will therefore "miss" and trigger a global memory read, to service the access. Thereafter, in either case, subsequent reads will be serviced out of the cache, assuming the relevant data is not evicted in the interim. For early cc1.x devices, the constant memory cache was pretty valuable since those early devices did not have a L1 cache. For Fermi and beyond the principal reason to use the constant cache would be if identifiable data(i.e. read-only) and access patterns (same address per warp) are available, then using the constant cache will prevent those reads from travelling through L1 and possibly evicting other data. In effect you are increasing the cacheable footprint somewhat, over just what the L1 can support alone.

Related

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Performance of cmem vs texture on Pascal [duplicate]

Does the use of data storage in constant memory provides any benefit over texture in the Pascal architecture if the data request frequency is very high among threads (every thread pick at least one data from a specific column)?
EDIT: This is a split version of this question to improve community searching
If the expectations for constant memory usage are satisfied, the use of constant memory is a good idea in the general case. It is allowing your code to take advantage of an additional cache mechanism provided by the GPU hardware, and in so doing putting less pressure on the usage of texture by other parts of your code.
Since the constant memory and its cache, as the texture and surface memory and it is own cache are defined by the hardware Compute Capability, the target hardware should be accounted. Thus the option by constant memory and texture memory is dependent of the access pattern and the cache use, as the cache availability.
The constant memory performance is related to data broadcast among threads in a warp, so the maximum performance is achieved if all threads request the very same data address and the data is already on the cache. Thus, if in the same warp there are request to multiple address, the service is splitted in multiple requests, since it can retrive a single address per operation. If the number of splitted requests due to data retrieval from multiple addresses is too high, the texture and surface memory performance may superior over constant memory in this specific situation.. This information is detailed in the Cuda Programming Guide:
The constant memory space resides in device memory and is cached in the constant cache mentioned in Compute Capability 2.x.
A request is then split into as many separate requests as there are
different memory addresses in the initial request, decreasing
throughput by a factor equal to the number of separate requests.
The resulting requests are then serviced at the throughput of the
constant cache in case of a cache hit, or at the throughput of device
memory otherwise.
The texture memory cache is more flexible than constant memory cache. It can take advantage of readings in the same warp of address that are close together in a 2D fashion. Despite of some advantages over constant memory, in general, the texture memory should be used if the data access pattern or the data size does not follow the constant memory requirements or to make use of texture memory cache. More detailed information can be found at:
The texture and surface memory spaces
reside in device memory and are cached in texture cache, so a texture
fetch or surface read costs one memory read from device memory only on
a cache miss, otherwise it just costs one read from texture cache. The
texture cache is optimized for 2D spatial locality, so threads of the
same warp that read texture or surface addresses that are close
together in 2D will achieve best performance. Also, it is designed for
streaming fetches with a constant latency; a cache hit reduces DRAM
bandwidth demand but not fetch latency.
Reading device memory through texture or surface fetching present some
benefits that can make it an advantageous alternative to reading
device memory from global or constant memory:
If the memory reads do not follow the access patterns that global or
constant memory reads must follow to get good performance, higher
bandwidth can be achieved providing that there is locality in the
texture fetches or surface reads;
Addressing calculations are
performed outside the kernel by dedicated units;
Packed data may be
broadcast to separate variables in a single operation;
8-bit and
16-bit integer input data may be optionally converted to 32 bit
floating-point values in the range [0.0, 1.0] or [-1.0, 1.0] (see
Texture Memory).
The developer should keep in mind that exploiting of the combination of texture memory with constant memory can be a real advantage over the preference for a single one, because it may allow to take advantage of the dedicated cache from both, since both caches have higher performance than over any data retrieved outside the cache (i.e. device memory).

How the access of the same global memory address is performed by threads from different kernels?

If many threads in a warp want to read an address in global memory, this data is broadcasted, is that right?
If many threads in a warp want to write into an address in global memory, there is a serialization, but is not possible to predict the order, is that right?
But, the first question: If many threads in a different warps, in different blocks, want to write into an address in global memory? What the GPU gonna do? Serializes all the access to this address? Is there any guarantee of data consistence?
With Hyper-Q it is possible to launch a lot of streams containing kernels. If I have a position in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the accesses of all threads from different kernels, or does the GPU do nothing and some inconsistencies gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It's preferred that you ask one question per question.
If many threads in a warp want to read an adress in global memory, this data is broadcasted, is that right?
Yes this is true for Fermi (CC2.0) and beyond.
If many threads in a warp want to write into an adress in global memory, there is a serialization, but is not possible to predict the order, is that right?
Correct. The order is undefined.
If many threads in a different warps, in different blocks, want to write into an adress in global memory? What the GPU gonna do? Serializes all the access to this address?
If the accesses are simultaneous, they are serialized. Again, order is undefined.
Is there any guarantee of data consistence?
Not sure what you mean by data consistence. Anyway, what else could the GPU do except serialize simultaneous writes? I'm surprised this is such a difficult concept, as there appears to me to be no obvious alternative.
If I have a possition in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the access of all threads from different kernels, or the GPU do nothing and some inconsistences gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It does not matter what is the origin of simultaneous writes to global memory, whether from the same warp, or different warps, in different blocks, in different kernels. Simultaneous writes are serialized, in an undefined order. Again, for "data consistence" I'd like to know what you mean by that. Simultaneous reads and writes are also going to produce undefined behavior. The reads may return a value including the initial value of the memory location or any of the values that were written.
The final result of simultaneous writes to any GPU memory location is undefined. If all simultaneous writes are writing the same value, then the final value in that location will reflect that. Otherwise, the final value will reflect one of the values that got written. Which value is undefined. Beyond that, most of your questions and statements don't make sense to me. (What do you mean by data consistence?) You should not expect anything rational from such programming behavior. The GPU should be programmed as a distributed independent work machine, not a globally synchronous machine. Note that "undefined" also means that results may vary from one run of a kernel to the next, even if the input data is identical.
Simultaneous or nearly simultaneous reading and writing of global memory from different blocks (whether from the same or different kernels) is especially hazardous on Fermi (cc2.x) devices due to the independent non-coherent L1 caches that are interposed between the SMs (where the threadblocks execute) and the L2 cache (which is device-wide, and therefore coherent). Attempting to create synchronized behavior between threadblocks using global memory as a vehicle is difficult at best, and discouraged. It is suggested to consider ways to recast your algorithm to structure the work independently.

CUDA Compute Capability 2.0. Global memory access pattern

From CUDA Compute Capability 2.0 (Fermi) global memory access works through 768 KB L2 cache. It looks, developer don't care anymore about global memory banks. But global memory is still very slow, so the right access pattern is important. Now the point is to use/reuse L2 as much as possible. And my question is, how? I would be thankful for some detailed info, how L2 works and how should I organize and access global memory if I need, for example, 100-200 elements array per thread.
L2 cache helps in some ways, but it does not obviate the need for coalesced access of global memory. In a nutshell, coalesced access means that for a given read (or write) instruction, individual threads in a warp are reading (or writing) adjacent, contiguous locations in global memory, preferably that are aligned as a group on a 128-byte boundary. This will result in the most effective utilization of the available memory bandwidth.
In practice this is often not difficult to accomplish. For example:
int idx=threadIdx.x + (blockDim.x * blockIdx.x);
int mylocal = global_array[idx];
will give coalesced (read) access across all the threads in a warp, assuming global_array is allocated in an ordinary fashion using cudaMalloc in global memory. This type of access makes 100% usage of the available memory bandwidth.
A key takeaway is that memory transactions ordinarily occur in 128-byte blocks, which happens to be the size of a cache line. If you request even one of the bytes in a block, the entire block will be read (and stored in L2, normally). If you later read other data from that block, it will normally be serviced from L2, unless it has been evicted by other memory activity. This means that the following sequence:
int mylocal1 = global_array[0];
int mylocal2 = global_array[1];
int mylocal3 = global_array[31];
would all typically be serviced from a single 128-byte block. The first read for mylocal1 will trigger the 128 byte read. The second read for mylocal2 would normally be serviced from the cached value (in L2 or L1) not by triggering another read from memory. However, if the algorithm can be suitably modified, it's better to read all your data contiguously from multiple threads, as in the first example. This may be just a matter of clever organization of data, for example using Structures of Arrays rather than Arrays of structures.
In many respects, this is similar to CPU cache behavior. The concept of a cache line is similar, along with the behavior of servicing requests from the cache.
Fermi L1 and L2 can support write-back and write-through. L1 is available on a per-SM basis, and is configurably split with shared memory to be either 16KB L1 (and 48KB SM) or 48KB L1 (and 16KB SM). L2 is unified across the device and is 768KB.
Some advice I would offer is to not assume that the L2 cache just fixes sloppy memory accesses. The GPU caches are much smaller than equivalent caches on CPUs, so it's easier to get into trouble there. A general piece of advice is simply to code as if the caches were not there. Rather than CPU oriented strategies like cache-blocking, it's usually better to focus your coding effort on generating coalesced accesses and then possibly make use of shared memory in some specific cases. Then for the inevitable cases where we can't make perfect memory accesses in all situations, we let the caches provide their benefit.
You can get more in-depth guidance by looking at some of the available NVIDIA webinars. For example, the Global Memory Usage & Strategy webinar (and slides ) or the CUDA Shared Memory & Cache webinar would be instructive for this topic. You may also want to read the Device Memory Access section of the CUDA C Programming Guide.

the latency of acessing shared memory

which latency is longer between two situation below,
The data be filled into the shared memory from global memory, and all the thread access the shared memory concurrently.the data maybe the same for multiple threads accessing
All the threads access the global memory,but the data are neighbors.
If you plan on accessing each value only once, then you won't gain anything from using shared memory.
Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.
If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.
Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.
I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.
Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.
Case (2) describes the optimal access pattern for global memory.
Perhaps I don't understand the difference between the two case. But if I do:
The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.
Think of it like this:
CASE 2:
you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.
CASE 1:
You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.