How the access of the same global memory address is performed by threads from different kernels? - cuda

If many threads in a warp want to read an address in global memory, this data is broadcasted, is that right?
If many threads in a warp want to write into an address in global memory, there is a serialization, but is not possible to predict the order, is that right?
But, the first question: If many threads in a different warps, in different blocks, want to write into an address in global memory? What the GPU gonna do? Serializes all the access to this address? Is there any guarantee of data consistence?
With Hyper-Q it is possible to launch a lot of streams containing kernels. If I have a position in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the accesses of all threads from different kernels, or does the GPU do nothing and some inconsistencies gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?

It's preferred that you ask one question per question.
If many threads in a warp want to read an adress in global memory, this data is broadcasted, is that right?
Yes this is true for Fermi (CC2.0) and beyond.
If many threads in a warp want to write into an adress in global memory, there is a serialization, but is not possible to predict the order, is that right?
Correct. The order is undefined.
If many threads in a different warps, in different blocks, want to write into an adress in global memory? What the GPU gonna do? Serializes all the access to this address?
If the accesses are simultaneous, they are serialized. Again, order is undefined.
Is there any guarantee of data consistence?
Not sure what you mean by data consistence. Anyway, what else could the GPU do except serialize simultaneous writes? I'm surprised this is such a difficult concept, as there appears to me to be no obvious alternative.
If I have a possition in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the access of all threads from different kernels, or the GPU do nothing and some inconsistences gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It does not matter what is the origin of simultaneous writes to global memory, whether from the same warp, or different warps, in different blocks, in different kernels. Simultaneous writes are serialized, in an undefined order. Again, for "data consistence" I'd like to know what you mean by that. Simultaneous reads and writes are also going to produce undefined behavior. The reads may return a value including the initial value of the memory location or any of the values that were written.
The final result of simultaneous writes to any GPU memory location is undefined. If all simultaneous writes are writing the same value, then the final value in that location will reflect that. Otherwise, the final value will reflect one of the values that got written. Which value is undefined. Beyond that, most of your questions and statements don't make sense to me. (What do you mean by data consistence?) You should not expect anything rational from such programming behavior. The GPU should be programmed as a distributed independent work machine, not a globally synchronous machine. Note that "undefined" also means that results may vary from one run of a kernel to the next, even if the input data is identical.
Simultaneous or nearly simultaneous reading and writing of global memory from different blocks (whether from the same or different kernels) is especially hazardous on Fermi (cc2.x) devices due to the independent non-coherent L1 caches that are interposed between the SMs (where the threadblocks execute) and the L2 cache (which is device-wide, and therefore coherent). Attempting to create synchronized behavior between threadblocks using global memory as a vehicle is difficult at best, and discouraged. It is suggested to consider ways to recast your algorithm to structure the work independently.

Related

global memory access for individual threads

I am writing a simplistic raytracer. The idea is that for every pixel there is a thread that traverses a certain structure (geometry) that resides in global memory.
I invoke my kernel like so:
trace<<<gridDim, blockDim>>>(width, height, frameBuffer, scene)
Where scene is a structure that was previously allocated with cudaMalloc. Every thread has to start traversing this structure starting from the same node, and chances are that many concurrent threads will attempt to read the same nodes many times. Does that mean that when such reads take place, it cripples the degree of parallelism?
Given that geometry is large, I would assume that replicating it is not an option. I mean the whole processing still happens fairly fast, but I was wondering whether it is something that has to be dealt with, or simply left flung to the breeze.
First of all I think you got the wrong idea when you say concurrent reads may or may not cripple the degree of parallelism. Because that is what it means to be parallel. Each thread is reading concurrently. Instead you should be thinking if it affects the performance due to more memory accesses when each thread basically wants the same thing i.e. the same node.
Well according to the article here, Memory accesses can be coalesced if data locality is present and within warps only.
Which means if threads within a warp are trying to access memory locations near each other they can be coalesced. In your case each thread is trying to access the "same" node until it meets an endpoint where they branch.
This means the memory accesses will be coalesced within the warp till the threads branch off.
Efficient access to global memory from each thread depends on both your device architecture and your code. Arrays allocated on global memory are aligned to 256-byte memory segments by the CUDA driver. The device can access global memory via 32-, 64-, or 128-byte transactions that are aligned to their size. The device coalesces global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A misaligned data access for devices with a compute capability of less than 2.0 affects the effective bandwidth of accessing data. This is not a serious issue when working with a device that has a compute capability of > 2.0. That being said, pretty much regardless of your device generation, when accessing global memory with large strides, the effective bandwidth becomes poor (Reference). I would assume that for random access the the same behavior is likely.
Unless you are not changing the structure while reading, which I assume you do (if it's a scene you probably render each frame?) then yes, it cripples performance and may cause undefined behaviour. This is called a race condition. You can use atomic operations to overcome this type of problem. Using atomic operations guarantees that the race conditions don't happen.
You can try, stuffing the 'scene' to the shared memory if you can fit it.
You can also try using streams to increase concurrency which also brings some sort of synchronization to the kernels that are run in the same stream.

Does serializing of simultaneous global memory accesses to one address happen when there are L1 and L2 cache levels?

Based on what I know, when threads of a warp access the same address in global memory, requests get serialized so it's better to use constant memory. Does serializing of simultaneous global memory accesses happen when GPU is equipped with L1 and L2 cache levels (in Fermi and Kepler architecture)? In other words, when threads of a warp access the same global memory address, do 31 threads of a warp benefit from cache existence because 1 thread has already requested that address? What happens when the access is a read and also when access is a write?
Simultaneous global accesses to the same address by threads in the same warp in Fermi and Kepler do not get serialized. The warp read has a broadcast mechanism which satisfies all such reads from a single cacheline read with no performance impact. The performance is the same as if it were a fully coalesced read. This is true regardless of cache specifics, for example it is true even if L1 caching is disabled.
The performance of simultaneous writes is not specified (AFAIK) but behaviorally, simultaneous writes always get serialized, and the order is undefined.
EDIT responding to additional questions below:
Even if all threads in the warp write the same value into the same address, does it get serialized? Isn't there a write broadcast mechanism that recognizes such situation?
There is not a write broadcast mechanism that looks at all the simultaneous writes to see if they are all the same, and then take some action based on that. The correct answer is that the writes happen in unspecified order, and the performance characteristics are undefined. Obviously, if all the values being written are the same, you can be assured that the value that ends up in the location will be that value. But if you're asking whether the write activity is collapsed to a single cycle or requires multiple cycles to complete, that actual behavior is undefined (undocumented) and in fact may vary from one architecture to the next (for example, cc1.x may serialize in such all way that all the writes are performed, whereas cc2.x may "serialize" in such a way that one write "wins" and all the others are discarded, not consuming actual cycles.) Again, the performance is undocumented/unspecified, but the program-observable behavior is defined.
2 With this broadcast mechanism you explained, the only difference between constant memory broadcast access and global memory broadcast access is that the first one may route the access all the way to the global memory but the latter has a dedicated hardware and is faster, right?
__constant__ memory uses the constant cache, which is a dedicated piece of hardware that is available on a per-SM basis, and caches a particular section of global memory in a read-only fashion. This HW cache is physically and logically separate from L1 cache (if it exists and is enabled) and L2 cache. For Fermi and beyond, both mechanisms support broadcast on read, and for constant cache, this is the preferred access pattern, because the constant cache can only service one read access per cycle (i.e. does not support a whole cacheline read by a warp.) Either mechanism may "hit" in the cache (if present) or "miss" and trigger a global read. On the first read of a given location (or cacheline), niether cache will have the requested data, and it will therefore "miss" and trigger a global memory read, to service the access. Thereafter, in either case, subsequent reads will be serviced out of the cache, assuming the relevant data is not evicted in the interim. For early cc1.x devices, the constant memory cache was pretty valuable since those early devices did not have a L1 cache. For Fermi and beyond the principal reason to use the constant cache would be if identifiable data(i.e. read-only) and access patterns (same address per warp) are available, then using the constant cache will prevent those reads from travelling through L1 and possibly evicting other data. In effect you are increasing the cacheable footprint somewhat, over just what the L1 can support alone.

the latency of acessing shared memory

which latency is longer between two situation below,
The data be filled into the shared memory from global memory, and all the thread access the shared memory concurrently.the data maybe the same for multiple threads accessing
All the threads access the global memory,but the data are neighbors.
If you plan on accessing each value only once, then you won't gain anything from using shared memory.
Values in shared memory are only valid within a block, so one or more threads in each block will have to load the values from global memory. So you're not able to avoid the global memory accesses.
If you have a device of compute capability >= 2.0 (Fermi), values read from global memory are automatically cached in the L1 and L2 caches. L1 has the same latency as shared memory.
Latency is a fixed value that depends on which memory you're accessing. It doesn't change. Latency is always much lower for shared memory than for global memory.
I think what you might really be asking is what type of access would give you the best memory throughput. If you will be using each value only once, case (2) will give the best throughput. If you will be reusing values and have CC >= 2.0, letting L1 handle the caching is likely to give the best throughput. If you're reusing values on CC < 2.0, using shared memory will give the best throughput.
Case (1) may or may not cause bank conflicts but will give better throughput regardless, for values that are already stored in shared memory.
Case (2) describes the optimal access pattern for global memory.
Perhaps I don't understand the difference between the two case. But if I do:
The second is faster if your hardware architecture allows it. For example, on a multicore machine with parallel registers. Notice also that in the second case, even from a pure software viewpoint, the data does not need to be made thread-safe for such fears as race-conditions due to interleaving.
Think of it like this:
CASE 2:
you have a large table with five dinners, and you have five kids to eat them: no synchronization needed.
CASE 1:
You have, say, three tables with three dinners; so that two kids may have to eat from the same plate and thus may need to synchronize their movements so they don't hit each other. Synchronization means delay.

CUDA thread local array

I am writing a CUDA kernel that requires maintaining a small associative array per thread. By small, I mean 8 elements max worst case, and an expected number of entries of two or so; so nothing fancy; just an array of keys and an array of values, and indexing and insertion happens by means of a loop over said arrays.
Now I do this by means of thread local memory; that is identifier[size]; where size is a compile time constant. Now ive heard that under some circumstances this memory is stored off-chip, and under some circumstances it is stored on-chip. Obviously I want the latter, under all circumstances. I understand that I can accomplish such with a block of shared mem, where I let each thread work on its own private block; but really? I dont want to share anything between threads, and it would be a horrible kludge.
What exactly are the rules for where this memory goes? I cant seem to find any word from nvidia. For the record, I am using CUDA5 and targetting Kepler.
Local variables are either stored in registers, or (cached for compute capability >=2.0) off-chip memory.
Registers are only used for arrays if all array indices are constant and can be determined at compile time, as the architecture has no means for indexed access to registers at runtime.
In you case the number of keys may be small enough to use registers (and tolerate the increase in register pressure). Unroll all loops over array accesses to allow the compiler to place the keys in registers, and use cuobjdump -sass to check it actually does.
If you don't want to spend registers, you can either choose shared memory with a per-thread offset (but check that the additional registers used to hold per-thread indices into shared memory don't outvalue the number of keys you use) as you mentioned, or do nothing and use off-chip "local" memory (really "global" memory with just a different addressing scheme) hoping for the cache to do it's work.
If you hope for the cache to hold the keys and values, and don't use much shared memory, it may be beneficial to select the 48kB cache / 16kB shared memory setting over the default 16kB/48kB split using cudaDeviceSetCacheConfig().

Is there a good way use a read only hashmap on cuda?

I am really new to programming and Cuda. Basically I have a C function that reads a list of data and then checks each item against a hashmap (I'm using uthash for this in C). It works well but I want to run this process in Cuda (once it gets the value for the hash key then it does a lot of processing), but I'm unsure the best way to create a read only hash function that's as quick as possible in Cuda.
Background
Basically I'm trying to value a very very large batch of portfolio as quickly as possible. I get several million portfolio constantly that are in the form of two lists. One has the stock name and the other has the weight. I then use the stock name to look up a hashtable to get other data(value, % change,etc..) and then process it based on the weight. On a CPU in plain C it takes about 8 minutes so I am interesting in trying it on a GPU.
I have read and done the examples in cuda by example so I believe I know how to do most of this except the hash function(there is one in the appendix but it seems focused on adding to it while I only really want it as a reference since it'll never change. I might be rough around the edges in cuda for example so maybe there is something I'm missing that is helpful for me in this situation, like using textual or some special form of memory for this). How would I structure this for best results should each block have its own access to the hashmap or should each thread or is one good enough for the entire GPU?
Edit
Sorry just to clarify, I'm only using C. Worst case I'm willing to use another language but ideally I'd like something that I can just natively put on the GPU once and have all future threads read to it since to process my data I'll need to do it in several large batches).
This is some thoughts on potential performance issues of using a hash map on a GPU, to back up my comment about keeping the hash map on the CPU.
NVIDIA GPUs run threads in groups of 32 threads, called warps. To get good performance, each of the threads in a warp must be doing essentially the same thing. That is, they must run the same instructions and they must read from memory locations that are close to each other.
I think a hash map may break with both of these rules, possibly slowing the GPU down so much that there's no use in keeping the hash map on the GPU.
Consider what happens when the 32 threads in a warp run:
First, each thread has to create a hash of the stock name. If these names differ in length, this will involve a different number of rounds in the hashing loop for the different lengths and all the threads in the warp must wait for the hash of the longest name to complete. Depending on the hashing algorithm, there might different paths that the code can take inside the hashing algorithm. Whenever the different threads in a warp need to take different paths, the same code must run multiple times (once for each code path). This is called warp divergence.
When all the threads in warp each have obtained a hash, each thread will then have to read from different locations in slow global memory (designated by the hashes). The GPU runs optimally when each of the 32 threads in the warp read in a tight, coherent pattern. But now, each thread is reading from an essentially random location in memory. This could cause the GPU to have to serialize all the threads, potentially dropping the performance to 1/32 of the potential.
The memory locations that the threads read are hash buckets. Each potentially containing a different number of hashes, again causing the threads in the warp to have to do different things. They may then have to branch out again, each to a random location, to get the actual structures that are mapped.
If you instead keep the stock names and data structures in a hash map on the CPU, you can use the CPU to put together arrays of information that are stored in the exact pattern that the GPU is good at handling. Depending on how busy the CPU is, you may be able to do this while the GPU is processing the previously submitted work.
This also gives you an opportunity to change the array of structures (AoS) that you have on the CPU to a structure of arrays (SoA) for the GPU. If you are not familiar with this concept, essentially, you convert:
my_struct {
int a;
int b;
};
my_struct my_array_of_structs[1000];
to:
struct my_struct {
int a[1000];
int b[1000];
} my_struct_of_arrays;
This puts all the a's adjacent to each other in memory so that when the 32 threads in a warp get to the instruction that reads a, all the values are neatly laid out next to each other, causing the entire warp to be able to load the values very quickly. The same is true for the b's, of course.
There is a hash_map extension for CUDA Thrust, in the cuda-thrust-extensions library. I have not tried it.
Because of your hash map is so large, I think it can be replaced by a database, mysql or other products will all be OK, they probably will be fast than hash map design by yourself. And I agree with Roger's viewpoint, it is not suitable to move it to GPU, it consumes too large device memory (may be not capable to contain it) and it is terribly slow for kernel function access global memory on device.
Further more, which part of your program takes 8 minutes, finding in hash map or process on weight? If it is the latter, may be it can be accelerated by GPU.
Best regards!