My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner?
I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read.
In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering.
While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. There's more to atomics than just tearing:
A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. For instance, int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. If you're waiting for another thread to change the flag's value, you'll never observe the change. The volatile modifier ensures that the value is actually read from memory on every access. See the CUDA documentation on volatile for more info.
Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication.
For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written.
Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. Out-of-order and/or speculative execution may load the data before the flag's read is completed. The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read.
Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. __threadfence() and friends are also covered in the CUDA docs.
Putting all of this together, writing your own atomic load method in CUDA looks something like:
// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
const volatile unsigned int *vaddr = addr; // volatile to bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const unsigned int value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
volatile unsigned int *vaddr = addr; // volatile to bypass cache
// fence to ensure that previous non-atomic stores are visible to other threads
__threadfence();
*vaddr = value;
}
This can be written similarly for other non-tearing load/store sizes.
From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. They're hoping to add them in sometime this year. Once those are in place, a full std::atomic implementation shouldn't be far behind.
To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have.
There are two quasi-alternatives, with their advantages and drawbacks:
Use a no-op atomic read-modify-write as you suggest. I have provided a similar answer in the past. Guaranteed atomicity and memory consistency but you pay the cost of a needless write.
In practice, the second closest thing to an atomic load could be marking a variable volatile, although strictly speaking the semantics are completely different. The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. But in practice, as indicated in the comments by #Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic.
Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language.
Related
It is said that zero copy should be used in situations where “read and/or write exactly once” constraint is met. That's fine.
I have understood this, but my question is why is zero copy fast in first place ? After all whether we use explicit transfer via cudamemcpy or zero copy , in both case data has to travel through pci express bus. Or there exist any other path ( i.e copy happen's directly in GPU register by passing device RAM ?
Considered purely from a data-transfer-rate perspective, I know of no reason why the data transfer rate for moving data between host and device via PCIE should be any different when comparing moving that data using a zero-copy method vs. moving it using cudaMemcpy.
However, both operations have overheads associated with them. The primary overhead I can think of for zero-copy comes with pinning of the host memory. This has a noticeable time overhead (e.g. when compared to allocating the same amount of data using e.g. malloc or new). The primary overhead that comes to mind with cudaMemcpy is a per-transfer overhead of at least a few microseconds that is associated with the setup costs of using the underlying DMA engine that does the transfer.
Another difference is in accessibility to the data. pinned/zero-copy data is simultaneously accessible between host and device, and this can be useful for some kinds of communication patterns that would otherwise be more complicated with cudaMemcpyAsync for example.
Here are two fairly simple design patterns where it may make sense to use zero-copy rather than cudaMemcpy.
When you have a large amount of data and you're not sure what will be needed. Suppose we have a large table of data, say 1GB, and the GPU kernel will need access to it. Suppose, also that the kernel design is such that only one or a few locations in the table are needed for each kernel call, and we don't know a-priori which locations those will be. We could use cudaMemcpy to transfer the entire 1GB to the GPU. This would certainly work, but it would take a possibly non-trivial amount of time (e.g. ~0.1s). Suppose also that we don't know what location was updated, and after the kernel call we need access to the modified data on the host. Another transfer would be needed. Using pinned/zero-copy methods here would mostly eliminate the costs associated with moving the data, and since our kernel is only accessing a few locations, the cost for the kernel to do so using zero-copy is far less than 0.1s.
When you need to check status of a search or convergence algorithm. Suppose that we have an algorithm that consists of a loop that is calling a kernel in each loop iteration. The kernel is doing some kind of search or convergence type algorithm, and so we need a "stopping condition" test. This might be as simple as a boolean value, that we communicate back to the host from the kernel activity, to indicate whether we have reached the stopping point or not. If the stopping point is reached, the loop terminates. Otherwise the loop continues with the next kernel launch. There may even be "two-way" communication here. For example, the host code might be setting the boolean value to false. The kernel might set it to true if iteration needs to continue, but the kernel does not ever set the flag to false. Therefore if continuation is needed, the host code sets the flag to false and calls the kernel again. We could realize this with cudaMemcpy:
bool *d_continue;
cudaMalloc(&d_continue, sizeof(bool));
bool h_continue = true;
while (h_continue){
h_continue = false;
cudaMemcpy(d_continue, &h_continue, sizeof(bool), cudaMemcpyHostToDevice);
my_search_kernel<<<...>>>(..., d_continue);
cudaMemcpy(&h_continue, d_continue, sizeof(bool), cudaMemcpyDeviceToHost);
}
The above pattern should be workable, but even though we are only transferring a small amount of data (1 byte), the cudaMemcpy operations will each take ~5 microseconds. If this were a performance concern, we could almost certainly reduce the time cost with:
bool *z_continue;
cudaHostAlloc(&z_continue, sizeof(bool), ...);
*z_continue = true;
while (*z_continue){
*z_continue = false;
my_search_kernel<<<...>>>(..., z_continue);
cudaDeviceSynchronize();
}
For example, assume that you wrote a cuda-accelerated editor algorithm to fix spelling errors for books. If a 2MB text data has only 5 bytes of error, it would need to edit only 5 bytes of it. So it doesn't need to copy whole array from GPU VRAM to system RAM. Here, zero-copy version would access only the page that owns the 5 byte word. Without zero-copy, it would need to copy whole 2MB text. Copying 2MB would take more time than copying 5 bytes (or just the page that owns those bytes) so it would reduce books/second throughput.
Another example, there could be a sparse path-tracing algorithm to add shiny surfaces for few small objects of a game scene. Result may need just to update 10-100 pixels instead of 1920x1080 pixels. Zero copy would work better again.
Maybe sparse-matrix-multiplication would work better with zero-copy. If 8192x8192 matrices are multiplied but only 3-5 elements are non-zero, then zero-copy could still make difference when writing results.
A lot of sources provide implementation of spin lock in CUDA:
https://devtalk.nvidia.com/default/topic/1014009/try-to-use-lock-and-unlock-in-cuda/
Cuda Mutex, why deadlock?
How to implement Critical Section in cuda?
Implementing a critical section in CUDA
https://wlandau.github.io/gpu/lectures/cudac-atomics/cudac-atomics.pdf.
They follow the same pattern:
LOCK: wait for the atomic change of the value of lock from 0 to 1
do some critical operations
UNLOCK: release the lock by setting its value to 0
Let's assume that we don't have warp-divergence or, in other words, we don't use locks for interwarp synchronization.
What is the right way to implement step 1?
Some answers propose to use atomicCAS while other atomicExch. Are both equivalent?
while (0 != (atomicCAS(&lock, 0, 1))) {}
while (atomicExch(&lock, 1) != 0) {}
What is the right way to implement step 3?
Almost all sources propose to use atomicExch for that:
atomicExch(&lock, 0);
One user proposed an alternative (Implementing a critical section in CUDA) that also make sense, but it doesn't work for him (so probably it leads to Undefined Behavior in CUDA):
lock = 0;
It seems that for general spin lock on CPU it is valid to do that: https://stackoverflow.com/a/7007893/8044236. Why we can't use it in CUDA?
Do we have to use memory fence and volatile specifier for memory accesses in step 2?
CUDA docs about atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions) say that they don't guarantee ordering constraints:
Atomic functions do not act as memory fences and do not imply synchronization or ordering constraints for memory operations
Does it mean that we have to use a memory fence at the end of the critical section (2) to ensure that change inside a critical section (2) made visible to other threads before unlocking (3)?
Do CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?
This is not true for memory fences (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions):
Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)).
So probably it is also not true for atomic operations? If yes, all spinlock implementations in CUDA rely on UB.
How we can implement reliable spinlock in the presence of warps?
Now, provided that we have answers to all the questions above, let's remove the assumption that we don't have warp divergence. Is it possible to implement spinlock in such a case?
The main issue (deadlock) is represented at slide 30 of https://wlandau.github.io/gpu/lectures/cudac-atomics/cudac-atomics.pdf:
Is the only option to replace while loop by if in step (1) and enclose all 3 steps in single while loop as proposed, for example, in Thread/warp local lock in cuda or CUDA, mutex and atomicCAS()?
What is the right way to implement step 1? Some answers propose to use
atomicCAS while other atomicExch. Are both equivalent?
No they are not, and only the atomicCas is correct. The point of that code is to check that the change of state of the lock from unlocked to locked by a given thread worked. The atomicExch version doesn't do that because it doesn't check that the initial state is unlocked before performing the assignment.
It seems that for general spin lock on CPU it is valid to do that: https://stackoverflow.com/a/7007893/8044236. Why we can't use it in CUDA?
If you read the comments on that answer you will see that it isn't valid on a CPU either.
Do we have to use memory fence and volatile specifier for memory accesses in step 2?
That depends completely on your end use and why you are using a critical section in the first place. Obviously, if you want a given thread's manipulation of globally visible memory to be globally visible, you need either a fence or an atomic transaction to do so, and you need to ensure that values are not cached in registers by the compiler.
Do[sic] CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?
Yes, but only for other operations performed atomically. Atomic operations imply serialization of all memory transactions performed on a given address, and they return the previous state of an address when a thread performs an action.
How we can implement reliable spinlock in the presence of warps?
Warp divergence is irrelevant. The serialization of atomic memory operations implies warp divergence when multiple threads from the same warp try to obtain the lock
I'd like to optimize the random access read, and random access write in the following code:
__global__ void kernel(float* input, float* output, float* table, size_t size)
{
int x_id = blockIdx.x * blockDim.x + threadIdx.x;
if (x_id > size)
return;
float in_f = input[x_id];
int in_i = (int)(floor(in_f));
int table_index = (int)((in_f - float(in_i)) * 1024000.0f );
float* t = table + table_index;
output[table_index] = t[0] * in_f;
}
As you can see, the index to the table and to the output are determined at run-time, and completely random.
I understand that I can use texture memory or __ldg() for reading such data.
So, my questions are:
Is there better way to read a randomly indexed data than using the texture memory or __ldg()?
What about the random access write as the case of output[table_index] above?
Actually, I'm adding the code here to give an example of random access read and write. I do not need code optimization, I just need a high level description of the best way to deal with such situation.
There are no magic bullets for random access of data on the GPU.
The best advice is to attempt to perform data reorganization or some other method to regularize the data access. For repeated/heavy access patterns, even such intensive methods as sorting operations on the data may result in an overall net improvement in performance.
Since your question implies that the random access is unavoidable, the main thing you can do is intelligently make use of caches.
The L2 is a device-wide cache, and all DRAM accesses go through it. So thrashing of the L2 may be unavoidable if you have large-scale random accesses. There aren't any functions to disable (selectively or otherwise) the L2 for either read or write accesses (*).
For smaller scale cases, the main thing you can do is route the accesses through one of the "non-L1" caches, i.e. the texture cache (on all GPUs) and the Read-Only cache (i.e. __ldg()) on cc3.5 and higher GPUs. The use of these caches may help in 2 ways:
For some access patterns that would thrash the linear-organized L1, you may get some cache hits in the texture or read-only cache, due to a different caching strategy employed by those caches.
On devices that also have an L1 cache in use, routing the "random" traffic through an alternate cache will keep the L1 "unpolluted" and therefore less likely to thrash. In other words, the L1 may still provide caching benefit for other accesses, since it is not being thrashed by the random accesses.
Note that the compiler may route traffic through the read-only cache for you, without explicit use of __ldg(), if you decorate appropriate pointers with const __restrict__ as described here
You can also use cache control hints on loads and stores.
Similar to the above advice for protecting the L1, it may make sense on some devices, in some cases to perform loads and stores in an "uncached" fashion. You can generally get the compiler to handle this for you through the use of the volatile keyword. You can keep both an ordinary and a volatile pointer to the same data, so that accesses that you can regularize can use the "ordinary" pointer, and "random" accesses can use the volatile version. Other mechanisms to pursue uncached access would be to use the ptxas compiler switches (e.g. -Xptxas dlcm=cg) or else manage the load/store operations via appropriate use of inline PTX with appropriate caching modifiers.
The "uncached" advice is the main advice I can offer for "random" writes. Use of the surface mechanism might provide some benefit for some access patterns, but I think it is unlikely to make any improvement for random patterns.
(*) This has changed in recent versions of CUDA and for recent GPU families such as Ampere (cc 8.x). There is a new capability to reserve a portion of the L2 for data persistence. Also see here
For accessing structures, nvcc generates a code to read/write the structure field-by-field. Having this structure:
typedef struct cache_s {
int tag;
TYPE data;
} cache_t;
Following PTX code is generated to write a variable of this type to shared memory:
st.shared.f64 [%rd1+8], %fd53;
st.shared.u32 [%rd1], %r33;
This can raise a logical error in the execution of programs. If two concurrent threads of a thread block write back different values at the same shared memory address, fields from different structures may mix-up. CUDA Programming Guide states:
If a non-atomic instruction executed by a warp writes to the same
location in global or shared memory for more than one of the threads
of the warp, the number of serialized writes that occur to that
location varies depending on the compute capability of the device (see
Compute Capability 2.x, Compute Capability 3.x, and Compute Capability
5.x), and which thread performs the final write is undefined.
From this, I expect one of the threads writes its complete structure (whole the fields together) and I don't expect the mix of the fields (from different writes) form an undefined value. Is there a way to force nvcc to generate the code that I expect?
More Information:
NVCC Version: 7.5
This can raise a logical error in the execution of programs. If two concurrent threads of a thread block write back different values at the same shared memory address, fields from different structures may mix-up.
If you need a complete result from one thread in the block while discarding the results from the other threads, just have one of the threads (thread 0 is often used for this) write out its result and have the remaining threads skip the write:
__global__ void mykernel(...)
{
...
if (!threadIdx.x) {
// store the struct
}
}
Is there a way to force nvcc to generate the code that I expect?
You want to see NVCC generate a single instruction that does an atomic write of a complete struct of arbitrary size. There is no such instruction, so, no, you can't get NVCC to generate the code.
I assume using an atomic lock on shared memory is a workaround, a terrible solution though. Is there a better solution?
We can't tell you what would be a better solution because you haven't told us what the problem is that you're trying to solve. In CUDA, atomic operations are typically used only for locking a single 32- or 64-bit word during a read-modify-write operation so wouldn't be a good fit for protecting a complete structure.
There are are parallel operations, sometimes called parallel primitives, such as "reduce" and "scan", that allow many types of problems to be solved without locking. For instance, you might first start a kernel in which each thread writes its results to a separate location then start a new kernel that performs a parallel reduce to pick the result you need.
I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?
I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.
My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).
Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).
Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).
Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.
I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.
Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.
But in kernels, use always everything known size.
Kernel must not do malloc... it is very very weird from the concept of the platform.
Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ?
You would not believe what problems could come if you don't )))
Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there.
Ones, your array of data will be finished and last buffer will be not full.
You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds.
Only the last block will have divergence.
Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
You can do in your kernel function something like this, using shared memory:
__global__ void dynamicReverse(int *d, int n)
{
extern __shared__ int s[];
.....
}
and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int):
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.