Writing an struct in shared memory atomically - cuda

For accessing structures, nvcc generates a code to read/write the structure field-by-field. Having this structure:
typedef struct cache_s {
int tag;
TYPE data;
} cache_t;
Following PTX code is generated to write a variable of this type to shared memory:
st.shared.f64 [%rd1+8], %fd53;
st.shared.u32 [%rd1], %r33;
This can raise a logical error in the execution of programs. If two concurrent threads of a thread block write back different values at the same shared memory address, fields from different structures may mix-up. CUDA Programming Guide states:
If a non-atomic instruction executed by a warp writes to the same
location in global or shared memory for more than one of the threads
of the warp, the number of serialized writes that occur to that
location varies depending on the compute capability of the device (see
Compute Capability 2.x, Compute Capability 3.x, and Compute Capability
5.x), and which thread performs the final write is undefined.
From this, I expect one of the threads writes its complete structure (whole the fields together) and I don't expect the mix of the fields (from different writes) form an undefined value. Is there a way to force nvcc to generate the code that I expect?
More Information:
NVCC Version: 7.5

This can raise a logical error in the execution of programs. If two concurrent threads of a thread block write back different values at the same shared memory address, fields from different structures may mix-up.
If you need a complete result from one thread in the block while discarding the results from the other threads, just have one of the threads (thread 0 is often used for this) write out its result and have the remaining threads skip the write:
__global__ void mykernel(...)
{
...
if (!threadIdx.x) {
// store the struct
}
}
Is there a way to force nvcc to generate the code that I expect?
You want to see NVCC generate a single instruction that does an atomic write of a complete struct of arbitrary size. There is no such instruction, so, no, you can't get NVCC to generate the code.
I assume using an atomic lock on shared memory is a workaround, a terrible solution though. Is there a better solution?
We can't tell you what would be a better solution because you haven't told us what the problem is that you're trying to solve. In CUDA, atomic operations are typically used only for locking a single 32- or 64-bit word during a read-modify-write operation so wouldn't be a good fit for protecting a complete structure.
There are are parallel operations, sometimes called parallel primitives, such as "reduce" and "scan", that allow many types of problems to be solved without locking. For instance, you might first start a kernel in which each thread writes its results to a separate location then start a new kernel that performs a parallel reduce to pick the result you need.

Related

CUDA How Does Kernel Fusion Improve Performance on Memory Bound Applications on the GPU?

I've been conducting research on streaming datasets larger than the memory available on the GPU to the device for basic computations. One of the main limitations is the fact that the PCIe bus is generally limited around 8GB/s, and kernel fusion can help reuse data that can be reused and that it can exploit shared memory and locality within the GPU. Most research papers I have found are very difficult to understand and most of them implement fusion in complex applications such as https://ieeexplore.ieee.org/document/6270615 . I've read many papers and they ALL FAIL TO EXPLAIN some simple steps to fuse two kernels together.
My question is how does fusion actually work?. What are the steps one would go through to change a normal kernel to a fused kernel? Also, is it necessary to have more than one kernel in order to fuse it, as fusing is just a fancy term for eliminating some memory bound issues, and exploiting locality and shared memory.
I need to understand how kernel fusion is used for a basic CUDA program, like matrix multiplication, or addition and subtraction kernels. A really simple example (The code is not correct but should give an idea) like:
int *device_A;
int *device_B;
int *device_C;
cudaMalloc(device_A,sizeof(int)*N);
cudaMemcpyAsync(device_A,host_A, N*sizeof(int),HostToDevice,stream);
KernelAdd<<<block,thread,stream>>>(device_A,device_B); //put result in C
KernelSubtract<<<block,thread,stream>>>(device_C);
cudaMemcpyAsync(host_C,device_C, N*sizeof(int),DeviceToHost,stream); //send final result through the PCIe to the CPU
The basic idea behind kernel fusion is that 2 or more kernels will be converted into 1 kernel. The operations are combined. Initially it may not be obvious what the benefit is. But it can provide two related kinds of benefits:
by reusing the data that a kernel may have populated either in registers or shared memory
by reducing (i.e. eliminating) "redundant" loads and stores
Let's use an example like yours, where we have an Add kernel and a multiply kernel, and assume each kernel works on a vector, and each thread does the following:
Load my element of vector A from global memory
Add a constant to, or multiply by a constant, my vector element
Store my element back out to vector A (in global memory)
This operation requires one read per thread and one write per thread. If we did both of them back-to-back, the sequence of operations would look like:
Add kernel:
Load my element of vector A from global memory
Add a value to my vector element
Store my element back out to vector A (in global memory)
Multiply kernel:
Load my element of vector A from global memory
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
We can see that step 3 in the first kernel and step 1 in the second kernel are doing things that aren't really necessary to achieve the final result, but they are necessary due to the design of these (independent) kernels. There is no way for one kernel to pass results to another kernel except via global memory.
But if we combine the two kernels together, we could write a kernel like this:
Load my element of vector A from global memory
Add a value to my vector element
Multiply my vector element by a value
Store my element back out to vector A (in global memory)
This fused kernel does both operations, produces the same result, but instead of 2 global memory load operations and 2 global memory store operations, it only requires 1 of each.
This savings can be very significant for memory-bound operations (like these) on the GPU. By reducing the number of loads and stores required, the overall performance is improved, usually proportional to the reduction in number of load/store operations.
Here is a trivial code example.

Does __syncthreads() synchronize all threads in the grid?

...or just the threads in the current warp or block?
Also, when the threads in a particular block encounter (in the kernel) the following line
__shared__ float srdMem[128];
will they just declare this space once (per block)?
They all obviously operate asynchronously so if Thread 23 in Block 22 is the first thread to reach this line, and then Thread 69 in Block 22 is the last one to reach this line, Thread 69 will know that it already has been declared?
The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].
Example of using __syncthreads(): (source)
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.
Example of using this technique (source):
Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.
I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.
Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.
Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.
That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.
If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).
__syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.
If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.
Existing answers have done a great job answering how __syncthreads() works (it allows intra-block synchronization), I just wanted to add an update that there are now newer methods for inter-block synchronization. Since CUDA 9.0, "Cooperative Groups" have been introduced, which allow synchronizing an entire grid of blocks (as explained in the Cuda Programming Guide). This achieves the same functionality as launching a new kernel (as mentioned above), but can usually do so with lower overhead and make your code more readable.
In order to provide further details, aside of the answers, quoting seibert:
More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.
The rules of use are pretty simple:
Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.
__syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.
Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.
When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.
__syncthreads() does not mark a critical section, so don’t use it like that.
Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.
Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.

Writing large unknown size array in Cuda?

I have a process which I send data to Cuda to process and it outputs data that matches a certain criteria. The problem is I often don't know the size out of outputted array. What can I do?
I send in several hundred lines of data and have it processed in over 20K different ways on Cuda. If the results match some rules I have then I want to save the results. The problem is I cannot create a linked list in Cuda(let me know if I can) and memory on my card is small so I was thinking of using zero copy to have Cuda write directly to the hosts memory. This solves my memory size issue but still doesn't give me a way to deal with unknown.
My intial idea was to figure out the max possible results and malloc a array of that size. The problem is it would be huge and most would not be used(800 lines of data * 20K possible outcomes = 16 Million items in a array..which is not likely).
Is there a better way to deal with variable size arrays in Cuda? I'm new to programming so ideally it would be something not too complex(although if it is I'm willing to learn it).
Heap memory allocation using malloc in kernel code is expensive operation (it forces CUDA driver initialize kernel with custom heap size and manage memory operations inside the kernel).
Generally, CUDA device memory allocation is the main bottleneck of program performance. The common practice is to allocate all needed memory at the beginning and reuse it as long as possible.
I think that you can create such buffer that is big enough and use it instead of memory allocations. In worst case you can wrap it to implement memory allocation from this buffer. In simple simple case you can keep last free cell in your array to write data into it next time.
Yes, the CUDA and all GPGPU stuff bottleneck is transfer from host to device and back.
But in kernels, use always everything known size.
Kernel must not do malloc... it is very very weird from the concept of the platform.
Even if you have 'for' - loop in CUDA kernel, think 20 times about is your approach optimal, you must be doing realy complex algorithm. Is it really necessary on the parallel platform ?
You would not believe what problems could come if you don't )))
Use buffered approach. You determine some buffer size, what is more dependent of CUDA requirements( read -> hardware), then of your array. You call a kernel in the loop and upload, process and retrieve data from there.
Ones, your array of data will be finished and last buffer will be not full.
You can pass the size of each buffer as single value (pointer to an int for example), what each thread will compare to its thread id, to determine do it if it is possible to get some value or it would be out of bounds.
Only the last block will have divergence.
Here is an useful link: https://devblogs.nvidia.com/parallelforall/using-shared-memory-cuda-cc/
You can do in your kernel function something like this, using shared memory:
__global__ void dynamicReverse(int *d, int n)
{
extern __shared__ int s[];
.....
}
and when you call the kernel function on host, having third parameter the shared memory size, precisely n*sizeof(int):
dynamicReverse<<<1,n,n*sizeof(int)>>>(d_d, n);
Also, it's a best practice to split a huge kernel function, if possible, in more kernel functions, having less code and are easier to execute.

producer/comsumer model and concurrent kernels

I'm writing a cuda program that can be interpreted as producer/consumer model.
There are two kernels,
one produces a data on the device memory,
and the other kernel the produced data.
The number of comsuming threads are set two a multiple of 32 which is the warp size.
and each warp waits utill 32 data have been produced.
I've got some problem here.
If the consumer kernel is loaded later than the producer,
the program doesn't halt.
The program runs indefinately sometimes even though consumer is loaded first.
What I'm asking is that is there a nice implementation model of producer/consumer in CUDA?
Can anybody give me a direction or reference?
here is the skeleton of my code.
**kernel1**:
while LOOP_COUNT
compute something
if SOME CONDITION
atomically increment PRODUCE_COUNT
write data into DATA
atomically increment PRODUCER_DONE
**kernel2**:
while FOREVER
CURRENT=0
if FINISHED CONDITION
return
if PRODUCER_DONE==TOTAL_PRODUCER && CONSUME_COUNT==PRODUCE_COUNT
return
if (MY_WARP+1)*32+(CONSUME_WARPS*32*CURRENT)-1 < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
else if PRODUCUER_DONE==TOTAL_PRODUCER
if currnet*32*CONSUME_WARPS+THREAD_INDEX < PRODUCE_COUNT
process the data
if SOME CONDITION
set FINISHED CONDITION true
increment CURRENT
Since you did not provide an actual code, it is hard to check where is the bug. Usually the sceleton is correct, but the problem lies in details.
One of possible issues that I can think of:
By default, in CUDA there is no guarantee that global memory writes by one kernel will be visible by another kernel, with an exception of atomic operations. It can happen then that your first kernel increments PRODUCER_DONE, but there is still no data in DATA.
Fortunately, you are given the intristic function __threadfence() which halts the execution of the current thread, until the data is visible. You should put it before atomically incrementing PRODUCER_DONE. Check out chapter B.5 in the CUDA Programming Guide.
Another issue that may or may not appear:
From the point of view of kernel2, the compiler may deduct that PRODUCE_COUNT, once read, it never changes. The compiler may optimise the code so that, once loaded in register it reuses its value, instead of querying the global memory every time. Solution? Use volatile, or read the value using another atomic operation.
(Edit)
Third issue:
I forgot about one more problem. On pre-Fermi cards (GeForce before 400-series) you can run only a single kernel at a time. So, if you schedule the producer to run after the consumer, the system will wait for consumer-kernel to end before producer-kernel starts its execution. If you want both to run at the same time, put both into a single kernel and have an if-branch based on some block index.

How to have atomic load in CUDA

My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner?
I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read.
In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering.
While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. There's more to atomics than just tearing:
A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. For instance, int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. If you're waiting for another thread to change the flag's value, you'll never observe the change. The volatile modifier ensures that the value is actually read from memory on every access. See the CUDA documentation on volatile for more info.
Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication.
For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written.
Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. Out-of-order and/or speculative execution may load the data before the flag's read is completed. The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read.
Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. __threadfence() and friends are also covered in the CUDA docs.
Putting all of this together, writing your own atomic load method in CUDA looks something like:
// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
const volatile unsigned int *vaddr = addr; // volatile to bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const unsigned int value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
volatile unsigned int *vaddr = addr; // volatile to bypass cache
// fence to ensure that previous non-atomic stores are visible to other threads
__threadfence();
*vaddr = value;
}
This can be written similarly for other non-tearing load/store sizes.
From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. They're hoping to add them in sometime this year. Once those are in place, a full std::atomic implementation shouldn't be far behind.
To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have.
There are two quasi-alternatives, with their advantages and drawbacks:
Use a no-op atomic read-modify-write as you suggest. I have provided a similar answer in the past. Guaranteed atomicity and memory consistency but you pay the cost of a needless write.
In practice, the second closest thing to an atomic load could be marking a variable volatile, although strictly speaking the semantics are completely different. The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. But in practice, as indicated in the comments by #Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic.
Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language.