Does atomic functions really make variables volatile in CUDA?

Does atomic functions really make variables volatile in CUDA? - cuda

I've written a very simple code ask thread 0 to update a global variable while other threads keep reading that variable.But I found other threads don't really get the value.
Code is here, it is quite simple. Could anyone give me any suggestion how to fix it?
Thanks a lot
__global__ void addKernel(int *c)
{
int i = threadIdx.x;
int j = 0;
if (i == 0)
{
while(*c < 2000){
int temp = *c;
printf("*c = %d\n",*c);
atomicCAS(c,temp, temp+1);
}
}else{
while(*c < 1000)
{
j++;
}
}
}

I'd like to make an analogy: imagine for a second that atomic operations are mutexes: for a program to be well-defined, two threads accessing a shared resource must both agree to use the mutex to access the resource exclusively. If one of the threads accesses the resource without first holding the mutex, the result is undefined.
The same thing is true for atomics: if you decide to treat a particular location in memory as an atomic variable, then all threads accessing that location should agree and treat it as such for your program to have meaning. You should only be manipulating it through atomic loads and stores, not a combination of non-atomic and atomic operations.
In other words, this:
atomicCAS(c,temp, temp+1);
Contains an atomic load-compare-store. The resulting instruction will go all the way down to global memory to load c, do the comparison, and go all the way down to global memory to store the new value.
But this:
while(*c < 2000)
Is not atomic by any means. The compiler (and the hardware) has no idea that c may have been modified by another thread. So instead of going all the way down to global memory, it will simply read from the fastest available cache. Possibly the compiler will even put the variable in a register, because it doesn't see anyone else modifying it in the current thread.
What you would want is something like (imaginary):
while (atomicLoad(c) < 2000)
But to the best of my knowledge there is no such construct in CUDA at the time of writing.
In this regard, the volatile qualifier may help: it tells the compiler to not optimize the variable, and consider it as "modifiable from external sources". This will trigger a load for every read of the variable, although I am not sure this load bypasses all the caches. In practice, it may work, but in theory I don't think you should rely on it. Besides, this will also disable any optimizations on that variable (such as constant propagation or promoting the variable to a register for better performance).
You may want to try the following hack (I haven't tried it):
while(atomicAdd(c, 0) < 2000)
This will emit an atomic instruction that does load from global memory, and therefore should see the most recent value of c. However, it also introduces an (useless in this case) atomic store.

Related

How can I make sure the compiler parallelizes my loads from global memory?

I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?

The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?

cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.

Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

how can a global function RETURN a value or BREAK out like C/C++ does

Recently I've been doing string comparing jobs on CUDA, and i wonder how can a __global__ function return a value when it finds the exact string that I'm looking for.
I mean, i need the __global__ function which contains a great amount of threads to find a certain string among a big big string-pool simultaneously, and i hope that once the exact string is caught, the __global__ function can stop all the threads and return back to the main function, and tells me "he did it"!
I'm using CUDA C. How can I possibly achieve this?

There is no way in CUDA (or on NVIDIA GPUs) for one thread to interrupt execution of all running threads. You can't have immediate exit of the kernel as soon as a result is found, it's just not possible today.
But you can have all threads exit as soon as possible after one thread finds a result. Here's a model of how you would do that.
__global___ void kernel(volatile bool *found, ...)
{
while (!(*found) && workLeftToDo()) {
bool iFoundIt = do_some_work(...); // see notes below
if (iFoundIt) *found = true;
}
}
Some notes on this.
Note the use of volatile. This is important.
Make sure you initialize found—which must be a device pointer—to false before launching the kernel!
Threads will not exit instantly when another thread updates found. They will exit only the next time they return to the top of the while loop.
How you implement do_some_work matters. If it is too much work (or too variable), then the delay to exit after a result is found will be long (or variable). If it is too little work, then your threads will be spending most of their time checking found rather than doing useful work.
do_some_work is also responsible for allocating tasks (i.e. computing/incrementing indices), and how you do that is problem specific.
If the number of blocks you launch is much larger than the maximum occupancy of the kernel on the present GPU, and a match is not found in the first running "wave" of thread blocks, then this kernel (and the one below) can deadlock. If a match is found in the first wave, then later blocks will only run after found == true, which means they will launch, then exit immediately. The solution is to launch only as many blocks as can be resident simultaneously (aka "maximal launch"), and update your task allocation accordingly.
If the number of tasks is relatively small, you can replace the while with an if and run just enough threads to cover the number of tasks. Then there is no chance for deadlock (but the first part of the previous point applies).
workLeftToDo() is problem-specific, but it would return false when there is no work left to do, so that we don't deadlock in the case that no match is found.
Now, the above may result in excessive partition camping (all threads banging on the same memory), especially on older architectures without L1 cache. So you might want to write a slightly more complicated version, using a shared status per block.
__global___ void kernel(volatile bool *found, ...)
{
volatile __shared__ bool someoneFoundIt;
// initialize shared status
if (threadIdx.x == 0) someoneFoundIt = *found;
__syncthreads();
while(!someoneFoundIt && workLeftToDo()) {
bool iFoundIt = do_some_work(...);
// if I found it, tell everyone they can exit
if (iFoundIt) { someoneFoundIt = true; *found = true; }
// if someone in another block found it, tell
// everyone in my block they can exit
if (threadIdx.x == 0 && *found) someoneFoundIt = true;
__syncthreads();
}
}
This way, one thread per block polls the global variable, and only threads that find a match ever write to it, so global memory traffic is minimized.
Aside: __global__ functions are void because it's difficult to define how to return values from 1000s of threads into a single CPU thread. It is trivial for the user to contrive a return array in device or zero-copy memory which suits his purpose, but difficult to make a generic mechanism.
Disclaimer: Code written in browser, untested, unverified.

If you feel adventurous, an alternative approach to stopping kernel execution would be to just execute
// (write result to memory here)
__threadfence();
asm("trap;");
if an answer is found.
This doesn't require polling memory, but is inferior to the solution that Mark Harris suggested in that it makes the kernel exit with an error condition. This may mask actual errors (so be sure to write out your results in a way that clearly allows to tell a successful execution from an error), and it may cause other hiccups or decrease overall performance as the driver treats this as an exception.
If you look for a safe and simple solution, go with Mark Harris' suggestion instead.

The global function doesn't really contain a great amount of threads like you think it does. It is simply a kernel, function that runs on device, that is called by passing paramaters that specify the thread model. The model that CUDA employs is a 2D grid model and then a 3D thread model inside of each block on the grid.
With the type of problem you have it is not really necessary to use anything besides a 1D grid with 1D of threads on in each block because the string pool doesn't really make sense to split into 2D like other problems (e.g. matrix multiplication)
I'll walk through a simple example of say 100 strings in the string pool and you want them all to be checked in a parallelized fashion instead of sequentially.
//main
//Should cudamalloc and cudacopy to device up before this code
dim3 dimGrid(10, 1); // 1D grid with 10 blocks
dim3 dimBlocks(10, 1); //1D Blocks with 10 threads
fun<<<dimGrid, dimBlocks>>>(, Height)
//cudaMemCpy answerIdx back to integer on host
//kernel (Not positive on these types as my CUDA is very rusty
__global__ void fun(char *strings[], char *stringToMatch, int *answerIdx)
{
int idx = blockIdx.x * 10 + threadIdx.x;
//Obviously use whatever function you've been using for string comparison
//I'm just using == for example's sake
if(strings[idx] == stringToMatch)
{
*answerIdx = idx
}
}
This is obviously not the most efficient and is most likely not the exact way to pass paramaters and work with memory w/ CUDA, but I hope it gets the point across of splitting the workload and that the 'global' functions get executed on many different cores so you can't really tell them all to stop. There may be a way I'm not familiar with, but the speed up you will get by just dividing the workload onto the device (in a sensible fashion of course) will already give you tremendous performance improvements. To get a sense of the thread model I highly recommend reading up on the documents on Nvidia's site for CUDA. They will help tremendously and teach you the best way to set up the grid and blocks for optimal performance.

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!

Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.

If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Atomic Instruction

What do you mean by Atomic instructions?
How does the following become Atomic?
TestAndSet
int TestAndSet(int *x){
register int temp = *x;
*x = 1;
return temp;
}
From a software perspective, if one does not want to use non-blocking synchronization primitives, how can one ensure Atomicity of instruction? is it possible only at Hardware or some assembly level directive optimization can be used?

Some machine instructions are intrinsically atomic - for example, reading and writing properly aligned values of the native processor word size is atomic on many architectures.
This means that hardware interrupts, other processors and hyper-threads cannot interrupt the read or store and read or write a partial value to the same location.
More complicated things such as reading and writing together atomically can be achieved by explicit atomic machine instructions e.g. LOCK CMPXCHG on x86.
Locking and other high-level constructs are built on these atomic primitives, which typically only guard a single processor word.
Some clever concurrent algorithms can be built using just the reading and writing of pointers e.g. in linked lists shared between a single reader and writer, or with effort, multiple readers and writers.

Below are some of my notes on Atomicity that may help you understand the meaning. The notes are from the sources listed at the end and I recommend reading some of them if you need a more thorough explanation rather than point-form bullets as I have. Please point out any errors so that I may correct them.
Definition :
From the Greek meaning "not divisible into smaller parts"
An "atomic" operation is always observed to be done or not done, but
never halfway done.
An atomic operation must be performed entirely or not performed at
all.
In multi-threaded scenarios, a variable goes from unmutated to
mutated directly, with no "halfway mutated" values
Example 1 : Atomic Operations
Consider the following integers used by different threads :
int X = 2;
int Y = 1;
int Z = 0;
Z = X; //Thread 1
X = Y; //Thread 2
In the above example, two threads make use of X, Y, and Z
Each read and write are atomic
The threads will race :
If thread 1 wins, then Z = 2
If thread 2 wins, then Z=1
Z will will definitely be one of those two values
Example 2 : Non-Atomic Operations : ++/-- Operations
Consider the increment/decrement expressions :
i++; //increment
i--; //decrement
The operations translate to :
Read i
Increment/decrement the read value
Write the new value back to i
The operations are each composed of 3 atomic operations, and are not atomic themselves
Two attempts to increment i on separate threads could interleave such that one of the increments is lost
Example 3 - Non-Atomic Operations : Values greater than 4-Bytes
Consider the following immutable struct :
struct MyLong
{
public readonly int low;
public readonly int high;
public MyLong(int low, int high)
{
this.low = low;
this.high = high;
}
}
We create fields with specific values of type MyLong :
MyLong X = new MyLong(0xAAAA, 0xAAAA);
MyLong Y = new MyLong(0xBBBB, 0xBBBB);
MyLong Z = new MyLong(0xCCCC, 0xCCCC);
We modify our fields in separate threads without thread safety :
X = Y; //Thread 1
Y = X; //Thread 2
In .NET, when copying a value type, the CLR doesn't call a constructor - it moves the bytes one atomic operation at a time
Because of this, the operations in the two threads are now four atomic operations
If there is no thread safety enforced, the data can be corrupted
Consider the following execution order of operations :
X.low = Y.low; //Thread 1 - X = 0xAAAABBBB
Y.low = Z.low; //Thread 2 - Y = 0xCCCCBBBB
Y.high = Z.high; //Thread 2 - Y = 0xCCCCCCCC
X.high = Y.high; //Thread 1 - X = 0xCCCCBBBB <-- corrupt value for X
Reading and writing values greater than 32-bits on multiple threads on a 32-bit operating system without adding some sort of locking to make the operation atomic is likely to result in corrupt data as above
Processor Operations
On all modern processors, you can assume that reads and writes of naturally aligned native types are atomic as long as :
1 : The memory bus is at least as wide as the type being read or written
2 : The CPU reads and writes these types in a single bus transaction, making it impossible for other threads to see them in a half-completed state
On x86 and X64 there is no guarantee that reads and writes larger than eight bytes are atomic
Processor vendors define the atomic operations for each processor in a Software Developer's Manual
In single processors / single core systems it is possible to use standard locking techniques to prevent CPU instructions from being interrupted, but this can be inefficient
Disabling interrupts is another more efficient solution, if possible
In multiprocessor / multicore systems it is still possible to use locks but merely using a single instruction or disabling interrupts does not guarantee atomic access
Atomicity can be achieved by ensuring that the instructions used assert the 'LOCK' signal on the bus to prevent other processors in the system from accessing the memory at the same time
Language Differences
C#
C# guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Operations on value types that take more than four bytes (double, long, etc.) are not guaranteed to be atomic
The CLI guarantees that reads and writes of variables of value type that are the size (or smaller) of the processor's natural pointer size are atomic
Ex - running C# on a 64-bit OS in a 64-bit version of the CLR performs reads and writes of 64-bit doubles and long integers atomically
Creating atomic operations :
.NET provodes the Interlocked Class as part of the System.Threading namespace
The Interlocked Class provides atomic operations such as increment, compare, exchange, etc.
using System.Threading;
int unsafeCount;
int safeCount;
unsafeCount++;
Interlocked.Increment(ref safeCount);
C++
C++ standard does not guarantee atomic behavior
All C / C++ operations are presumed non-atomic unless otherwise specified by the compiler or hardware vendor - including 32-bit integer assignment
Creating atomic operations :
The C++ 11 concurrency library includes the - Atomic Operations Library ()
The Atomic library provides atomic types as a template class to use with any type you want
Operations on atomic types are atomic and thus thread-safe
struct AtomicCounter
{
std::atomic< int> value;
void increment(){
++value;
}
void decrement(){
--value;
}
int get(){
return value.load();
}
}
Java
Java guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Assignments to volatile longs and doubles are also guaranteed to be atomic
Java provides a small toolkit of classes that support lock-free thread-safe programming on single variables through java.util.concurrent.atomic
This provides atomic lock-free operations based on low-level atomic hardware primitives such as compare-and-swap (CAS) - also called compare and set :
CAS form - boolean compareAndSet(expectedValue, updateValue );
This method atomically sets a variable to the updateValue if it currently holds the expectedValue - reporting true on success
import java.util.concurrent.atomic.AtomicInteger;
public class Counter
{
private AtomicInteger value= new AtomicInteger();
public int increment(){
return value.incrementAndGet();
}
public int getValue(){
return value.get();
}
}
Sources
http://www.evernote.com/shard/s10/sh/c2735e95-85ae-4d8c-a615-52aadc305335/99de177ac05dc8635fb42e4e6121f1d2

Atomic comes from the Greek ἄτομος (atomos) which means "indivisible". (Caveat: I don't speak Greek, so maybe it's really something else, but most English speakers citing etymologies interpret it this way. :-)
In computing, this means that the operation, well, happens. There isn't any intermediate state that's visible before it completes. So if your CPU gets interrupted to service hardware (IRQ), or if another CPU is reading the same memory, it doesn't affect the result, and these other operations will observe it as either completed or not started.
As an example... let's say you wanted to set a variable to something, but only if it has not been set before. You might be inclined to do this:
if (foo == 0)
{
foo = some_function();
}
But what if this is run in parallel? It could be that the program will fetch foo, see it as zero, meanwhile thread 2 comes along and does the same thing and sets the value to something. Back in the original thread, the code still thinks foo is zero, and the variable gets assigned twice.
For cases like this, the CPU provides some instructions that can do the comparison and the conditional assignment as an atomic entity. Hence, test-and-set, compare-and-swap, and load-linked/store-conditional. You can use these to implement locks (your OS and your C library has done this.) Or you can write one-off algorithms that rely on the primitives to do something. (There's cool stuff to be done here, but most mere mortals avoid this for fear of getting it wrong.)

Atomicity is a key concept when you have any form of parallel processing (including different applications cooperating or sharing data) that includes shared resources.
The problem is well illustrated with an example. Let's say you have two programs that want to create a file but only if the file doesn't already exists. Any of the two program can create the file at any point in time.
If you do (I'll use C since it's what's in your example):
...
f = fopen ("SYNCFILE","r");
if (f == NULL) {
f = fopen ("SYNCFILE","w");
}
...
you can't be sure that the other program hasn't created the file between your open for read and your open for write.
There's no way you can do this on your own, you need help from the operating system, that usually provide syncronization primitives for this purpose, or another mechanism that is guaranteed to be atomic (for example a relational database where the lock operation is atomic, or a lower level mechanism like processors "test and set" instructions).

Atomicity can only be guaranteed by the OS. The OS uses the underlying processor features to achieve this.
So creating your own testandset function is impossible. (Although I'm not sure if one could use an inline asm snippet, and use the testandset mnemonic directly (Could be that this statement can only be done with OS priviliges))
EDIT:
According to the comments below this post, making your own 'bittestandset' function using an ASM directive directly is possible (on intel x86). However, if these tricks also work on other processors is not clear.
I stand by my point: if You want to do atmoic things, use the OS functions and don't do it yourself

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008