Why are there no simple atomic increment, decrement operations in CUDA? - cuda

From the CUDA Programming guide:
unsigned int atomicInc(unsigned int* address,
unsigned int val);
reads the 32-bit word old located at the address address in global or
shared memory, computes ((old >= val) ? 0 : (old+1)), and stores the
result back to memory at the same address. These three operations are
performed in one atomic transaction. The function returns old.
This is nice and dandy. But where's
unsigned int atomicInc(unsigned int* address);
which simply increments the value at address and returns the old value? And
void atomicInc(unsigned int* address);
which simply increments the value at address and returns nothing?
Note: Of course I can 'roll my own' by wrapping the actual API call, but I would think the hardware has such a simpler operation, which could be cheaper.

They haven't implemented the simple increase and decrease operations because the operations wouldn't have better performance. Each machine code instruction in current architectures takes the same amount of space, which is 64 bits. In other words, there's room in the instruction for a full 32-bit immediate value and since they have atomic instructions that support adding a full 32-bit value, they have already spent the transistors.
I think that dedicated inc and dec instructions on old processors are now just artifacts from a time when transistors were much more expensive and the instruction cache was small, thereby making it worth the effort to code instructions into as few bits as possible. My guess is that, on new CPUs, the inc and dec instructions are implemented in terms of a more general addition function internally and are mostly there for backwards compatibility.

You should be able to accomplish "simple" atomic increment using:
__global__ void mykernel(int *value){
int my_old_val = atomicAdd(value, 1);
}
Likewise if you don't care about the return value, this is perfectly acceptable:
__global__ void mykernel(int *value){
atomicAdd(value, 1);
}
You can do atomic decrement similarly with:
atomicSub(value, 1);
or even
atomicAdd(value, -1);
Here is the section on atomics in the programming guide, which covers support by compute capability for the various functions and data types.

Related

Why does cuFFT performance suffer with overlapping inputs?

I'm experimenting with using cuFFT's callback feature to perform input format conversion on the fly (for instance, calculating FFTs of 8-bit integer input data without first doing an explicit conversion of the input buffer to float). In many of my applications, I need to calculate overlapped FFTs on an input buffer, as described in this previous SO question. Typically, adjacent FFTs might overlap by 1/4 to 1/8 of the FFT length.
cuFFT, with its FFTW-like interface, explicitly supports this via the idist parameter of the cufftPlanMany() function. Specifically, if I want to calculate FFTs of size 32768 with an overlap of 4096 samples between consecutive inputs, I would set idist = 32768 - 4096. This does work properly in the sense that it yields the correct output.
However, I'm seeing strange performance degradation when using cuFFT in this way. I have devised a test that implements this format conversion and overlap in two different ways:
Explicitly tell cuFFT about the overlapping nature of the input: set idist = nfft - overlap as I described above. Install a load callback function that just does the conversion from int8_t to float as needed on the buffer index provided to the callback.
Don't tell cuFFT about the overlapping nature of the input; lie to it an dset idist = nfft. Then, let the callback function handle the overlapping by calculating the correct index that should be read for each FFT input.
A test program implementing both of these approaches with timing and equivalence tests is available in this GitHub gist. I didn't reproduce it all here for brevity. The program calculates a batch of 1024 32768-point FFTs that overlap by 4096 samples; the input data type is 8-bit integers. When I run it on my machine (with a Geforce GTX 660 GPU, using CUDA 8.0 RC on Ubuntu 16.04), I get the following result:
executing method 1...done in 32.523 msec
executing method 2...done in 26.3281 msec
Method 2 is noticeably faster, which I would not expect. Look at the implementations of the callback functions:
Method 1:
template <typename T>
__device__ cufftReal convert_callback(void * inbuf, size_t fft_index,
void *, void *)
{
return (cufftReal)(((const T *) inbuf)[fft_index]);
}
Method 2:
template <typename T>
__device__ cufftReal convert_and_overlap_callback(void *inbuf,
size_t fft_index, void *, void *)
{
// fft_index is the index of the sample that we need, not taking
// the overlap into account. Convert it to the appropriate sample
// index, considering the overlap structure. First, grab the FFT
// parameters from constant memory.
int nfft = overlap_params.nfft;
int overlap = overlap_params.overlap;
// Calculate which FFT in the batch that we're reading data for. This
// tells us how much overlap we need to account for. Just use integer
// arithmetic here for speed, knowing that this would cause a problem
// if we did a batch larger than 2Gsamples long.
int fft_index_int = fft_index;
int fft_batch_index = fft_index_int / nfft;
// For each transform past the first one, we need to slide "overlap"
// samples back in the input buffer when fetching the sample.
fft_index_int -= fft_batch_index * overlap;
// Cast the input pointer to the appropriate type and convert to a float.
return (cufftReal) (((const T *) inbuf)[fft_index_int]);
}
Method 2 has a significantly more complex callback function, one that even involves integer division by a non-compile time value! I would expect this to be much slower than method 1, but I'm seeing the opposite. Is there a good explanation for this? Is it possible that cuFFT structures its processing much differently when the input overlaps, thus resulting in the degraded performance?
It seems like I should be able to achieve performance that is quite a bit faster than method 2 if the index calculations could be removed from the callback (but that would require the overlapping to be specified to cuFFT).
Edit: After running my test program under nvvp, I can see that cuFFT definitely seems to be structuring its computations differently. It's hard to make sense of the kernel symbol names, but the kernel invocations break down like this:
Method 1:
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex14packR2C_kernelIjfEEvNS_19spRealComplexR2C_stIT_T0_EE: 3.72 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 7.71 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.75 msec (yes, it gets invoked twice)
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.49 msec
Method 2:
spRadix0128C::kernel1MemCallback<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, L1, ALL, WRITEBACK>: 5.15 msec
spRadix0128C::kernel1Tex<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=4, CONSTANT, ALL, WRITEBACK>: 12.88 msec
__nv_static_73__60_tmpxft_00006cdb_00000000_15_spRealComplex_compute_60_cpp1_ii_1f28721c__ZN13spRealComplex24postprocessC2C_kernelTexIjfL9fftAxii_t1EEEvP7ComplexIT0_EjT_15coordDivisors_tIS6_E7coord_tIS6_ESA_S6_S3_: 7.51 msec
Interestingly, it looks like cuFFT invokes two kernels to actually compute the FFTs using method 1 (when cuFFT knows about the overlapping), but with method 2 (where it doesn't know that the FFTs are overlapped), it does the job with just one. For the kernels that are used in both cases, it does seem to use the same grid parameters between methods 1 and 2.
I don't see why it should have to use a different implementation here, especially since the input stride istride == 1. It should just use a different base address when fetching data at the transform input; the rest of the algorithm should be exactly the same, I think.
Edit 2: I'm seeing some even more bizarre behavior. I realized by accident that if I fail to destroy the cuFFT handles appropriately, I see differences in measured performance. For example, I modified the test program to skip destruction of the cuFFT handles and then executed the tests in a different sequence: method 1, method 2, then method 2 and method 1 again. I got the following results:
executing method 1...done in 31.5662 msec
executing method 2...done in 17.6484 msec
executing method 2...done in 17.7506 msec
executing method 1...done in 20.2447 msec
So the performance seems to change depending upon whether there are other cuFFT plans in existence when creating a plan for the test case! Using the profiler, I see that the structure of the kernel launches doesn't change between the two cases; the kernels just all seem to execute faster. I have no reasonable explanation for this effect either.
If you specify non-standard strides (doesn't matter if batch/transform) cuFFT uses different path internally.
ad edit 2:
This is likely GPU Boost adjusting clocks on GPU. cuFFT plan do not have impact one on another
Ways to get more stable results:
run warmup kernel (anything that would make full GPU work is good) and then your problem
increase batch size
run test several times and take average
lock clocks of the GPU (not really possible on GeForce - Tesla can do it)
At the suggestion of #llukas, I filed a bug report with NVIDIA regarding the issue (https://partners.nvidia.com/bug/viewbug/1821802 if you're registered as a developer). They acknowledged the poorer performance with overlapped plans. They actually indicated that the kernel configuration used in both cases is suboptimal and they plan to improve that eventually. No ETA was given, but it will likely not be in the next release (8.0 was just released last week). Finally, they said that as of CUDA 8.0, there is no workaround to make cuFFT use a more efficient method with strided inputs.

Implementing parallel merging in CUDA with shared memory does not enhance performance

I am trying to implement a parallel merging algorithm in CUDA. The algorithm is designed to be executed in one thread block. Its basic idea is to compute the global rank of each element in two input sequences. Since these two input sequences are sorted, a element's global rank is equal to its index in its original sequence plus its rank in the other sequence which is computed by binary search. I think the best strategy for implementing such algorithm is to load two sequences into shared memory to reduce the global memory read. However, when I compared two version of implementation, one using shared memory and one without using shared memory, I can't see the performance enhancement. I wonder if I am doing something wrong.
hardware: GeForce GTX 285, Linux x86_64.
time to merge two sequences of 1024 elements for both implementations is about 0.068672 ms.
__global__ void localMerge(int * A, int numA,int * B,int numB,int * C){
extern __shared__ int temp[]; // shared memory for A and B;
int tx=threadIdx.x;
int size=blockDim.x;
int *tempA=temp;
int *tempB=temp+numA;
int i,j,k,mid;
//read sequences into shared memory
for(i=tx;i<numA;i+=size){
tempA[i]=A[i];
}
for(i=tx;i<numB;i+=size){
tempB[i]=B[i];
}
__syncthreads();
//compute global rank for elements in sequence A
for(i=tx;i<numA;i+=size){
j=0;
k=numB-1;
if(tempA[i]<=tempB[0]){
C[i]=tempA[i];
}
else if(tempA[i]>tempB[numB-1]){
C[i+numB]=tempA[i];
}
else{
while(j<k-1){
mid=(j+k)/2;
if(tempB[mid]<tempA[i]){
j=mid;
}
else{
k=mid;
}
}
//printf("i=%d,j=%d,C=%d\n",i,j,tempA[i]);
C[i+j+1]=tempA[i];
}
}
//compute global rank for elements in sequence B
for(i=tx;i<numB;i+=size){
j=0;
k=numA-1;
if(tempB[i]<tempA[0]){
C[i]=tempB[i];
}
else if(tempB[i]>=tempA[numA-1]){
C[i+numA]=tempB[i];
}
else{
while(j<k-1){
mid=(j+k)/2;
if(tempA[mid]<=tempB[i]){
j=mid;
}
else{
k=mid;
}
}
//printf("i=%d,j=%d,C=%d\n",i,j,tempB[i]);
C[i+j+1]=tempB[i];
}
}
}
You may have more luck with applying the "merge path" algorithm than by relying on a collection of parallel fine-grained binary searches through both input lists in __shared__ memory. Using __shared__ memory for this problem is less important because what reuse exists in this algorithm can be captured pretty well by the cache.
With this merge algorithm, the idea is that each thread of the CTA is responsible for producing k outputs in the merged result. This has the nice property that each thread's work is roughly uniform, and the binary searching involved is fairly coarse grained.
Thread i searches both input lists at once to find the position in each list of the k*ith output element. The job then is simple: each thread serially merges k items from the input lists and copies them to location k*i in the output.
You can refer to Thrust's implementation for the details.

About warp voting function

The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".
My question is: what applications will use these 3 functions?
The prototype of __ballot is the following
unsigned int __ballot(int predicate);
If predicate is nonzero, __ballot returns a value with the Nth bit set, where N is the thread index.
Combined with atomicOr and __popc, it can be used to accumulate the number of threads in each warp having a true predicate.
Indeed, the prototype of atomicOr is
int atomicOr(int* address, int val);
and atomicOr reads the value pointed to by address, performs a bitwise OR operation with val, and writes the value back to address and returns its old value as a return parameter.
On the other side, __popc returns the number of bits set withing a 32-bit parameter.
Accordingly, the instructions
volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];
const u32 warp_sum = threadIdx.x >> 5;
atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));
atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));
can be used to count the number of threads for which the predicate is true.
For more details, see Shane Cook, CUDA Programming, Morgan Kaufmann
__ballot is used in CUDA-histogram and in CUDA NPP library for quick generation of bitmasks, and combining it with __popc intrinsic to make a very efficient implementation of boolean reduction.
__all and __any was used in reduction before introduction of __ballot, though I can not think of any other use of them.
As an example of algorithm that uses __ballot API i would mention the In-Kernel Stream Compaction by D.M Hughes et Al. It is used in prefix sum part of the stream compaction to count (per warp) the number of elements that passed the predicate.
Here the paper. In-k Stream Compaction
CUDA provides several warp-wide broadcast and reduction operations that NVIDIA’s architectures efficiently support. For example, __ballot(predicate) instruction evaluates predicate for all active threads of the warp and returns an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active [Reference: Flexible Software Profiling of GPU Architectures].

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!
Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.
If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Atomic Instruction

What do you mean by Atomic instructions?
How does the following become Atomic?
TestAndSet
int TestAndSet(int *x){
register int temp = *x;
*x = 1;
return temp;
}
From a software perspective, if one does not want to use non-blocking synchronization primitives, how can one ensure Atomicity of instruction? is it possible only at Hardware or some assembly level directive optimization can be used?
Some machine instructions are intrinsically atomic - for example, reading and writing properly aligned values of the native processor word size is atomic on many architectures.
This means that hardware interrupts, other processors and hyper-threads cannot interrupt the read or store and read or write a partial value to the same location.
More complicated things such as reading and writing together atomically can be achieved by explicit atomic machine instructions e.g. LOCK CMPXCHG on x86.
Locking and other high-level constructs are built on these atomic primitives, which typically only guard a single processor word.
Some clever concurrent algorithms can be built using just the reading and writing of pointers e.g. in linked lists shared between a single reader and writer, or with effort, multiple readers and writers.
Below are some of my notes on Atomicity that may help you understand the meaning. The notes are from the sources listed at the end and I recommend reading some of them if you need a more thorough explanation rather than point-form bullets as I have. Please point out any errors so that I may correct them.
Definition :
From the Greek meaning "not divisible into smaller parts"
An "atomic" operation is always observed to be done or not done, but
never halfway done.
An atomic operation must be performed entirely or not performed at
all.
In multi-threaded scenarios, a variable goes from unmutated to
mutated directly, with no "halfway mutated" values
Example 1 : Atomic Operations
Consider the following integers used by different threads :
int X = 2;
int Y = 1;
int Z = 0;
Z = X; //Thread 1
X = Y; //Thread 2
In the above example, two threads make use of X, Y, and Z
Each read and write are atomic
The threads will race :
If thread 1 wins, then Z = 2
If thread 2 wins, then Z=1
Z will will definitely be one of those two values
Example 2 : Non-Atomic Operations : ++/-- Operations
Consider the increment/decrement expressions :
i++; //increment
i--; //decrement
The operations translate to :
Read i
Increment/decrement the read value
Write the new value back to i
The operations are each composed of 3 atomic operations, and are not atomic themselves
Two attempts to increment i on separate threads could interleave such that one of the increments is lost
Example 3 - Non-Atomic Operations : Values greater than 4-Bytes
Consider the following immutable struct :
struct MyLong
{
public readonly int low;
public readonly int high;
public MyLong(int low, int high)
{
this.low = low;
this.high = high;
}
}
We create fields with specific values of type MyLong :
MyLong X = new MyLong(0xAAAA, 0xAAAA);
MyLong Y = new MyLong(0xBBBB, 0xBBBB);
MyLong Z = new MyLong(0xCCCC, 0xCCCC);
We modify our fields in separate threads without thread safety :
X = Y; //Thread 1
Y = X; //Thread 2
In .NET, when copying a value type, the CLR doesn't call a constructor - it moves the bytes one atomic operation at a time
Because of this, the operations in the two threads are now four atomic operations
If there is no thread safety enforced, the data can be corrupted
Consider the following execution order of operations :
X.low = Y.low; //Thread 1 - X = 0xAAAABBBB
Y.low = Z.low; //Thread 2 - Y = 0xCCCCBBBB
Y.high = Z.high; //Thread 2 - Y = 0xCCCCCCCC
X.high = Y.high; //Thread 1 - X = 0xCCCCBBBB <-- corrupt value for X
Reading and writing values greater than 32-bits on multiple threads on a 32-bit operating system without adding some sort of locking to make the operation atomic is likely to result in corrupt data as above
Processor Operations
On all modern processors, you can assume that reads and writes of naturally aligned native types are atomic as long as :
1 : The memory bus is at least as wide as the type being read or written
2 : The CPU reads and writes these types in a single bus transaction, making it impossible for other threads to see them in a half-completed state
On x86 and X64 there is no guarantee that reads and writes larger than eight bytes are atomic
Processor vendors define the atomic operations for each processor in a Software Developer's Manual
In single processors / single core systems it is possible to use standard locking techniques to prevent CPU instructions from being interrupted, but this can be inefficient
Disabling interrupts is another more efficient solution, if possible
In multiprocessor / multicore systems it is still possible to use locks but merely using a single instruction or disabling interrupts does not guarantee atomic access
Atomicity can be achieved by ensuring that the instructions used assert the 'LOCK' signal on the bus to prevent other processors in the system from accessing the memory at the same time
Language Differences
C#
C# guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Operations on value types that take more than four bytes (double, long, etc.) are not guaranteed to be atomic
The CLI guarantees that reads and writes of variables of value type that are the size (or smaller) of the processor's natural pointer size are atomic
Ex - running C# on a 64-bit OS in a 64-bit version of the CLR performs reads and writes of 64-bit doubles and long integers atomically
Creating atomic operations :
.NET provodes the Interlocked Class as part of the System.Threading namespace
The Interlocked Class provides atomic operations such as increment, compare, exchange, etc.
using System.Threading;
int unsafeCount;
int safeCount;
unsafeCount++;
Interlocked.Increment(ref safeCount);
C++
C++ standard does not guarantee atomic behavior
All C / C++ operations are presumed non-atomic unless otherwise specified by the compiler or hardware vendor - including 32-bit integer assignment
Creating atomic operations :
The C++ 11 concurrency library includes the - Atomic Operations Library ()
The Atomic library provides atomic types as a template class to use with any type you want
Operations on atomic types are atomic and thus thread-safe
struct AtomicCounter
{
std::atomic< int> value;
void increment(){
++value;
}
void decrement(){
--value;
}
int get(){
return value.load();
}
}
Java
Java guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Assignments to volatile longs and doubles are also guaranteed to be atomic
Java provides a small toolkit of classes that support lock-free thread-safe programming on single variables through java.util.concurrent.atomic
This provides atomic lock-free operations based on low-level atomic hardware primitives such as compare-and-swap (CAS) - also called compare and set :
CAS form - boolean compareAndSet(expectedValue, updateValue );
This method atomically sets a variable to the updateValue if it currently holds the expectedValue - reporting true on success
import java.util.concurrent.atomic.AtomicInteger;
public class Counter
{
private AtomicInteger value= new AtomicInteger();
public int increment(){
return value.incrementAndGet();
}
public int getValue(){
return value.get();
}
}
Sources
http://www.evernote.com/shard/s10/sh/c2735e95-85ae-4d8c-a615-52aadc305335/99de177ac05dc8635fb42e4e6121f1d2
Atomic comes from the Greek ἄτομος (atomos) which means "indivisible". (Caveat: I don't speak Greek, so maybe it's really something else, but most English speakers citing etymologies interpret it this way. :-)
In computing, this means that the operation, well, happens. There isn't any intermediate state that's visible before it completes. So if your CPU gets interrupted to service hardware (IRQ), or if another CPU is reading the same memory, it doesn't affect the result, and these other operations will observe it as either completed or not started.
As an example... let's say you wanted to set a variable to something, but only if it has not been set before. You might be inclined to do this:
if (foo == 0)
{
foo = some_function();
}
But what if this is run in parallel? It could be that the program will fetch foo, see it as zero, meanwhile thread 2 comes along and does the same thing and sets the value to something. Back in the original thread, the code still thinks foo is zero, and the variable gets assigned twice.
For cases like this, the CPU provides some instructions that can do the comparison and the conditional assignment as an atomic entity. Hence, test-and-set, compare-and-swap, and load-linked/store-conditional. You can use these to implement locks (your OS and your C library has done this.) Or you can write one-off algorithms that rely on the primitives to do something. (There's cool stuff to be done here, but most mere mortals avoid this for fear of getting it wrong.)
Atomicity is a key concept when you have any form of parallel processing (including different applications cooperating or sharing data) that includes shared resources.
The problem is well illustrated with an example. Let's say you have two programs that want to create a file but only if the file doesn't already exists. Any of the two program can create the file at any point in time.
If you do (I'll use C since it's what's in your example):
...
f = fopen ("SYNCFILE","r");
if (f == NULL) {
f = fopen ("SYNCFILE","w");
}
...
you can't be sure that the other program hasn't created the file between your open for read and your open for write.
There's no way you can do this on your own, you need help from the operating system, that usually provide syncronization primitives for this purpose, or another mechanism that is guaranteed to be atomic (for example a relational database where the lock operation is atomic, or a lower level mechanism like processors "test and set" instructions).
Atomicity can only be guaranteed by the OS. The OS uses the underlying processor features to achieve this.
So creating your own testandset function is impossible. (Although I'm not sure if one could use an inline asm snippet, and use the testandset mnemonic directly (Could be that this statement can only be done with OS priviliges))
EDIT:
According to the comments below this post, making your own 'bittestandset' function using an ASM directive directly is possible (on intel x86). However, if these tricks also work on other processors is not clear.
I stand by my point: if You want to do atmoic things, use the OS functions and don't do it yourself