Is there proper CUDA atomicLoad function? - cuda

I've faced with the issue that CUDA atomic API do not have atomicLoad function.
After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad
But looks like this function is failed to work in following example:
#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>
template <typename T>
__device__ T atomicLoad(const T* addr) {
const volatile T* vaddr = addr; // To bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const T value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
__global__ void initAtomic(unsigned& count, const unsigned initValue) {
count = initValue;
}
__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
atomicAdd(&count, 1);
// NOTE: When uncomment the following while loop the addVerify is stuck,
// it cannot read last proper value in variable count
// while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
// printf("count = %u\n", atomicLoad(&count));
// }
}
int main() {
std::cout << "Hello, CUDA atomics!" << std::endl;
const auto atomicSize = sizeof(unsigned);
unsigned* datomic = nullptr;
cudaMalloc(&datomic, atomicSize);
cudaStream_t stream;
cudaStreamCreate(&stream);
constexpr unsigned biasAtomicValue = 11;
initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
cudaStreamSynchronize(stream);
unsigned countHost = 0;
cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
assert(countHost == 1024 * 1024 + biasAtomicValue);
cudaStreamDestroy(stream);
return 0;
}
If you will uncomment the section with atomicLoad then application will stuck ...
Maybe I missed something ? Is there a proper way to load variable modified atomically ?
P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware

Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.
Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.
Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.
You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.
Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.

Related

Can I run a CUDA device function without parallelization or calling it as part of a kernel?

I have a program that loads an image onto a CUDA device, analyzes it with cufft and some custom stuff, and updates a single number on the device which the host then queries as needed. The analysis is mostly parallelized, but the last step sums everything up (using thrust::reduce) for a couple final calculations that aren't parallel.
Once everything is reduced, there's nothing to parallelize, but I can't figure out how to just run a device function without calling it as its own tiny kernel with <<<1, 1>>>. That seems like a hack. Is there a better way to do this? Maybe a way to tell the parallelized kernel "just do these last lines once after the parallel part is finished"?
I feel like this must have been asked before, but I can't find it. Might just not know what to search for though.
Code snip below, I hope I didn't remove anything relevant:
float *d_phs_deltas; // Allocated using cudaMalloc (data is on device)
__device__ float d_Z;
static __global__ void getDists(const cufftComplex* data, const bool* valid, float* phs_deltas)
{
const int i = blockIdx.x*blockDim.x + threadIdx.x;
// Do stuff with the line indicated by index i
// ...
// Save result into array, gets reduced to single number in setDist
phs_deltas[i] = phs_delta;
}
static __global__ void setDist(const cufftComplex* data, const bool* valid, const float* phs_deltas)
{
// Final step; does it need to be it's own kernel if it only runs once??
d_Z += phs2dst * thrust::reduce(thrust::device, phs_deltas, phs_deltas + d_y);
// Save some other stuff to refer to next frame
// ...
}
void fftExec(unsigned __int32 *host_data)
{
// Copy image to device, do FFT, etc
// ...
// Last parallel analysis step, sets d_phs_deltas
getDists<<<out_blocks, N_THREADS>>>(d_result, d_valid, d_phs_deltas);
// Should this be a serial part at the end of getDists somehow?
setDist<<<1, 1>>>(d_result, d_valid, d_phs_deltas);
}
// d_Z is copied out only on request
void getZ(float *Z) { cudaMemcpyFromSymbol(Z, d_Z, sizeof(float)); }
Thank you!
There is no way to run a device function directly without launching a kernel. As pointed out in comments, there is a working example in the Programming Guide which shows how to use memory fence functions and an atomically incremented counter to signal that a given block is the last block:
__device__ unsigned int count = 0;
__global__ void sum(const float* array, unsigned int N, volatile float* result)
{
__shared__ bool isLastBlockDone;
float partialSum = calculatePartialSum(array, N);
if (threadIdx.x == 0) {
result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation
// of the "count" variable is only performed after
// the partial sum has been written to global memory.
__threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last
// block to be done.
isLastBlockDone = (value == (gridDim.x - 1));
}
// Synchronize to make sure that each thread reads
// the correct value of isLastBlockDone.
__syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums
// stored in result[0 .. gridDim.x-1] float totalSum =
calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varilable, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
I would recommend benchmarking both ways and choosing which is faster. On most platforms kernel launch latency is only a few microseconds, so a short running kernel to finish an action after a long running kernel can be the most efficient way to get this done.

What is the exact behaviour of the CUDA execution?

Let's suppose we want to call a global function with the code that follows. Every single thread will have a curandState generator and an array of ints (both properly initialized) that we'll use in order to execute the following code sample:
#define NUMTHREADS 200
int main(){
int * result;
curandState * randState;
if (cudaMalloc(&randState, NUMTHREADS * sizeof(curandState)) == cudaErrorMemoryAllocation ||
cudaMalloc(&result, NUMTHREADS * sizeof(int)) == cudaErrorMemoryAllocation){
cudaDeviceReset();
exit(-1);
}
setup_cuRand <<<1, NUMTHREADS>>> (randState, unsigned(time(NULL)));
method <<<1, NUMTHREADS>>> (state,result);
return 1;
}
__global__ void setup_cuRand(curandState * state, unsigned long seed)
{
int id = threadIdx.x;
curand_init(seed, id, 0, &state[id]);
}
__global__ void generic method(curandState* state, int * result){
curandState localState = state[threadIdx.x];
int num = curand(&localState) % 100;
if(num > 50)
result[threadIdx.x] = threadIdx.x;
else
result[threadIdx.x] = -1;
}
What would be our execution? I mean, do the threads split into both codes magically and re-join later or how it works? are all 1024 threads in execution at once? this last question is because when i'm debugging on Visual Studio 2013, using Cuda Debugger, when i'm going forward, threadIdx.x allways has a value like n*32 and until now i tought that 1024 threads could be executed at the same time and now i'm doubtfull
The test is likely to be transformed into a predicate that will mean conditional assignment of some value in your region of memory. Should your if be more complex, the threads of a warp would magically join after the second part of the if clause. Depending on predicate for each thread of a warp, a branch might not even get visited.
When entering a breakpoint, the data will be shown for a specific thread/block id. Which thread/block is followed is given by the CUDA Debug Focus setting in NSIGHT for Visual Studio (While debugging with CUDA, enter the NSIGHT menu entry, and select Windows, then CUDA Debug Focus...) By default, thread 0,0,0 will be focused.
Threads are logically executed at the same time, but in practice, you have less than 1024 CUDA-cores per SM. The threads are organized into warps of 32, and warps are scheduled on different execution units by the instruction scheduler.
For 1024 threads, that is 32 warps, the first and last warp are not necessarily executed at the same time precisely.
See Memory Fence function in cuda documentation for more details, as well as Synchronization Functions.

How do i calculate the number of CUDA threads being launched?

I have a CUDA card with : Cuda Compute capability (3.5) If i have a call such as <<<2000,512>>> , what are the number of iterations that occur within the kernel? I thought it was (2000*512), but testing isn't proving this? I also want to confirm that the way I'm calculating the the variable is correct.
The situation is, within the kernel I am incrementing a passed global memory number based on the thread number :
int thr = blockDim.x * blockIdx.x + threadIdx.x;
worknumber = globalnumber + thr;
So, when I return back to the CPU, I want to know exactly how many increments there were so I can keep track so I don't repeat or skip numbers when I recall the kernel GPU to process my next set of numbers.
Edit :
__global__ void allin(uint64_t *lkey, const unsigned char *d_patfile)
{
uint64_t kkey;
int tmp;
int thr = blockDim.x * blockIdx.x + threadIdx.x;
kkey = *lkey + thr;
if (thr > tmp) {
tmp = thr;
printf("%u \n", thr);
}
}
If you launch a kernel with the configuration <<<X,Y>>>, and you have not violated any rules of CUDA usage, then the number of threads launched will, in fact, be X*Y (or a suitable modification of that if we are talking about 2 or 3 dimensional threadblocks and/or grids, i.e. X.x*X.y*X.z*Y.x*Y.y*Y.z ).
printf from a CUDA kernel has various limitations. Therefore, generating a large amount of printf output from a CUDA kernel is generally unwise and probably not useful for validating the number of threads launched in a large grid.
If you want to keep track of the number of threads that actually get launched, you could use a global variable and have each thread atomically update it. Something like this:
$ cat t848.cu
#include <stdio.h>
__device__ unsigned long long totThr = 0;
__global__ void mykernel(){
atomicAdd(&totThr, 1);
}
int main(){
mykernel<<<2000,512>>>();
unsigned long long total;
cudaMemcpyFromSymbol(&total, totThr, sizeof(unsigned long long));
printf("Total threads counted: %lu\n", total);
}
$ nvcc -o t848 t848.cu
$ cuda-memcheck ./t848
========= CUDA-MEMCHECK
Total threads counted: 1024000
========= ERROR SUMMARY: 0 errors
$
Note that atomic operations may be relatively slow. I wouldn't recommend making regular use of such a code for performance reasons. But if you want to convince yourself of the number of threads launched, it should give the correct answer.

Kernel Launch Failure

I'm operating on a Linux system and a Tesla C2075 machine. I am launching a kernel that is a modified version of the reduction kernel. My aim is to find the mean and a step by step averaged version(time_avg) of a large data set (result). See code below.
Size of "result" and "time_avg" is same and equal to "nsamps". "time_avg" contains successive averaged sets of the array result. So, first half contains averages of every two non-overlapping samples, the quarter after that has averages of every four non-overlapping samples, the next eighth of 8 samples and so on.
__global__ void timeavg_mean(float *result, unsigned int *nsamps, float *time_avg, float *mean) {
__shared__ float temp[1024];
int ltid = threadIdx.x, gtid = blockIdx.x*blockDim.x + threadIdx.x, stride;
int start = 0, index;
unsigned int npts = *nsamps;
printf("here here\n");
// Store chunk of memory=2*blockDim.x (which is to be reduced) into shared memory
if ( (2*gtid) < npts ){
temp[2*ltid] = result[2*gtid];
temp[2*ltid+1] = result[2*gtid + 1];
}
for (stride=1; stride<blockDim.x; stride>>=1) {
__syncthreads();
if (ltid % (stride*2) == 0){
if ( (2*gtid) < npts ){
temp[2*ltid] += temp[2*ltid + stride];
index = (int)(start + gtid/stride);
time_avg[index] = (float)( temp[2*ltid]/(2.0*stride) );
}
}
start += npts/(2*stride);
}
__syncthreads();
if (ltid == 0)
{
atomicAdd(mean, temp[0]);
}
__syncthreads();
printf("%f\n", *mean);
}
Launch configuration is 40 blocks, 512 threads. Data set is ~40k samples.
In my main code, I call cudaGetLastError() after the kernel call and it returns no error. Memory allocations and memory copies return no errors. If I write cudaDeviceSynchronize() (or a cudaMemcpy to check for the value of mean) after the kernel call, the program hangs completely after the kernel call. If I remove it, program runs and exits. In neither case, do I get the outputs "here here" or the mean value printed. I understand that unless the kernel executes successfully, the printf's won't print.
Has this got to do with __syncthreads() in a recursion? All threads will go till the same depth so I think that checks out.
What is the problem here?
Thank you!
A kernel call is asynchronous, if the kernel starts successfully your host code will continue to run and you will see no error. Errors that happen during the kernel run appear only after you do an explicit synchronization or call a function that causes an implicit synchronization.
If your host hangs on synchronization than your kernel probably didn't finished running - it is either running some infinite loop or it is waiting on some __synchthreads() or some other synchronization primitive.
Your code seems to contain an infinite loop: for (stride=1; stride<blockDim.x; stride>>=1). You probably want to shift the stride left not right: stride<<=1.
You mentioned recursion but your code contains only one __global__ function, there are no recursive calls.
Your kernel has an infinite loop. Replace the for loop with
for (stride=1; stride<blockDim.x; stride<<=1) {

Efficient method to check for matrix stability in CUDA

A number of algorithms iterate until a certain convergence criterion is reached (e.g. stability of a particular matrix). In many cases, one CUDA kernel must be launched per iteration. My question is: how then does one efficiently and accurately determine whether a matrix has changed over the course of the last kernel call? Here are three possibilities which seem equally unsatisfying:
Writing a global flag each time the matrix is modified inside the kernel. This works, but is highly inefficient and is not technically thread safe.
Using atomic operations to do the same as above. Again, this seems inefficient since in the worst case scenario one global write per thread occurs.
Using a reduction kernel to compute some parameter of the matrix (e.g. sum, mean, variance). This might be faster in some cases, but still seems like overkill. Also, it is possible to dream up cases where a matrix has changed but the sum/mean/variance haven't (e.g. two elements are swapped).
Is there any of the three options above, or an alternative, that is considered best practice and/or is generally more efficient?
I'll also go back to the answer I would have posted in 2012 but for a browser crash.
The basic idea is that you can use warp voting instructions to perform a simple, cheap reduction and then use zero or one atomic operations per block to update a pinned, mapped flag that the host can read after each kernel launch. Using a mapped flag eliminates the need for an explicit device to host transfer after each kernel launch.
This requires one word of shared memory per warp in the kernel, which is a small overhead, and some templating tricks can allow for loop unrolling if you provide the number of warps per block as a template parameter.
A complete working examplate (with C++ host code, I don't have access to a working PyCUDA installation at the moment) looks like this:
#include <cstdlib>
#include <vector>
#include <algorithm>
#include <assert.h>
__device__ unsigned int process(int & val)
{
return (++val < 10);
}
template<int nwarps>
__global__ void kernel(int *inout, unsigned int *kchanged)
{
__shared__ int wchanged[nwarps];
unsigned int laneid = threadIdx.x % warpSize;
unsigned int warpid = threadIdx.x / warpSize;
// Do calculations then check for change/convergence
// and set tchanged to be !=0 if required
int idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tchanged = process(inout[idx]);
// Simple blockwise reduction using voting primitives
// increments kchanged is any thread in the block
// returned tchanged != 0
tchanged = __any(tchanged != 0);
if (laneid == 0) {
wchanged[warpid] = tchanged;
}
__syncthreads();
if (threadIdx.x == 0) {
int bchanged = 0;
#pragma unroll
for(int i=0; i<nwarps; i++) {
bchanged |= wchanged[i];
}
if (bchanged) {
atomicAdd(kchanged, 1);
}
}
}
int main(void)
{
const int N = 2048;
const int min = 5, max = 15;
std::vector<int> data(N);
for(int i=0; i<N; i++) {
data[i] = min + (std::rand() % (int)(max - min + 1));
}
int* _data;
size_t datasz = sizeof(int) * (size_t)N;
cudaMalloc<int>(&_data, datasz);
cudaMemcpy(_data, &data[0], datasz, cudaMemcpyHostToDevice);
unsigned int *kchanged, *_kchanged;
cudaHostAlloc((void **)&kchanged, sizeof(unsigned int), cudaHostAllocMapped);
cudaHostGetDevicePointer((void **)&_kchanged, kchanged, 0);
const int nwarps = 4;
dim3 blcksz(32*nwarps), grdsz(16);
// Loop while the kernel signals it needs to run again
do {
*kchanged = 0;
kernel<nwarps><<<grdsz, blcksz>>>(_data, _kchanged);
cudaDeviceSynchronize();
} while (*kchanged != 0);
cudaMemcpy(&data[0], _data, datasz, cudaMemcpyDeviceToHost);
cudaDeviceReset();
int minval = *std::min_element(data.begin(), data.end());
assert(minval == 10);
return 0;
}
Here, kchanged is the flag the kernel uses to signal it needs to run again to the host. The kernel runs until each entry in the input has been incremented to above a threshold value. At the end of each threads processing, it participates in a warp vote, after which one thread from each warp loads the vote result to shared memory. One thread reduces the warp result and then atomically updates the kchanged value. The host thread waits until the device is finished, and can then directly read the result from the mapped host variable.
You should be able to adapt this to whatever your application requires
I'll go back to my original suggestion. I've updated the related question with an answer of my own, which I believe is correct.
create a flag in global memory:
__device__ int flag;
at each iteration,
initialize the flag to zero (in host code):
int init_val = 0;
cudaMemcpyToSymbol(flag, &init_val, sizeof(int));
In your kernel device code, modify the flag to 1 if a change is made to the matrix:
__global void iter_kernel(float *matrix){
...
if (new_val[i] != matrix[i]){
matrix[i] = new_val[i];
flag = 1;}
...
}
after calling the kernel, at the end of the iteration (in host code), test for modification:
int modified = 0;
cudaMemcpyFromSymbol(&modified, flag, sizeof(int));
if (modified){
...
}
Even if multiple threads in separate blocks or even separate grids, are writing the flag value, as long as the only thing they do is write the same value (i.e. 1 in this case), there is no hazard. The write will not get "lost" and no spurious values will show up in the flag variable.
Testing float or double quantities for equality in this fashion is questionable, but that doesn't seem to be the point of your question. If you have a preferred method to declare "modification" use that instead (such as testing for equality within a tolerance, perhaps).
Some obvious enhancements to this method would be to create one (local) flag variable per thread, and have each thread update the global flag variable once per kernel, rather than on every modification. This would result in at most one global write per thread per kernel. Another approach would be to keep one flag variable per block in shared memory, and have all threads simply update that variable. At the completion of the block, one write is made to global memory (if necessary) to update the global flag. We don't need to resort to complicated reductions in this case, because there is only one boolean result for the entire kernel, and we can tolerate multiple threads writing to either a shared or global variable, as long as all threads are writing the same value.
I can't see any reason to use atomics, or how it would benefit anything.
A reduction kernel seems like overkill, at least compared to one of the optimized approaches (e.g. a shared flag per block). And it would have the drawbacks you mention, such as the fact that anything less than a CRC or similarly complicated computation might alias two different matrix results as "the same".