cudaStreamAddCallback doesn't block later cudaMemcpyAsync - cuda

I am trying to let cudaMemcpyHost2Device wait for some specific event by using cudaStreamAddCallback. And I found the comments about cudaStreamCallback API
The callback will block later work in the stream until it is finished.
So, later work like cudaMemcpyAsync to be blocked is expected. But later code assertion failed.
#include <cuda_runtime.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#include <unistd.h>
#include <stdio.h>
#define cuda_check(x) \
assert((x) == cudaSuccess)
const size_t size = 1024 * 1024;
static void CUDART_CB cuda_callback(
cudaStream_t, cudaError_t, void* host) {
float* host_A = static_cast<float*>(host);
for (size_t i = 0; i < size; ++i) {
host_A[i] = i;
}
printf("hello\n");
sleep(1);
}
int main(void) {
float* A;
cuda_check(cudaMalloc(&A, size * 4));
float* host_A = static_cast<float*>(malloc(size * 4));
float* result = static_cast<float*>(malloc(size * 4));
memset(host_A, 0, size * 4);
cuda_check(cudaMemcpy(A, host_A, size * 4, cudaMemcpyHostToDevice));
cudaStream_t stream;
cuda_check(cudaStreamCreate(&stream));
cuda_check(cudaStreamAddCallback(stream, cuda_callback, host_A, 0));
cuda_check(cudaMemcpyAsync(A, host_A, size * 4, cudaMemcpyHostToDevice,
stream));
cuda_check(cudaStreamSynchronize(stream));
cuda_check(cudaMemcpy(result, A, size * 4, cudaMemcpyDeviceToHost));
for (size_t i = 0; i < size; ++i) {
assert(result[i] == i);
}
return 0;
}

Your assumption about what is happening isn't really correct. If I use the profiler to collect the runtime API trace for your code (the cudaDeviceReset was added by me to ensure profiling data is flushed), I see this:
124.79ms 129.57ms cudaMalloc
255.23ms 694.20us cudaMemcpy
255.93ms 38.881us cudaStreamCreate
255.97ms 123.44us cudaStreamAddCallback
256.09ms 1.00348s cudaMemcpyAsync
1.25957s 76.899us cudaStreamSynchronize
1.25965s 1.3067ms cudaMemcpy
1.26187s 71.884ms cudaDeviceReset
As you can see, the cudaMemcpyAsync did get blocked by the callback (it took > 1.0 second to finish).
The fact that the copy failed to happen in the sequence you thought is likely caused by the fact that you are using a regular pageable host allocation, not pinned memory and expecting the callback to fire on the empty queue instantly. It is important to note that registering the stream callback and starting the copy occur less that 0.1 milliseconds from one another, and it is possible that the callback might not fire immediately (given it is in another thread), leaving the possibility that the copy will start before the callback function reacts to the empty queue condition.
Interestingly, if I change host_A to a pinned allocation and run the code I get this API timeline:
124.21ms 130.24ms cudaMalloc
254.45ms 1.0988ms cudaHostAlloc
255.98ms 376.14us cudaMemcpy
256.36ms 33.841us cudaStreamCreate
256.39ms 87.303us cudaStreamAddCallback
256.48ms 17.208us cudaMemcpyAsync
256.50ms 1.00331s cudaStreamSynchronize
1.25981s 1.2880ms cudaMemcpy
1.26205s 68.506ms cudaDeviceReset
Note now that the cudaStreamSynchronize is the call which is blocked. But in this case the program passes the assert, which is probably related to the scheduler correctly managing dependencies in the stream given the host memory is pinned.

Related

Delete a cudaMalloc allocated memory inside kernel

I want to delete an array allocated by cudaMalloc inside a kernel using delete[]; but memory checker shows access violation, array is kept in memory and kernel continues to execute.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__global__ void kernel(int *a)
{
int *b = new int[10];
delete[] b; // no violation
delete[] a; // Memory Checker detects access violation.
}
int main()
{
int *d_a;
cudaMalloc(&d_a, 10 * sizeof(int));
kernel<<<1, 1>>>(d_a);
return 0;
}
What is the difference between memory allocated by cudaMalloc and new in device code?
Is it possible to delete memory allocated by cudaMalloc in device code?
Thanks
cudaMalloc in host code and new (or malloc) in device code allocate out of logically separate areas. The two areas are not generally interoperable from an API standpoint.
no
You may wish to read the documentation. The description given there for in-kernel malloc and free generally applies to in-kernel new and delete as well.

synchronizing device memory access with host thread

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.

Does my code run 1000 instances of 1000 iterations of my non linear recursive equation?

from my understanding of CUDA C each thread executes an instances of the equation. but how do i print out all the entire values. the code actually works but really need someone to review it for me please to confirm my result is actually inline with what i set out to design.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <conio.h>
#include <cuda.h>
#include <cutil.h>
__global__ void my_compute(float *y_d,float *theta_d,float *u_d)
{
int idx=threadIdx.x+blockIdx.x*gridDim.x;
for (idx=7;idx<1000;idx++)
{
y_d[idx]=theta_d[0]*y_d[idx-1]+theta_d[1]*y_d[idx-3]+
theta_d[2]*u_d[idx-5]*u_d[idx-4]+theta_d[3]+
theta_d[4]*u_d[idx-6]+theta_d[5]*u_d[idx-4]*y_d[idx-6]+
theta_d[6]*u_d[idx-7]+theta_d[7]*u_d[idx-7]*u_d[idx-6]+
theta_d[8]*y_d[idx-4]+theta_d[9]*y_d[idx-5]+
theta_d[10]*u_d[idx-4]*y_d[idx-5]+theta_d[11]*u_d[idx-4]*y_d[idx-2]+
theta_d[12]*u_d[idx-7]*u_d[idx-3]+theta_d[13]*u_d[idx-5]+
theta_d[14]*u_d[idx-4];
}
}
int main(void)
{
float y[1000];
FILE *fpoo;
FILE *u;
float theta[15];
float u_data[1000];
float *y_d;
float *theta_d;
float *u_d;
cudaEvent_t start,stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// memory allocation
cudaMalloc((void**)&y_d,1000*sizeof(float));
cudaMalloc((void**)&theta_d,15*sizeof(float));
cudaMalloc((void**)&u_d,1000*sizeof(float));
cudaEventRecord( start, 0 );
// importing data for theta and input of model//
fpoo= fopen("c:\\Fly_theta.txt","r");
u= fopen("c:\\Fly_u.txt","r");
for (int k=0;k<15;k++)
{
fscanf(fpoo,"%f\n",&theta[k]);
}
for (int k=0;k<1000;k++)
{
fscanf(u,"%f\n",&u_data[k]);
}
//NB: pls does this for loop below make my equation run 1000000
// instances as oppose to the 1000 instances i desire?
for (int i=0;i<1000;i++)
{
//i initialised the first 7 values of y because the equation output
//starts form y(8)
for (int k=0;k<8;k++)
{
y[k]=0;
cudaMemcpy(y_d,y,1000*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(theta_d,theta,15*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(u_d,u_data,1000*sizeof(float),cudaMemcpyHostToDevice);
//calling kernel function//
my_compute<<<200,5>>>(y_d,theta_d,u_d);
cudaMemcpy(y,y_d,1000*sizeof(float),cudaMemcpyDeviceToHost);
}
printf("\n\n*******Iteration %i*******\n", i);
//does this actually print all the values from the threads?
for(int i=0;i<1000;i++)
{
printf("%f",y[i]);
}
}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
printf("Time to generate: %3.1f ms \n", time);
cudaFree(y_d);
cudaFree(theta_d);
cudaFree(u_d);
fclose(u);
fclose(fpoo);
//fclose();
_getche();
return (0);
}
how do i print out all the entire values.
Well, you can copy it to host (which you do already) and print it out normally?
However, I am worried about your code for several reasons:
Only the threads belonging the same warp are executed truly in parallel. A warp is a collection of 32 adjacent threads. (something like warpId = threadIdx.x/32). Threads belonging to different warp can execute in any order, unless you apply some synchronization functions.
Because of the above you cannot say much about y_d[idx-1] when computing y_d[idx]. Was y_d[idx-1] already computed/overwritten by the other thread or not?
You have only 5 threads in your block (<<<200,5>>>), but because blocks can be launched at warp granularity (multiple of 32), it will just keep 5 threads running and 27 threads idling for each block you launch.
You are not using parallelism at all! You have a for loop which will be executed by all the 1000 threads. All 1000 threads compute exactly the same thing (modulo the race conditions). You compute the thread index idx, but then completely ignore it and set idx to 7 for all threads.
I would strongly suggest --- as an exercise for launch configuration, synchronization, thread indexing --- implementing a parallel prefix-sum algorithm, and only after confirming that it works correctly, doing this more advanced stuff...

Equivalent of usleep() in CUDA kernel?

I'd like to call something like usleep() inside a CUDA kernel. The basic goal is to make all GPU cores sleep or busywait for a number of millesconds--it's part of some sanity checks that I want to do for a CUDA application. My attempt at doing this is below:
#include <unistd.h>
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
__global__ void gpu_uSleep(useconds_t wait_time_in_ms)
{
usleep(wait_time_in_ms);
}
int main(void)
{
//input parameters -- arbitrary
// TODO: set these exactly for full occupancy
int m = 16;
int n = 16;
int block1D = 16;
dim3 block(block1D, block1D);
dim3 grid(m/block1D, n/block1D);
useconds_t wait_time_in_ms = 1000;
//execute the kernel
gpu_uSleep<<< grid, block >>>(wait_time_in_ms);
cudaDeviceSynchronize();
return 0;
}
I get the following error when I try to compile this using NVCC:
error: calling a host function("usleep") from a __device__/__global__
function("gpu_uSleep") is not allowed
Clearly, I'm not allowed to use a host function such as usleep() inside a kernel. What would be a good alternative to this?
You can spin on clock() or clock64(). The CUDA SDK concurrentKernels sample does this does the following:
__global__ void clock_block(clock_t *d_o, clock_t clock_count)
{
clock_t start_clock = clock();
clock_t clock_offset = 0;
while (clock_offset < clock_count)
{
clock_offset = clock() - start_clock;
}
d_o[0] = clock_offset;
}
I recommend using clock64(). clock() and clock64() are in cycles so you will have to query the frequency using cudaDeviceProperties(). The frequency can be dynamic so it will be hard to guarantee an accurate spin loop.
You can busy wait with a loop that reads clock().
To wait for at least 10,000 clock cycles:
clock_t start = clock();
clock_t now;
for (;;) {
now = clock();
clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
if (cycles >= 10000) {
break;
}
}
// Stored "now" in global memory here to prevent the
// compiler from optimizing away the entire loop.
*global_now = now;
Note: This is untested. The code that handles overflows was borrowed from this answer by #Pedro. See his answer and section B.10 in the CUDA C Programming Guide 4.2 for details on how clock() works. There is also a clock64() command.
With recent versions of CUDA, and a device with Compute Capability 7.0 or later (Volta, Turing, Ampere etc.), you can use the __nanosleep() primitive:
void __nanosleep(unsigned ns);
which obviates the need for busy-sleeping as suggested in older answers.
The best way to "sleep the cores" is to have the kernel return, to the CPU, and then have the CPU just launch a second kernel (or the same kernel again). This prevents the GPUs from having to spin & overheat.

Implementing a critical section in CUDA

I'm trying to implement a critical section in CUDA using atomic instructions, but I ran into some trouble. I have created the test program to show the problem:
#include <cuda_runtime.h>
#include <cutil_inline.h>
#include <stdio.h>
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
while (atomicExch(&(locks[id]), 1u) != 0u) {} //lock
//critical section would go here
atomicExch(&(locks[id]),0u); //unlock
}
int main(int argc, char** argv) {
//initialize the locks array on the GPU to (0...0)
unsigned int* locks;
unsigned int zeros[10]; for (int i = 0; i < 10; i++) {zeros[i] = 0u;}
cutilSafeCall(cudaMalloc((void**)&locks, sizeof(unsigned int)*10));
cutilSafeCall(cudaMemcpy(locks, zeros, sizeof(unsigned int)*10, cudaMemcpyHostToDevice));
//Run the kernel:
k_testLocking<<<dim3(1), dim3(256)>>>(locks, 10);
//Check the error messages:
cudaError_t error = cudaGetLastError();
cutilSafeCall(cudaFree(locks));
if (cudaSuccess != error) {
printf("error 1: CUDA ERROR (%d) {%s}\n", error, cudaGetErrorString(error));
exit(-1);
}
return 0;
}
This code, unfortunately, hard freezes my machine for several seconds and finally exits, printing out the message:
fcudaSafeCall() Runtime API error in file <XXX.cu>, line XXX : the launch timed out and was terminated.
which means that one of those while loops is not returning, but it seems like this should work.
As a reminder atomicExch(unsigned int* address, unsigned int val) atomically sets the value of the memory location stored in address to val and returns the old value. So the idea behind my locking mechanism is that it is initially 0u, so one thread should get past the while loop and all other threads should wait on the while loop since they will read locks[id] as 1u. Then when the thread is done with the critical section, it resets the lock to 0u so another thread can enter.
What am I missing?
By the way, I am compiling with:
nvcc -arch sm_11 -Ipath/to/cuda/C/common/inc XXX.cu
Okay, I figured it out, and this is yet-another-one-of-the-cuda-paradigm-pains.
As any good cuda programmer knows (notice that I did not remember this which makes me a bad cuda programmer, I think) all threads in a warp must execute the same code. The code I wrote would work perfectly if not for this fact. As it is, however, there are likely to be two threads in the same warp accessing the same lock. If one of them acquires the lock, it just forgets about executing the loop, but it cannot continue past the loop until all other threads in its warp have completed the loop. Unfortunately the other thread will never complete because it is waiting for the first one to unlock.
Here is a kernel that will do the trick without error:
__global__ void k_testLocking(unsigned int* locks, int n) {
int id = threadIdx.x % n;
bool leaveLoop = false;
while (!leaveLoop) {
if (atomicExch(&(locks[id]), 1u) == 0u) {
//critical section
leaveLoop = true;
atomicExch(&(locks[id]),0u);
}
}
}
The poster has already found an answer to his own issue. Nevertheless, in the code below, I'm providing a general framework to implement a critical section in CUDA. More in detail, the code performs a block counting, but it is easily modifyiable to host other operations to be performed in a critical section. Below, I'm also reporting some explanation of the code, with some, "typical" mistakes in the implementation of critical sections in CUDA.
THE CODE
#include <stdio.h>
#include "Utilities.cuh"
#define NUMBLOCKS 512
#define NUMTHREADS 512 * 2
/***************/
/* LOCK STRUCT */
/***************/
struct Lock {
int *d_state;
// --- Constructor
Lock(void) {
int h_state = 0; // --- Host side lock state initializer
gpuErrchk(cudaMalloc((void **)&d_state, sizeof(int))); // --- Allocate device side lock state
gpuErrchk(cudaMemcpy(d_state, &h_state, sizeof(int), cudaMemcpyHostToDevice)); // --- Initialize device side lock state
}
// --- Destructor
__host__ __device__ ~Lock(void) {
#if !defined(__CUDACC__)
gpuErrchk(cudaFree(d_state));
#else
#endif
}
// --- Lock function
__device__ void lock(void) { while (atomicCAS(d_state, 0, 1) != 0); }
// --- Unlock function
__device__ void unlock(void) { atomicExch(d_state, 0); }
};
/*************************************/
/* BLOCK COUNTER KERNEL WITHOUT LOCK */
/*************************************/
__global__ void blockCountingKernelNoLock(int *numBlocks) {
if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
}
/**********************************/
/* BLOCK COUNTER KERNEL WITH LOCK */
/**********************************/
__global__ void blockCountingKernelLock(Lock lock, int *numBlocks) {
if (threadIdx.x == 0) {
lock.lock();
numBlocks[0] = numBlocks[0] + 1;
lock.unlock();
}
}
/****************************************/
/* BLOCK COUNTER KERNEL WITH WRONG LOCK */
/****************************************/
__global__ void blockCountingKernelDeadlock(Lock lock, int *numBlocks) {
lock.lock();
if (threadIdx.x == 0) { numBlocks[0] = numBlocks[0] + 1; }
lock.unlock();
}
/********/
/* MAIN */
/********/
int main(){
int h_counting, *d_counting;
Lock lock;
gpuErrchk(cudaMalloc(&d_counting, sizeof(int)));
// --- Unlocked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCountingKernelNoLock << <NUMBLOCKS, NUMTHREADS >> >(d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the unlocked case: %i\n", h_counting);
// --- Locked case
h_counting = 0;
gpuErrchk(cudaMemcpy(d_counting, &h_counting, sizeof(int), cudaMemcpyHostToDevice));
blockCountingKernelLock << <NUMBLOCKS, NUMTHREADS >> >(lock, d_counting);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(&h_counting, d_counting, sizeof(int), cudaMemcpyDeviceToHost));
printf("Counting in the locked case: %i\n", h_counting);
gpuErrchk(cudaFree(d_counting));
}
CODE EXPLANATION
Critical sections are sequences of operations that must be executed sequentially by the CUDA threads.
Suppose to construct a kernel which has the task of computing the number of thread blocks of a thread grid. One possible idea is to let each thread in each block having threadIdx.x == 0 increase a global counter. To prevent race conditions, all the increases must occur sequentially, so they must be incorporated in a critical section.
The above code has two kernel functions: blockCountingKernelNoLock and blockCountingKernelLock. The former does not use a critical section to increase the counter and, as one can see, returns wrong results. The latter encapsulates the counter increase within a critical section and so produces correct results. But how does the critical section work?
The critical section is governed by a global state d_state. Initially, the state is 0. Furthermore, two __device__ methods, lock and unlock, can change this state. The lock and unlock methods can be invoked only by a single thread within each block and, in particular, by the thread having local thread index threadIdx.x == 0.
Randomly during the execution, one of the threads having local thread index threadIdx.x == 0 and global thread index, say, t will be the first invoking the lock method. In particular, it will launch atomicCAS(d_state, 0, 1). Since initially d_state == 0, then d_state will be updated to 1, atomicCAS will return 0 and the thread will exit the lock function, passing to the update instruction. In the meanwhile such a thread performs the mentioned operations, all the other threads of all the other blocks having threadIdx.x == 0 will execute the lock method. They will however find a value of d_state equal to 1, so that atomicCAS(d_state, 0, 1) will perform no update and will return 1, so leaving these threads running the while loop. After that thread t accomplishes the update, then it executes the unlock function, namely atomicExch(d_state, 0), thus restoring d_state to 0. At this point, randomly, another of the threads with threadIdx.x == 0 will lock again the state.
The above code contains also a third kernel function, namely blockCountingKernelDeadlock. However, this is another wrong implementation of the critical section, leading to deadlocks. Indeed, we recall that warps operate in lockstep and they synchronize after every instruction. So, when we execute blockCountingKernelDeadlock, there is the possibility that one of the threads in a warp, say a thread with local thread index t≠0, will lock the state. Under this circumstance, the other threads in the same warp of t, including that with threadIdx.x == 0, will execute the same while loop statement as thread t, being the execution of threads in the same warp performed in lockstep. Accordingly, all the threads will wait for someone to unlock the state, but no other thread will be able to do so, and the code will be stuck in a deadlock.
by the way u have to remember that global memory writes and ! reads aren't completed where u write them in the code ... so for this to be practice you need to add a global memfence ie __threadfence()