Equivalent of usleep() in CUDA kernel? - cuda

I'd like to call something like usleep() inside a CUDA kernel. The basic goal is to make all GPU cores sleep or busywait for a number of millesconds--it's part of some sanity checks that I want to do for a CUDA application. My attempt at doing this is below:
#include <unistd.h>
#include <stdio.h>
#include <cuda.h>
#include <sys/time.h>
__global__ void gpu_uSleep(useconds_t wait_time_in_ms)
{
usleep(wait_time_in_ms);
}
int main(void)
{
//input parameters -- arbitrary
// TODO: set these exactly for full occupancy
int m = 16;
int n = 16;
int block1D = 16;
dim3 block(block1D, block1D);
dim3 grid(m/block1D, n/block1D);
useconds_t wait_time_in_ms = 1000;
//execute the kernel
gpu_uSleep<<< grid, block >>>(wait_time_in_ms);
cudaDeviceSynchronize();
return 0;
}
I get the following error when I try to compile this using NVCC:
error: calling a host function("usleep") from a __device__/__global__
function("gpu_uSleep") is not allowed
Clearly, I'm not allowed to use a host function such as usleep() inside a kernel. What would be a good alternative to this?

You can spin on clock() or clock64(). The CUDA SDK concurrentKernels sample does this does the following:
__global__ void clock_block(clock_t *d_o, clock_t clock_count)
{
clock_t start_clock = clock();
clock_t clock_offset = 0;
while (clock_offset < clock_count)
{
clock_offset = clock() - start_clock;
}
d_o[0] = clock_offset;
}
I recommend using clock64(). clock() and clock64() are in cycles so you will have to query the frequency using cudaDeviceProperties(). The frequency can be dynamic so it will be hard to guarantee an accurate spin loop.

You can busy wait with a loop that reads clock().
To wait for at least 10,000 clock cycles:
clock_t start = clock();
clock_t now;
for (;;) {
now = clock();
clock_t cycles = now > start ? now - start : now + (0xffffffff - start);
if (cycles >= 10000) {
break;
}
}
// Stored "now" in global memory here to prevent the
// compiler from optimizing away the entire loop.
*global_now = now;
Note: This is untested. The code that handles overflows was borrowed from this answer by #Pedro. See his answer and section B.10 in the CUDA C Programming Guide 4.2 for details on how clock() works. There is also a clock64() command.

With recent versions of CUDA, and a device with Compute Capability 7.0 or later (Volta, Turing, Ampere etc.), you can use the __nanosleep() primitive:
void __nanosleep(unsigned ns);
which obviates the need for busy-sleeping as suggested in older answers.

The best way to "sleep the cores" is to have the kernel return, to the CPU, and then have the CPU just launch a second kernel (or the same kernel again). This prevents the GPUs from having to spin & overheat.

Related

Is there proper CUDA atomicLoad function?

I've faced with the issue that CUDA atomic API do not have atomicLoad function.
After searching on stackoverflow, I've found the following implementation of CUDA atomicLoad
But looks like this function is failed to work in following example:
#include <cassert>
#include <iostream>
#include <cuda_runtime_api.h>
template <typename T>
__device__ T atomicLoad(const T* addr) {
const volatile T* vaddr = addr; // To bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const T value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
__global__ void initAtomic(unsigned& count, const unsigned initValue) {
count = initValue;
}
__global__ void addVerify(unsigned& count, const unsigned biasAtomicValue) {
atomicAdd(&count, 1);
// NOTE: When uncomment the following while loop the addVerify is stuck,
// it cannot read last proper value in variable count
// while (atomicLoad(&count) != (1024 * 1024 + biasAtomicValue)) {
// printf("count = %u\n", atomicLoad(&count));
// }
}
int main() {
std::cout << "Hello, CUDA atomics!" << std::endl;
const auto atomicSize = sizeof(unsigned);
unsigned* datomic = nullptr;
cudaMalloc(&datomic, atomicSize);
cudaStream_t stream;
cudaStreamCreate(&stream);
constexpr unsigned biasAtomicValue = 11;
initAtomic<<<1, 1, 0, stream>>>(*datomic, biasAtomicValue);
addVerify<<<1024, 1024, 0, stream>>>(*datomic, biasAtomicValue);
cudaStreamSynchronize(stream);
unsigned countHost = 0;
cudaMemcpyAsync(&countHost, datomic, atomicSize, cudaMemcpyDeviceToHost, stream);
assert(countHost == 1024 * 1024 + biasAtomicValue);
cudaStreamDestroy(stream);
return 0;
}
If you will uncomment the section with atomicLoad then application will stuck ...
Maybe I missed something ? Is there a proper way to load variable modified atomically ?
P.S.: I know there exists cuda::atomic implementation, but this API is not supported by my hardware
Since warps work in a lockstep manner (at least in old arch), if you put a conditional wait for one thread and a producer on another thread, both in same warp, then the warp could be stuck in the waiting if it starts/is executed first. Maybe only newest architecture that has asynchronous warp thread scheduling can do this. For example, you should query minor-major versions of cuda architecture before running this. Volta and onwards is ok.
Also you are launching 1million threads and waiting on all of them at once. GPU may not have that many execution ports/pipeline availability to have 1 million threads in-flight. Maybe it would work in only a GPU of 64k CUDA pipelines (assuming 16 threads in flight per pipeline). Instead of waiting on millions of threads, just spawn sub-kernels from main kernel when a condition occurs. Dynamic parallelism is the key feature. You should also check for the minimum minor-major cuda version to use dynamic parallelism just in case someone is using ancient nvidia cards.
Atomic-add command returns the old value in the target address. If you have meant to call a third kernel only once only after the condition, then you can simply check that returned value by an "if" before starting the dynamic parallelism.
You are printing for 1 million times, it is not good for performance and it may take some time before text appears in console output if you have a slow CPU/RAM.
Lastly, you can optimize performance of atomic operations by running them on shared memory first then going global atomic only once per block. This will miss the point of condition if there are more threads than the condition value (assuming always 1 increment value) so it may not be applicable for all algorithms.

cudaStreamAddCallback doesn't block later cudaMemcpyAsync

I am trying to let cudaMemcpyHost2Device wait for some specific event by using cudaStreamAddCallback. And I found the comments about cudaStreamCallback API
The callback will block later work in the stream until it is finished.
So, later work like cudaMemcpyAsync to be blocked is expected. But later code assertion failed.
#include <cuda_runtime.h>
#include <stdlib.h>
#include <string.h>
#include <cassert>
#include <unistd.h>
#include <stdio.h>
#define cuda_check(x) \
assert((x) == cudaSuccess)
const size_t size = 1024 * 1024;
static void CUDART_CB cuda_callback(
cudaStream_t, cudaError_t, void* host) {
float* host_A = static_cast<float*>(host);
for (size_t i = 0; i < size; ++i) {
host_A[i] = i;
}
printf("hello\n");
sleep(1);
}
int main(void) {
float* A;
cuda_check(cudaMalloc(&A, size * 4));
float* host_A = static_cast<float*>(malloc(size * 4));
float* result = static_cast<float*>(malloc(size * 4));
memset(host_A, 0, size * 4);
cuda_check(cudaMemcpy(A, host_A, size * 4, cudaMemcpyHostToDevice));
cudaStream_t stream;
cuda_check(cudaStreamCreate(&stream));
cuda_check(cudaStreamAddCallback(stream, cuda_callback, host_A, 0));
cuda_check(cudaMemcpyAsync(A, host_A, size * 4, cudaMemcpyHostToDevice,
stream));
cuda_check(cudaStreamSynchronize(stream));
cuda_check(cudaMemcpy(result, A, size * 4, cudaMemcpyDeviceToHost));
for (size_t i = 0; i < size; ++i) {
assert(result[i] == i);
}
return 0;
}
Your assumption about what is happening isn't really correct. If I use the profiler to collect the runtime API trace for your code (the cudaDeviceReset was added by me to ensure profiling data is flushed), I see this:
124.79ms 129.57ms cudaMalloc
255.23ms 694.20us cudaMemcpy
255.93ms 38.881us cudaStreamCreate
255.97ms 123.44us cudaStreamAddCallback
256.09ms 1.00348s cudaMemcpyAsync
1.25957s 76.899us cudaStreamSynchronize
1.25965s 1.3067ms cudaMemcpy
1.26187s 71.884ms cudaDeviceReset
As you can see, the cudaMemcpyAsync did get blocked by the callback (it took > 1.0 second to finish).
The fact that the copy failed to happen in the sequence you thought is likely caused by the fact that you are using a regular pageable host allocation, not pinned memory and expecting the callback to fire on the empty queue instantly. It is important to note that registering the stream callback and starting the copy occur less that 0.1 milliseconds from one another, and it is possible that the callback might not fire immediately (given it is in another thread), leaving the possibility that the copy will start before the callback function reacts to the empty queue condition.
Interestingly, if I change host_A to a pinned allocation and run the code I get this API timeline:
124.21ms 130.24ms cudaMalloc
254.45ms 1.0988ms cudaHostAlloc
255.98ms 376.14us cudaMemcpy
256.36ms 33.841us cudaStreamCreate
256.39ms 87.303us cudaStreamAddCallback
256.48ms 17.208us cudaMemcpyAsync
256.50ms 1.00331s cudaStreamSynchronize
1.25981s 1.2880ms cudaMemcpy
1.26205s 68.506ms cudaDeviceReset
Note now that the cudaStreamSynchronize is the call which is blocked. But in this case the program passes the assert, which is probably related to the scheduler correctly managing dependencies in the stream given the host memory is pinned.

How to call a CUDA kernel from inside a class containing device member variables

I want to use CUDA 5.0 linking to write re-usable CUDA objects. i've set up this simple test of but my kernel fails silently (runs without error or exception and outputs junk).
My simple test (below) allocates an array of integers to CUDA device memory. The CUDA kernel should populate the array with sequential entries (0,1,2,....,9). The device array is copied to CPU memory and output to the console.
Currently, this code outputs "0,0,0,0,0,0,0,0,0," instead of the desired "0,1,2,3,4,5,6,7,8,9,". It is compiled using VS2010 and CUDA 5.0 (with compute_35 and sm_35 set). Running on Win7-64-bit with a GeForce 580.
In Test.h:
class Test
{
public:
Test();
~Test();
void Run();
private:
int* cuArray;
};
In Test.cu:
#include <stdio.h>
#include <assert.h>
#include <cuda_runtime.h>
#include "Test.h"
#define ARRAY_LEN 10
__global__ void kernel(int *p)
{
int elemID = blockIdx.x * blockDim.x + threadIdx.x;
p[elemID] = elemID;
}
Test::Test()
{
cudaMalloc(&cuArray, ARRAY_LEN * sizeof(int));
}
Test::~Test()
{
cudaFree(cuArray);
}
void Test::Run()
{
kernel<<<1,ARRAY_LEN>>>(cuArray);
// Copy the array contents to CPU-accessible memory
int cpuArray[ARRAY_LEN];
cudaMemcpy(static_cast<void*>(cpuArray), static_cast<void*>(cuArray), ARRAY_LEN * sizeof(int), cudaMemcpyDeviceToHost);
// Write the array contents to console
for (int i = 0; i < ARRAY_LEN; ++i)
printf("%d,", cpuArray[i]);
printf("\n");
}
In main.cpp:
#include <iostream>
#include "Test.h"
int main()
{
Test t;
t.Run();
}
I've experimented with the DECLs (__device__ __host__) as suggested by #harrism but to no effect.
Can anyone suggest how to make his work? (The code works when it isn't inside a class.)
The device you are using is GTX 580 whose compute capability is 2.0. If you compile the code for any architecture greater than 2.0, the kernel will not run on your device, and the output will be garbage. Compile the code for compute 2.0 or lower, and the code will run fine.

synchronizing device memory access with host thread

Is it possible for a CUDA kernel to synchronize writes to device-mapped memory without any host-side invocation (e.g., of cudaDeviceSynchronize)? When I run the following program, it doesn't seem that the kernel waits for the writes to device-mapped memory to complete before terminating because examining the page-locked host memory immediately after the kernel launch does not show any modification of the memory (unless a delay is inserted or the call to cudaDeviceSynchronize is uncommented):
#include <stdio.h>
#include <cuda.h>
__global__ void func(int *a, int N) {
int idx = threadIdx.x;
if (idx < N) {
a[idx] *= -1;
__threadfence_system();
}
}
int main(void) {
int *a, *a_gpu;
const int N = 8;
size_t size = N*sizeof(int);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc((void **) &a, size, cudaHostAllocMapped);
cudaHostGetDevicePointer((void **) &a_gpu, (void *) a, 0);
for (int i = 0; i < N; i++) {
a[i] = i;
}
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
func<<<1, N>>>(a_gpu, N);
// cudaDeviceSynchronize();
for (int i = 0; i < N; i++) {
printf("%i ", a[i]);
}
printf("\n");
cudaFreeHost(a);
}
I'm compiling the above for sm_20 with CUDA 4.2.9 on Linux and running it on a Fermi GPU (S2050).
A kernel launch will immediately return to the host code before any kernel activity has occurred. Kernel execution is in this way asynchronous to host execution and does not block host execution. So it's no surprise that you have to wait a bit or else use a barrier (like cudaDeviceSynchronize()) to see the results of the kernel.
As described here:
In order to facilitate concurrent execution between host and device,
some function calls are asynchronous: Control is returned to the host
thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
This is all intentional of course, so that you can use the GPU and CPU simultaneously. If you don't want this behavior, a simple solution as you've already discovered is to insert a barrier. If your kernel is producing data which you will immediately copy back to the host, you don't need a separate barrier. The cudaMemcpy call after the kernel will wait until the kernel is completed before it begins it's copy operation.
I guess to answer your question, you are wanting kernel launches to be synchronous without you having even to use a barrier (why do you want to do this? Is adding the cudaDeviceSynchronize() call a problem?) It's possible to do this:
"Programmers can globally disable asynchronous kernel launches for all
CUDA applications running on a system by setting the
CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is
provided for debugging purposes only and should never be used as a way
to make production software run reliably. "
If you want this synchronous behavior, it's better just to use the barriers (or depend on another subsequent cuda call, like cudaMemcpy). If you use the above method and depend on it, your code will break as soon as somebody else tries to run it without the environment variable set. So it's really not a good idea.

Does my code run 1000 instances of 1000 iterations of my non linear recursive equation?

from my understanding of CUDA C each thread executes an instances of the equation. but how do i print out all the entire values. the code actually works but really need someone to review it for me please to confirm my result is actually inline with what i set out to design.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <conio.h>
#include <cuda.h>
#include <cutil.h>
__global__ void my_compute(float *y_d,float *theta_d,float *u_d)
{
int idx=threadIdx.x+blockIdx.x*gridDim.x;
for (idx=7;idx<1000;idx++)
{
y_d[idx]=theta_d[0]*y_d[idx-1]+theta_d[1]*y_d[idx-3]+
theta_d[2]*u_d[idx-5]*u_d[idx-4]+theta_d[3]+
theta_d[4]*u_d[idx-6]+theta_d[5]*u_d[idx-4]*y_d[idx-6]+
theta_d[6]*u_d[idx-7]+theta_d[7]*u_d[idx-7]*u_d[idx-6]+
theta_d[8]*y_d[idx-4]+theta_d[9]*y_d[idx-5]+
theta_d[10]*u_d[idx-4]*y_d[idx-5]+theta_d[11]*u_d[idx-4]*y_d[idx-2]+
theta_d[12]*u_d[idx-7]*u_d[idx-3]+theta_d[13]*u_d[idx-5]+
theta_d[14]*u_d[idx-4];
}
}
int main(void)
{
float y[1000];
FILE *fpoo;
FILE *u;
float theta[15];
float u_data[1000];
float *y_d;
float *theta_d;
float *u_d;
cudaEvent_t start,stop;
float time;
cudaEventCreate(&start);
cudaEventCreate(&stop);
// memory allocation
cudaMalloc((void**)&y_d,1000*sizeof(float));
cudaMalloc((void**)&theta_d,15*sizeof(float));
cudaMalloc((void**)&u_d,1000*sizeof(float));
cudaEventRecord( start, 0 );
// importing data for theta and input of model//
fpoo= fopen("c:\\Fly_theta.txt","r");
u= fopen("c:\\Fly_u.txt","r");
for (int k=0;k<15;k++)
{
fscanf(fpoo,"%f\n",&theta[k]);
}
for (int k=0;k<1000;k++)
{
fscanf(u,"%f\n",&u_data[k]);
}
//NB: pls does this for loop below make my equation run 1000000
// instances as oppose to the 1000 instances i desire?
for (int i=0;i<1000;i++)
{
//i initialised the first 7 values of y because the equation output
//starts form y(8)
for (int k=0;k<8;k++)
{
y[k]=0;
cudaMemcpy(y_d,y,1000*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(theta_d,theta,15*sizeof(float),cudaMemcpyHostToDevice);
cudaMemcpy(u_d,u_data,1000*sizeof(float),cudaMemcpyHostToDevice);
//calling kernel function//
my_compute<<<200,5>>>(y_d,theta_d,u_d);
cudaMemcpy(y,y_d,1000*sizeof(float),cudaMemcpyDeviceToHost);
}
printf("\n\n*******Iteration %i*******\n", i);
//does this actually print all the values from the threads?
for(int i=0;i<1000;i++)
{
printf("%f",y[i]);
}
}
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
cudaEventElapsedTime( &time, start, stop );
cudaEventDestroy( start );
cudaEventDestroy( stop );
printf("Time to generate: %3.1f ms \n", time);
cudaFree(y_d);
cudaFree(theta_d);
cudaFree(u_d);
fclose(u);
fclose(fpoo);
//fclose();
_getche();
return (0);
}
how do i print out all the entire values.
Well, you can copy it to host (which you do already) and print it out normally?
However, I am worried about your code for several reasons:
Only the threads belonging the same warp are executed truly in parallel. A warp is a collection of 32 adjacent threads. (something like warpId = threadIdx.x/32). Threads belonging to different warp can execute in any order, unless you apply some synchronization functions.
Because of the above you cannot say much about y_d[idx-1] when computing y_d[idx]. Was y_d[idx-1] already computed/overwritten by the other thread or not?
You have only 5 threads in your block (<<<200,5>>>), but because blocks can be launched at warp granularity (multiple of 32), it will just keep 5 threads running and 27 threads idling for each block you launch.
You are not using parallelism at all! You have a for loop which will be executed by all the 1000 threads. All 1000 threads compute exactly the same thing (modulo the race conditions). You compute the thread index idx, but then completely ignore it and set idx to 7 for all threads.
I would strongly suggest --- as an exercise for launch configuration, synchronization, thread indexing --- implementing a parallel prefix-sum algorithm, and only after confirming that it works correctly, doing this more advanced stuff...