Launching a CUDA stream from each host thread - cuda

My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication...I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.
The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart...I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don't think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.
Following is the code each thread runs:
//every thread would run this method in conjunction
static CUT_THREADPROC solverThread(TGPUplan *plan)
{
//Allocate memory
cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );
//Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
//to make cudaMemcpyAsync blocking
cudaStreamSynchronize( plan->stream );
//launch
launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
cutilCheckMsg("simpleKernel() execution failed.\n");
cudaStreamSynchronize(plan->stream);
//Read back GPU results
cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
//to make the cudaMemcpyAsync blocking...
cudaStreamSynchronize(plan->stream);
cutilSafeCall( cudaFree(plan->d_Data) );
CUT_THREADEND;
}
And creation of multiple threads and calling the above function:
for(i = 0; i < nkernels; i++)
threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);
printf("main(): waiting for GPU results...\n");
cutWaitForThreads(threadID, nkernels);
I took this strategy from one of the CUDA Code SDK samples. As I've said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code...

first off I am not an expert by any stretch of the imagination, just from my experience.
I don't see why this needs multiple host threads. It seems like you're managing one device and passing it multiple streams. The way I've seen this done (pseudocode)
{
create a handle
allocate an array of streams equal to the number of streams you want
for(int n=0;n<NUM_STREAMS;n++)
{
cudaStreamCreate(&streamArray[n]);
}
}
From there you can just pass the streams in your array to the various asynchronous calls (cudaMemcpyAsync(), kernel streams, etc.) and the device manages the rest. I've had weird scalability issues with multiple streams (don't try to make 10k streams, I run into problems around 4-8 on a GTX460), so don't be surprised if you run into those. Best of luck,
John

My bet is that
BLOCK_N, THREAD_N
, don't cover the exact size of the array you are passing.
Please provide the code for initializing the streams and the size of those buffers.
As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.

Related

How can I make sure the compiler parallelizes my loads from global memory?

I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?
The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.

How to avoid Cuda error 6 (Launch Timeout) with consecutive asynchronous kernel launches?

I get a Cuda error 6 (also known as cudaErrorLaunchTimeout and CUDA_ERROR_LAUNCH_TIMEOUT) with this (simplified) code:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
The Cuda error 6 indicates that the kernel took too much time to return. The duration of a single MyKernel is only ~60 ms though. The block size is a classic 16×16.
Now, when I call cudaDeviceSynchronize() every, say, 50 iterations, the error doesn't occur:
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
if(i % 50 == 0) cudaDeviceSynchronize();
}
I would like to avoid this synchronization, because it slows the program down a lot.
Since kernel launches are asynchronous, I guess the error occurs because the watchdog measures the execution duration of a kernel from its asynchronous launch, and not from the actual beginning of its execution.
I am new to Cuda. Is this a common case for the error 6 to occur? Is there a way to avoid this error without altering the performance?
Thanks to talonmies and Robert Crovella (whose proposed solution didn't work for me), I've been able to find an acceptable workaround.
To prevent the CUDA driver to batch the kernel launches together, another operation must be performed before or after each kernel launch. E.g. a dummy copy does the trick:
void* dummy;
cudaMalloc(&dummy, 1);
for(int i = 0; i < 650; ++i)
{
int param = foo(i); //some CPU computation here, but no memory copy
cudaMemcpyAsync(dummy, dummy, 1, cudaMemcpyDeviceToDevice);
MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}
This solution is 8 seconds faster (50s to 42s) than the one that includes calls to cudaDeviceSynchronize() (see question).
Besides, it's more reliable, 50 being an arbitrary, device-specific period.
The watchdog isn't measuring execution time of kernels, per se. The watchdog is keeping track of requests in the command queue that goes to the GPU, and determining if any of them have not been acknowledged by the GPU within a timeout period.
As #talonmies indicated in the comments, my best guess is that (if you are certain that no kernel execution exceeds the timeout period) this behavior is due to the CUDA driver WDDM batching mechanism, which seeks to reduce average latency by batching GPU commands together and sending to the GPU, in batches.
You don't have direct control over the batching behavior, and so in general, trying to work around this without disabling or modifying the windows TDR mechanism will be an imprecise exercise.
The general (somewhat undocumented) suggestion for a low-cost "flush" of the command queue, which you might try experimenting with, is to use cudaEventQuery(0); (as suggested here) in place of cudaDeviceSynchronize();, perhaps every 50 kernel launches or so. To some degree the specifics may depend on the machine configuration, and the GPU in use.
I'm not sure how effective it will be in your case. I don't think that it can be advanced as a "guarantee" of avoiding a TDR event without a lot more experimentation. Your mileage may vary.

cuda out-of-core implementation using a circular buffer

I'm trying to do out-of-core between GPU memory and CPU memory. For example, I have blocks of data each is 1GB, and I need to process 1000 of such blocks in order, each is done by a kernel launch. Assume the processing must be done one by one, because the n'th kernel launch needs to use the result produced by the (n-1)'th kernel, which is stored in the (n-1)'th block, except the first kernel launch. So I'm thinking of using a circular buffer on GPU to store the most recent 5 blocks, and use events to synchronize between the data stream and the task stream. The data stream prepares the data and the task stream launches the kernels. The code is illustrated as the following.
const int N_CBUF = 5, N_TASK = 1000;
// Each pointer points to a data block of 1GB
float* d_cir_buf[N_CBUF];
float* h_data_blocks[N_TASK];
// The data stream for transfering data from host to device.
// The task stream for launching kernels to process the data.
cudaStream_t s_data, s_task;
// The data events for the completion of each data transfer.
// The task events for the completion of each kernel execution.
cudaEvent_t e_data[N_TASK], e_task[N_TASK];
// ... code for creating the streams and events.
for (int i = 0; i < N_TASK; i++) {
// Data transfer should not overwritten the data needed by the kernels.
if (i >= N_CBUF) {
cudaStreamWaitEvent(s_data, e_task[i-N_CBUF+1]);
}
cudaMemcpyAsync(d_cir_buf[i % N_CBUF], h_data_blocks[i], ..., cudaMemcpyHostToDevice, s_data);
cudaEventRecord(e_data[i], s_data);
cudaStreamWaitEvent(s_task, e_data[i]);
// Pass the current and the last data block to the kernel.
my_kernel<<<..., s_task>>>(d_cir_buf[i % N_CBUF],
i == 0 ? 0 : d_cir_buf[(i+N_CBUF-1)%N_CBUF]);
cudaEventRecord(e_task[i], s_task);
}
I'm wondering if this is even a valid idea, or is there anything completely wrong? Also, the CUDA programming guide mentioned that if there is memcpy from two different host memory address to the same device address, then there will be no concurrent execution, does this matter in my case? In particular, if the memory for d_cir_buf is allocated as a whole big block and then split into 5 pieces, would that count as "the same memory address in device", causing concurrency to fail? Also, in my case the (n+5)'th data transfer will go to the same address as the n'th data transfer, however, given the synchronization required, there won't be two such transfers to execute at the same time. So is this OK?
I have the feeling that your problem is best suited to double buffering:
two streams
upload data1 in stream1
run kernel on data1 in stream1
upload data2 in stream2
run kernel on data2 in stream2
... And so on
Kernel in stream2 can overlap with data transfers in strezm 1 and vice versa

Writing from Device to Host and notifying the host

Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
The data size is too large to fit on the device memory and wait until the end.
The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Here is one possible approach. At a high level, on the device:
You'll need to write the data to either device global memory (allocated previously with cudaMalloc) or else directly to host memory (allocated previously with cudaHostAlloc). This memory should be accessed via a volatile pointer.
You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
You'll then want to issue a threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior to the following steps
Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a volatile pointer.
Optionally issue another threadfence or threadfence_system call
for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a volatile pointer.
On the host:
Before launching the kernel, the host will need to set the mailbox location to a default value.
After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
Once the specific value is seen, indicating that the data is ready, the host can consume the data
Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
Both the mailbox location and the payload region should be accessed by the host thread using a volatile pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments.
It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
while (more_frames)
process a frame
signal host
increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
while (more_frames)
call kernel to process frame
consume frame
increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.

Code running perfectly on host, put in a kernel, fails for mysterious reasons

I have to port a pre-existing “host-only” backpropagation implementation to CUDA. I think the nature of the algorithm doesn’t matter here, so I won’t give much explanation about the way it works. What I think matter though, is that it uses 3-dimensional arrays, whose all three dimensions are dynamically allocated.
I use VS2010, with CUDA 5.0. And my device is a 2.1. The original host-only code can be downloaded here
→ http://files.getwebb.org/view-cre62u4d.html
Main points of the code:
patterns from adult.data are loaded into memory, using the Data structure, present in “pattern.h”.
several multi-dimensional arrays are allocated
the algorithm is ran over the patterns, using the arrays allocated just before.
If you want to try to run the code don’t forget to modify the PATH constant at the beginning of kernel.cu. I also advise you to use “2” layers, “5” neurons, and a learning rate of “0.00001”. As you can see, this work perfectly. The “MSE” is improving. For those who have no clue about what does this algorithms, let’s simply say that it learns how to predict a target value, based on 14 variables present in the patterns. The “MSE” decrease, meaning that the algorithm makes less mistakes after each “epoch”.
I spent a really long time trying to run this code on the device. And I’m still unsuccessful. Last attempt was done by simply copying the code initializing the arrays and running the algorithm into a big kernel. Which failed again. This code can be downloaded there
→ http://files.getwebb.org/view-cre62u4c.html
To be precise, here are the differences with the original host-only code:
f() and fder(), which are used by the algorithm, become device
functions.
parameters are hardcoded: 2 layers, 5 neurons, and a learning rate of
0.00001
the “w” array is initialized using a fixed value (0.5), not rand()
anymore
a Data structure is allocated in device’s memory, and the data are
sent in device’s memory after they have been loaded from adult.data
in host’s memory
I think I did the minimal amount of modifications needed to make the code run in a kernel. The “kernel_check_learningData” kernel, show some informations about the patterns loaded in device’s memory, proving the following code, sending the patterns from the host to the device, did work:
Data data;
Data* dev_data;
int* dev_t;
double* dev_x;
...
input_adult(PathFile, &data);
...
cudaMalloc((void**)&dev_data, sizeof(Data));
cudaMalloc((void**)&dev_t, data.N * sizeof(int));
cudaMalloc((void**)&dev_x, data.N * data.n * sizeof(double));
// Filling the device with t and x's data.
cudaMemcpy(dev_t, data.t, data.N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_x, data.x, data.N * data.n * sizeof(double), cudaMemcpyHostToDevice);
// Updating t and x pointers into devices Data structure.
cudaMemcpy(&dev_data->t, &dev_t, sizeof(int*), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->x, &dev_x, sizeof(double*), cudaMemcpyHostToDevice);
// Copying N and n.
cudaMemcpy(&dev_data->N, &data.N, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(&dev_data->n, &data.n, sizeof(int), cudaMemcpyHostToDevice);
It apparently fails at the beginning of the forward phase, when reading the “w” array. I can’t find any explanation for that.
I see two possibilities:
the code sending the patterns into device's memory is bugged, despite the fact it seems to work properly, and provoke a bug way further, when beginning the forward phase.
the CUDA API is not behaving like it should!
I’m desperately searching for my mistake for a very long time. So I wondered if the community could provide me with some help.
Thanks.
Here's the problem in your code, and why it works in 64 bit machine mode but not 32 bit machine mode.
In your backpropagation kernel, in the forward path, you have a sequence of code like this:
/*
* for layer = 0
*/
for (i = 0; i < N[0]; i++) { // for all neurons i of layer 0
a[0][i] = x[ data->n * pat + i]; // a[0][i] = input i
}
In 32 bit machine mode (Win32 project, --machine 32 is being passed to nvcc), the failure occurs on the iteration i=7 when the write of a[0][7] occurs; this write is out of bounds. At this point, a[0][7] is intended to hold a double value, but for some reason the indexing is placing us out of bounds.
By the way, you can verify this by simply opening a command prompt in the directory where your executable is built, and running the command:
cuda-memcheck test_bp
assuming test_bp.exe is the name of your executable. cuda-memcheck conveniently identifies that there is an out of bounds write occurring, and even identifies the line of source that it is occurring on.
So why is this out of bounds? Let's take a look earlier in the kernel code where a[0][] is allocated:
a[0] = (double *)malloc( N[0] * sizeof(double *) );
^ oops!!
a[0][] is intended to hold double data but you're allocating pointer storage.
As it turns out, in a 64 bit machine the two types of storage are the same size, so it ends up working. But in a 32-bit machine, a double pointer is 4 bytes whereas double data is 8 bytes. So, in a 32-bit machine, when we index through this array taking data strides of 8 bytes, we eventually run off the end of the array.
Elsewhere in the kernel code you are allocating storage for the other "layers" of a like this:
a[layer] = (double *)malloc( N[layer] * sizeof(double) );
which is correct. I see that the original "host-only" code seems to contain this error as well. There may be a latent defect in that code as well.
You will still need to address the kernel running time to avoid the windows TDR event, in some fashion, if you want to run on a windows wddm device. And as I already pointed out, this code makes no attempt to use the parallel capability of the machine.