Concurrent execution of streams and host functions - cuda

I have wrote a program which has two streams. Both streams operate on some data and write results back on the host memory.
Here is the generic structure of how i am doing this:
loop {
AsyncCpy(....,HostToDevice,Stream1);
AsyncCpy(....,HostToDevice,Stream2);
Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>
/* Write the results on the host memory */
AsyncCpy(....,DeviceToHost,Stream1);
AsyncCpy(....,DeviceToHost,Stream2);
}
I want to do some work on the CPU once i know that StreamX has finished copying the results back to the host memory. At the same time, i don't want to stop the loop from executing Async operations (memcpy or kernel execution).
If i insert my host functions, let say host_ftn1(..) and host_ftn2(..) like this
loop {
AsyncCpy(....,HostToDevice,Stream1);
AsyncCpy(....,HostToDevice,Stream2);
Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>
/* Write the results on the host memory to be processed by host_ftn1(..) */
AsyncCpy(....DeviceToHost,Stream1);
/* Write the results on the host memory to be processed by host_ftn2(..) */
AsyncCpy(....DeviceToHost,Stream2);
if(Stream1 results are copied to host)
host_ftn1(..);
if(Stream2 results are copied to host)
host_ftn2(..);
}
It will stop the execution of loop until it finishes the execution of host functions i.e. host_ftn1 and host_ftn2, but I don't want to stop the execution of GPU instructions i.e. AsyncCpy(..) and Kernel<<<....,StreamX>>> while the CPU is busy executing the host functions i.e. host_ftn1(..) and host_ftn2(..)
Any solution/approach regarding this problem

As huseyin tugrul buyukisik suggested, the stream callback worked in this scenario. I have tested this for two streams.
The final design is as following:-
loop {
AsyncCpy(....,HostToDevice,Stream1);
AsyncCpy(....,HostToDevice,Stream2);
Kernel<<<...,Stream1>>>
Kernel<<<...,Stream2>>>
/* Write the results on the host memory to be processed by host_ftn1(..) */
AsyncCpy(....DeviceToHost,Stream1);
/* Write the results on the host memory to be processed by host_ftn2(..) */
AsyncCpy(....DeviceToHost,Stream2);
callback1(..); // Work to be done on the host once stream1 completes
callback2(..); // Work to be done on the host once stream2 completes
}
See Stream Callbacks

Related

cudaEventSynchronize vs cudaDeviceSynchronize

I am new to CUDA and got a little confused with cudaEvent. I now have a code sample that goes as follows:
float elapsedTime;
cudaEvent_t start, stop;
CUDA_ERR_CHECK(cudaEventCreate(&start));
CUDA_ERR_CHECK(cudaEventCreate(&stop));
CUDA_ERR_CHECK(cudaEventRecord(start));
// Kernel functions go here ...
CUDA_ERR_CHECK(cudaEventRecord(stop));
CUDA_ERR_CHECK(cudaEventSynchronize(stop));
CUDA_ERR_CHECK(cudaEventElapsedTime(&elapsedTime, start, stop));
CUDA_ERR_CHECK(cudaDeviceSynchronize());
I have two questions regarding this code:
1.Is the last cudaDeviceSynchronize necessary? Because according to the documentation for cudaEventSynchronize, its functionality is Wait until the completion of all device work preceding the most recent call to cudaEventRecord(). So given that we have already called cudaEventSynchronize(stop), do we need to call cudaDeviceSynchronize once again?
2.How different is the above code compared to the following implementation:
#include <chrono>
auto tic = std::chrono::system_clock::now();
// Kernel functions go here ...
CUDA_ERR_CHECK(cudaDeviceSynchronize());
auto toc = std::chrono::system_clock:now();
float elapsedTime = std::chrono::duration_cast < std::chrono::milliseconds > (toc - tic).count() * 1.0;
Just to flesh out comments so that this question has an answer and will fall off the unanswered queue:
No, the cudaDeviceSynchronize() call is not necessary. In fact, in many cases where asynchronous API calls are being used in multiple streams, it is incorrect to use a global scope synchronization call, because you will break the features of event timers which allow accurate timing of operations in streams.
They are completely different. One is using host side
timing, the other is using device driver timing. In the simplest cases, the time measured by both will be comparable. However, in the host side timing version, if you put a host CPU operation which consumes a significant amount of time in the host timing section, your measurement of time will not reflect the GPU time used when the GPU operations take less time than the host operations.

cuda out-of-core implementation using a circular buffer

I'm trying to do out-of-core between GPU memory and CPU memory. For example, I have blocks of data each is 1GB, and I need to process 1000 of such blocks in order, each is done by a kernel launch. Assume the processing must be done one by one, because the n'th kernel launch needs to use the result produced by the (n-1)'th kernel, which is stored in the (n-1)'th block, except the first kernel launch. So I'm thinking of using a circular buffer on GPU to store the most recent 5 blocks, and use events to synchronize between the data stream and the task stream. The data stream prepares the data and the task stream launches the kernels. The code is illustrated as the following.
const int N_CBUF = 5, N_TASK = 1000;
// Each pointer points to a data block of 1GB
float* d_cir_buf[N_CBUF];
float* h_data_blocks[N_TASK];
// The data stream for transfering data from host to device.
// The task stream for launching kernels to process the data.
cudaStream_t s_data, s_task;
// The data events for the completion of each data transfer.
// The task events for the completion of each kernel execution.
cudaEvent_t e_data[N_TASK], e_task[N_TASK];
// ... code for creating the streams and events.
for (int i = 0; i < N_TASK; i++) {
// Data transfer should not overwritten the data needed by the kernels.
if (i >= N_CBUF) {
cudaStreamWaitEvent(s_data, e_task[i-N_CBUF+1]);
}
cudaMemcpyAsync(d_cir_buf[i % N_CBUF], h_data_blocks[i], ..., cudaMemcpyHostToDevice, s_data);
cudaEventRecord(e_data[i], s_data);
cudaStreamWaitEvent(s_task, e_data[i]);
// Pass the current and the last data block to the kernel.
my_kernel<<<..., s_task>>>(d_cir_buf[i % N_CBUF],
i == 0 ? 0 : d_cir_buf[(i+N_CBUF-1)%N_CBUF]);
cudaEventRecord(e_task[i], s_task);
}
I'm wondering if this is even a valid idea, or is there anything completely wrong? Also, the CUDA programming guide mentioned that if there is memcpy from two different host memory address to the same device address, then there will be no concurrent execution, does this matter in my case? In particular, if the memory for d_cir_buf is allocated as a whole big block and then split into 5 pieces, would that count as "the same memory address in device", causing concurrency to fail? Also, in my case the (n+5)'th data transfer will go to the same address as the n'th data transfer, however, given the synchronization required, there won't be two such transfers to execute at the same time. So is this OK?
I have the feeling that your problem is best suited to double buffering:
two streams
upload data1 in stream1
run kernel on data1 in stream1
upload data2 in stream2
run kernel on data2 in stream2
... And so on
Kernel in stream2 can overlap with data transfers in strezm 1 and vice versa

Writing from Device to Host and notifying the host

Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
The data size is too large to fit on the device memory and wait until the end.
The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Here is one possible approach. At a high level, on the device:
You'll need to write the data to either device global memory (allocated previously with cudaMalloc) or else directly to host memory (allocated previously with cudaHostAlloc). This memory should be accessed via a volatile pointer.
You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
You'll then want to issue a threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior to the following steps
Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a volatile pointer.
Optionally issue another threadfence or threadfence_system call
for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a volatile pointer.
On the host:
Before launching the kernel, the host will need to set the mailbox location to a default value.
After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
Once the specific value is seen, indicating that the data is ready, the host can consume the data
Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
Both the mailbox location and the payload region should be accessed by the host thread using a volatile pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments.
It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
while (more_frames)
process a frame
signal host
increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
while (more_frames)
call kernel to process frame
consume frame
increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

Launching a CUDA stream from each host thread

My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication...I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.
The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart...I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don't think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.
Following is the code each thread runs:
//every thread would run this method in conjunction
static CUT_THREADPROC solverThread(TGPUplan *plan)
{
//Allocate memory
cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );
//Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
//to make cudaMemcpyAsync blocking
cudaStreamSynchronize( plan->stream );
//launch
launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
cutilCheckMsg("simpleKernel() execution failed.\n");
cudaStreamSynchronize(plan->stream);
//Read back GPU results
cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
//to make the cudaMemcpyAsync blocking...
cudaStreamSynchronize(plan->stream);
cutilSafeCall( cudaFree(plan->d_Data) );
CUT_THREADEND;
}
And creation of multiple threads and calling the above function:
for(i = 0; i < nkernels; i++)
threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);
printf("main(): waiting for GPU results...\n");
cutWaitForThreads(threadID, nkernels);
I took this strategy from one of the CUDA Code SDK samples. As I've said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code...
first off I am not an expert by any stretch of the imagination, just from my experience.
I don't see why this needs multiple host threads. It seems like you're managing one device and passing it multiple streams. The way I've seen this done (pseudocode)
{
create a handle
allocate an array of streams equal to the number of streams you want
for(int n=0;n<NUM_STREAMS;n++)
{
cudaStreamCreate(&streamArray[n]);
}
}
From there you can just pass the streams in your array to the various asynchronous calls (cudaMemcpyAsync(), kernel streams, etc.) and the device manages the rest. I've had weird scalability issues with multiple streams (don't try to make 10k streams, I run into problems around 4-8 on a GTX460), so don't be surprised if you run into those. Best of luck,
John
My bet is that
BLOCK_N, THREAD_N
, don't cover the exact size of the array you are passing.
Please provide the code for initializing the streams and the size of those buffers.
As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.