cuda out-of-core implementation using a circular buffer - cuda

I'm trying to do out-of-core between GPU memory and CPU memory. For example, I have blocks of data each is 1GB, and I need to process 1000 of such blocks in order, each is done by a kernel launch. Assume the processing must be done one by one, because the n'th kernel launch needs to use the result produced by the (n-1)'th kernel, which is stored in the (n-1)'th block, except the first kernel launch. So I'm thinking of using a circular buffer on GPU to store the most recent 5 blocks, and use events to synchronize between the data stream and the task stream. The data stream prepares the data and the task stream launches the kernels. The code is illustrated as the following.
const int N_CBUF = 5, N_TASK = 1000;
// Each pointer points to a data block of 1GB
float* d_cir_buf[N_CBUF];
float* h_data_blocks[N_TASK];
// The data stream for transfering data from host to device.
// The task stream for launching kernels to process the data.
cudaStream_t s_data, s_task;
// The data events for the completion of each data transfer.
// The task events for the completion of each kernel execution.
cudaEvent_t e_data[N_TASK], e_task[N_TASK];
// ... code for creating the streams and events.
for (int i = 0; i < N_TASK; i++) {
// Data transfer should not overwritten the data needed by the kernels.
if (i >= N_CBUF) {
cudaStreamWaitEvent(s_data, e_task[i-N_CBUF+1]);
}
cudaMemcpyAsync(d_cir_buf[i % N_CBUF], h_data_blocks[i], ..., cudaMemcpyHostToDevice, s_data);
cudaEventRecord(e_data[i], s_data);
cudaStreamWaitEvent(s_task, e_data[i]);
// Pass the current and the last data block to the kernel.
my_kernel<<<..., s_task>>>(d_cir_buf[i % N_CBUF],
i == 0 ? 0 : d_cir_buf[(i+N_CBUF-1)%N_CBUF]);
cudaEventRecord(e_task[i], s_task);
}
I'm wondering if this is even a valid idea, or is there anything completely wrong? Also, the CUDA programming guide mentioned that if there is memcpy from two different host memory address to the same device address, then there will be no concurrent execution, does this matter in my case? In particular, if the memory for d_cir_buf is allocated as a whole big block and then split into 5 pieces, would that count as "the same memory address in device", causing concurrency to fail? Also, in my case the (n+5)'th data transfer will go to the same address as the n'th data transfer, however, given the synchronization required, there won't be two such transfers to execute at the same time. So is this OK?

I have the feeling that your problem is best suited to double buffering:
two streams
upload data1 in stream1
run kernel on data1 in stream1
upload data2 in stream2
run kernel on data2 in stream2
... And so on
Kernel in stream2 can overlap with data transfers in strezm 1 and vice versa

Related

How can I make sure the compiler parallelizes my loads from global memory?

I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?
The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.

Simulating pipeline program with CUDA

Say I have two arrays A and B and a kernel1 that does some calculation on both arrays (vector addition for example) by breaking the arrays into different chunks and and writes the partial result to C. kernel1 then keeps doing this until all elements in the arrays are processed.
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int gridSize = blockDim.x*gridDim.x;
//iterate through each chunk of gridSize in both A and B
while (i < N) {
C[i] = A[i] + B[i];
i += gridSize;
}
Say, now I want to launch a kernel2 on C and another data array D. Is there anyway I can start kernel2 immediately after the first chunk in C is calculated? In essence, kernel1 piped it result to kernel2. The dependency tree would look like this
Result
/ \
C D
/ \
A B
I have thought about using CUDA streams but not sure exactly how. Maybe incorporating the host in calculation?
Yes, you could use CUDA streams to manage order and dependencies in such a scenario.
Let's assume that you will want to overlap the copy and compute operations. This typically implies that you will break your input data into "chunks" and you will copy chunks to the device, then launch compute operations. Each kernel launch operates on a "chunk" of data.
We could manage the process with a loop in host code:
// create streams and ping-pong pointer
cudaStream_t stream1, stream2, *st_ptr;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);
// assume D is already on device as dev_D
for (int chunkid = 0; chunkid < max; chunkid++){
//ping-pong streams
st_ptr = (chunkid % 2)?(&stream1):(&stream2);
size_t offset = chunkid*chunk_size;
//copy A and B chunks
cudaMemcpyAsync(dev_A+offset, A+offset, chksize*sizeof(A_type), cudaMemcpyHostToDevice, *st_ptr);
cudaMemcpyAsync(dev_B+offset, B+offset, chksize*sizeof(B_type), cudaMemcpyHostToDevice, *st_ptr);
// then compute C based on A and B
compute_C_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_A+offset, dev_B+offset, chksize);
// then compute Result based on C and D
compute_Result_kernel<<<...,*st_ptr>>>(dev_C+offset, dev_D, chksize);
// could copy a chunk of Result back to host here with cudaMemcpyAsync on same stream
}
All operations issued to the same stream are guaranteed to execute in order (i.e. sequentially) on the device. Operations issued to separate streams can overlap. Therefore the above sequence should:
copy a chunk of A to the device
copy a chunk of B to the device
launch a kernel to compute C from A and B
launch a kernel to compute Result from C and D
The above steps will be repeated for each chunk, but successive chunk operations will be issued to alternate streams. Therefore the copy operations of chunk 2 can overlap with the kernel operations from chunk 1, etc.
You can learn more by reviewing a presentation on CUDA streams. Here is one example.
Newer devices (Kepler and Maxwell) should be fairly flexible about the program-issue-order needed to witness overlap of operations on the device. Older (Fermi) devices may be sensitive to issue order. You can read more about that here

cudaDeviceSynchronize and performing sequential work

I have a program that, when after profiled with nvprof, says that ~98% of the execution time is devoted to cudaDeviceSynchronize. In thinking about how to optimize the following code, I'm brought back here to try and confirm my understanding of the need for cudaDeviceSynchronize.
The general layout of my program is thus :
Copy input array to GPU.
program<<<1,1>>>(inputs)
Copy outputs back to host.
Thus, my program kernel is a master thread that basically looks like this :
for (int i = 0; i < 10000; i++)
{
calcKs(inputs);
takeStep(inputs);
}
The calcKs function is one of the most egregious abusers of cudaDeviceSynchronize and look like this :
//Calculate k1's
//Calc fluxes for r = 1->(ml-1), then for r = 0, then calc K's
zeroTemps();
calcFlux<<< numBlocks, numThreads >>>(concs, temp2); //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1); //temp1 calculated from temp2
cudaDeviceSynchronize();
calcK<<< numBlocks, numThreads >>>(k1s, temp2); //k1s calculated from temp2
cudaDeviceSynchronize();
where arrays temp2, temp1 and k1s are each calculated from the results of each other. My understanding was that cudaDeviceSynchronize was essential because I need temp2 to be completely calculated before temp1 is calculated and same for temp1 and k1s.
I feel like I've critically misunderstood the function of cudaDeviceSynchronize from reading this post : When to call cudaDeviceSynchronize?. I'm not sure how pertinent the comments on there are to my situation, however, as all of my program is running on the device and there's no CPU-GPU interaction until the final memory copy back to host, hence I don't get the implicit serialization caused by the memCpy
CUDA activities (kernel calls, memcopies, etc.) issued to the same stream will be serialized.
When you don't use streams at all in your application, everything you are doing is in the default stream.
Therefore, in your case, there is no functional difference between:
calcFlux<<< numBlocks, numThreads >>>(concs, temp2); //temp2 calculated from concs
cudaDeviceSynchronize();
calcMonomerFlux(temp2, temp1); //temp1 calculated from temp2
and:
calcFlux<<< numBlocks, numThreads >>>(concs, temp2); //temp2 calculated from concs
calcMonomerFlux(temp2, temp1); //temp1 calculated from temp2
You don't show what calcMonomerFlux is, but assuming it uses data from temp2 and is doing calculations on the host, it must be using cudaMemcpy to grab the temp2 data before it actually uses it. Since the cudaMemcpy will be issued to the same stream as the preceding kernel call (calcFlux) it will be serialized, i.e. it will not begin until calcFlux is done. Your other code depending on temp2 data in calcMonomerFlux presumably executes after the cudaMemcpy, which is a blocking operation, so it will not begin executing until the cudaMemcpy is done.
Even if calcMonomerFlux contains kernels that operate on temp2 data, the argument is the same. Those kernels are presumably issued to the same stream (default stream) as calcFlux, and therefore will not begin until calcFlux is complete.
So the cudaDeviceSynchronize() call is almost certainly not needed.
Having said that, cudaDeviceSynchronize() by itself should not consume a tremendous amount of overhead. The reason that most of your execution time is being attributed to cudaDeviceSynchronize(), is because from a host thread perspective, this sequence:
calcFlux<<< numBlocks, numThreads >>>(concs, temp2); //temp2 calculated from concs
cudaDeviceSynchronize();
spends almost all its time in the cudaDeviceSynchronize() call. The kernel call is asynchronous, meaning it launches the kernel and then immediately returns control to the host thread, allowing the host thread to continue. Therefore the overhead in the host thread for a kernel call may be as low as a few microseconds. But the cudaDeviceSynchronize() call will block the host thread until the preceding kernel call completes. The longer your kernel executes, the more time the host thread spends waiting at the cudaDeviceSynchronize() call. So nearly all your host thread execution time appears to be spent on these calls.
For properly written single threaded, single (default) stream CUDA codes, cudaDeviceSynchronize() is almost never needed in the host thread. It may be useful in some cases for certain types of debugging/error checking, and it may be useful in the case where you have a kernel executing and want to see the printout (printf) from the kernel before your application terminates.

Writing from Device to Host and notifying the host

Using CUDA 5 with VS 2012 and capability 3.5 (Titan and K20).
At particular stages of my kernel execution, I want to send a generated data chunk to the host memory and notify the host that the data is ready, so the host will operate on it.
I cannot wait until the end of the kernel execution to read the data back from the device, because:
The data is no longer relevant to the device once it is calculated, so there is no point keeping it to the end.
The data size is too large to fit on the device memory and wait until the end.
The host should not have to wait until the end of the kernel execution to start processing the data.
Could you point me to the path I have to take and the possible cuda concepts and functions I have to use to achieve my requirements? Put simply, how can I write to the host and notify the host that a chunk data is ready for host processing?
N.B. Each thread does not share any generated data with any other thread, they run independently. So, as far as I know (and please correct me if I am wrong), the concept of blocks, threads and warps do not affect the question. Or in other words, if they aid the answer, I am free to alter their combination.
Below is a sample code that shows that I am trying to do:
#pragma once
#include <conio.h>
#include <cstdio>
#include <cuda_runtime_api.h>
__global__ void Kernel(size_t length, float* hResult)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
// Processing multiple data chunks
for(int i = 0;i < length;i++)
{
// Once this is assigned, I don't need it on the device anymore.
hResult[i + (tid * length)] = i * 100;
}
}
void main()
{
size_t length = 10;
size_t threads = 2;
float* hResult;
// An array that will hold all data from all threads
cudaMallocHost((void**)&hResult, threads * length * sizeof(float));
Kernel<<<threads,1>>>(length, hResult);
// I DO NOT want to wait to the end and block to get the data
cudaError_t error = cudaDeviceSynchronize();
if (error != cudaSuccess) { throw error; }
for(int i = 0;i < threads * length;i++)
{
printf("%f\n", hResult[i]);;
}
cudaFreeHost(hResult);
system("pause");
}
Here is one possible approach. At a high level, on the device:
You'll need to write the data to either device global memory (allocated previously with cudaMalloc) or else directly to host memory (allocated previously with cudaHostAlloc). This memory should be accessed via a volatile pointer.
You may wish to do all the data writing to this region from a single threadblock, to be sure that all the data is written prior to the following steps
You'll then want to issue a threadfence() (if you're using device global memory) or threadfence_system() call (if using host memory) prior to the following steps
Next you'll write to a special location in device global memory or host memory, let's call it the mailbox location, with a specific value indicating the data is ready. This location should also be accessed with a volatile pointer.
Optionally issue another threadfence or threadfence_system call
for device memory usage on the receiving end, again both regions (payload and "mailbox") should be accessed using a volatile pointer.
On the host:
Before launching the kernel, the host will need to set the mailbox location to a default value.
After launching the kernel, the host thread will need to "poll" the mailbox location, looking for the specific value indicating data is ready
Once the specific value is seen, indicating that the data is ready, the host can consume the data
Optionally, if you want to repeat this process, the host can reset the mailbox location to the default value. The device can check for this default value before updating the data block with new data.
Both the mailbox location and the payload region should be accessed by the host thread using a volatile pointer.
Note that even with the above process, there is still an implied device-wide synchronization needed, if the data is being generated/created from multiple threadblocks. The only straightforward device-wide synchronization available is the kernel launch (or completion of the kernel, specifically). Copying the data from a single threadblock simply moves the requirement for device-wide sync out of this particular sequence (to somewhere before this sequence).
The reasons you give don't really suggest to me that the code could not be refactored to create the data on a kernel-launch by kernel-launch basis, which would neatly solve these issues and eliminate the need for the above process as well.
EDIT: responding to a question in the comments.
It's difficult to be more specific about how to refactor the code to deliver one data chunk per kernel call, without a specific example.
Let's take an image processing case, where I have a video sequence of 30 frames stored in global memory. The kernel will process each frame according to some algorithm, then make the processed data available to the host.
In your proposal, after the kernel is done processing a frame, it can signal to the host that the data is ready, and go on to process the next frame. The problem is, if the frame is processed by multiple threadblocks, there's no easy way to know when all threadblocks are done processing that frame. A device-wide synchronization barrier might be what is needed, but it doesn't exist conveniently, except via the kernel call mechanism. However, presumably inside such a kernel we might have a sequence like this:
while (more_frames)
process a frame
signal host
increment frame pointer
In a refactored approach, we would move the loop outside the kernel, to host code:
while (more_frames)
call kernel to process frame
consume frame
increment frame pointer
By doing this, the kernel marks the explicit synchronization needed to know when the frame processing is complete, and the data can be consumed.

Launching a CUDA stream from each host thread

My intention is to use n host threads to create n streams concurrently on a NVidia Tesla C2050. The kernel is a simple vector multiplication...I am dividing the data equally amongst n streams, and each stream would have concurrent execution/data transfer going on.
The data is floating point, I am sometimes getting CPU/GPU sums as equal, and sometimes they are wide apart...I guess this could be attributed to loss of synchronization constructs on my code, for my case, but also I don't think any synch constructs between streams is necessary, because I want every CPU to have a unique stream to control, and I do not care about asynchronous data copy and kernel execution within a thread.
Following is the code each thread runs:
//every thread would run this method in conjunction
static CUT_THREADPROC solverThread(TGPUplan *plan)
{
//Allocate memory
cutilSafeCall( cudaMalloc((void**)&plan->d_Data, plan->dataN * sizeof(float)) );
//Copy input data from CPU
cutilSafeCall( cudaMemcpyAsync((void *)plan->d_Data, (void *)plan->h_Data, plan->dataN * sizeof(float), cudaMemcpyHostToDevice, plan->stream) );
//to make cudaMemcpyAsync blocking
cudaStreamSynchronize( plan->stream );
//launch
launch_simpleKernel( plan->d_Data, BLOCK_N, THREAD_N, plan->stream);
cutilCheckMsg("simpleKernel() execution failed.\n");
cudaStreamSynchronize(plan->stream);
//Read back GPU results
cutilSafeCall( cudaMemcpyAsync(plan->h_Data, plan->d_Data, plan->dataN * sizeof(float), cudaMemcpyDeviceToHost, plan->stream) );
//to make the cudaMemcpyAsync blocking...
cudaStreamSynchronize(plan->stream);
cutilSafeCall( cudaFree(plan->d_Data) );
CUT_THREADEND;
}
And creation of multiple threads and calling the above function:
for(i = 0; i < nkernels; i++)
threadID[i] = cutStartThread((CUT_THREADROUTINE)solverThread, &plan[i]);
printf("main(): waiting for GPU results...\n");
cutWaitForThreads(threadID, nkernels);
I took this strategy from one of the CUDA Code SDK samples. As I've said before, this code work sometimes, and other time it gives wayward results. I need help with fixing this code...
first off I am not an expert by any stretch of the imagination, just from my experience.
I don't see why this needs multiple host threads. It seems like you're managing one device and passing it multiple streams. The way I've seen this done (pseudocode)
{
create a handle
allocate an array of streams equal to the number of streams you want
for(int n=0;n<NUM_STREAMS;n++)
{
cudaStreamCreate(&streamArray[n]);
}
}
From there you can just pass the streams in your array to the various asynchronous calls (cudaMemcpyAsync(), kernel streams, etc.) and the device manages the rest. I've had weird scalability issues with multiple streams (don't try to make 10k streams, I run into problems around 4-8 on a GTX460), so don't be surprised if you run into those. Best of luck,
John
My bet is that
BLOCK_N, THREAD_N
, don't cover the exact size of the array you are passing.
Please provide the code for initializing the streams and the size of those buffers.
As a side note, Streams are useful for overlapping computation with memory transfer. Synching the stream after each async call is not useful at all.