CUDA host to device transfer faster than device to host transfer - cuda

I was working on a simple cuda program in which I figured out that 90% of the time was coming from a single statement which was a cudamemcpy from device to host. The program was transfering some 2MB was data from host to device in 600-700microseconds and was copying back 4MB of data from device to host in 10ms. The total time taken by my program was 13ms. My question is that why there is an asymmetry in the two copying from device to host and host to device. Is it because cuda devlopers thought that copying back would be usually smaller in bytes. My second question is that is there any way to circumvent it.
I am using a Fermi GTX560 graphics card with 343 cores and 1GB memory.

Timing of CUDA functions is a bit different than CPU. First of all be sure that you do not take the initialization cost of CUDA into account by calling a CUDA function at the start of your application, otherwise it might be initialized while you started your timing.
int main (int argc, char **argv) {
cudaFree(0);
....//cuda is initialized..
}
Use a Cutil timer like this
unsigned int timer;
cutCreateTimer(&timer);
cutStartTimer(timer);
//your code, to assess elapsed time..
cutStopTimer(timer);
printf("Elapsed: %.3f\n", cutGetTimerValue(timer));
cutDeleteTimer(timer);
Now, after these preliminary steps lets look at the problem. When a kernel is called, the CPU part will be stalled only till the call is delivered to GPU. The GPU will continue execution while the CPU continues too. If you call cudaThreadSynchronize(..), then the CPU will stall till the GPU finishes current call. cudaMemCopy operation also requires GPU to finish its execution, because the values that should be filled by the kernel is requested.
kernel<<<numBlocks, threadPerBlock>>>(...);
cudaError_t err = cudaThreadSynchronize();
if (cudaSuccess != err) {
fprintf(stderr, "cudaCheckError() failed at %s:%i : %s.\n", __FILE__, __LINE__, cudaGetErrorString( err ) );
exit(1);
}
//now the kernel is complete..
cutStopTimer(timer);
So place a synchronization before calling the stop timer function. If you place a memory copy after the kernel call, then the elapsed time of memory copy will include some part of the kernel execution. So memCopy operation may be placed after the timing operations.
There are also some profiler counters that may be used to assess some sections of the kernels.
How to profile the number of global memory transactions for cuda kernels?
How Do You Profile & Optimize CUDA Kernels?

Related

How can I pause a CUDA stream and then resume it?

Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code?
I have no idea how to write a sample code in this case, for example, to continue this question.
Exactly my question is whether there is an instruction in CUDA that can pause a CUDA kernel running in a CUDA stream and then resume it?
You can use dynamic parallelism with parameters for communication with host for the signals. Then launch a parent kernel with only 1 cuda thread and let it launch child kernels continuously until work is done or signal is received. If child kernel does not fully occupy the GPU, then it will lose performance.
__global__ void parent(int * atomicSignalPause, int * atomicSignalExit, Parameters * prm)
{
int progress = 0;
while(checkSignalExit(atomicSignalExit) && progress<100)
{
while(checkSignalPause(atomicSignalPause))
{
child<<<X,Y>>>(prm,progress++);
cudaDeviceSynchronize();
}
}
}
There is no command to pause a stream. For multiple GPUs, you should use unified memory allocation for the communication (between GPUs).
To overcome the gpu utilization issue, you may invent a task queue for child kernels. It pushes work N times (roughly enough to keep GPU efficient in power/compute), then for every completed child kernel it increments a dedicated counter in the parent kernel and pushes a new work, until all work is complete (while trying to keep concurrent kernels at N).
Maybe something like this:
// producer kernel
// N: number of works that make gpu fully utilized
while(hasWork)
{
// concurrency is a global parameter
while(checkConcurrencyAtomic(concurrency)<N)
{
incrementConcurrencyAtomic(concurrency);
// a "consumer" parent kernel will get items from queue
// it will decrement concurrency when a work is done
bool success = myQueue.tryPush(work, concurrency);
if(success)
{
// update status of whole work or signal the host
}
}
// synchronization once per ~N work
cudaDeviceSynchronize();
... then check for pause signals and other tasks
}
If total work takes more than a few seconds, these atomic value updates shouldn't be a performance problem but if you have way too many child kernels to launch then you can launch more producer/consumer (parent) cuda-threads.

Nvidia CUDA: Profiler indicates memory transfer operations are not performed asynchronously

I have profiled my CUDA application and the profiling results are not as I would expect them to be.
Here's a summary of how my application works:
There are 4 streams used
The CPU loop runs around polling the state of each stream
If the stream is found to be idle, then a function is called: launch_job
This function looks liks this:
launch_job(cudaStream_t stream, ...)
{
cudaMemcpyAsync(..., stream);
cuda_process_kernel<<<grid, block, 0, stream>>>(...);
cudaError_t err = cudaGetLastError();
if(err) ...
cudaMemcpyAsync(..., stream);
}
For the first block of 4 kernel thread launches seen in the profiler screenshot, the stream is different for each time launch_job is called.
However there is no overlapping of the memory transfers or the kernel execution.
I would have expected to see at least one memory transfer overlapped with a kernel function execution, if not both memory transfers. (One is direction H2D the other is direction D2H but that was probably obvious.)
Have I fundamentally misunderstood something about the way in which streams work? Or is there some other reason why my launch_job function does not produce parallelized memory transfer and kernel function execution?
Please try this:
For each stream, do cudaMemcpyAsync(..., stream) to copy H2D.
For each stream, launch the kernels on that stream;
For each stream, do cudaMemcpyAsync(..., stream) to copy D2H.
Note you are having three for loops here. If your GPU supports, your profiler should show some overlapping among different streams.
Also, if your data is really small, say only 1 MB, you may not see much overlapping, it would be more obvious if you have 100MB data copy on each stream.

parallel execution of kernels in cuda

lets say i have three global array which have been copied into gpu using cudaMemcpy but these gloabl array in c has NOT been allocated using cudaHostAlloc so as to allocate memory that is page-locked instead they are simple gloabl allocation.
int a[100],b [100],c[100];
cudaMemcpy(d_a,a,100*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_b,b,100*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_c,c,100*sizeof(int),cudaMemcpyHostToDevice);
now i have 10 kernels which are launched in seperate streams so as to run concurrently and some of them are using global array copied in gpu.
and now these kernels are running for say 1000 iterations.
they dont have to copy anything back to host during iterations.
But the problem is that they are not executing in parallel instead they are going for serial fashion.
cudaStream_t stream[3];
for(int i=0;i<3;i++)cudaStreamCreate (&stream[i]);
for(int i=0;i<100;i++){
kernel1<<<blocks,threads,0,stream[0]>>>(d_a,d_b);
kernel2<<<blocks,threads,0,strea[1]>>(d_b,d_c);
kernal3<<<blocks,threads,0,stream[2]>>>(d_c,d_a);
cudaDeviceSynchronize();
}
I can't understand why?
Kernels issued this way:
for(int i=0;i<100;i++){
kernel1<<<blocks,threads>>>(d_a,d_b);
kernel2<<<blocks,threads>>>(d_b,d_c);
kernal3<<<blocks,threads>>>(d_c,d_a);
cudaDeviceSynchronize();
}
Will always run serially. In order to get kernels to run concurrently, they must be issued to separate CUDA streams. And there are other requirements as well. Read the documentation.
You'll need to create some CUDA streams, then launch your kernels like this:
cudaStream_t stream1, stream2, stream3;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaStreamCreate(&stream3);
for(int i=0;i<100;i++){
kernel1<<<blocks,threads,0,stream1>>>(d_a,d_b);
kernel2<<<blocks,threads,0,stream2>>>(d_b,d_c);
kernal3<<<blocks,threads,0,stream3>>>(d_c,d_a);
cudaDeviceSynchronize();
}
Actually witnessing concurrent kernel execution will also generally require kernels that have limited resource utilization. If a given kernel will "fill" the machine, due to a large number of blocks, or threads per block, or shared memory usage, or some other resource usage, then you won't actually witness concurrency; there's no room left in the machine.
You may also want to review some of the CUDA sample codes, such as simpleStreams and concurrentKernels.

Difference on creating a CUDA context

I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:
__global__ void warmStart(int* f)
{
*f = 0;
}
which is launched before the kernels I want to time as follows:
int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");
I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.
The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.
Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.
So, what's the difference of creating a context in these two ways, using a kernel and using an API call?
I run the test in a GTX480, using CUDA 4.0 under Linux.
Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.

cudaMemcpy & blocking

I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.