I've a program that uses three kernels. In order to get the speedups, I was doing a dummy memory copy to create a context as follows:
__global__ void warmStart(int* f)
{
*f = 0;
}
which is launched before the kernels I want to time as follows:
int *dFlag = NULL;
cudaMalloc( (void**)&dFlag, sizeof(int) );
warmStart<<<1, 1>>>(dFlag);
Check_CUDA_Error("warmStart kernel");
I also read about other simplest ways to create a context as cudaFree(0) or cudaDevicesynchronize(). But using these API calls gives worse times than using the dummy kernel.
The execution times of the program, after forcing the context, are 0.000031 seconds for the dummy kernel and 0.000064 seconds for both, the cudaDeviceSynchronize() and cudaFree(0). The times were get as a mean of 10 individual executions of the program.
Therefore, the conclusion I've reached is that launch a kernel initialize something that is not initialized when creating a context in the canonical way.
So, what's the difference of creating a context in these two ways, using a kernel and using an API call?
I run the test in a GTX480, using CUDA 4.0 under Linux.
Each CUDA context has memory allocations that are required to execute a kernel that are not required to be allocated to syncrhonize, allocate memory, or free memory. The initial allocation of the context memory and resizing of these allocations is deferred until a kernel requires these resources. Examples of these allocations include the local memory buffer, device heap, and printf heap.
Related
In the following code:
__managed__ int mData[1024];
void foo(int* dataOut)
{
some_kernel_that_writes_to_mdata<<<...>>>();
// cudaDeviceSynchronize() // do I need this synch here?
memcpy(dataOut, mData, sizeof(int) * 1024);
...
cudaDeviceSynchronize();
}
do I need synchronization between the kernel and memcpy?
cudaMemcpy documentation mentions that the function exhibits synchronous behavior for most use cases. But what about "normal" memcpy from/to managed memory? In my tests it seems the synchronization happens implicitly, but I can't find that in documentation.
Yes, you need that synchronization.
The kernel launch is asynchronous. Therefore the CPU thread will continue on to the next line of code, after launching the kernel, without any guarantee that the kernel completes.
If your subsequent copy operation is expecting to pick up data modified by the kernel, it's necessary to force the kernel to complete first.
cudaMemcpy is a special case. It is issued into the default stream. It has both a device synchronizing characteristic (forces all previously issued work to that device to complete, before it begins the copy), as well as a CPU thread blocking characteristic (it does not return from the library call, i.e. allow the CPU thread to proceed, until the copy operation is complete.)
(that synchronization would also be required in a pre-pascal UM regime. The fact that you are not getting a seg fault suggests to me that you are in a demand-paged UM regime.)
Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code?
I have no idea how to write a sample code in this case, for example, to continue this question.
Exactly my question is whether there is an instruction in CUDA that can pause a CUDA kernel running in a CUDA stream and then resume it?
You can use dynamic parallelism with parameters for communication with host for the signals. Then launch a parent kernel with only 1 cuda thread and let it launch child kernels continuously until work is done or signal is received. If child kernel does not fully occupy the GPU, then it will lose performance.
__global__ void parent(int * atomicSignalPause, int * atomicSignalExit, Parameters * prm)
{
int progress = 0;
while(checkSignalExit(atomicSignalExit) && progress<100)
{
while(checkSignalPause(atomicSignalPause))
{
child<<<X,Y>>>(prm,progress++);
cudaDeviceSynchronize();
}
}
}
There is no command to pause a stream. For multiple GPUs, you should use unified memory allocation for the communication (between GPUs).
To overcome the gpu utilization issue, you may invent a task queue for child kernels. It pushes work N times (roughly enough to keep GPU efficient in power/compute), then for every completed child kernel it increments a dedicated counter in the parent kernel and pushes a new work, until all work is complete (while trying to keep concurrent kernels at N).
Maybe something like this:
// producer kernel
// N: number of works that make gpu fully utilized
while(hasWork)
{
// concurrency is a global parameter
while(checkConcurrencyAtomic(concurrency)<N)
{
incrementConcurrencyAtomic(concurrency);
// a "consumer" parent kernel will get items from queue
// it will decrement concurrency when a work is done
bool success = myQueue.tryPush(work, concurrency);
if(success)
{
// update status of whole work or signal the host
}
}
// synchronization once per ~N work
cudaDeviceSynchronize();
... then check for pause signals and other tasks
}
If total work takes more than a few seconds, these atomic value updates shouldn't be a performance problem but if you have way too many child kernels to launch then you can launch more producer/consumer (parent) cuda-threads.
I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.
lets say i have three global array which have been copied into gpu using cudaMemcpy but these gloabl array in c has NOT been allocated using cudaHostAlloc so as to allocate memory that is page-locked instead they are simple gloabl allocation.
int a[100],b [100],c[100];
cudaMemcpy(d_a,a,100*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_b,b,100*sizeof(int),cudaMemcpyHostToDevice);
cudaMemcpy(d_c,c,100*sizeof(int),cudaMemcpyHostToDevice);
now i have 10 kernels which are launched in seperate streams so as to run concurrently and some of them are using global array copied in gpu.
and now these kernels are running for say 1000 iterations.
they dont have to copy anything back to host during iterations.
But the problem is that they are not executing in parallel instead they are going for serial fashion.
cudaStream_t stream[3];
for(int i=0;i<3;i++)cudaStreamCreate (&stream[i]);
for(int i=0;i<100;i++){
kernel1<<<blocks,threads,0,stream[0]>>>(d_a,d_b);
kernel2<<<blocks,threads,0,strea[1]>>(d_b,d_c);
kernal3<<<blocks,threads,0,stream[2]>>>(d_c,d_a);
cudaDeviceSynchronize();
}
I can't understand why?
Kernels issued this way:
for(int i=0;i<100;i++){
kernel1<<<blocks,threads>>>(d_a,d_b);
kernel2<<<blocks,threads>>>(d_b,d_c);
kernal3<<<blocks,threads>>>(d_c,d_a);
cudaDeviceSynchronize();
}
Will always run serially. In order to get kernels to run concurrently, they must be issued to separate CUDA streams. And there are other requirements as well. Read the documentation.
You'll need to create some CUDA streams, then launch your kernels like this:
cudaStream_t stream1, stream2, stream3;
cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaStreamCreate(&stream3);
for(int i=0;i<100;i++){
kernel1<<<blocks,threads,0,stream1>>>(d_a,d_b);
kernel2<<<blocks,threads,0,stream2>>>(d_b,d_c);
kernal3<<<blocks,threads,0,stream3>>>(d_c,d_a);
cudaDeviceSynchronize();
}
Actually witnessing concurrent kernel execution will also generally require kernels that have limited resource utilization. If a given kernel will "fill" the machine, due to a large number of blocks, or threads per block, or shared memory usage, or some other resource usage, then you won't actually witness concurrency; there's no room left in the machine.
You may also want to review some of the CUDA sample codes, such as simpleStreams and concurrentKernels.
I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.