cudaMemcpy & blocking - cuda

I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);

Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.

According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.

Related

Does memcpy from/to unified memory exhibits synchronous behavior?

In the following code:
__managed__ int mData[1024];
void foo(int* dataOut)
{
some_kernel_that_writes_to_mdata<<<...>>>();
// cudaDeviceSynchronize() // do I need this synch here?
memcpy(dataOut, mData, sizeof(int) * 1024);
...
cudaDeviceSynchronize();
}
do I need synchronization between the kernel and memcpy?
cudaMemcpy documentation mentions that the function exhibits synchronous behavior for most use cases. But what about "normal" memcpy from/to managed memory? In my tests it seems the synchronization happens implicitly, but I can't find that in documentation.
Yes, you need that synchronization.
The kernel launch is asynchronous. Therefore the CPU thread will continue on to the next line of code, after launching the kernel, without any guarantee that the kernel completes.
If your subsequent copy operation is expecting to pick up data modified by the kernel, it's necessary to force the kernel to complete first.
cudaMemcpy is a special case. It is issued into the default stream. It has both a device synchronizing characteristic (forces all previously issued work to that device to complete, before it begins the copy), as well as a CPU thread blocking characteristic (it does not return from the library call, i.e. allow the CPU thread to proceed, until the copy operation is complete.)
(that synchronization would also be required in a pre-pascal UM regime. The fact that you are not getting a seg fault suggests to me that you are in a demand-paged UM regime.)

When is safe to reuse CPU buffer when calling cudaMemcpyAsync?

My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:
while(true) {
cudaMemcpyAsync(d_mem, h_mem, some_stream)
kernel_launch(some_stream)
cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}
Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?
How to know when the h_mem can be reused? should I make some synchronization using cudaevents?
BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?
Asynchronous
For transfers from pageable host memory to device memory, host memory
is copied to a staging buffer immediately (no device synchronization
is performed). The function will return once the pageable buffer has
been copied to the staging memory. The DMA transfer to final
destination may not have completed. For transfers between pinned host
memory and device memory, the function is fully asynchronous. For
transfers from device memory to pageable host memory, the function
will return only once the copy has completed. For all other transfers,
the function is fully asynchronous. If pageable memory must first be
staged to pinned memory, this will be handled asynchronously with a
worker thread. For transfers from any host memory to any host memory,
the function is fully synchronous with respect to the host.
MemcpyAsynchronousBehavior
Thanks!
In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.
Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.
You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.

Cuda: Kernel launch queue

I'm not finding much info on the mechanics of a kernel launch operation. The API say to see the CudaProgGuide. And I'm not finding much there either.
Being that kernel execution is asynch, and some machines support concurrent execution, I'm lead to believe there is a queue for the kernels.
Host code:
1. malloc(hostArry, ......);
2. cudaMalloc(deviceArry, .....);
3. cudaMemcpy(deviceArry, hostArry, ... hostToDevice);
4. kernelA<<<1,300>>>(int, int);
5. kernelB<<<10,2>>>(float, int));
6. cudaMemcpy(hostArry, deviceArry, ... deviceToHost);
7. cudaFree(deviceArry);
Line 3 is synchronous. Line 4 & 5 are asynchronous, and the machine supports concurrent execution. So at some point, both of these kernels are running on the GPU. (There is the possibility that kernelB starts and finishes, before kernelA finishes.) While this is happening, the host is executing line 6. Line 6 is synchronous with respect to the copy operation, but there is nothing preventing it from executing before kernelA or kernelB has finished.
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Yes, there are a variety of queues on the GPU, and the driver manages those.
Asynchronous calls return more or less immediately. Synchronous calls do not return until the operation is complete. Kernel calls are asynchronous. Most other CUDA runtime API calls are designated by the suffix Async if they are asynchronous. So to answer your question:
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
There are various queues. The GPU blocks/stalls the host on a synchronous call, but the kernel launch is not a synchronous operation. It returns immediately, before the kernel has completed, and perhaps before the kernel has even started. When launching operations into a single stream, all CUDA operations in that stream are serialized. Therefore, even though kernel launches are asynchronous, you will not observed overlapped execution for two kernels launched to the same stream, because the CUDA subsystem guarantees that a given CUDA operation in a stream will not start until all previous CUDA operations in the same stream have finished. There are other specific rules for the null stream (the stream you are using if you don't explicitly call out streams in your code) but the preceding description is sufficient for understanding this question.
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Since the operation that transfers results from the device to the host is a CUDA call (cudaMemcpy...), and it is issued in the same stream as the preceding operations, the device and CUDA driver manage the execution sequence of cuda calls so that the cudaMemcpy does not begin until all previous CUDA calls issued to the same stream have completed. Therefore a cudaMemcpy issued after a kernel call in the same stream is guaranteed not to start until the kernel call is complete, even if you use cudaMemcpyAsync.
You can use cudaDeviceSynchronize() after a kernel call to guarantee that all previous tasks requested to the device has been completed.
If the results of kernelB are independent from the results on kernelA, you can set this function right before the memory copy operation. If not, you will need to block the device before calling kernelB, resulting in two blocking operations.

Accessing cuda device memory when the cuda kernel is running

I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.

How to use shared memory between kernel call of CUDA?

I want to use shared memory between kernel call of one kernel.
Can I use shared memory between kernel call?
No, you can't. Shared memory has thread block life-cycle. A variable stored in it can be accessible by all the threads belonging to one group during one __global__ function invocation.
Take a try of page-locked memory, but the speed should be much slower than graphic memory.
cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped);
then send the ptr to the kernel code.
Previously you could do it in a non-standard way where you would have a unique id for each shared memory block and the next kernel would check the id and therefore carry out required processing on this shared memory block. This was hard to implement as you needed to ensure full occupancy for each kernel and deal with various corner cases. In addition, without formal support you coulf not rely on compatibility across compute device and cuda versions.