My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:
while(true) {
cudaMemcpyAsync(d_mem, h_mem, some_stream)
kernel_launch(some_stream)
cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}
Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?
How to know when the h_mem can be reused? should I make some synchronization using cudaevents?
BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?
Asynchronous
For transfers from pageable host memory to device memory, host memory
is copied to a staging buffer immediately (no device synchronization
is performed). The function will return once the pageable buffer has
been copied to the staging memory. The DMA transfer to final
destination may not have completed. For transfers between pinned host
memory and device memory, the function is fully asynchronous. For
transfers from device memory to pageable host memory, the function
will return only once the copy has completed. For all other transfers,
the function is fully asynchronous. If pageable memory must first be
staged to pinned memory, this will be handled asynchronously with a
worker thread. For transfers from any host memory to any host memory,
the function is fully synchronous with respect to the host.
MemcpyAsynchronousBehavior
Thanks!
In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.
Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.
You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.
Related
I am using Cuda API:
cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )
to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
No.
However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.
Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.
For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.
Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.
You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.
I repeatedly enqueue a sequence of kernels:
for 1..100:
for 1..10000:
// Enqueue GPU kernels
Kernel 1 - update each element of array
Kernel 2 - sort array
Kernel 3 - operate on array
end
// run some CPU code
output "Waiting for GPU to finish"
// copy from device to host
cudaMemcpy ... D2H(array)
end
Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:
thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);
It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.
Is there a way to call thrust::sort_by_key asynchronously?
First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.
To work around item 2, it seems there might be at least 2 possible approaches:
Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.
The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.
I'm using: cudaMemcpy2DArrayToArray(). Is there also an asynchronous counterpart of this function? cudaMemcpy2DArrayToArrayAsync() does, not exist. I want to avoid implicit synchronization of my cuda operations.
The other 10 or so cudaMemcpy*() calls all have an async version so my guess is that this call is implemented in some way that prevents a fully async version. Note, though, that the docs say that "This function exhibits synchronous behavior for most use cases" and, by that, they seem to mean:
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.
For transfers from device memory to device memory, no host-side synchronization is performed.
For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.
I have allocated memory on device using cudaMalloc and have passed it to a kernel function. Is it possible to access that memory from host before the kernel finishes its execution?
The only way I can think of to get a memcpy to kick off while the kernel is still executing is by submitting an asynchronous memcpy in a different stream than the kernel. (If you use the default APIs for either kernel launch or asynchronous memcpy, the NULL stream will force the two operations to be serialized.)
But because there is no way to synchronize a kernel's execution with a stream, that code would be subject to a race condition. i.e. the copy engine might pull from memory that hasn't yet been written by the kernel.
The person who alluded to mapped pinned memory is into something: if the kernel writes to mapped pinned memory, it is effectively "copying" data to host memory as it finishes processing it. This idiom works nicely, provided the kernel will not be touching the data again.
It is possible, but there's no guarantee as to the contents of the memory you retrieve in such a way, since you don't know what the progress of the kernel is.
What you're trying to achieve is to overlap data transfer and execution. That is possible through the use of streams. You create multiple CUDA streams, and queue a kernel execution and a device-to-host cudaMemcpy in each stream. For example, put the kernel that fills the location "0" and cudaMemcpy from that location back to host into stream 0, kernel that fills the location "1" and cudaMemcpy from "1" into stream 1, etc. What will happen then is that the GPU will overlap copying from "0" and executing "1".
Check CUDA documentation, it's documented somewhere (in the best practices guide, I think).
You can't access GPU memory directly from the host regardless of a kernel is running or not.
If you're talking about copying that memory back to the host before the kernel is finished writing to it, then the answer depends on the compute capability of your device. But all but the very oldest chips can perform data transfers while the kernel is running.
It seems unlikely that you would want to copy memory that is still being updated by a kernel though. You would get some random snapshot of partially finished data. Instead, you might want to set up something where you have two buffers on the device. You can copy one of the buffers while the GPU is working on the other.
Update:
Based on your clarification, I think the closest you can get is using mapped page-locked host memory, also called zero-copy memory. With this approach, values are copied to the host as they are written by the kernel. There is no way to query the kernel to see how much of the work it has performed, so I think you would have to repeatedly scan the memory for newly written values. See section 3.2.4.3, Mapped Memory, in the CUDA Programming Guide v4.2 for a bit more information.
I wouldn't recommend this though. Unless you have some very unusual requirements, there is likely to be a better way to accomplish your task.
When you launch the Kernel it is an asynchronous (non blocking) call. Calling cudaMemcpy next will block until the Kernel has finished.
If you want to have the result for Debug purposes maybe it is possible for you to use cudaDebugging where you can step through the kernel and inspect the memory.
For small result checks you could also use printf() in the Kernel code.
Or run only a threadblock of size (1,1) if you are interested in that specific result.