I'm using: cudaMemcpy2DArrayToArray(). Is there also an asynchronous counterpart of this function? cudaMemcpy2DArrayToArrayAsync() does, not exist. I want to avoid implicit synchronization of my cuda operations.
The other 10 or so cudaMemcpy*() calls all have an async version so my guess is that this call is implemented in some way that prevents a fully async version. Note, though, that the docs say that "This function exhibits synchronous behavior for most use cases" and, by that, they seem to mean:
For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.
For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.
For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.
For transfers from device memory to device memory, no host-side synchronization is performed.
For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.
Related
I am using Cuda API:
cudaMemcpyAsync ( void* dst, const void* src, size_t count, cudaMemcpyKind kind, cudaStream_t stream = 0 )
to copy data from GPU memory from CPU memory. In case copying the data from CPU memory to Persistent Memory using memcpy(), we need to explicitly call the flush operation(eg. clflush()) to make sure data is flushed from CPU caches. Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
Do I need to call the flush operation when copying from GPU Memory to Persistent Memory using cudaMemcpyAsync();
No.
However, you are calling a potentially asynchronous API, so you may need to use one of the synchronization APIs (stream or device scope) in order to ensure data consistency between operations that can potentially overlap and need to access the same memory area.
Intel processors with the server uncore design starting with Sandy Bridge support Data Direct I/O (DDIO), which is enabled by default. With DDIO, an inbound PCIe write targeting system memory location of type WB is an allocating write transaction.
For a full write (that writes to an entire cache line), the IIO first obtains ownership of the target cache line by invalidating all copies in the coherence domain except in the L3 that exists in the same NUMA node to which the originating device is attached. If the line doesn't already exist in the target L3, an L3 entry is allocated, which may require evicting another line to make space. The write is performed in the L3 and the coherence state of the line becomes M. This means that the data is not sent to the memory controller to which its address is mapped. Partial writes are buffered in the IIO (which is in the coherence domain) until they are eventually evicted to be written into the LLC (allocate or update). In DDIO, reads are never allocating.
Even if DDIO is disabled, PCIe writes can be buffered in the DDIO. When cudaMemcpyAsync or even cudaMemcpy returns, there is no guarantee that all writes have reached the persistence domain on Intel processors (unless you have Whole System Persistence). In addition, the memory copy is not guaranteed to be persistently atomic and there is no guarantee in what order the bytes will move from the IIO to the target memory controllers. You need a flag to tell you whether the entire data was persisted or not.
You can use a barrier (cudaStreamSynchronize() or cudaDeviceSynchronize()) to wait on the host until the data copy operation is complete, and then flush each cache line, followed by writing a flag, in that order.
I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.
My project will have multiple threads, each one issuing kernel executions on different cudaStreams. Some other thread will consume the results that whill be stored in a queue Some pseudo-code here:
while(true) {
cudaMemcpyAsync(d_mem, h_mem, some_stream)
kernel_launch(some_stream)
cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}
Is safe to reuse the h_mem after the first cudaMemcpyAsync returns? or should I use N host buffers for issuing the gpu computation?
How to know when the h_mem can be reused? should I make some synchronization using cudaevents?
BTW. h_mem is host-pinned. If it was pageable, could I reuse h_mem inmediatly? from what I have read here it seems I could reuse inmediatly after memcpyasync returns, am i right?
Asynchronous
For transfers from pageable host memory to device memory, host memory
is copied to a staging buffer immediately (no device synchronization
is performed). The function will return once the pageable buffer has
been copied to the staging memory. The DMA transfer to final
destination may not have completed. For transfers between pinned host
memory and device memory, the function is fully asynchronous. For
transfers from device memory to pageable host memory, the function
will return only once the copy has completed. For all other transfers,
the function is fully asynchronous. If pageable memory must first be
staged to pinned memory, this will be handled asynchronously with a
worker thread. For transfers from any host memory to any host memory,
the function is fully synchronous with respect to the host.
MemcpyAsynchronousBehavior
Thanks!
In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.
Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.
You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.
I have:
Host memory that has been successfully pinned and mapped using cudaHostAlloc(..., cudaHostAllocMapped) or cudaHostRegister(..., cudaHostRegisterMapped);
Device pointers have been obtained using cudaHostGetDevicePointer(...).
I initiate cudaMemcpy(..., cudaMemcpyDeviceToDevice) on src and dest device pointers that point to two different regions of pinned+mapped memory obtained by the technique above.
Everything works fine.
Question: should I continue doing this or just use a traditional CPU-style memcpy() since everything is in system memory anyway? ...or are they the same (i.e. does cudaMemcpy map to a straight memcpy when both src and dest are pinned)?
(I am still using the cudaMemcpy method because previously everything was in device global memory, but have since switched to pinned memory due to gmem size constraints)
With cudaMemcpy the CUDA driver detects that you are copying from a host pointer to a host pointer and the copy is done on the CPU. You can of course use memcpy on the CPU yourself if you prefer.
If you use cudaMemcpy, there may be an extra stream synchronize performed before doing the copy (which you may see in the profiler, but I'm guessing there—test and see).
On a UVA system you can just use cudaMemcpyDefault as talonmies says in his answer. But if you don’t have UVA (sm_20+ and 64-bit OS), then you have to call the right copy (e.g. cudaMemcpyDeviceToDevice). If you cudaHostRegister() everything you are interested in then cudaMemcpyDeviceToDevice will end up doing the following depending on the where the memory is located:
Host <-> Host: performed by the CPU (memcpy)
Host <-> Device: DMA (device copy engine)
Device <-> Device: Memcpy CUDA kernel (runs on the SMs, launched by driver)
If you are working on a platform with UVA (unified virtual addressing), I would strongly suggest using cudaMemcpy with cudaMemcpyDefault. That way all of this handwringing about the fastest path becomes an internal API implementation detail you don't have to worry about.
I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.