A few questions about implicit synchronization? - cuda

In the cuda programming guide, it is mentioned that the following operations will cause implicit synchronization:
a page-locked host memory allocation
I'm wondering whether this include cudaHostRegister and cudaHostUnregister? If not then this shall imply that we can call malloc before all asynchronous operations, then in the asynchronous part we can do cudaHostRegister. Is this right?
any CUDA command to the default stream
Does this include any operations with cudaEvent, such as recording events on stream 0 or let stream 0 wait some events in other streams?
By the way, does the implicit synchronization happen within one device or will the synchronization be over all the devices?

I'm wondering whether this include cudaHostRegister and
cudaHostUnregister?
Yes, I believe there is implicit synchronization with these functions. But like I said in my comment above, these are slow functions. Use cudaHostAlloc() instead if you are able. If you're using shared memory or something like that which requires cudaHostRegister(), you'd generally want to take care of this just once near the beginning of your program and then just leave it registered.
Does this include any operations with cudaEvent, such as recording
events on stream 0 or let stream 0 wait some events in other streams?
Again, this is a CUDA call in the default stream, so I believe implicit synchronization is done here as well.
By the way, does the implicit synchronization happen within one device
or will the synchronization be over all the devices?
Synchronization only applies to the same device. It doesn't impact other devices.
Note that you can now create a stream which doesn't implicitly synchronize with the default stream using cudaStreamCreateWithFlags:
cudaStreamCreateWithFlags( &stream, cudaStreamNonBlocking );
There is something else that could be useful if your host code runs CUDA kernels on the same GPU from multiple host threads. CUDA 7.0 RC has a new nvcc option, --default-stream=per-thread, which you might want to look into. With this option, each host thread by default uses its own stream.
But if you're trying to optimize and check for implicit synchronizations, I'd start by using the CUDA profiler, nvvp, or the profiler which is part of NSight Eclipse Edition.

Related

Wait until *any* device has finished in CUDA?

I have a CUDA kernel that I want to run across multiple GPUs. On each GPU, it's performing a search task, so I'd like to launch it on each GPU and then in the host code wait until any of the GPUs returns (indicating that it found what it was looking for).
I know about cudaDeviceSynchronize(), but that blocks until the current GPU is finished. Is there anything that will let me block until any one of N different GPUs finishes?
CUDA doesn't provide any built-in functions to accomplish that directly.
I believe you would need to do something via polling, and then if you want to poll the results, you can. If you want to build something that blocks the CPU thread, I guess a spin on the polling operation would do it. (cudaDeviceSynchronize() is by default a spin operation under the hood)
You could build a polling system using various ideas:
cudaEvent - launch an event after each kernel launch, then use cudaEventQuery() operations to poll
cudaHostAlloc - use host-pinned memory that each kernel can update with status - read the memory directly
cudaLaunchHostFunc - put a callback in place after each kernel launch. The callback host function would update ordinary host memory, which you could poll for status.
The callback method (at least) would allow you (perhaps via atomics) to collapse the polling to a single memory location, if that were important for some reason. You could probably implement something similar using the host-pinned memory method for systems that have CUDA system atomic support.

Synchronization between CUDA applications

Is there a way to synchronize two different CUDA applications on the same GPU?
I have two different part of processes: original process & post processing. Original process is using GPU. And now we're going to migrate post processing to GPU also. In our architecture there is a requirement, that this two processes should be organized as two separate applications.
And now I'm thinking of synchronization problem:
if I synchronize them on CPU level, I have to know outside when GPU of 1 app is over.
ideal way as I see is to synchronize them somehow on GPU level.
Is there some flag for that purpose? Or some workaround?
Is there a way to synchronize two different CUDA applications on the same GPU?
In a word, no. You would have to do this via some sort of inter-process communication mechanism on the host side.
If you are on Linux or on Windows with a GPU in TCC mode, host IPC will still be required, but you can "interlock" CUDA activity in one process to CUDA activity in another process using the CUDA IPC mechanism. In particular, it is possible to communicate an event handle to another process, using cudaIpcGetEventHandle, and cudaIpcOpenEventHandle. This would provide an event that you could use for a cudaStreamWaitEvent call. Of course, this is really only half of the solution. You would also need to have CUDA IPC memory handles. The CUDA simpleIPC sample code has most of the plumbing you need.
You should also keep in mind that CUDA cannot be used in a child process if CUDA has been initialized in a parent process. This concept is also already provided for in the sample code.
So you would do something like this:
Process A:
create (cudaMalloc) allocation for buffer to hold results to send to post-process
create event for synchronization
get cuda IPC memory and event handles
using host-based IPC, communicate these handles to process B
launch processing work (i.e. GPU kernel) on the data, results should be put in the buffer created above
into the same stream as the GPU kernel, record the event
signal to process B via host based IPC, to launch work
Process B:
receive memory and event handles from process A using host IPC
extract memory pointer and create IPC event from the handles
create a stream for work issue
wait for signal from process A (indicates event has been recorded)
perform cudaStreamWaitEvent using the local event and created stream
in that same stream, launch the post-processing kernel
This should allow the post-processing kernel to begin only when kernel from process A is complete, using the event interlock. Another caveat with this is that you cannot allow process A to terminate at any time during this. Since it is the owner of the memory and event, it must continue to run as long as that memory or that event is required, even if required in another process. If that is a concern, it might make sense to make process B the "owner" and communicate the handles to process A.

Is cudaMallocHost() , cudaCreateEvent() asynchronous with executing kernels?

I am running on a very strange issue with the Cuda Runtime API. Calls to functions like cudaMallocHost(), cudaEventCreate(), cudaFree() etc.. seem to be executed only when kernels finish execution on GPU. This kernels are all launched on a stream created with the cudaStreamNonBlocking flag. What is the problem? Do I have to put up some other flags somewhere?
They could be made asynchronous, but it wouldn't be surprising if they are not.
With respect to cudaMallocHost(), which requires that the host memory be mapped for the GPU: if the allocation can't be satisfied from a preallocated pool, the GPU's page tables must be edited. It would not surprise me in the least if the driver had a restriction where it could not edit the page tables of an executing kernel. (Esp. since the page table editing must be done by kernel mode driver code.)
With respect to cudaEventCreate(), that really should be asynchronous since those allocations generally can be satisfied from a preallocated pool. The main impediment there is that changing the behavior would break existing applications that rely on its current, synchronous behavior.
Freeing objects asynchronously requires the driver to track which objects are referenced in the command buffers submitted to the GPU, and defer the actual free operation until after the GPU has finished processing them. It is doable but I am not sure NVIDIA has done the work.
For cudaFree(), it is not possible to track references as you could for CUDA events (because pointers can be stored for running kernels to read and chase). So for large vitrual address ranges that should be deallocated and unmapped, the free must be deferred until after all pending GPU operations have executed. Again, doable but I am not sure NVIDIA has done the work.
I think NVIDIA generally expects developers to work around the lack of asynchrony in these entry points.

Understanding CUDA dependency check

CUDA C Programming Guide provides the following statements:
For devices that support concurrent kernel execution and are of compute capability 3.0
or lower, any operation that requires a dependency check to see if a streamed kernel
launch is complete:
‣ Can start executing only when all thread blocks of all prior kernel launches from any
stream in the CUDA context have started executing;
‣ Blocks all later kernel launches from any stream in the CUDA context until the kernel
launch being checked is complete.
I am quite lost here. What is a dependency check? Can I say a kernel execution on some device memories requires a dependency check on all the previous kernel or memory transfer involving the same device memory? If this is true (maybe not true), this dependency check blocks all later kernels from any other stream according to the above statement, and therefore no asynchronous or concurrent execution will happen afterward, which seems not true.
Any explanation or elaboration will be appreciated!
First of all I suggest you visit the webinar-site of nvidia and watch the webinar on Concurrency & Streams.
Furthermore consider the following points:
commands issued to the same stream are treated as dependent
e.g. you would insert a kernel into a stream after a memcopy of some data this
kernel will acess. The kernel "depends" on the data being available.
commands in the same stream are therefore guaranteed to be executed sequentially (or synchronously, which is often used as synonym)
commands in different streams however are independent and can be run concurrently
so dependencies are only known to the programmer and are expressed using streams (to avoid errors)!
The following corresponds only to devices with compute capability 3.0 or lower (as stated in the quide). If you want to know more about the changes to stream scheduling behaviour with compute capability 3.5 have a look at HyperQ and the corresponding example. At this point I also want to reference this thread where I found the HyperQ examples :)
About your second question: I do not quite understand what you mean by a "kernel execution on some device memory" or a "kernel execution involving device memory" so i reduced your statement to:
A kernel execution requires a dependency check on all the previous kernels and memory transfers.
Better would be:
A CUDA operation requires a dependency check to see whether preceding CUDA operations in the same stream have completed.
I think your problem here is with the expression "started executing". That means there can still be independent (that is on different streams) kernel launches, which will be concurrent with the previous kernels, provided they have all started executing and enough device resources are available.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)