Can cudaIPC be only used between processes? - cuda

I followed the cuda examples to implement cudaIPC. Before I have two machines, each of which has 8 GPUs. Let's say we have worker and server processes. In my case, worker has to send data to server using cudaIPC. But per project's need, now we initiate the worker and server as two threads in one process on each machine. But if I still use the old logic, it reports that: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading CUDA: invalid device ordinal. I want to know if cudaIPC can only be used between processes? For my case, how should I modified my implementation?

Yes, CUDA IPC only works between separate processes on the same machine.
In your case, if the worker and server activities are created in separate threads of the same process, simply remove all CUDA IPC calls that create handles, and use the pointers to the allocations (or events) directly.

Related

Synchronization between CUDA applications

Is there a way to synchronize two different CUDA applications on the same GPU?
I have two different part of processes: original process & post processing. Original process is using GPU. And now we're going to migrate post processing to GPU also. In our architecture there is a requirement, that this two processes should be organized as two separate applications.
And now I'm thinking of synchronization problem:
if I synchronize them on CPU level, I have to know outside when GPU of 1 app is over.
ideal way as I see is to synchronize them somehow on GPU level.
Is there some flag for that purpose? Or some workaround?
Is there a way to synchronize two different CUDA applications on the same GPU?
In a word, no. You would have to do this via some sort of inter-process communication mechanism on the host side.
If you are on Linux or on Windows with a GPU in TCC mode, host IPC will still be required, but you can "interlock" CUDA activity in one process to CUDA activity in another process using the CUDA IPC mechanism. In particular, it is possible to communicate an event handle to another process, using cudaIpcGetEventHandle, and cudaIpcOpenEventHandle. This would provide an event that you could use for a cudaStreamWaitEvent call. Of course, this is really only half of the solution. You would also need to have CUDA IPC memory handles. The CUDA simpleIPC sample code has most of the plumbing you need.
You should also keep in mind that CUDA cannot be used in a child process if CUDA has been initialized in a parent process. This concept is also already provided for in the sample code.
So you would do something like this:
Process A:
create (cudaMalloc) allocation for buffer to hold results to send to post-process
create event for synchronization
get cuda IPC memory and event handles
using host-based IPC, communicate these handles to process B
launch processing work (i.e. GPU kernel) on the data, results should be put in the buffer created above
into the same stream as the GPU kernel, record the event
signal to process B via host based IPC, to launch work
Process B:
receive memory and event handles from process A using host IPC
extract memory pointer and create IPC event from the handles
create a stream for work issue
wait for signal from process A (indicates event has been recorded)
perform cudaStreamWaitEvent using the local event and created stream
in that same stream, launch the post-processing kernel
This should allow the post-processing kernel to begin only when kernel from process A is complete, using the event interlock. Another caveat with this is that you cannot allow process A to terminate at any time during this. Since it is the owner of the memory and event, it must continue to run as long as that memory or that event is required, even if required in another process. If that is a concern, it might make sense to make process B the "owner" and communicate the handles to process A.

Can execution of CUDA kernels from two contexts overlap?

From this, it appears that two kernels from different contexts cannot execute concurrently. In this regard, I am confused when reading CUPTI activity traces from two applications. The traces show kernel_start_timestamp, kernel_end_timestamp and duration (which is kernel_end_timestamp - kernel_start_timestamp).
Application 1:
.......
8024328958006530 8024329019421612 61415082
.......
Application 2:
.......
8024328940410543 8024329048839742 108429199
To make the long timestamp and duration more readable:
Application 1 : kernel X of 61.415 ms ran from xxxxx28.958 s to xxxxx29.019 s
Application 2 : kernel Y of 108.429 ms ran from xxxxx28.940 s to xxxxx29.0488 s
So, the execution of kernel X completely overlaps with that of kernel Y.
I am using the /path_to_cuda_install/extras/CUPTI/sample/activity_trace_async for tracing the applications. I modified CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE to 1024 and CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_POOL_LIMIT to 1. I have only enabled tracing for CUPTI_ACTIVITY_KIND_MEMCPY, CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL and CUPTI_ACTIVITY_KIND_OVERHEAD. My applications are calling cuptiActivityFlushAll(0) once in each of their respective logical timesteps.
Are these erroneous CUPTI values that I am seeing due to improper usage or is it something else?
Clarification : MPS not enabled, running on single GPU
UPDATE: bug filed, this seems to be a known problem for CUDA 6.5
Waiting for a chance to test this with CUDA 7 (have a GPU shared between multiple users and need a window of inactivity for temporary switch to CUDA 7)
I don't no how to set the CUPTI activity traces. But, 2 kernels can share a time-span on a single GPU even without the MPS server, though only one will run on the GPU at a time.
If CUDA MPS Server is not in use, then kernels from different contexts cannot overlap. I am assuming that you're not using the MPS server, then time-sliced scheduler will decide which context to access the GPU at a time. without MPS a context can only access the GPU in a time-slots that the time-sliced scheduler assigns to it. Thus, there are only kernels from a single context running on a GPU at a time (without the MPS server).
Note that, it is potentially possible that multiple kernels sharing a time-span with each other on a GPU, but still in that time-span only a kernels from a single context can access the GPU resources (which I am also assuming that you're using a single GPU).
For more information you can also check the MPS Service document

How to choose a non busy CUDA device?

I'm working on a cluster with a lot of nodes, and each node has two gpus. In the cluster, I can't launch "nvidia-smi" to check which device is busy. My code selects the best device (with cudaChooseDevice) in terms of capability, but when the cluster assign me the same node for two different jobs, then I have two tasks running on the same gpu.
My question is: There is a way to check at runtime if the device is busy or not?
Thanks
Your cluster managers should install and use cluster management (job-scheduling) software that allows them to assign and track GPUs just like CPUs and memory. There are a number of job schedulers that can do this. Even without explicit GPU support in the job-scheduler, it's possible to build job entry/exit scripts that will assign GPUs properly.
You can effectively include the same functionality that nvidia-smi uses by embedding NVML in your applications. Any query or data item reported on by nvidia-smi can be accessed programmatically through NVML.
It's also not clear to me why you could not launch a script for your job which checks which devices are busy using nvidia-smi, then picks an un-busy device.
But keep in mind that any runtime check you might do would be subject to the behavior of other applications. If those applications (whether launched by you or other users) have unusual behavior, your runtime check can easily be defeated.

Is cudaMallocHost() , cudaCreateEvent() asynchronous with executing kernels?

I am running on a very strange issue with the Cuda Runtime API. Calls to functions like cudaMallocHost(), cudaEventCreate(), cudaFree() etc.. seem to be executed only when kernels finish execution on GPU. This kernels are all launched on a stream created with the cudaStreamNonBlocking flag. What is the problem? Do I have to put up some other flags somewhere?
They could be made asynchronous, but it wouldn't be surprising if they are not.
With respect to cudaMallocHost(), which requires that the host memory be mapped for the GPU: if the allocation can't be satisfied from a preallocated pool, the GPU's page tables must be edited. It would not surprise me in the least if the driver had a restriction where it could not edit the page tables of an executing kernel. (Esp. since the page table editing must be done by kernel mode driver code.)
With respect to cudaEventCreate(), that really should be asynchronous since those allocations generally can be satisfied from a preallocated pool. The main impediment there is that changing the behavior would break existing applications that rely on its current, synchronous behavior.
Freeing objects asynchronously requires the driver to track which objects are referenced in the command buffers submitted to the GPU, and defer the actual free operation until after the GPU has finished processing them. It is doable but I am not sure NVIDIA has done the work.
For cudaFree(), it is not possible to track references as you could for CUDA events (because pointers can be stored for running kernels to read and chase). So for large vitrual address ranges that should be deallocated and unmapped, the free must be deferred until after all pending GPU operations have executed. Again, doable but I am not sure NVIDIA has done the work.
I think NVIDIA generally expects developers to work around the lack of asynchrony in these entry points.

CUDA inter-kernel communication between different streams

Has anyone successfully run 2 different kernels in 2 different CUDA streams and gotten them to synchronize? Basically I want to have 1 kernel A send data to another concurrently running kernel B (in a different stream), then get results back. The reason: kernel A is running in 1 CUDA thread and I want a multiple GPU thread implementation for kernel B.
This is with high end GPUs (Fermi/Tesla), CUDA 4.2
Same GPU, different streams. So the data should be able to be communicated thru device memory, but how to sync them?
The CUDA Programming Model only supports communication between threads in the same thread block (CUDA C Programming Guide at the end of section 2.2 Thread Hierarchy). This cannot be reliably implemented through the current CUDA API. If you try you may find partial success. However, this will fail on different OSes, different executions of your application, and this will be broken by future driver updates and new hardware (GK110 supports enhanced concurrency model).
If I correctly caught your question, you have two problems:
Inter-Kernel data exchange
Inter-Kernel synchronization
1) Inter-Kernel Data Exchange can be achieved through sharing data in global device memory.
2) As I know, there is no reliable facilities for inter-kernel synchronization provided by CUDA. And I'm unaware about any suitable trick that can be applied here.
CUDA C Programming Gide v7.5 tells us:
"Applications manage the concurrent operations described above through streams. A stream is a sequence of commands (possibly issued by different host threads) that execute in order. Different streams, on the other hand, may execute their commands out of order with respect to one another or concurrently; this behavior is not guaranteed and should therefore not be relied upon for correctness (e.g., inter-kernel communication is undefined)."
You will need to synchronize on the host. From the top of my head, calling cudaDeviceSynchronize for every stream in turn should do the trick but it may not be that easy.
Your data must be in global memory
You need to get the data address on the host
You must send this data back to the second kernel
your code must be something similar to this:
*dataToExchange_h,*dataToExchange_d;
cudaMalloc((void**)dataToExchange, sizeof(data));
kernel1<<< M1,N1,0,stream1>>>(dataToExchange);
cudaStreamSynchronize(stream1);
kernel2<<< M2,N2,0,stream2>>>(dataToExchange);
But note that stream synchronization slow down the process, so you should avoid it as much as possible.
You can also get stream synchronization through cuda events, it less obvious and does not give special advantage, but it's useful to know it ;-)