How to reduce CUDA context size (Multi-Process Service) - cuda

I followed Robert Crovella's example on how to use Nvidia's Multi-Process Service. According to docs:
2.1.2. Reduced on-GPU context storage
Without MPS each CUDA processes using a GPU allocates separate storage
and scheduling resources on the GPU. In contrast, the MPS server
allocates one copy of GPU storage and scheduling resources shared by
all its clients.
which I understood as the reduction of each of the processes' context sizes, which is possible because they are shared. This would increase free GPU memory and thus enable running more processes in parallel.
Now, back to the example. Without MPS:
And with MPS:
Unfortunately each process still takes virtually the same (~300MB) amount of memory. Isn't this in contradiction to the docs? Is there a way to decrease per process memory consumption?

Oops, I overeagerly asked before checking the memory usage on the other (pre-Volta) card and yes, there is actually a difference. Let me just post it here for future reference if anyone else stumbled on this problem too:
MPS off:
MPS on:

Indeed, as seen here, in Volta architecture, you can see the processes communicate directly with the GPU, without the MPS server in the middle:
Volta MPS clients submit work directly to the GPU without passing through the MPS server.
This can be easily seen from your first screenshot where the t1034 processes are listed as using the GPU.
On the contrary, in pre-Volta architectures, the client processes communicate with the GPU through the MPS server. This results in seeing only the MPS server process communicating directly with the GPU in the latter screenshot.

Related

What is the difference between Nvidia Hyper Q and Nvidia Streams?

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one article and it had these two statements:
A. Hyper-Q is a flexible solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process
B. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi)
In aforementioned points, Point B says that there can be multiple connected created to a single GPU from host. Does it mean I can create multiple context on a simple GPU through different applications? Does it mean that I will have to execute all applications on different streams?What if all my connections are memory and compute resource consuming, who manages the resource (memory/cores) scheduling?
Think of HyperQ as streams implemented in hardware on the device side.
Before the arrival of HyperQ, e.g. on Fermi, commands (kernel launches, memory transfers, etc.) from all streams were placed in a single work queue by the driver on the host. That meant that commands could not overtake each other, and you had to be careful issuing them in the right order on the host to achieve best overlap.
On the GK110 GPU and later devices with HyperQ, there are (at least) 32 work queues on the device. This means that commands from different queues can be reordered relative to each other until they start execution. So both orderings in the example linked above lead to good overlap on a GK110 device.
This is particularly important for multithreaded host code, where you can't control the order without additional synchronization between threads.
Note that of the 32 hardware queues only 8 are used by default to save resources. Set the CUDA_​DEVICE_​MAX_​CONNECTIONS environment variable to a higher value if you need more.

Multiple GPUs and Multiple Executables

Suppose I have 4 GPUs and would like to run 50 CUDA programs in parallel. My question is: is the NVIDIA driver smart enough to run the 50 CUDA programs on the different GPUs or do I have to set the CUDA device for each program?
thank you
The first point to make is that you cannot run 50 applications in parallel on 4 GPUs on just about any CUDA platform. If you have a Hyper-Q capable GPU, there is the possibility of up to 32 threads or MPI processes queuing work to the GPU. Otherwise there is a single command queue.
For anything other than the latest Kepler Tesla cards, CUDA driver only supports a single active context at a time. If you run more that one application on a GPU, the processes will both have contexts which just contend with one another in a "first come, first serve" basis. If one application blocks the other with a long running kernel or similar, there is no pre-emption or anything else which makes the process yield to another process. When the GPU is shared with a display manager, there is a watchdog timer that will impose an upper limit of a few seconds before the application will get its context killed. The result is that only one context ever runs on the hardware at a time. Context switching isn't free, and there is a performance penalty to having multiple processes contending for a single device.
Furthermore, every context present on a GPU requires device memory. On the platform you are asking about, linux, there is no memory paging, so every context's resources must coexist in GPU memory. I don't believe it would be possible to have 12 non-trivial contexts running on any current GPU simultaneously - you would run out of available memory well before that number. Trying to run more applications would result in an context establishment failure.
As for the behaviour of the driver distributing multiple applications on multiple GPUs, AFAIK the linux driver doesn't do any intelligent distribution of processes amongst GPUs, except when one or more of the GPUs are in a non-default compute mode. If no device is specifically requested, the driver will always try and find the first valid, free GPU it can run a process or thread on. If a GPU is busy and marked compute exclusive (either thread or process) or marked prohibited, then the driver will skip over it when trying to find a GPU to run on. If all GPUs are exclusive and occupied or prohibited, then the application will fail with a no valid device available error.
So in summary,for everything other than Hyper-Q devices, there is no performance gain in doing what you are asking about (quite the opposite) and I would expected it to break if you tried. A much saner approach would be to use compute exclusivity in combination with a resource managing task scheduler like Torque or one of the (former) Sun Grid Engine versions, which could schedule your processes to run in an orderly fashion according to the availability of GPUs. This is how most general purpose HPC clusters deal with scheduling in multi-gpu environments.

Multiple processes launching CUDA kernels in parallel

I know that NVIDIA gpus with compute capability 2.x or greater can execute u pto 16 kernels concurrently.
However, my application spawns 7 "processes" and each of these 7 processes launch CUDA kernels.
My first question is that what would be the expected behavior of these kernels. Will they execute concurrently as well or, since they are launched by different processes, they would execute sequentially.
I am confused because the CUDA C programming guide says:
"A kernel from one CUDA context cannot execute concurrently with a kernel from another CUDA context."
This brings me to my second question, what are CUDA "contexts"?
Thanks!
A CUDA context is a virtual execution space that holds the code and data owned by a host thread or process. Only one context can ever be active on a GPU with all current hardware.
So to answer your first question, if you have seven separate threads or processes all trying to establish a context and run on the same GPU simultaneously, they will be serialised and any process waiting for access to the GPU will be blocked until the owner of the running context yields. There is, to the best of my knowledge, no time slicing and the scheduling heuristics are not documented and (I would suspect) not uniform from operating system to operating system.
You would be better to launch a single worker thread holding a GPU context and use messaging from the other threads to push work onto the GPU. Alternatively there is a context migration facility available in the CUDA driver API, but that will only work with threads from the same process, and the migration mechanism has latency and host CPU overhead.
To add to the answer of #talonmies
In the newer architectures, by the use of MPS multiple processes can launch multiple kernels concurrently. So, now it is definitely possible which was not sometime before. For a detailed understanding read this article.
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf
Additionally, you can also see maximum number of concurrent kernels allowed per cuda compute capability type supported by different GPUs. Here is a link to that:
https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
For example a GPU with cuda compute capability of 7.5 can have maximum of 128 Cuda kernels launched to it.
Do you really need to have separate threads and contexts?
I believe that best practice is a usage one context per GPU, because multiple contexts on single GPU bring a sufficient overhead.
To execute many kernels concrurrenlty you should create few CUDA streams in one CUDA context and queue each kernel into its own stream - so they will be executed concurrently, if there are enough resources for it.
If you need to make the context accessible from few CPU threads - you can use cuCtxPopCurrent(), cuCtxPushCurrent() to pass them around, but only one thread will be able to work with the context at any time.

Prefetching in Nvidia CUDA

I'm working on data prefetching in nVidia CUDA. I read some documents on prefetching on device itself i.e. Prefetching from shared memory to cache.
But I'm interested in data prefetching between CPU and GPU. Can anyone connect me with some documents or something regarding this matter. Any help would be appreciated.
Answer based on your comment:
when we to want perform computation on large data ideally we'll send max data to GPU,perform computation,send it back to CPU i.e SEND,COMPUTE,SEND(back to CPU) now whn it sends back to CPU GPU has to stall,now my plan is given CU program,say it runs in entire global mem,i'll compel it to run it in half of the global mem so that rest of the half i can use for data prefetching,so while computation is being performed in one half simultaneously i cn prefetch data in otherhalf.so no stalls will be there..now tell me is it feasible to do?performance will be degraded or upgraded?should enhance..
CUDA streams were introduced to enable exactly this approach.
If your compoutation is rather intensive, then yes --- it can greatly speed up your performance. On the other hand, if data transfers take, say, 90% of your time, you will save only on computation time - that is - 10% tops...
The details, including examples, on how to use streams is provided in CUDA Programming Guide.
For version 4.0, that will be section "3.2.5.5 Streams", and in particular "3.2.5.5.5 Overlapping Behavior" --- there, they launch another, asynchronous memory copy, while a kernel is still running.
Perhaps you would be interested in the asynchronous host/device memory transfer capabilities of CUDA 4.0? You can overlap host/device memory transfers and kernels by using page-locked host memory. You could use this to...
Copy working set #1 & #2 from host to device.
Process #i, promote #i+1, and load #i+2 - concurrently.
So you could be streaming data in and out of the GPU and computing on it all at once (!). Please refer to the CUDA 4.0 Programming Guide and CUDA 4.0 Best Practices Guide for more detailed information. Good luck!
Cuda 6 will eliminate the need to copy, ie the copying will be automatic.
however you may still benefit from prefetching.
In a nutshell you want the data for the "next" computation transferring while you complete the current computation. to achieve that you need to have at least two threads on the CPU, and some kind of signalling scheme (to know when to send the next data). Chunking will of course play a big role and affect performance.
The above may be easier on an APU (CPU+GPU on the same die) as the need to copy is eliminated as both processors can access the same memory.
If you want to find some papers on GPU prefetching just use google scholar.

Where can I find information about the Unified Virtual Addressing in CUDA 4.0?

Where can I find information / changesets / suggestions for using the new enhancements in CUDA 4.0? I'm especially interested in learning about Unified Virtual Addressing?
Note: I would really like to see an example were we can access the RAM directly from the GPU.
Yes, using host memory (if that is what you mean by RAM) will most likely slow your program down, because transfers to/from the GPU take some time and are limited by RAM and PCI bus transfer rates. Try to keep everything in GPU memory. Upload once, execute kernel(s), download once. If you need anything more complicated try to use asynchronous memory transfers with streams.
As far as I know "Unified Virtual Addressing" is really more about using multiple devices, abstracting from explicit memory management. Think of it as a single virtual GPU, everything else still valid.
Using host memory automatically is already possible with device-mapped-memory. See cudaMalloc* in the reference manual found at the nvidia cuda website.
CUDA 4.0 UVA (Unified Virtual Address) does not help you in accessing the main memory from the CUDA threads. As in the previous versions of CUDA, you still have to map the main memory using CUDA API for direct access from GPU threads, but it will slow down the performance as mentioned above. Similarly, you cannot access GPU device memory from CPU thread just by dereferencing the pointer to the device memory. UVA only guarantees that the address spaces do not overlap across multiple devices (including CPU memory), and does not provide coherent accessibility.