Schedule jobs between GPUs for PyTorch Models - deep-learning

I'm trying to build up a system that trains deep models on requests. A user comes to my web site, clicks a button and a training process starts.
However, I have two GPUs and I'm not sure which is the best way to queue/handle jobs between the two GPUs: start a job when at least one GPU is available, queue the job if there are currently no GPUs available. I'd like to use one GPU per job request.
Is this something I can do in combination with Celery? I've used this in the past but I'm not sure how to handle this GPU related problem.
Thanks a lot!

Not sure about celery as I've never used it, but conceptually what seems reasonable (and the question is quite open ended anyway):
create thread(s) responsible solely for distributing tasks to certain GPUs and receiving requests
if any GPU is free assign task immediately to it
if both are occupied estimate time it will probably take to finish the task (neural network training)
add it to the GPU will smallest approximated time
Time estimation
ETA of current task can be approximated quite well given fixed number of samples and epochs. If that's not the case (e.g. early stopping) it will be harder/way harder and would need some heuristic.
When GPUs are overloaded (say each has 5 tasks in queue), what I would do is:
Stop process currently on-going on GPU
Run new process for a few batches of data to make rough estimation how long it might take to finish this task
Ask it to the estimation of all tasks
Now, this depends on the traffic. If it's big and would interrupt on-going process too often you should simply add new tasks to GPU queue which has the least amount of tasks (some heuristic would be needed here as well, you should have estimated possible amount of requests by now, assuming only 2 GPUs it cannot be huge probably).

Related

Using CUDA GPUs at prediction time for high througput streams

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back.
The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up.
We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras.
The GPUs are still mostly idle, but adding more Python workers does not speed up the process.
What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.
I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?
This is not an answer to your more general question, but rather an answer based on how I understand the scenario you described.
If someone has coded a system which uses a GPU for some computational task, they have (hopefully) taken the time to parallelize its execution so as to benefit from the full resources the GPU can offer, or something close to that.
That means that if you add a second similar task - even in parallel - the total amount of time to complete them should be similar to the amount of time to complete them serially, i.e. one after the other - since there are very little underutilized GPU resources for the second task to benefit from. In fact, it could even be the case that both tasks will be slower (if, say, they both somehow utilize the L2 cache a lot, and when running together they thrash it).
At any rate, when you want to improve performance, a good thing to do is profile your application - in this case, using the nvprof profiler or its nvvp frontend (the first link is the official documentation, the second link is a presentation).

Transferring data to GPU while kernel is running to save time

GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however,
GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.
Lets say we have some data what won't fit into GPU ram but we still want to use
it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.
Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.
By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.
I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.
My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?
Thank you!
is it possible to accomplish something like that described above
Yes, it's possible. What you're describing is a pipelined algorithm, and CUDA has various asynchronous capabilities to enable it.
The asynchronous concurrent execution section of the programming guide covers the necessary elements in CUDA to make it work. To use your example, there exists a non-blocking version of cudaMemcpy, called cudaMemcpyAsync. You'll need to understand CUDA streams and how to use them.
I would also suggest this presentation which covers most of what is needed.
Finally, here is a worked example. That particular example happens to use CUDA stream callbacks, but those are not necessary for basic pipelining. They enable additional host-oriented processing to be asynchronously triggered at various points in the pipeline, but the basic chunking of data, and delivery of data while processing is occurring does not depend on stream callbacks. Note also the linked CUDA sample codes in that answer, which may be useful for study/learning.

Utilizing GPU worth it?

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

Is it possible to split Cuda jobs between GPU & CPU?

I'm having a bit of problems understanding how or if its possible to share a work load between a gpu and cpu. I have a large log file that I need to read each line then run about 5 million operations on(testing for various scenarios). My current approach has been to read a few hundred lines, add it to an array and then send it to each GPU, which is working fine but because there is so much work per line and so many lines it takes a long time. I noticed that while this is going on my CPU cores are basically doing nothing. I'm using EC2, so I have 2 quad core Xeon & 2 Tesla GPUs, one cpu core reads the file(running the main program) and the GPU's do the work so I'm wondering how or what can I do to involve the other 7 cores into the process?
I'm a bit confused at how to design a program to balance the tasks between GPU/CPU because they both would finish the jobs at different times so I couldn't just send it to them all at the same time. I thought about setting up a queue(I'm new to c, so not sure if this is possible yet) but then is there a way to know when a GPU job is completed(since I thought sending jobs to Cuda was asynchronous)? I kernel is very similar to a normal c function so converting it for cpu usage is not problem just balancing the work seems to be the issue. I went though 'Cuda by example' again but couldn't really find anything referring to this type of balancing.
Any suggestions would be great.
I think the key is to create a multithreaded app, following all the common practices for that, and have two types of worker threads. One that does work with the GPU and one that does work with the CPU. So basically, you will need a thread pool and a queue.
http://en.wikipedia.org/wiki/Thread_pool_pattern
The queue can be very simple. You can have one shared integer that is the index of the current row in the log file. When a thread is ready to retrieve more work, it locks that index, gets some number of lines from the log file, starting at the line designated by the index, then increases the index by the number of lines that it retrieved, and then unlocks.
When a worker thread is done with one chunk of the log file, it posts its results back to the main thread and gets another chunk (or exits if there are no more lines to process).
The app launches some combination of GPU and CPU worker threads to utilize all available GPUs and CPU cores.
One problem you may run into is that if the CPU is busy, performance of the GPUs may suffer, as slight delays in submitting new work or processing results from the GPUs are introduced. You may need to experiment with the number of threads and their affinity. For instance, you may need to reserve one CPU core for each GPU by manipulating thread affinities.
Since you say line-by-line may be you can split the jobs across 2 different process -
One CPU + GPU Process
One CPU process that utilized remaining 7 cores
You can start of each process with different offsets - like 1st process reads the lines 1-50, 101-150 etc while the 2nd one reads 51-100, 151-200 etc
This will avoid you the headache of optimizing CPU-GPU interaction