I would like to ask for an advice, when using OpenMPI and CUDA on GPU cluster.
I am a beginner and I feel I can't foresee consequences of my decisions about a software architecture. I would highly appreciate someone's advice/rule of thumb, as the information on GPU Clusters is quite sparse.
Framework:
Cluster architecture
the cluster has 1 front node and 9 computation nodes
the cluster is heterogeneous, every node has the Intel Xeon CPU(s) and Nvidia Tesla K80, but with different number of processors and different number of GPU cards
the cluster runs PBSPro scheduler
Goal
1) redistribute the data from a root_MPI_process to MPI_processes
2) load the data to GPU, execute kernel (SIMT-parallel calculations), get the results back
3) send the results back to root_MPI_process
4) root_MPI_process processes the results, creates new data
... iterate -> redistribute the data ...
The steps 1, 2, 3 are purely [SERIAL], and each spawned MPI_process independent from all the others, i.e. no pieces of data are moved between any two MPI_processes
My considerations for software architecture
Alt. 1) 1 MPI process == 1 GPU
start X MPI_processes, every MPI_process (except the root_MPI_process) is responsible for 1 GPU
the MPI_process then receives a chunk of data, suitable right-away to be passed to GPU and executes kernel ... steps described above
Alt. 2) 1 MPI process == 1 computational cluster node (with multiple GPUs)
start X MPI processes, every MPI_process (except the root_MPI_process) runs on 1 computational cluster node
the MPI_process then identifies number of GPUs, and asks for appropriate amount of data from root_MPI_process
the data, passed from root_MPI_process to the MPI_process, are then redistributed among available GPUs ... step 2, 3, 4 mentioned above
Questions
1) From experienced point of view, what else -- except the data passing (which is easier in 1) and more complicated in 2), from my point-of-view) -- should I consider ?
2) This application cannot take the advantage of CUDA aware MPI, because the data are not passed between GPUs, is that right ? ( Is CUDA aware MPI useful for something else then inter-GPU communication ? )
3) Solution 2) offers Universal Addressing Space with Single Address Space, but the solution 1) does not, because every MPI_process access 1 GPU, is that right ?
Edit
this is research in progress, and I don't dare to estimate E2E timing. For reference, this task takes approx. 60 hours on 3x GTX 1070, the cluster has 16x Tesla K80. My computational time at the moment is quite unlimited.
The data are approx 1 [kB] per thread, therefore 1 kernel requires blocks * threads * 1024 [B] of data, I would like to run 1 kernel per GPU at a time.
the kernel (each thread in each block) runs simulation of 2nd order dynamic system with evaluation of small neural network (30 neurons) (the number of multiplications and additions are in 100's per iteration), there are around 1,000,000 simulation iterations before delivering the result.
From the above I can say with confidence, that evaluation of the kernel is more time consuming than the data transfer from host<->device.
1) From experienced point of view, what else -- except the data passing (which is easier in 1) and more complicated in 2), from my point-of-view) -- should I consider ?
If your assumption of kernel execution time >>> communication time holds true, then this is a simple problem. Also if you don't benefit / intend to really utilize the Xeon CPUs, then things are simpler. Just use Alt. 1) (1 to 1, pure MPI). Alt. 2) means you would have to implement two tiers of workload distribution. There's no need to do that without a good reason.
If your assumptions don't hold true, things can get way more complicated and far beyond a concise answer on SO. Addressing these issues is not useful without a clear understanding of the application characteristics.
One thing that you may have to consider if your application runs for > 1 day, is checkpointing.
2) This application cannot take the advantage of CUDA aware MPI, because the data are not passed between GPUs, is that right ? ( Is CUDA aware MPI useful for something else then inter-GPU communication ? )
Since the CPU processes the resulting data in step 4), you wouldn't benefit from CUDA-aware MPI.
3) Solution 2) offers Universal Addressing Space with Single Address Space, but the solution 1) does not, because every MPI_process access 1 GPU, is that right ?
No, there are multiple (9) address spaces in your second approach. One address space per compute node. So you have to use MPI anyway, even in the second approach - which is exactly what makes the 1-rank-1-GPU mapping much simpler.
One thing you should consider, your step 4) will be come a scalability bottleneck at some point. But probably not at the scales you are talking about. It is worth investing in performance analysis tools / methodology to get a good understanding on how your code performs and where the bottlenecks are during development and scaling up to production.
I would start with the first alternative:
The data transfer to each node would be the same in either situation, so that's a, wash.
The first scenario lets the scheduler assign a core to each GPU with room to spare.
The time to spawn the multiple MPI listeners only occurs once if done right.
The second alternative has to, unless you add parallelism in each MPI worker, process each GPU in a serial fashion.
My only caveat is to watch the network and DMA for multiple cores fighting for data. If collisions dominate, add the extra code to implement the second alternative. There is little lost in coding the easier solution first and checking the first iteration at step 4 to see if data passing is problematic.
Related
In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.
I'm trying to build up a system that trains deep models on requests. A user comes to my web site, clicks a button and a training process starts.
However, I have two GPUs and I'm not sure which is the best way to queue/handle jobs between the two GPUs: start a job when at least one GPU is available, queue the job if there are currently no GPUs available. I'd like to use one GPU per job request.
Is this something I can do in combination with Celery? I've used this in the past but I'm not sure how to handle this GPU related problem.
Thanks a lot!
Not sure about celery as I've never used it, but conceptually what seems reasonable (and the question is quite open ended anyway):
create thread(s) responsible solely for distributing tasks to certain GPUs and receiving requests
if any GPU is free assign task immediately to it
if both are occupied estimate time it will probably take to finish the task (neural network training)
add it to the GPU will smallest approximated time
Time estimation
ETA of current task can be approximated quite well given fixed number of samples and epochs. If that's not the case (e.g. early stopping) it will be harder/way harder and would need some heuristic.
When GPUs are overloaded (say each has 5 tasks in queue), what I would do is:
Stop process currently on-going on GPU
Run new process for a few batches of data to make rough estimation how long it might take to finish this task
Ask it to the estimation of all tasks
Now, this depends on the traffic. If it's big and would interrupt on-going process too often you should simply add new tasks to GPU queue which has the least amount of tasks (some heuristic would be needed here as well, you should have estimated possible amount of requests by now, assuming only 2 GPUs it cannot be huge probably).
I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.
I'm having a bit of problems understanding how or if its possible to share a work load between a gpu and cpu. I have a large log file that I need to read each line then run about 5 million operations on(testing for various scenarios). My current approach has been to read a few hundred lines, add it to an array and then send it to each GPU, which is working fine but because there is so much work per line and so many lines it takes a long time. I noticed that while this is going on my CPU cores are basically doing nothing. I'm using EC2, so I have 2 quad core Xeon & 2 Tesla GPUs, one cpu core reads the file(running the main program) and the GPU's do the work so I'm wondering how or what can I do to involve the other 7 cores into the process?
I'm a bit confused at how to design a program to balance the tasks between GPU/CPU because they both would finish the jobs at different times so I couldn't just send it to them all at the same time. I thought about setting up a queue(I'm new to c, so not sure if this is possible yet) but then is there a way to know when a GPU job is completed(since I thought sending jobs to Cuda was asynchronous)? I kernel is very similar to a normal c function so converting it for cpu usage is not problem just balancing the work seems to be the issue. I went though 'Cuda by example' again but couldn't really find anything referring to this type of balancing.
Any suggestions would be great.
I think the key is to create a multithreaded app, following all the common practices for that, and have two types of worker threads. One that does work with the GPU and one that does work with the CPU. So basically, you will need a thread pool and a queue.
http://en.wikipedia.org/wiki/Thread_pool_pattern
The queue can be very simple. You can have one shared integer that is the index of the current row in the log file. When a thread is ready to retrieve more work, it locks that index, gets some number of lines from the log file, starting at the line designated by the index, then increases the index by the number of lines that it retrieved, and then unlocks.
When a worker thread is done with one chunk of the log file, it posts its results back to the main thread and gets another chunk (or exits if there are no more lines to process).
The app launches some combination of GPU and CPU worker threads to utilize all available GPUs and CPU cores.
One problem you may run into is that if the CPU is busy, performance of the GPUs may suffer, as slight delays in submitting new work or processing results from the GPUs are introduced. You may need to experiment with the number of threads and their affinity. For instance, you may need to reserve one CPU core for each GPU by manipulating thread affinities.
Since you say line-by-line may be you can split the jobs across 2 different process -
One CPU + GPU Process
One CPU process that utilized remaining 7 cores
You can start of each process with different offsets - like 1st process reads the lines 1-50, 101-150 etc while the 2nd one reads 51-100, 151-200 etc
This will avoid you the headache of optimizing CPU-GPU interaction
Modern programming languages provide parallelism and concurrency mechanisms as first class citizens to their users. I understand how parallel algorithms are programmed and can well imagine how two threads on a multi-core CPU can run in parallel.
Yet, most of these platforms also support running parallel processes on a single thread.
Do these processes really run in parallel?
How, on an assembly level can two different routines be executed simultaneously on a single thread?
TLTR; : parallelism (in the sense of true simultanenous execution) on a single, non-hyperthreaded CPU core, is NOT possible.
Hardware (<- EDIT) Paralellism can be achieved at several levels. Ordered by decreasing granularity :
multi-host
multi-processor
multi-core
multi-threads ("Hyper-Threading", i.e. "HT")
(EDIT: I voluntarity omit the case of vectorized compuations where several ALUs can be driven by the same core)
Your question relates to running two software threads in cases 3. (in case HT is unavailable / disabled) or 4.
In both cases, the processes actually do NOT run in parallel. The user has an impression of simultaneity due to the extremely fast context switches performed at the CPU level, that tend to allocate, sequentially, the physical core (resp. thread) time to one or the other software thread
In both cases, those routines are simply not executed simultaneously, but sequentially
The relative priority allocated to each of those 2 routines can be set on various OSes by the "Priority" you give to the process, that will be handled by the OS's scheduler, which in turn will allocate CPU time.
HTH.
To perform tests to better understand this topic, you may want to google "cpu affinity". This will let you run a two-threaded process on one physical single core of a multi-core CPU, and time the time taken by each of the threads, while modifying their priority, etc...
Yes, there is parallelism in each thread and you get it for free, no matter which programming language you use (although the amount of parallelism may vary).
It's called instruction-level parallelism. The details are quite complex and differ between different processor micro-architectures.
Computer Architecture: A Quantitative Approach is a brilliant book which includes a chapter on instruction-level parallelism and the book's examples teach how to think rationally about engineering.
Check out the following links for more information:
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/Instruction_pipelining
http://en.wikipedia.org/wiki/Out-of-order_execution