Is it possible to split Cuda jobs between GPU & CPU? - cuda

I'm having a bit of problems understanding how or if its possible to share a work load between a gpu and cpu. I have a large log file that I need to read each line then run about 5 million operations on(testing for various scenarios). My current approach has been to read a few hundred lines, add it to an array and then send it to each GPU, which is working fine but because there is so much work per line and so many lines it takes a long time. I noticed that while this is going on my CPU cores are basically doing nothing. I'm using EC2, so I have 2 quad core Xeon & 2 Tesla GPUs, one cpu core reads the file(running the main program) and the GPU's do the work so I'm wondering how or what can I do to involve the other 7 cores into the process?
I'm a bit confused at how to design a program to balance the tasks between GPU/CPU because they both would finish the jobs at different times so I couldn't just send it to them all at the same time. I thought about setting up a queue(I'm new to c, so not sure if this is possible yet) but then is there a way to know when a GPU job is completed(since I thought sending jobs to Cuda was asynchronous)? I kernel is very similar to a normal c function so converting it for cpu usage is not problem just balancing the work seems to be the issue. I went though 'Cuda by example' again but couldn't really find anything referring to this type of balancing.
Any suggestions would be great.

I think the key is to create a multithreaded app, following all the common practices for that, and have two types of worker threads. One that does work with the GPU and one that does work with the CPU. So basically, you will need a thread pool and a queue.
http://en.wikipedia.org/wiki/Thread_pool_pattern
The queue can be very simple. You can have one shared integer that is the index of the current row in the log file. When a thread is ready to retrieve more work, it locks that index, gets some number of lines from the log file, starting at the line designated by the index, then increases the index by the number of lines that it retrieved, and then unlocks.
When a worker thread is done with one chunk of the log file, it posts its results back to the main thread and gets another chunk (or exits if there are no more lines to process).
The app launches some combination of GPU and CPU worker threads to utilize all available GPUs and CPU cores.
One problem you may run into is that if the CPU is busy, performance of the GPUs may suffer, as slight delays in submitting new work or processing results from the GPUs are introduced. You may need to experiment with the number of threads and their affinity. For instance, you may need to reserve one CPU core for each GPU by manipulating thread affinities.

Since you say line-by-line may be you can split the jobs across 2 different process -
One CPU + GPU Process
One CPU process that utilized remaining 7 cores
You can start of each process with different offsets - like 1st process reads the lines 1-50, 101-150 etc while the 2nd one reads 51-100, 151-200 etc
This will avoid you the headache of optimizing CPU-GPU interaction

Related

How do you keep data in fast GPU memory (l1/shared) across kernel invocations?

How do you keep data in fast GPU memory across kernel invocations?
Let's suppose, I need to answer 1 million queries, each of which has about 1.5MB of data that's reusable across invocations and about 8KB of data that's unique to each query.
One approach is to launch a kernel for each query, copying the 1.5MB + 8KB of data to shared memory each time. However, then I spend a lot of time just copying 1.5MB of data that really could persist across queries.
Another approach is to "recycle" the GPU threads (see https://stackoverflow.com/a/49957384/3738356). That involves launching one kernel that immediately copies the 1.5MB of data to shared memory. And then the kernel waits for requests to come in, waiting for the 8KB of data to show up before proceeding with each iteration. It really seems like CUDA wasn't meant to be used this way. If one just uses managed memory, and volatile+monotonically increasing counters to synchronize, there's still no guarantee that the data necessary to compute the answer will be on the GPU when you go to read it. You can seed the values in the memory with dummy values like -42 that indicate that the value hasn't yet made its way to the GPU (via the caching/managed memory mechanisms), and then busy wait until the values become valid. Theoretically, that should work. However, I had enough memory errors that I've given up on it for now, and I've pursued....
Another approach still uses recycled threads but instead synchronizes data via cudaMemcpyAsync, cuda streams, cuda events, and still a couple of volatile+monotonically increasing counters. I hear I need to pin the 8KB of data that's fresh with each query in order for the cudaMemcpyAsync to work correctly. But, the async copy isn't blocked -- its effects just aren't observable. I suspect with enough grit, I can make this work too.
However, all of the above makes me think "I'm doing it wrong." How do you keep extremely re-usable data in the GPU caches so it can be accessed from one query to the next?
First of all to observe the effects of the streams and Async copying
you definitely need to pin the host memory. Then you can observe
concurrent kernel invocations "almost" happening at the same time.
I'd rather used Async copying since it makes me feel in control of
the situation.
Secondly you could just hold on to the data in global memory and load
it in the shared memory whenever you need it. To my knowledge shared
memory is only known to the kernel itself and disposed after
termination. Try using Async copies while the kernel is running and
synchronize the streams accordingly. Don't forget to __syncthreads()
after loading to the shared memory. I hope it helps.

MPI + CUDA software architecture on GPU cluster

I would like to ask for an advice, when using OpenMPI and CUDA on GPU cluster.
I am a beginner and I feel I can't foresee consequences of my decisions about a software architecture. I would highly appreciate someone's advice/rule of thumb, as the information on GPU Clusters is quite sparse.
Framework:
Cluster architecture
the cluster has 1 front node and 9 computation nodes
the cluster is heterogeneous, every node has the Intel Xeon CPU(s) and Nvidia Tesla K80, but with different number of processors and different number of GPU cards
the cluster runs PBSPro scheduler
Goal
1) redistribute the data from a root_MPI_process to MPI_processes
2) load the data to GPU, execute kernel (SIMT-parallel calculations), get the results back
3) send the results back to root_MPI_process
4) root_MPI_process processes the results, creates new data
... iterate -> redistribute the data ...
The steps 1, 2, 3 are purely [SERIAL], and each spawned MPI_process independent from all the others, i.e. no pieces of data are moved between any two MPI_processes
My considerations for software architecture
Alt. 1) 1 MPI process == 1 GPU
start X MPI_processes, every MPI_process (except the root_MPI_process) is responsible for 1 GPU
the MPI_process then receives a chunk of data, suitable right-away to be passed to GPU and executes kernel ... steps described above
Alt. 2) 1 MPI process == 1 computational cluster node (with multiple GPUs)
start X MPI processes, every MPI_process (except the root_MPI_process) runs on 1 computational cluster node
the MPI_process then identifies number of GPUs, and asks for appropriate amount of data from root_MPI_process
the data, passed from root_MPI_process to the MPI_process, are then redistributed among available GPUs ... step 2, 3, 4 mentioned above
Questions
1) From experienced point of view, what else -- except the data passing (which is easier in 1) and more complicated in 2), from my point-of-view) -- should I consider ?
2) This application cannot take the advantage of CUDA aware MPI, because the data are not passed between GPUs, is that right ? ( Is CUDA aware MPI useful for something else then inter-GPU communication ? )
3) Solution 2) offers Universal Addressing Space with Single Address Space, but the solution 1) does not, because every MPI_process access 1 GPU, is that right ?
Edit
this is research in progress, and I don't dare to estimate E2E timing. For reference, this task takes approx. 60 hours on 3x GTX 1070, the cluster has 16x Tesla K80. My computational time at the moment is quite unlimited.
The data are approx 1 [kB] per thread, therefore 1 kernel requires blocks * threads * 1024 [B] of data, I would like to run 1 kernel per GPU at a time.
the kernel (each thread in each block) runs simulation of 2nd order dynamic system with evaluation of small neural network (30 neurons) (the number of multiplications and additions are in 100's per iteration), there are around 1,000,000 simulation iterations before delivering the result.
From the above I can say with confidence, that evaluation of the kernel is more time consuming than the data transfer from host<->device.
1) From experienced point of view, what else -- except the data passing (which is easier in 1) and more complicated in 2), from my point-of-view) -- should I consider ?
If your assumption of kernel execution time >>> communication time holds true, then this is a simple problem. Also if you don't benefit / intend to really utilize the Xeon CPUs, then things are simpler. Just use Alt. 1) (1 to 1, pure MPI). Alt. 2) means you would have to implement two tiers of workload distribution. There's no need to do that without a good reason.
If your assumptions don't hold true, things can get way more complicated and far beyond a concise answer on SO. Addressing these issues is not useful without a clear understanding of the application characteristics.
One thing that you may have to consider if your application runs for > 1 day, is checkpointing.
2) This application cannot take the advantage of CUDA aware MPI, because the data are not passed between GPUs, is that right ? ( Is CUDA aware MPI useful for something else then inter-GPU communication ? )
Since the CPU processes the resulting data in step 4), you wouldn't benefit from CUDA-aware MPI.
3) Solution 2) offers Universal Addressing Space with Single Address Space, but the solution 1) does not, because every MPI_process access 1 GPU, is that right ?
No, there are multiple (9) address spaces in your second approach. One address space per compute node. So you have to use MPI anyway, even in the second approach - which is exactly what makes the 1-rank-1-GPU mapping much simpler.
One thing you should consider, your step 4) will be come a scalability bottleneck at some point. But probably not at the scales you are talking about. It is worth investing in performance analysis tools / methodology to get a good understanding on how your code performs and where the bottlenecks are during development and scaling up to production.
I would start with the first alternative:
The data transfer to each node would be the same in either situation, so that's a, wash.
The first scenario lets the scheduler assign a core to each GPU with room to spare.
The time to spawn the multiple MPI listeners only occurs once if done right.
The second alternative has to, unless you add parallelism in each MPI worker, process each GPU in a serial fashion.
My only caveat is to watch the network and DMA for multiple cores fighting for data. If collisions dominate, add the extra code to implement the second alternative. There is little lost in coding the easier solution first and checking the first iteration at step 4 to see if data passing is problematic.

Transferring data to GPU while kernel is running to save time

GPU is really fast when it comes to paralleled computation and out performs CPU with being 15-30 ( some have reported even 50 ) times faster however,
GPU memory is very limited compared to CPU memory and communication between GPU memory and CPU is not as fast.
Lets say we have some data what won't fit into GPU ram but we still want to use
it's wonders to compute. What we can do is split that data into pieces and feed it into GPU one by one.
Sending large data to GPU can take time and one might think, what if we would split a data piece into two and feed the first half, run the kernel and then feed the other half while kernel is running.
By that logic we should save some time because data transfer should be going on while computation is, hopefully not interrupting it's job and when finished, it can just, well, continue it's job without needs for waiting a new data path.
I must say that I'm new to gpgpu, new to cuda but I have been experimenting around with simple cuda codes and have noticed that the function cudaMemcpy used to transfer data between CPU and GPU will block if kerner is running. It will wait until kernel is finished and then will do its job.
My question, is it possible to accomplish something like that described above and if so, could one show an example or provide some information source of how it could be done?
Thank you!
is it possible to accomplish something like that described above
Yes, it's possible. What you're describing is a pipelined algorithm, and CUDA has various asynchronous capabilities to enable it.
The asynchronous concurrent execution section of the programming guide covers the necessary elements in CUDA to make it work. To use your example, there exists a non-blocking version of cudaMemcpy, called cudaMemcpyAsync. You'll need to understand CUDA streams and how to use them.
I would also suggest this presentation which covers most of what is needed.
Finally, here is a worked example. That particular example happens to use CUDA stream callbacks, but those are not necessary for basic pipelining. They enable additional host-oriented processing to be asynchronously triggered at various points in the pipeline, but the basic chunking of data, and delivery of data while processing is occurring does not depend on stream callbacks. Note also the linked CUDA sample codes in that answer, which may be useful for study/learning.

Persistent GPU function/operation

I'm wondering if it is possible to write a persistent GPU function. I have my doubts, but I'm not sure how the scheduler works.
I'm looking to process an unknown number of data points, (approx 50 million). The data arrives in chunks of 20 or so. It would be good if I could drop these 20 points into a GPU 'bucket', and have this 'persistent' operation grab and process them as they come in. When done, grab the result.
I could keep the GPU busy w/ dummy data while the bucket is empty. But I think race conditions on a partially empty bucket will be an issue.
I suspect I wouldn't be able to run any other operations on the GPU while this persistent operation is running. i.e. Put other undedicated SM's to work.
Is this a viable (fermi) GPU approach, or just a bad idea?
I'm not sure whether this persistent kernel is possible, but it surely would be very inefficient. Although the idea is elegant, it doesn't fit the GPU: You would have to globally communicate which thread picks which element out of the bucket, some threads might never even start as they wait for others to finish and the bucket would have to be declared volatile and therefore slow down your entire input data.
A more common solution to your problem is to divide the data into chunks and asynchronosly copy the chunks onto the GPU. You would use two streams, one working on the last sent chunk and the other sending a new chunk from the host. This actually will be done simultaneously. That way you are likely to hide most of the transfer. But don't let the chunks become too small or your kernel will suffer from low occupancy and performance will degrade.

Is cudaMemcpy from host to device executed in parallel?

I am curious if cudaMemcpy is executed on the CPU or the GPU when we copy from host to device?
I other words, it the copy a sequential process or is it done in parallel?
Let me explain why I ask this: I have an array of 5 million elements . Now, I want to copy 2 sets of 50,000 elements from different parts of the array. SO, i was thinking will it be faster to first form a large array of all the elements i want to copy on the CPU and then do just 1 large transfer or should i just call 2 cudaMemcpy, one for each set.
If cudaMemcpy is done in parallel, then i think the 2nd approach will be faster as you dont have to copy 100000 elements sequentially on the CPU first
I am curious if cudaMemcpy is executed on the CPU or the GPU when we
copy from host to device?
In the case of the synchronous API call with regular pageable user allocated memory, the answer is it runs on both. The driver must first copy data from the source memory to a DMA mapped source buffer on the host, then signal to the GPU that data is waiting for transfer. The GPU then executes the transfer. The process is repeated as many times as necessary for the complete copy from source memory to the GPU.
The throughput of process can be improved by using pinned memory, which the driver can directly DMA to or from without intermediate copying (although pinning has a large initialization/allocation overhead which needs to be amortised as well).
As to the rest of the question, I suspect that two memory copies directly from the source memory would be more efficient than the alternative, but this is the sort of question that can only be conclusively answered by benchmarking.
I believe a transfer from host to GPU memory is a blocking call. It uses the entire bus and, as such, it doesn't really make sense (even if it was physically possible) to run multiple operations in parallel.
I doubt you'll get any performance gain from concatenating the data before transferring it. The bottleneck will likely be the transfer itself. The copies should be queued and executed sequentially with minimal overhead.