Multiple matrix inversions parallel on GPU [duplicate] - cuda

I'm attempting to optimise an application in realtime 3D modelling. The compute part of the application runs almost entirely on the GPU in CUDA. The application requires the solution of a small (6x6) double precision symmetric positive definite linear system Ax = b 500+ times per second. Currently this is being done with an efficient CPU based Linear Algebra library using Cholesky but necessitates the copying of data from the CPU - GPU and back to GPU hundreds of times per second and the overhead of kernel launches each time etc.
How can I calculate the solution to the linear system on the GPU solely without having to take the data onto the CPU at all? I've read a little about the MAGMA library but it seems to use hybrid algorithms rather than GPU only algorithms.
I'm prepared for the fact that the solution of an individual linear system on the GPU is going to be a lot slower than with the existing CPU based library but I want to see if that can be made up for by removing the data communication between the host and device and the overhead of kernel launches etc hundreds of times per second. If there is no GPU only LAPACK-like alternative out there how would I go about implementing something to solve this particular 6x6 case on the GPU only? Could it be done without a huge time investment with GPU BLAS libraries for example?

NVIDIA posted code for a batched Ax=b solver to the registered developer website last fall. This code works for generic matrices, and should work well enough for your needs provided you can expand the symmetric matrices to full matrices (that should not be an issue for a 6x6?). As the code performs pivoting, which is unnecessary for positive definite matrices, it is not optimal for your case, but you may be able to modify it for your purposes as the code is under a BSD license.
NVIDIA's standard developer website is experiencing some issues at the moment. Here is how you can download the batched solver code at this time:
(1) Go to http://www.nvidia.com/content/cuda/cuda-toolkit.html
(2) If you have an existing NVdeveloper account (e.g. via partners.nvidia.com) click on the green "Login to nvdeveloper" link on the right half of the screen. Otherwise click on "Join nvdeveloper" to apply for a new account; requests for new accounts are typically approved within one business day.
(3) Log in at the prompt with your email address and password
(4) There is a section on the right hand side titled "Newest Downloads". The fifth item from the top is "Batched Solver". Click on that and it will bring you to the download page for the code.
(5) Click on the "download" link, then click "Accept" to accept the license terms. Your download should start.

Related

Using CUDA GPUs at prediction time for high througput streams

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back.
The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up.
We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras.
The GPUs are still mostly idle, but adding more Python workers does not speed up the process.
What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.
I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?
This is not an answer to your more general question, but rather an answer based on how I understand the scenario you described.
If someone has coded a system which uses a GPU for some computational task, they have (hopefully) taken the time to parallelize its execution so as to benefit from the full resources the GPU can offer, or something close to that.
That means that if you add a second similar task - even in parallel - the total amount of time to complete them should be similar to the amount of time to complete them serially, i.e. one after the other - since there are very little underutilized GPU resources for the second task to benefit from. In fact, it could even be the case that both tasks will be slower (if, say, they both somehow utilize the L2 cache a lot, and when running together they thrash it).
At any rate, when you want to improve performance, a good thing to do is profile your application - in this case, using the nvprof profiler or its nvvp frontend (the first link is the official documentation, the second link is a presentation).

Overlapping Data Transfers in Maxwell (GPU Nvidia)

I´m newbie in the forum and I hope that you will help me with my question. Recently, I´ve developed an application in which I´ve used CUDA streams with the aim of overlapping computation and data transfers. I've executed this application on a GPU Nvidia (Maxwell architecture). I've observed with the Visual Profiler tool that some data transfers HostToDevice occur at the same time. The Maxwell GPUs only have 2 Copy engines. One copy engine is for the HostToDevice transfers and the other copy engine is for the DeviceToHost transfers, right?. With this in mind, I think that two HostToDevice transfers can´t occur at the same time. However, I´ve observed with Visual Profiler that this behaviour appears in my application. So, my question is: in this architecture, is it possible that two HostToDevice (or DeviceToHost) data transfers might occur at the same time?.
Thank you so much.
So, my question is: in this architecture, is it possible that two HostToDevice (or DeviceToHost) data transfers might occur at the same time?.
No, it's not possible.
It's not possible for 2 transfers to occur at the same time in the same direction. This is arguably based on PCI express, and not having anything to do with CUDA. When a PCI express transaction is in progress in a given direction, no other transactions can be taking place in that direction. Either you are misinterpreting the output of visual profiler, or visual profiler has some sort of bug.
By hovering your mouse over the specific transactions in visual profiler, you can get additional details about it in the window at the right hand side of the visual profiler display. This additional information should include the start and finish time of each transaction (as well as size in bytes, etc.) I would start there, to see if visual profiler thinks they are in the same direction and have the same start time.

Paralelizing FFT (using CUDA)

On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So,
Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations.
This might be obsolete once NVIDIA implements device calls on CUFFT.
(something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))
So, Is this it? Is there no gain in performing more than one FFT at a time?
If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.
If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5, as there have been some API improvements.
Is there any library that supports calls from the device?
cuFFT library cannot be used by making calls from device code.
There are other CUDA libraries, of course, such as ArrayFire, which may have options I'm not familiar with.
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.

Can CUDA handle its own work queues?

Sorry if this is obvious, but I'm studying c++ and Cuda right now and wanted to know if this was possible so I could focus more on the relevant sections.
Basically my problem is highly parallelizable, in fact I'm running it on multiple servers currently. My program gets a work item(very small list) and runs a loop on it and makes one of 3 decisions:
keep the data(saves it),
Discard the data(doesn't do anything with it),
Process data further(its unsure of what to do so it modifies the data and resends it to the queue to process.
This used to be a recursion but I made each part independent and although I'm longer bound by one cpu but the negative effect of it is there's alot of messages that pass back/forth. I understand at a high level how CUDA works and how to submit work to it but is it possible for CUDA to manage the queue on the device itself?
My current thought process was manage the queue on the c++ host and then send the processing to the device, after which the results are returned back to the host and sent back to the device(and so on). I think that could work but I wanted to see if it was possible to have the queue on the CUDA memory itself and kernels take work and send work directly to it.
Is something like this possible with CUDA or is there a better way to do this?
I think what you're asking is if you can keep intermediate results on the device. The answer to that is yes. In other words, you should only need to copy new work items to the device and only copy finished items from the device. The work items that are still undetermined can stay on the device between kernel calls.
You may want to look into CUDA Thrust for this. Thrust has efficient algorithms for transformations, which can be combined with custom logic (search for "kernel fusion" in the Thrust manual.) It sounds like maybe your processing can be considered to be transformations, where you take a vector of work items and create two new vectors, one of items to keep and one of items that are still undetermined.
Is the host aware(or can it monitor) memory on device? My concern is how to be aware and deal with data that starts to exceed GPU onboard memory.
It is possible to allocate and free memory from within a kernel but it's probably not going to be very efficient. Instead, manage memory by running CUDA calls such as cudaMalloc() and cudaFree() or, if you're using Thrust, creating or resizing vectors between kernel calls.
With this "manual" memory management you can keep track of how much memory you have used with cudaMemGetInfo().
Since you will be copying completed work items back to the host, you will know how many work items are left on the device and thus, what the maximum amount of memory that might be required in a kernel call is.
Maybe a good strategy will be to swap source and destination vectors for each transform. To take a simple example, say you have a set of work items that you want to filter in multiple steps. You create vector A and fill it with work items. Then you create vector B of the same size and leave it empty. After the filtering, some portion of the work items in A have been moved to B, and you have the count. Now you run the filter again, this time with B as the source and A as the destination.

CUDA contexts, streams, and events on multiple GPUs

TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.