are there existing libraries for many optimization jobs in parallel on GPU - cuda

I'm looking to perform many (thousands) of small optimization jobs on my nVidia Geforce.
With small jobs I mean 3-6 dimensions and around 1000 data points input each. Basically it's for curve fitting purposes, so the objective function to minimize is a sum of squares of a continuous (non-trivial) analytical function, of which I can compute the first derivative analytically. Each dimension is constrained between lower and upper boundary.
The only thing these jobs have in common is the original data series out of which they take different 1000 data points.
I suspect this will be much faster on my GPU than now, running them one by one my CPU, so I could use it for realtime monitoring.
However, the GPU libraries I've seen only focus on calculating a single function evaluation (faster) on the GPU.
There was a thread on my specific question on the nvidia CUDA forum, with more users looking for this, but the forums have been down for a while. It mentioned porting an existing C library (eg. levmar) to the CUDA language, but this got lost...
Do you know of an existing library to run many optimizations in parallel on a gpu?
Thanks!

The GFOR loop is meant to tile together many small problems like this. Each body of the loop is tiled together with the other loop bodies. Many people have used it for optimization problems like you describe. It is available for C/C++, Fortran, or Python as well as MATLAB code.
My disclaimer is that I work on GFOR. But I'm not aware of any other production-level GPU library that does similar optimization problems. You might be able to find some academic projects if you search around.

Related

Using CUDA GPUs at prediction time for high througput streams

We're trying to develop a Natural Language Processing application that has a user facing component. The user can call models through an API, and get the results back.
The models are pretrained using Keras with Theano. We use GPUs to speed up the training. However, prediction is still sped up significantly by using the GPU. Currently, we have a machine with two GPUs. However, at runtime (e.g. when running the user facing bits) there is a problem: multiple Python processes sharing the GPUs via CUDA does not seem to offer a parallelism speed up.
We're using nvidia-docker with libgpuarray (pygpu), Theano and Keras.
The GPUs are still mostly idle, but adding more Python workers does not speed up the process.
What is the preferred way of solving the problem of running GPU models behind an API? Ideally we'd utilize the existing GPUs more efficiently before buying new ones.
I can imagine that we want some sort of buffer before sending it off to the GPU, rather than requesting a lock for each HTTP call?
This is not an answer to your more general question, but rather an answer based on how I understand the scenario you described.
If someone has coded a system which uses a GPU for some computational task, they have (hopefully) taken the time to parallelize its execution so as to benefit from the full resources the GPU can offer, or something close to that.
That means that if you add a second similar task - even in parallel - the total amount of time to complete them should be similar to the amount of time to complete them serially, i.e. one after the other - since there are very little underutilized GPU resources for the second task to benefit from. In fact, it could even be the case that both tasks will be slower (if, say, they both somehow utilize the L2 cache a lot, and when running together they thrash it).
At any rate, when you want to improve performance, a good thing to do is profile your application - in this case, using the nvprof profiler or its nvvp frontend (the first link is the official documentation, the second link is a presentation).

Paralelizing FFT (using CUDA)

On my application I need to transform each line of an image, apply a filter and transform it back.
I want to be able to make multiple FFT at the same time using the GPU. More precisely, I'm using NVIDIA's CUDA. Now, some considerations:
CUDA's FFT library, CUFFT is only able to make calls from the host ( https://devtalk.nvidia.com/default/topic/523177/cufft-device-callable-library/).
On this topic (running FFTW on GPU vs using CUFFT), Robert Corvella says
"cufft routines can be called by multiple host threads".
I believed that doing all this FFTs in parallel would increase performance, but Robert comments
"the FFT operations are of reasonably large size, then just calling the cufft library routines as indicated should give you good speedup and approximately fully utilize the machine"
So,
Is this it? Is there no gain in performing more than one FFT at a time?
Is there any library that supports calls from the device?
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
(this 2 links limit is killing me...)
My objective is to get some discussion on what's the best solution to this problem, since many have faced similar situations.
This might be obsolete once NVIDIA implements device calls on CUFFT.
(something they said they are working on but there is no expected date for the release - something said on the discussion at the NVIDIA forum (first link))
So, Is this it? Is there no gain in performing more than one FFT at a time?
If the individual FFT's are large enough to fully utilize the device, there is no gain in performing more than one FFT at a time. You can still use standard methods like overlap of copy and compute to get the most performance out of the machine.
If the FFT's are small then the batched plan is a good way to get the most performance. If you go this route, I recommend using CUDA 5.5, as there have been some API improvements.
Is there any library that supports calls from the device?
cuFFT library cannot be used by making calls from device code.
There are other CUDA libraries, of course, such as ArrayFire, which may have options I'm not familiar with.
Shoud I just use cufftPlanMany() instead (as refered in "is-there-a-method-of-fft-that-will-run-inside-cuda-kernel" by hang or as referred in the previous topic, by Robert)?
Or the best option is to call mutiple host threads?
Batched plan is preferred over multiple host threads - the API can do a better job of resource management that way, and you will have more API-level visibility (such as through the resource estimation functions in CUDA 5.5) as to what is possible.

Advice needed for a physics engine

I've recently started a project, building a physics engine.
I was hoping you could give me some advice related to some documentation and/or best technologies for this.
First of all, I've seen that Game-Physics-Engine-Development is highly recommended for the task at hand, and I was wondering if you could give me a second opinion.Should I get it?
Also, while browsing Amazon, I've stumbled onto Game Engine Architecture and since I want to build my physics engine for games, I thought this might be a good read aswell.
Second, I know that simulating physics is highly computation intensive so I would like to use either CUDA or OpenCL.Right now I'm leaning towards OpenCL, because it would work on both NVIDIA and ATI chipsets.What do you guys suggest?
PS: I will be implementing this in C++ on Linux.
Thanks.
I would suggest first of all planning a simple game as a test case for your engine. Having a basic game will drive feature and API development. Writing an engine without having clear goal makes the project riskier. While I agree nVidia and ATi should be treated as separate targets for performance reasons, I'd recommend you start with neither.
I personally wrote physics engine for Uncharted:Drake's Fortune - a PS3 game - and I did a pass in C++, and when it worked, made a pass to optimize it for VMX and then put it on SPU. Mind you, I did just a fraction what I wanted to do initially because of time constraints. After that I made an iteration to split data stages out and formulate a pipeline of data transformations. It's important because whether CPU, GPU or SPU, modern processors running nontrivial code spend most of the time waiting for caches. You have to pay special attention to data structures and pipeline them such that you have a small working set of data at any stage. E.g. first I do broadphase, so I don't need shapes but I need world-space bounding boxes. So I split bounding boxes into separate array and compute them all together in another pass, that writes them out in an optimal way. As input to bbox computation, I need shape transformations and some bounds from them, but not necessarily the whole shapes. After broadphase, I compute/update sim islands, at the same time performing narrow phase, for which I do actually need the shapes. And so on - I described this with pictures in an article to Game Physics Pearls I wrote.
I guess what I'm trying to say are the following points:
Make sure you have a clear goal that drives your development - a very basic game with flushed out design would be best in game physics engine case.
Don't try to optimize before you have a working product. Write it in the simplest and fastest way possible first and fix all the bugs in math. Design it so that you can port it to CUDA later, but don't start writing CUDA kernels before you have boxes rolling on the screen.
After you write the first pass in C++, optimize it for CPU : streamline it such that it doesn't thrash the cache, and compartmentalize the code so that there's no spaghetti of calls and all the code from each stage is localized. This will help a) port to CUDA b) port to OpenCL c) port to a console d) make it run reasonably fast e) make it possible to debug.
While developing, resist temptation to go do something you just thought about unless that feature is not necessary for your clear goal (see #1) - that's why you need a goal, to steer you towards it and make it possible to finish the actual project. Distractions usually kill projects without clear goals.
Remember that in one way or another, software development is iterative. It's ok to do a rough-in and then refine it. Leather, rinse, repeat - it's a mantra of a programmer :)
It's easy to give advice. If you wanna do something, just go and do it, and we'll sit back and critique :)
Here is an answer regarding the choice of CUDA or OpenCL. I do not have a recommendation for a book.
If you want to run your program on both NVIDIA and ATI chipsets, then OpenCL will make the job easier. However, you will want to write a different version of each kernel to get good performance on each chipset. For example, on ATI cards you'll want to manually vectorize code using float4/int4 data types (or accept a nearly 4x performance penalty), while NVIDIA works better with scalar data types.
If you're only targeting NVIDIA, then CUDA is somewhat more convenient to program in.

Different threads on different multiprocessors

Is it possible to run different threads on different multiprocessors? similar to CPU cores?
Suppose I have 2 large arrays a, b and I want to compute both sum and difference. Lets say I have 2 multiprocessors on my device. Is it possible to run both kernel functions (which compute sum and difference) concurrently on 2 different multiprocessors?
Using your example of computing both the sum and difference, the best performance is probably going to be achieved if you compute both at the same time (i.e. in the same kernel).
Assuming this is not possible for some reason, then if your arrays are very large then the best performance may be to use the whole GPU (i.e. multiple multiprocessors) to compute the result in which case it doesn't matter too much that you do one after the other.
For both of the above cases I would strongly recommend you check out the reduction sample in the SDK which walks you through a naive implementation up to a pretty quick version with good documentation.
Having said all of that, if the amount of work is sufficiently small that you would not be fully utilising the whole GPU for one of your computations then there are two ways to run different computations on different multiprocessors:
Use "Concurrent Kernels", where multiple kernels run on the same GPU at the same time. See the CUDA Programming Guide for more information and check out the concurrentKernels sample in the SDK, in essence you manual tell the scheduler that there is no dependency between the two (by using CUDA streams) which allows thenm to be executed simultaneously.
Have a switch on the blockIdx to select which code to execute.
The first of these is far more preferable if your hardware supports it (you will need Compute Capability 2.0 or greater) since it is far simpler to read and maintain.
yes, using Fermi devices and multiple streams.

Have you successfully used a GPGPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?
I have been doing gpgpu development with ATI's stream SDK instead of Cuda.
What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)
A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.
For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.
For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.
Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.
(source: earthlink.net)
Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.
I have written trivial applications, it really helps if you can parallize floating point calculations.
I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).
I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).
IMO, there are three typical stages when porting an algorithm to CUDA:
Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).
This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.
I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing.
I've been getting about a 10-20X speedup in these situations.
From what I've read, this is fairly typical for data-parallel algorithms.
While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).
This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.
In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.
Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.
I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.
Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.
Also, since CUDA is a C extension, integrating it into another application can be non-trivial.
I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right
I wrote a complex valued matrix multiplication kernel that beat the cuBLAS implementation by about 30% for the application I was using it for, and a sort of vector outer product function that ran several orders of magnitude than a multiply-trace solution for the rest of the problem.
It was a final year project. It took me a full year.
http://www.maths.tcd.ie/~oconbhup/Maths_Project.pdf
I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. My observations were
Got performance speedup upto 10 times.
Working on same problem to optimize it more, by scaling it to multiple GPUs.
Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.
It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).
I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.
The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.