Advice needed for a physics engine - cuda

I've recently started a project, building a physics engine.
I was hoping you could give me some advice related to some documentation and/or best technologies for this.
First of all, I've seen that Game-Physics-Engine-Development is highly recommended for the task at hand, and I was wondering if you could give me a second opinion.Should I get it?
Also, while browsing Amazon, I've stumbled onto Game Engine Architecture and since I want to build my physics engine for games, I thought this might be a good read aswell.
Second, I know that simulating physics is highly computation intensive so I would like to use either CUDA or OpenCL.Right now I'm leaning towards OpenCL, because it would work on both NVIDIA and ATI chipsets.What do you guys suggest?
PS: I will be implementing this in C++ on Linux.
Thanks.

I would suggest first of all planning a simple game as a test case for your engine. Having a basic game will drive feature and API development. Writing an engine without having clear goal makes the project riskier. While I agree nVidia and ATi should be treated as separate targets for performance reasons, I'd recommend you start with neither.
I personally wrote physics engine for Uncharted:Drake's Fortune - a PS3 game - and I did a pass in C++, and when it worked, made a pass to optimize it for VMX and then put it on SPU. Mind you, I did just a fraction what I wanted to do initially because of time constraints. After that I made an iteration to split data stages out and formulate a pipeline of data transformations. It's important because whether CPU, GPU or SPU, modern processors running nontrivial code spend most of the time waiting for caches. You have to pay special attention to data structures and pipeline them such that you have a small working set of data at any stage. E.g. first I do broadphase, so I don't need shapes but I need world-space bounding boxes. So I split bounding boxes into separate array and compute them all together in another pass, that writes them out in an optimal way. As input to bbox computation, I need shape transformations and some bounds from them, but not necessarily the whole shapes. After broadphase, I compute/update sim islands, at the same time performing narrow phase, for which I do actually need the shapes. And so on - I described this with pictures in an article to Game Physics Pearls I wrote.
I guess what I'm trying to say are the following points:
Make sure you have a clear goal that drives your development - a very basic game with flushed out design would be best in game physics engine case.
Don't try to optimize before you have a working product. Write it in the simplest and fastest way possible first and fix all the bugs in math. Design it so that you can port it to CUDA later, but don't start writing CUDA kernels before you have boxes rolling on the screen.
After you write the first pass in C++, optimize it for CPU : streamline it such that it doesn't thrash the cache, and compartmentalize the code so that there's no spaghetti of calls and all the code from each stage is localized. This will help a) port to CUDA b) port to OpenCL c) port to a console d) make it run reasonably fast e) make it possible to debug.
While developing, resist temptation to go do something you just thought about unless that feature is not necessary for your clear goal (see #1) - that's why you need a goal, to steer you towards it and make it possible to finish the actual project. Distractions usually kill projects without clear goals.
Remember that in one way or another, software development is iterative. It's ok to do a rough-in and then refine it. Leather, rinse, repeat - it's a mantra of a programmer :)
It's easy to give advice. If you wanna do something, just go and do it, and we'll sit back and critique :)

Here is an answer regarding the choice of CUDA or OpenCL. I do not have a recommendation for a book.
If you want to run your program on both NVIDIA and ATI chipsets, then OpenCL will make the job easier. However, you will want to write a different version of each kernel to get good performance on each chipset. For example, on ATI cards you'll want to manually vectorize code using float4/int4 data types (or accept a nearly 4x performance penalty), while NVIDIA works better with scalar data types.
If you're only targeting NVIDIA, then CUDA is somewhat more convenient to program in.

Related

Normal Cuda Vs CuBLAS?

Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.

are there existing libraries for many optimization jobs in parallel on GPU

I'm looking to perform many (thousands) of small optimization jobs on my nVidia Geforce.
With small jobs I mean 3-6 dimensions and around 1000 data points input each. Basically it's for curve fitting purposes, so the objective function to minimize is a sum of squares of a continuous (non-trivial) analytical function, of which I can compute the first derivative analytically. Each dimension is constrained between lower and upper boundary.
The only thing these jobs have in common is the original data series out of which they take different 1000 data points.
I suspect this will be much faster on my GPU than now, running them one by one my CPU, so I could use it for realtime monitoring.
However, the GPU libraries I've seen only focus on calculating a single function evaluation (faster) on the GPU.
There was a thread on my specific question on the nvidia CUDA forum, with more users looking for this, but the forums have been down for a while. It mentioned porting an existing C library (eg. levmar) to the CUDA language, but this got lost...
Do you know of an existing library to run many optimizations in parallel on a gpu?
Thanks!
The GFOR loop is meant to tile together many small problems like this. Each body of the loop is tiled together with the other loop bodies. Many people have used it for optimization problems like you describe. It is available for C/C++, Fortran, or Python as well as MATLAB code.
My disclaimer is that I work on GFOR. But I'm not aware of any other production-level GPU library that does similar optimization problems. You might be able to find some academic projects if you search around.

CUDA - Use the CURAND Library for Dummies

I was reading the CURAND Library API and I am a newbie in CUDA and I wanted to see if someone could actually show me a simple code that uses the CURAND Library to generate random numbers. I am looking into generating a large amount of number to use with Discrete Event Simulation. My task is just to develop the algorithms to use GPGPU's to speed up the random number generation. I have implemented the LCG, Multiplicative, and Fibonacci methods in standard C Language Programming. However I want to "port" those codes into CUDA and take advantage of threads and blocks to speed up the process of generating random numbers.
Link 1: http://adnanboz.wordpress.com/tag/nvidia-curand/
That person has two of the methods I will need (LCG and Mersenne Twister) but the codes do not provide much detail. I was wondering if anyone could expand on those initial implementations to actually point me in the right direction on how to use them properly.
Thanks!
Your question is misleading - you say "Use the cuRAND Library for Dummies" but you don't actually want to use cuRAND. If I understand correctly, you actually want to implement your own RNG from scratch rather than use the optimised RNGs available in cuRAND.
First recommendation is to revisit your decision to use your own RNG, why not use cuRAND? If the statistical properties are suitable for your application then you would be much better off using cuRAND in the knowledge that it is tuned for all generations of the GPU. It includes Marsaglia's XORWOW, l'Ecuyer's MRG32k3a, and the MTGP32 Mersenne Twister (as well as Sobol' for Quasi-RNG).
You could also look at Thrust, which has some simple RNGs, for an example see the Monte Carlo sample.
If you really need to create your own generator, then there's some useful techniques in GPU Computing Gems (Emerald Edition, Chapter 16: Parallelization Techniques for Random Number Generators).
As a side note, remember that while a simple LCG is fast and easy to skip-ahead, they typically have fairly poor statistical properties especially when using large quantities of draws. When you say you will need "Mersenne Twister" I assume you mean MT19937. The referenced Gems book talks about parallelising MT19937 but the original developers created the MTGP generators (also referenced above) since MT19937 is fairly complex to implement skip-ahead.
Also as another side note, just using a different seed to achieve parallelisation is usually a bad idea, statistically you are not assured of the independence. You either need to skip-ahead or leap-frog, or else use some other technique (e.g. DCMT) for ensuring there is no correlation between sequences.

Is GPGPU a hack?

I had started working on GPGPU some days ago and successfully implemented cholesky factorization with good performacne and I attended a conference on High Performance Computing where some people said that "GPGPU is a Hack".
I am still confused what does it mean and why they were saying it hack. One said that this is hack because you are converting your problem into a matrix and doing operations on it. But still I am confused that does people think it is a hack or if yes then why?
Can anyone help me, why they called it a hack while I found nothing wrong with it.
One possible reason for such opinion is that the GPU was not originally intended for general purpose computations. Also programming a GPU is less traditional and more hardcore and therefore more likely to be perceived as a hack.
The point that "you convert the problem into a matrix" is not reasonable at all. Whatever task you solve with writing code you choose reasonable data structures. In case of GPU matrices are likely the most reasonable datastructures and it's not a hack but just a natural choice to use them.
However I suppose that it's a matter of time for GPGPU becoming widespread. People just have to get used to the idea. After all who cares which unit of the computer runs the program?
On the GPU, having efficient memory access is paramount to achieving optimal performance. This often involves restructuring or even choosing entirely new algorithms and data structures. This is reason why GPU programming can be perceived as a hack.
Secondly, adapting an existing algorithm to run on the GPU is not in and of itself science. The relatively low scientific contribution of some GPU algorithm-related papers has led to a negative perception of GPU programming as strictly "engineering".
Obviously, only the person who said that can say for certain why he said it, but, here's my take:
A "Hack" is not a bad thing.
It forces people to learn new programming languages and concepts. For people who are just trying to model the weather or protein folding or drug reactions, this is an unwelcome annoyance. They didn't really want to learn FORTRAN (or whatever) in the first place, and now the have to learn another programming system.
The programming tools are NOT very mature yet.
The hardware isn't as reliable as CPUs (yet) so all of the calculations have to be done twice to make sure you've got the right answer. One reason for this is that GPUs don't come with error-correcting memory yet, so if you're trying to build a supercomputer with thousands of processors, the probability of a cosmic ray flipping a bit in you numbers approaches certainty.
As for the comment "you are converting your problem into a matrix and doing operations on it", I think that shows a lot of ignorance. Virtually ALL of high-performance computing fits that description!
One of the major problems in GPGPU for the past few years and probably for the next few is that programming them for arbitrary tasks was not very easy. Up until DX10 there was no integer support among GPUs and branching is still very poor. This is very much a situation where in order to get maximum benefit you have to write your code in a very awkward manner to extract all sorts of efficiency gains from the GPU. This is because you're running on hardware that is still dedicated to processing polygons and textures, rather than abstract parallel tasks.
Obviously, thats my take on it and YMMV
The GPGPU harks back to the days of the math co-processor. A hack is a shortcut to solving a long winded problem. GPGPU is a hack just like NAT on top of IPV4 is a hack. Computational problems just like networks are getting bigger as we try to do more, GPGPU is an useful interim solution, whether it stays outside the core CPU chip and has separate cranky API or gets sucked into the CPU via API or manufacture is up to the path finders.
I suppose he meant that using GPGPU forced you to restructure your implementation, so that it fitted the hardware, not the problem domain. Elegant implementation should fit the latter.
Note, that the word "hack" may have several different meanings:
http://www.urbandictionary.com/define.php?term=hack

How well do common programming tasks translate to GPUs?

I have recently begun working on a project to establish how best to leverage the processing power available in modern graphics cards for general programming. It seems that the field general purpose GPU programming (GPGPU) has a large bias towards scientific applications with a lot of heavy math as this fits well with the GPU computational model. This is all good and well, but most people don't spend all their time running simulation software and the like so we figured it might be possible to create a common foundation for easily building GPU-enabled software for the masses.
This leads to the question I would like to pose; What are the most common types of work performed by programs? It is not a requirement that the work translates extremely well to GPU programming as we are willing to accept modest performance improvements (Better little than nothing, right?).
There are a couple of subjects we have in mind already:
Data management - Manipulation of large amounts of data from databases
and otherwise.
Spreadsheet type programs (Is somewhat related to the above).
GUI programming (Though it might be impossible to get access to the
relevant code).
Common algorithms like sorting and searching.
Common collections (And integrating them with data manipulation
algorithms)
Which other coding tasks are very common? I suspect a lot of the code being written is of the category of inventory management and otherwise tracking of real 'objects'.
As I have no industry experience I figured there might be a number of basic types of code which is done more often than I realize but which just doesn't materialize as external products.
Both high level programming tasks as well as specific low level operations will be appreciated.
General programming translates terribly to GPUs. GPUs are dedicated to performing fairly simple tasks on streams of data at a massive rate, with massive parallelism. They do not deal well with the rich data and control structures of general programming, and there's no point trying to shoehorn that into them.
General programming translates terribly to GPUs. GPUs are dedicated to performing fairly simple tasks on streams of data at a massive rate, with massive parallelism. They do not deal well with the rich data and control structures of general programming, and there's no point trying to shoehorn that into them.
This isn't too far away from my impression of the situation but at this point we are not concerning ourselves too much with that. We are starting out by getting a broad picture of which options we have to focus on. After that is done we will analyse them a bit deeper and find out which, if any, are plausible options. If we end up determining that it is impossible to do anything within the field, and we are only increasing everybody's electricity bill then that is a valid result as well.
Things that modern computers do a lot of, where a little benefit could go a long way? Let's see...
Data management: relational database management could benefit from faster relational joins (especially joins involving a large number of relations). Involves massive homogeneous data sets.
Tokenising, lexing, parsing text.
Compilation, code generation.
Optimisation (of queries, graphs, etc).
Encryption, decryption, key generation.
Page layout, typesetting.
Full text indexing.
Garbage collection.
I do a lot of simplifying of configuration. That is I wrap the generation/management of configuration values inside a UI. The primary benefit is I can control work flow and presentation to make it simpler for non-techie users to configure apps/sites/services.
The other thing to consider when using a GPU is the bus speed, Most Graphics cards are designed to have a higher bandwidth when transferring data from the CPU out to the GPU as that's what they do most of the time. The bandwidth from the GPU back up to the CPU, which is needed to return results etc, isn't as fast. So they work best in a pipelined mode.
You might want to take a look at the March/April issue of ACM's Queue magazine, which has several articles on GPUs and how best to use them (besides doing graphics, of course).