How to accelerate a C++ code using CUDA - cuda

I had done my physics simulation project using C++ , OpenGL in Visual Studio 10. Later I had used OpenMP for CPU Parallelization. Now I want to accelerate my C++ code to CUDA so that I can achieve higher performance. Is it possible to convert my code into CUDA or any GPU devices?

Cuda and C++ are different programming languages (even if they look syntactically similar) with different programming paradigm.
You'll have to recode, and perhaps even redesign, your project to take advantage of Cuda (or of OpenCL).
Actually, you'll need to define what are the numerical kernels that might take advantage of your GPGPU and then recode these kernels (in Cuda, or in OpenCL); you'll also have to write some glue code to make all this work together.

You can determine which parts of your project can be parallelized and then reimplement these parts in Cuda. You can take a look at Fast N-Body Simulation with CUDA.

Related

OpenCL vs CUDA performance on Nvidia's device

I coded a program to create a color lookup table. I did it in CUDA and OpenCL, from my point of view both programs are pretty much the same, i.e. use the same amount of constant memory, global memory, same loops and branching code, etc.
I measure the running time and CUDA performed slightly better than OpenCL. My question is if using CUDA+NvidiaGPU is faster than OpenCL+NvidiaGPU because CUDA is the native way of programming such GPU?
Could you share some links to info related on this topic?
OpenCL and CUDA are equally fast if they are tweaked correctly for the target architecture. However, tweaking may negatively impact portability.
Links:
http://arxiv.org/ftp/arxiv/papers/1005/1005.2581.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=6047190&tag=1

Mathematica and CUDA

Is it possible that built in functions in Mathematica (like Minimize[expr,{x1,x2,...}]) will start to work via CUDA after installation of CUDA module for Mathematica?
I don't believe so, no. Mathematica's CUDALink module currently provides only a handful of GPU accelerated functions - some basic image processing operations, BLAS style linear algebra calls, Fourier Transforms and simple parallel reductions (argmin, argmax, and summation). There is also tools for integrating user written CUDA code, and for generating CUDA code symbolically. Outside of that, the rest of Mathematica's core functionality remains CPU only.
You can see full details of current CUDA and OpenCL support here.

What is CUDA like? What is it for? What are the benefits? And how to start?

I am interested in developing under some new technology and I was thinking in trying out CUDA. Now... their documentation is too technical and doesn't provide the answers I'm looking for. Also, I'd like to hear those answers from people that've had some experience with CUDA already.
Basically my questions are those in the title:
What exactly IS CUDA? (is it a framework? Or an API? What?)
What is it for? (is there something more than just programming to the GPU?)
What is it like?
What are the benefits of programming against CUDA instead of programming to the CPU?
What is a good place to start programming with CUDA?
CUDA brings together several things:
Massively parallel hardware designed to run generic (non-graphic) code, with appropriate drivers for doing so.
A programming language based on C for programming said hardware, and an assembly language that other programming languages can use as a target.
A software development kit that includes libraries, various debugging, profiling and compiling tools, and bindings that let CPU-side programming languages invoke GPU-side code.
The point of CUDA is to write code that can run on compatible massively parallel SIMD architectures: this includes several GPU types as well as non-GPU hardware such as nVidia Tesla. Massively parallel hardware can run a significantly larger number of operations per second than the CPU, at a fairly similar financial cost, yielding performance improvements of 50× or more in situations that allow it.
One of the benefits of CUDA over the earlier methods is that a general-purpose language is available, instead of having to use pixel and vertex shaders to emulate general-purpose computers. That language is based on C with a few additional keywords and concepts, which makes it fairly easy for non-GPU programmers to pick up.
It's also a sign that nVidia is willing to support general-purpose parallelization on their hardware: it now sounds less like "hacking around with the GPU" and more like "using a vendor-supported technology", and that makes its adoption easier in presence of non-technical stakeholders.
To start using CUDA, download the SDK, read the manual (seriously, it's not that complicated if you already know C) and buy CUDA-compatible hardware (you can use the emulator at first, but performance being the ultimate point of this, it's better if you can actually try your code out)
(Disclaimer: I have only used CUDA for a semester project in 2008, so things might have changed since then.) CUDA is a development toolchain for creating programs that can run on nVidia GPUs, as well as an API for controlling such programs from the CPU.
The benefits of GPU programming vs. CPU programming is that for some highly parallelizable problems, you can gain massive speedups (about two orders of magnitude faster). However, many problems are difficult or impossible to formulate in a manner that makes them suitable for parallelization.
In one sense, CUDA is fairly straightforward, because you can use regular C to create the programs. However, in order to achieve good performance, a lot of things must be taken into account, including many low-level details of the Tesla GPU architecture.

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.

Performance differences between different CUDA SDK's?

If I want to re-write my application so that it leverages the power of nVidia's CUDA SDK, are there any differences at all in runtime performance between the different SDK offerings: C++, Java, Python?
Is there any difference at all between these 3 SDK's, besides the obvious language being used?
There will be a measurable performance impact on the CPU bound portions of your processing. For instance, if your CUDA data requires pre-processing before reaching the GPU, writing the numerical routine in Python would be suboptimal.
If your CUDA routines dominate the computation time (the CPU remains relatively idle), any of the bindings are a good choice.
It may be best to first prototype in a language such as Python, and if you identify a performance bottleneck move that code to C++.