Using High Level Shader Language for computational algorithms - cuda

So, I heard that some people have figured out ways to run programs on the GPU using High Level Shader Language and I would like to start writing my own programs that run on the GPU rather than my CPU, but I have been unable to find anything on the subject.
Does anyone have any experience with writing programs for the GPU or know of any documentation on the subject?
Thanks.

For computation, CUDA and OpenCL are more suitable than shader languages. For CUDA, I highly recommend the book CUDA by Example. The book is aimed at absolute beginners to this area of programming.

The best way I think to start is to
Have a CUDA Card from Nvidia
Download Driver + Toolkit + SDK
Build the examples
Read the Cuda Programming Guide
Start to recreate the cudaDeviceInfo example
Try to allocate memory in the gpu
Try to create a little kernel
From there you should be able to gain enough momentum to learn the rest.
Once you learn CUDA then OpenCL and other are a breeze.
I am suggesting CUDA because is the one most widely supported and tested.

Related

How do I develop CUDA application on my ATI, to be later executed on NVIDIA

My computer has an ATI graphics card, but I need to code an algorithm I already have in CUDA, to accerelate the process. Is that even possible? If yes does anyone have any link or tutorial from setting up my IDE to coding a simple image processing or passing an image. I also considered OpenCL but I have not found any information how to do anything with it.
This answer is more directed toward the part
I also considered OpenCL but I have not found any information how to do anything with it.
Check on this NVIDIA site:
http://developer.nvidia.com/nvidia-gpu-computing-documentation
Scroll down and you find
OpenCL Programming Guide
This is a detailed programming guide for OpenCL developers.
OpenCL Best Practices Guide
This is a manual to help developers obtain the best performance from OpenCL.
OpenCL Overview for the CUDA Architecture
This whitepaper summarizes the guidelines for how to choose the best implementations for NVIDIA GPUs.
OpenCL Implementation Notes
This document describes the "Implementation Defined" behavior for the NVIDIA OpenCL implementation as required by the OpenCL specification Version: 1.0. The implementation defined behavior is referenced below in the order of it's reference in the OpenCL specification and is grouped by the section number for the specification.
On AMD/ATI you have this site for a brief introduction:
http://www.amd.com/us/products/technologies/stream-technology/opencl/pages/opencl-intro.aspx
And for more resources check:
http://www.amd.com/us/products/technologies/stream-technology/Pages/training-resources.aspx
Unless CUDA is a requirement you should consider OpenCL again as you can you use it on both platforms and you state have one and want to develop for the other.
You might also want to take a look at these:
http://blogs.nvidia.com/2011/06/cuda-now-available-for-multiple-x86-processors/
http://www.pgroup.com/resources/cuda-x86.htm
I haven't tried it myself, but the prospect of running CUDA code on x86 seems pretty attractive.

gpgpu on cuda and opengl

I have been working with CUDA recently. I am just wondering if there is any performance difference between CUDA and Opengl in terms of general purpose computing. I am currently working on a GTX 580.
The correct answer is probably "it depends".
In pure floating point or integer performance terms it shouldn't matter much whether you use GLSL or something more "modern", but CUDA and OpenCL expose hardware features like pointers, shared memory, communication and synchronization between threads, and the grid/block virtualization of compute domains which are pretty crucial to achieving good performance on compute workloads. There are lots of algorithms which would be either difficult or impossible to implement in shader language that are efficiently implemented in literally a handful of lines of code in OpenCL or CUDA.

I've got a Nvidia GPU, how can i code on it?

I've never really been into GPUs, not being a gamer but im aware of their parallel ability and wondered how could i get started programming on one? I recall (somewhere) there is a CUDA C-style programming language. What IDE do I use and is it relatively simple to execute code?
There are quick-start guides for getting the dev drivers and libraries set up on different platforms (win/mac/lin) here, there is also a link to the Cuda C programming guide.
http://developer.nvidia.com/object/nsight.html
Although all the CUDA stuff we do (fluid sims / particle sims etc) are done on Linux, essentially with emacs and and gcc.
Some suggestions:
(1) Download the CUDA SDK from Nvidia (http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html). They have extensive set of application examples that have been previous developed, tested and commented. Some useful examples to startwith are matrixMul,
histogram, convolutionSeparable. For more complex well documented code see the examples "nbody".
(2) If you are very good in C++ programmming, then using C++ Thrust libraries for GPU is another best place to start. It has extensive STL like support for doing operations on GPU. And the overall programming effort is much less for standard algorithms.
(3) Eclipse with CUDA plugin is a good IDE to work initially.
On windows visual studio. On linux eclipse, code::blocks and others depending on which you feel more comfortable.
IDE though is the last thing. There are steps preceding this (installing appropriate display driver, toolkit, run sdk samples). The manuals/ links provided above are really helpful. Also there is nvidia forum for cuda development an many getting started guides

best way of using cuda

There are ways of using cuda:
auto-paralleing tools such as PGI workstation;
wrapper such as Thrust(in STL style)
NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.

OpenCL examples with benchmarks

I'm looking for some introductory examples to OpenCL which illustrate the types of applications that can experience large (e.g., 50x-1000x) increases in speed. Cuda has lots of nice examples, but I haven't found the same thing for OpenCL.
A nice example might be global optimization of complex functions via particle swarms, simulated annealing, evolutionary algorithms, ant colony optimization, etc.
The algorithms you are describing are neither simple nor introductory from the perspective of GPU programming. The reason CUDA has examples in these areas is that it has been around long enough for people to have developed these examples. There is currently no publicly available version of OpenCL that runs on GPUs. Both ATI and NVIDIA are offering beta versions of their OpenCL drivers, but ATI's supports only CPU computation and NVIDIA's requires signing an NDA to get. Simply put, OpenCL has not been around long enough for comprehensive examples like these to have been developed and demonstrated.
That said, gaining access to NVIDIA's OpenCL drivers is not difficult. You can find out how to do so on their forums here. I assume that the OpenCL distribution contains some sample programs to help you get started.
This also means that it's an excellent opportunity for you to develop some of these benchmarks and post your results. Then people will refer to your work rather than you referring to their work. I wouldn't expect too many surprises though. OpenCL performance should be roughly on par with CUDA performance once it becomes widely available and supported.
The are some great examples in the SDK from nvidia:
http://developer.nvidia.com/object/get-opencl.html
Our team has been working on OpenCL algorithms and acceleration and we would like to suggest the article
http://www.cmsoft.com.br/index.php?view=article&catid=1:latest-news&id=247:opencl-simulated-annealing
as a sample implementation of Simulated Annealing algorithm for minimization.
You could try the following two books:
Programming Massively Parallel Processors ... A Hands-on Approach (NVIDIA)(chapter 1 and 2)
The OpenCL Programming Book ... Parallel Programming for MultiCore CPU and GPU (History components
Both go in detail as to explain why the development was made and where the true bonusses can be found.
Not sure about benchmarking though , haven't had any luck there myself either.