CUDA math API: difference between functions and intrinsics - cuda

According to the CUDA math APi, many mathematical functions, like sine and cosine, are implemented both in software (functions) and in hardware (intrinsics). These intrinsics probably use the Special Function Units of the GPU, so what is the point of the software implementation? Isn't that slower than the hardware implementation?

The better question to ask is "what is the point of the intrinsics?".
The answer lies in Appendix D of the programming guide. The intrinsics for the transcendental, trigonometric, and special functions are faster, but have more domain restrictions and generally lower accuracy than their software counterparts. For the primary purpose of the hardware (ie graphics), having fast approximate functions for sin, cos, square root, reciprocal, etc. allows for improved shader performance when ultimate mathematical accuracy is not critical. For some compute tasks, the less accurate versions are also fine. For other applications, the intrinsics may not be sufficient.
Having both allows the informed programmer to have a choice: speed or accuracy.

Related

Special CUDA Double Precision trig functions for SFU

I was wondering how I would go about using __cos(x) (and respectively __sin(x)) in the kernel code with CUDA. I looked up in the CUDA manual that there is such a device function however when I implement it the compiler just says that I cannot call a host function in the device.
However, I found that there are two sister functions cosf(x) and __cosf(x) the latter of which runs on the SFU and is overall much faster than the original cosf(x) function. The compiler does not complain about the __cosf(x) function of course.
Is there a library I'm missing? Am I mistaken about this trig function?
As the SFU only supports certain single-precision operations, there are no double-precision __cos() and __sin() device functions. There are single-precision __cosf() and __sinf() device functions, as well as other functions detailed in table C-4 of the CUDA 4.2 Programming Manual.
I assume you are looking for faster alternatives to the double-precision versions of the standard math functions sin() and cos()? If sine and cosine of the same argument are needed, sincos() should be used for a significant performance boost. If the argument of sine or cosine is multiplied by π, you would want to use sinpi(), cospi(), or sincospi() instead, for even more performance. For example, sincospi() is very useful when implementing the Box-Muller algorithm for generating normally distributed random numbers. Also, check out the CUDA 5.0 preview for best possible performance (note that the preview provides alpha-release quality).

GPU-accelerated hardware simulation?

I am investigating if GPGPUs could be used for accelerating simulation of hardware.
My reasoning is this: As hardware by nature is very parallel, why simulate on highly sequential CPUs?
GPUs would be excellent for this, if not for their restrictive style of programming: You have a single kernel running, etc.
I have little experience with GPGPU-programming, but is it possible to use events or queues in OpenCL / CUDA?
Edit: By hardware simulation I don't mean emulation, but bit-accurate behavorial simulation (as in VHDL behavioral simulation).
I am not aware of any approaches regarding VHDL simulation on GPUs (or a general scheme to map discrete-event simulations), but there are certain application areas where discrete-event simulation is typically applied and which can be simulated efficiently on GPUs (e.g. transportation networks, as in this paper or this one, or stochastic simulation of chemical systems, as done in this paper).
Is it possible to re-formulate the problem in a way that makes a discrete time-stepped simulator feasible? In this case, simulation on a GPU should be much simpler (and still faster, even if it seems wasteful because the time steps have to be sufficiently small - see this paper on the GPU-based simulation of cellular automata, for example).
Note, however, that this is still most likely a non-trivial (research) problem, and the reason why there is no general scheme (yet) is what you already assumed: implementing an event queue on a GPU is difficult, and most simulation approaches on GPUs gain speed-up due to clever memory layout and application-specific optimizations and problem modifications.
This is outside my area of expertise, but it seems that while the following paper discusses gate-level simulation rather than behavioral simulation, it may contain some useful ideas:
Debapriya Chatterjee, Andrew Deorio, Valeria Bertacco.
Gate-Level Simulation with GPU Computing
http://web.eecs.umich.edu/~valeria/research/publications/TODAES0611.pdf

gpgpu on cuda and opengl

I have been working with CUDA recently. I am just wondering if there is any performance difference between CUDA and Opengl in terms of general purpose computing. I am currently working on a GTX 580.
The correct answer is probably "it depends".
In pure floating point or integer performance terms it shouldn't matter much whether you use GLSL or something more "modern", but CUDA and OpenCL expose hardware features like pointers, shared memory, communication and synchronization between threads, and the grid/block virtualization of compute domains which are pretty crucial to achieving good performance on compute workloads. There are lots of algorithms which would be either difficult or impossible to implement in shader language that are efficiently implemented in literally a handful of lines of code in OpenCL or CUDA.

What is CUDA like? What is it for? What are the benefits? And how to start?

I am interested in developing under some new technology and I was thinking in trying out CUDA. Now... their documentation is too technical and doesn't provide the answers I'm looking for. Also, I'd like to hear those answers from people that've had some experience with CUDA already.
Basically my questions are those in the title:
What exactly IS CUDA? (is it a framework? Or an API? What?)
What is it for? (is there something more than just programming to the GPU?)
What is it like?
What are the benefits of programming against CUDA instead of programming to the CPU?
What is a good place to start programming with CUDA?
CUDA brings together several things:
Massively parallel hardware designed to run generic (non-graphic) code, with appropriate drivers for doing so.
A programming language based on C for programming said hardware, and an assembly language that other programming languages can use as a target.
A software development kit that includes libraries, various debugging, profiling and compiling tools, and bindings that let CPU-side programming languages invoke GPU-side code.
The point of CUDA is to write code that can run on compatible massively parallel SIMD architectures: this includes several GPU types as well as non-GPU hardware such as nVidia Tesla. Massively parallel hardware can run a significantly larger number of operations per second than the CPU, at a fairly similar financial cost, yielding performance improvements of 50× or more in situations that allow it.
One of the benefits of CUDA over the earlier methods is that a general-purpose language is available, instead of having to use pixel and vertex shaders to emulate general-purpose computers. That language is based on C with a few additional keywords and concepts, which makes it fairly easy for non-GPU programmers to pick up.
It's also a sign that nVidia is willing to support general-purpose parallelization on their hardware: it now sounds less like "hacking around with the GPU" and more like "using a vendor-supported technology", and that makes its adoption easier in presence of non-technical stakeholders.
To start using CUDA, download the SDK, read the manual (seriously, it's not that complicated if you already know C) and buy CUDA-compatible hardware (you can use the emulator at first, but performance being the ultimate point of this, it's better if you can actually try your code out)
(Disclaimer: I have only used CUDA for a semester project in 2008, so things might have changed since then.) CUDA is a development toolchain for creating programs that can run on nVidia GPUs, as well as an API for controlling such programs from the CPU.
The benefits of GPU programming vs. CPU programming is that for some highly parallelizable problems, you can gain massive speedups (about two orders of magnitude faster). However, many problems are difficult or impossible to formulate in a manner that makes them suitable for parallelization.
In one sense, CUDA is fairly straightforward, because you can use regular C to create the programs. However, in order to achieve good performance, a lot of things must be taken into account, including many low-level details of the Tesla GPU architecture.

Have you successfully used a GPGPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?
I have been doing gpgpu development with ATI's stream SDK instead of Cuda.
What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)
A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.
For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.
For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.
Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.
(source: earthlink.net)
Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.
I have written trivial applications, it really helps if you can parallize floating point calculations.
I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).
I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).
IMO, there are three typical stages when porting an algorithm to CUDA:
Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).
This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.
I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing.
I've been getting about a 10-20X speedup in these situations.
From what I've read, this is fairly typical for data-parallel algorithms.
While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).
This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.
In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.
Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.
I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.
Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.
Also, since CUDA is a C extension, integrating it into another application can be non-trivial.
I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right
I wrote a complex valued matrix multiplication kernel that beat the cuBLAS implementation by about 30% for the application I was using it for, and a sort of vector outer product function that ran several orders of magnitude than a multiply-trace solution for the rest of the problem.
It was a final year project. It took me a full year.
http://www.maths.tcd.ie/~oconbhup/Maths_Project.pdf
I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. My observations were
Got performance speedup upto 10 times.
Working on same problem to optimize it more, by scaling it to multiple GPUs.
Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.
It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).
I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.
The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.