about floating point operation - cuda

Recently, I have been making program (FDTD Operation) using the CUDA
development environment, OS is Windows server 2008 , Graphic card is TeslaC2070, compiler is VS2010. This program calculates using single and double precision floating-point.
I was reading the CUDA programming guide 3.2 and 4.0 . In appendix, guide tell me sin(), cos() has maximum accuracy of 2 ULP. My original CPU program produces results which are different to the CUDA Version.
I want to make results correctly same. Is it possible?

To quote Goldberg (a paper that every Computer Scientist, Computational Scientist, and possibly even every scientist who programs, should read):
Due to roundoff errors, the associative laws of algebra do not
necessarily hold for floating-point numbers.
This means that when you change the order of operations—even when using ostensibly associative arithmetic—you are likely to get slightly different answers.
Parallelism, by definition, results in different ordering of operations relative to serial arithmetic. "Embarrasingly parallel" computations, that is, computations where each output element is computed independently from all others, sometimes do not have to worry about this. But collective operations, like reductions or scans, and spatial neighborhood computations, such stencils (as in FDTD), do experience this effect.
In practice, even using a different compiler (and even different compiler options) can change the result of floating point computation, even when compiling the same code, with or without parallelism.

Related

Why are CUDA indices 2D? [duplicate]

I have basically the same question as posed in this discussion. In particular I want to refer to this final response:
I think there are two different questions mixed together in this
thread:
Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the
reasons you and others have described. If the data or calculation has
spatial locality, then so should the assignment of work to threads in
a warp.
Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since
you can do the index calculation trivially yourself at the top of the
kernel. This burns a few arithmetic instructions, but that should be
negligible compared to the kernel launch overhead.
This is why I think the multidimensional grids are intended as a
programmer convenience rather than a way to improve performance. You
do absolutely need to think about each warp's memory access patterns,
though.
I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.
What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?
A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.
There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA
PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions.
The special registers are:
%tid
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid
If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.

Large matrix multiplication on gpu

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.
There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.
To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:
which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).
In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).

Fermi architecture possible solution to my comparative study?

I am working on a comparative study in which I have to make a comparison of the serial and parallel versions of an algorithm (NSGA-II algorithm to be precise download link here). NSGA-II is a heuristic optimization method and hence depends on the initial random population generated. If the initial populations generated using the CPU and the GPU are different, then I can not make an impartial speedup study.
I possess a NVIDIA-TESLA-C1060 card which has a compute capability of 1.3. As per this anwer and this NVIDIA document, we can't expect an sm_13 device to always yield an IEEE-754 compliant float (single precision) value. Which in other word means that on my current device, I cant conduct an impartial speedup study of the CUDA program corresponding to its serial counterpart.
My question is: Would switching to Fermi architecture solve the problem?
Floating-point operations will yield different results on different architectures, regardless of whether they support IEEE754 or not, since floating-point is not associative. Even switching compiler on x86 will typically give different results. This whitepaper gives some excellent explanations.
Having said that, your real issue is that you have a data dependent algorithm where the operations are dependent on the random numbers you generate. So if you generate the same numbers on the CPU and the GPU then both runs will be following the same paths. Consider using cuRAND, which can generate the same numbers on both the CPU and GPU.

Terminology for a "complete" programming language?

The full definition of "Turing Completeness" requires infinite memory.
Is there a better term than Turing Complete for a programming language and implementation that seems useably complete, except for being limited by finite (say 100 word or 16-bit or 32-bit, etc.) address space?
I guess, you can bring the limited memory into the definition. Something like
A programming language for a given architecture (!) is Limitedly Turing Complete, if for every Turing Machine there exists a program that either
a) simulates the Turing Machine and returns the same result (iff the Turing Machine returns) or
b) at some point uses at least one available limited resource (e.g. memory) completely and returns an arbitrary result.
The question is, whether this intuitive definition really helps, or if it is better to assume that your architecture has unlimited memory (even though it is actually finite). Note that you don't even have to try hard in order to satisfy Limited Turing Completeness (as defined above), if you simply go into an infinite loop that mallocs one byte each time, you found your program for all Turing Machines.
The problem seems to be that you cannot pin implementation specific properties. For instance, if you have 500K ram, you may be able to express a program that computes 1+1 but maybe you're not, who knows.
I'd argue that languages like Haskell and Brainfuck (yes, I'm serious) are actually Turing Complete because they abstract resources away. While languages like C++ are only Limitedly Turing Complete, because at some point the address-space of pointers is exhausted and it is not possible to address any more data (e.g. sort a list of 2^2^2^2^100 items).
You could say that an implementation requires infinite memory to be truly turing complete, but languages themselves have no concept of a memory limit. You can make a compiler for a million-bit machine or a 4-bit machine without changing the language.

Have you successfully used a GPGPU? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am interested to know whether anyone has written an application that takes advantage of a GPGPU by using, for example, nVidia CUDA. If so, what issues did you find and what performance gains did you achieve compared with a standard CPU?
I have been doing gpgpu development with ATI's stream SDK instead of Cuda.
What kind of performance gain you will get depends on a lot of factors, but the most important is the numeric intensity. (That is, the ratio of compute operations to memory references.)
A BLAS level-1 or BLAS level-2 function like adding two vectors only does 1 math operation for each 3 memory references, so the NI is (1/3). This is always run slower with CAL or Cuda than just doing in on the cpu. The main reason is the time it takes to transfer the data from the cpu to the gpu and back.
For a function like FFT, there are O(N log N) computations and O(N) memory references, so the NI is O(log N). If N is very large, say 1,000,000 it will likely be faster to do it on the gpu; If N is small, say 1,000 it will almost certainly be slower.
For a BLAS level-3 or LAPACK function like LU decomposition of a matrix, or finding its eigenvalues, there are O( N^3) computations and O(N^2) memory references, so the NI is O(N). For very small arrays, say N is a few score, this will still be faster to do on the cpu, but as N increases, the algorithm very quickly goes from memory-bound to compute-bound and the performance increase on the gpu rises very quickly.
Anything involving complex arithemetic has more computations than scalar arithmetic, which usually doubles the NI and increases gpu performance.
(source: earthlink.net)
Here is the performance of CGEMM -- complex single precision matrix-matrix multiplication done on a Radeon 4870.
I have written trivial applications, it really helps if you can parallize floating point calculations.
I found the following course cotaught by a University of Illinois Urbana Champaign professor and an NVIDIA engineer very useful when I was getting started: http://courses.ece.illinois.edu/ece498/al/Archive/Spring2007/Syllabus.html (includes recordings of all lectures).
I have used CUDA for several image processing algorithms. These applications, of course, are very well suited for CUDA (or any GPU processing paradigm).
IMO, there are three typical stages when porting an algorithm to CUDA:
Initial Porting: Even with a very basic knowledge of CUDA, you can port simple algorithms within a few hours. If you are lucky, you gain a factor of 2 to 10 in performance.
Trivial Optimizations: This includes using textures for input data and padding of multi-dimensional arrays. If you are experienced, this can be done within a day and might give you another factor of 10 in performance. The resulting code is still readable.
Hardcore Optimizations: This includes copying data to shared memory to avoid global memory latency, turning the code inside out to reduce the number of used registers, etc. You can spend several weeks with this step, but the performance gain is not really worth it in most cases. After this step, your code will be so obfuscated that nobody understands it (including you).
This is very similar to optimizing a code for CPUs. However, the response of a GPU to performance optimizations is even less predictable than for CPUs.
I have been using GPGPU for motion detection (Originally using CG and now CUDA) and stabilization (using CUDA) with image processing.
I've been getting about a 10-20X speedup in these situations.
From what I've read, this is fairly typical for data-parallel algorithms.
While I haven't got any practical experiences with CUDA yet, I have been studying the subject and found a number of papers which document positive results using GPGPU APIs (they all include CUDA).
This paper describes how database joins can be paralellized by creating a number of parallel primitives (map, scatter, gather etc.) which can be combined into an efficient algorithm.
In this paper, a parallel implementation of the AES encryption standard is created with comparable speed to discreet encryption hardware.
Finally, this paper analyses how well CUDA applies to a number of applications such as structured and unstructured grids, combination logic, dynamic programming and data mining.
I've implemented a Monte Carlo calculation in CUDA for some financial use. The optimised CUDA code is about 500x faster than a "could have tried harder, but not really" multi-threaded CPU implementation. (Comparing a GeForce 8800GT to a Q6600 here). It is well know that Monte Carlo problems are embarrassingly parallel though.
Major issues encountered involves the loss of precision due to G8x and G9x chip's limitation to IEEE single precision floating point numbers. With the release of the GT200 chips this could be mitigated to some extent by using the double precision unit, at the cost of some performance. I haven't tried it out yet.
Also, since CUDA is a C extension, integrating it into another application can be non-trivial.
I implemented a Genetic Algorithm on the GPU and got speed ups of around 7.. More gains are possible with a higher numeric intensity as someone else pointed out. So yes, the gains are there, if the application is right
I wrote a complex valued matrix multiplication kernel that beat the cuBLAS implementation by about 30% for the application I was using it for, and a sort of vector outer product function that ran several orders of magnitude than a multiply-trace solution for the rest of the problem.
It was a final year project. It took me a full year.
http://www.maths.tcd.ie/~oconbhup/Maths_Project.pdf
I have implemented Cholesky Factorization for solving large linear equation on GPU using ATI Stream SDK. My observations were
Got performance speedup upto 10 times.
Working on same problem to optimize it more, by scaling it to multiple GPUs.
Yes. I have implemented the Nonlinear Anisotropic Diffusion Filter using the CUDA api.
It is fairly easy, since it's a filter that must be run in parallel given an input image. I haven't encountered many difficulties on this, since it just required a simple kernel. The speedup was at about 300x. This was my final project on CS. The project can be found here (it's written in Portuguese thou).
I have tried writing the Mumford&Shah segmentation algorithm too, but that has been a pain to write, since CUDA is still in the beginning and so lots of strange things happen. I have even seen a performance improvement by adding a if (false){} in the code O_O.
The results for this segmentation algorithm weren't good. I had a performance loss of 20x compared to a CPU approach (however, since it's a CPU, a different approach that yelded the same results could be taken). It's still a work in progress, but unfortunaly I left the lab I was working on, so maybe someday I might finish it.