I'm using a ARM7 device without any NEON floating point arithmetic hardware capabilities, so for my project I'm having to write in Assembly, I already have a multiplier working, is there any why I could calculate inverses quickly?
Will GCC software floating-point library be good for you?
You'll find the assembler routines for floating-point division under libgcc/config/arm.
Related
I have seen that some people suggest that using signbit() can eliminate warp divergence and improve performance. If this is correct, then how is it implemented in the GPU? Is there some dedicated hardware for this function in, e.g., special function units (SFU)?
The implementation of signbit() is in the open in CUDA versions up to, and including, CUDA 6.5. It can be found in the header file math_functions.h. For newer versions of CUDA, you could inspect the machine code with cubobjdump --dump-sass to see how it is implemented.
Looking at the header file in CUDA 6.5, one sees that signbit() is a macro that maps to an inline function that extracts the sign bit from the raw bit representation for the floating-point operand. On GPUs this is easily doable since integer and floating-point operands share the same register file. In case of CUDA 6.5, the sign bit is extracted with a single right-shift instruction.
So the implementation of signbit() is branchless and efficient, however there is no dedicated hardware instruction for it, as this is unnecessary.
In general, CUDA programmer's do not need to worry about branches all that often, especialy when if-then-else constructs with small bodies are concerned. The compiler frequently renders these into branchless code using either predication of select-type instructions (the machine equivalent of C/C++ ternary operator). It may also combine uniform branches with predication.
I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.
There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.
To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:
which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).
In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).
Recently, I have been making program (FDTD Operation) using the CUDA
development environment, OS is Windows server 2008 , Graphic card is TeslaC2070, compiler is VS2010. This program calculates using single and double precision floating-point.
I was reading the CUDA programming guide 3.2 and 4.0 . In appendix, guide tell me sin(), cos() has maximum accuracy of 2 ULP. My original CPU program produces results which are different to the CUDA Version.
I want to make results correctly same. Is it possible?
To quote Goldberg (a paper that every Computer Scientist, Computational Scientist, and possibly even every scientist who programs, should read):
Due to roundoff errors, the associative laws of algebra do not
necessarily hold for floating-point numbers.
This means that when you change the order of operations—even when using ostensibly associative arithmetic—you are likely to get slightly different answers.
Parallelism, by definition, results in different ordering of operations relative to serial arithmetic. "Embarrasingly parallel" computations, that is, computations where each output element is computed independently from all others, sometimes do not have to worry about this. But collective operations, like reductions or scans, and spatial neighborhood computations, such stencils (as in FDTD), do experience this effect.
In practice, even using a different compiler (and even different compiler options) can change the result of floating point computation, even when compiling the same code, with or without parallelism.
I was wondering how I would go about using __cos(x) (and respectively __sin(x)) in the kernel code with CUDA. I looked up in the CUDA manual that there is such a device function however when I implement it the compiler just says that I cannot call a host function in the device.
However, I found that there are two sister functions cosf(x) and __cosf(x) the latter of which runs on the SFU and is overall much faster than the original cosf(x) function. The compiler does not complain about the __cosf(x) function of course.
Is there a library I'm missing? Am I mistaken about this trig function?
As the SFU only supports certain single-precision operations, there are no double-precision __cos() and __sin() device functions. There are single-precision __cosf() and __sinf() device functions, as well as other functions detailed in table C-4 of the CUDA 4.2 Programming Manual.
I assume you are looking for faster alternatives to the double-precision versions of the standard math functions sin() and cos()? If sine and cosine of the same argument are needed, sincos() should be used for a significant performance boost. If the argument of sine or cosine is multiplied by π, you would want to use sinpi(), cospi(), or sincospi() instead, for even more performance. For example, sincospi() is very useful when implementing the Box-Muller algorithm for generating normally distributed random numbers. Also, check out the CUDA 5.0 preview for best possible performance (note that the preview provides alpha-release quality).
I'm making a medical imaging equipment. I want to use CUDA for making faster equipment
I receive 1024 size 1d data from CCD 512 times.
before I perform IFFT
I have to apply high performance interpolation algorithm (like cubic spline interpolation)
to the 1024 size data each (then 1d interpolation 512 times).
Is there any CUDA library to perform cubic spline interpolation?
(I found that there is one library, but it is for 2 or 3 dimensional image.
Since I need to perform other complicated filtering functions, I need the data on the global memory, not on the texture memory.)
Is there any NUFFT (non uniform fast Fourier transform) library (doesn't need to be written for CUDA)?
I'm thinking that if I have NUFFT function, I don't have to do interpolation and IFFT separately which is possible for making even faster equipment.
Since more people have asked this, I have extended my CUDA cubic interpolation code with 1D cubic interpolation as well. You can find the updated code here: http://www.dannyruijters.nl/cubicinterpolation/
A working CUDA example that also contains 1D cubic interpolation can be found in the cudaAccuracyTest sample in the examples subdirectory in CI.zip.
For those of you who are more interested in a SSE approach, I have some working SSE optimized multi-threaded cubic interpolation code (albeit in 3D, not 1D) in the referenceCubicTexture3D sample in the examples subdirectory.
edit: The cubic interpolation code is now available on github. The 1D cubic interpolation code is here.
Regarding #1
Ruijters' bi/tricubic spline interpolation, which is I think what you refer to http://dannyruijters.nl/cubicinterpolation, (edited!) now works with 1D data, thank you! See Danny Ruijters' answer on this page.
Regarding #2
Here're a few NUFFT implementations that I'm aware of, and brief thoughts on them.
The first library mentioned by #ardiyu07, Greengard, et alia's implementation of fast Gaussian gridding, is in Fortran, which I don't know and so I didn't look at this for long (though this does offer type-III nonuniform-to-nonuniform transforms).
The second one is Ferrara's implementation of Greengard's algorithm in Matlab/MEX, and I couldn't get it to give me the correct solution (see my comment to that effect on Mathworks FileExchange, which I just posted).
Potts, et al., http://www-user.tu-chemnitz.de/~potts/nfft/ I couldn't get this to compile in Windows so I gave up on it. It also has type-III NUFFTs.
Fessler, et al., http://web.eecs.umich.edu/~fessler/code/ written in Matlab/MEX and pre-compiled binaries provided for Linux and Windows at least. Definitely written by non-professional programmers, but it's the only one of the 4 that I've gotten to work correctly. I even got it to work in GNU Octave after changing their Matlab source code in a handful of places (basically by seeing where Octave errors were raised), since Octave can use pre-compiled MEX binaries. This also uses a different algorithm than Greengard's or Potts', based on min-max criteria (its solutions are guaranteed to minimize the maximum DFT error), but lacks type-III NUFFTs (only types-I and II: one of the domains has to be uniform).
I believe a fifth NUFFT/"gridding" implementation is by Hargreaves, et al.: http://www-mrsrl.stanford.edu/~brian/gridding/ (paper at http://dx.doi.org/10.1109/TMI.2005.848376). It is in Matlab/MEX. As is, it is not as general-purpose as the previous four on this list, as it's very much embedded in its MRI context.
And here' a sixth implementation, in Cython (fast Python), with type-III nonuniform-to-nonuniform transforms and some other nice features, alas under GPL: https://github.com/mrbell/gfft
I'm working, at a glacial pace, on porting Fessler's algorithm to Python/Cython, and maybe CUDA ("maybe" because just zero-padding the standard (CU)FFT and linear interpolation seems to work well enough). Best of luck.
I don't know about that algorithm, but if what you've found you think fast enough for your equipment, then why dont you change the implementation from using texture memory to just a simple array, and maybe you can do more speedup using shared memory?
I've found some written in matlab and fortran 77:
http://www.cims.nyu.edu/cmcl/nufft/nufft.html
http://www.mathworks.com/matlabcentral/fileexchange/25135-nufft-nufft-usffft
To be honest, your parallelism seems to be a bit low for the GPU. A 6core with SSE optimizations might outperform a GPU here.