CUDA: 1-dimensional cubic spline interpolation in CUDA - cuda

I'm making a medical imaging equipment. I want to use CUDA for making faster equipment
I receive 1024 size 1d data from CCD 512 times.
before I perform IFFT
I have to apply high performance interpolation algorithm (like cubic spline interpolation)
to the 1024 size data each (then 1d interpolation 512 times).
Is there any CUDA library to perform cubic spline interpolation?
(I found that there is one library, but it is for 2 or 3 dimensional image.
Since I need to perform other complicated filtering functions, I need the data on the global memory, not on the texture memory.)
Is there any NUFFT (non uniform fast Fourier transform) library (doesn't need to be written for CUDA)?
I'm thinking that if I have NUFFT function, I don't have to do interpolation and IFFT separately which is possible for making even faster equipment.

Since more people have asked this, I have extended my CUDA cubic interpolation code with 1D cubic interpolation as well. You can find the updated code here: http://www.dannyruijters.nl/cubicinterpolation/
A working CUDA example that also contains 1D cubic interpolation can be found in the cudaAccuracyTest sample in the examples subdirectory in CI.zip.
For those of you who are more interested in a SSE approach, I have some working SSE optimized multi-threaded cubic interpolation code (albeit in 3D, not 1D) in the referenceCubicTexture3D sample in the examples subdirectory.
edit: The cubic interpolation code is now available on github. The 1D cubic interpolation code is here.

Regarding #1
Ruijters' bi/tricubic spline interpolation, which is I think what you refer to http://dannyruijters.nl/cubicinterpolation, (edited!) now works with 1D data, thank you! See Danny Ruijters' answer on this page.
Regarding #2
Here're a few NUFFT implementations that I'm aware of, and brief thoughts on them.
The first library mentioned by #ardiyu07, Greengard, et alia's implementation of fast Gaussian gridding, is in Fortran, which I don't know and so I didn't look at this for long (though this does offer type-III nonuniform-to-nonuniform transforms).
The second one is Ferrara's implementation of Greengard's algorithm in Matlab/MEX, and I couldn't get it to give me the correct solution (see my comment to that effect on Mathworks FileExchange, which I just posted).
Potts, et al., http://www-user.tu-chemnitz.de/~potts/nfft/ I couldn't get this to compile in Windows so I gave up on it. It also has type-III NUFFTs.
Fessler, et al., http://web.eecs.umich.edu/~fessler/code/ written in Matlab/MEX and pre-compiled binaries provided for Linux and Windows at least. Definitely written by non-professional programmers, but it's the only one of the 4 that I've gotten to work correctly. I even got it to work in GNU Octave after changing their Matlab source code in a handful of places (basically by seeing where Octave errors were raised), since Octave can use pre-compiled MEX binaries. This also uses a different algorithm than Greengard's or Potts', based on min-max criteria (its solutions are guaranteed to minimize the maximum DFT error), but lacks type-III NUFFTs (only types-I and II: one of the domains has to be uniform).
I believe a fifth NUFFT/"gridding" implementation is by Hargreaves, et al.: http://www-mrsrl.stanford.edu/~brian/gridding/ (paper at http://dx.doi.org/10.1109/TMI.2005.848376). It is in Matlab/MEX. As is, it is not as general-purpose as the previous four on this list, as it's very much embedded in its MRI context.
And here' a sixth implementation, in Cython (fast Python), with type-III nonuniform-to-nonuniform transforms and some other nice features, alas under GPL: https://github.com/mrbell/gfft
I'm working, at a glacial pace, on porting Fessler's algorithm to Python/Cython, and maybe CUDA ("maybe" because just zero-padding the standard (CU)FFT and linear interpolation seems to work well enough). Best of luck.

I don't know about that algorithm, but if what you've found you think fast enough for your equipment, then why dont you change the implementation from using texture memory to just a simple array, and maybe you can do more speedup using shared memory?
I've found some written in matlab and fortran 77:
http://www.cims.nyu.edu/cmcl/nufft/nufft.html
http://www.mathworks.com/matlabcentral/fileexchange/25135-nufft-nufft-usffft

To be honest, your parallelism seems to be a bit low for the GPU. A 6core with SSE optimizations might outperform a GPU here.

Related

Solving the Poisson equation on multiple GPUs located on different cluster nodes interacting by the MPI protocol

I'm trying to solve the Poisson equation in real space on a multi GPUs architecture using a code in C/CUDA with the MPI library. For the moment, I'm only interested in solving the problem in a periodic box. But in the future, I may want to look at spherical geometry.
Is there an existing routine to solve this problem ?
Comments dated from August 2012 seem to indicate that the thrust library in not adapted for multi GPUs architectures. Is that still correct ?
If the routine exists, what method does it use (Jacobi, SOR, Gauss-Seidel, Krylov) ?
Please express your opinion about its speed and the problems you may have encountered.
Thanks for your time.
Solving the Poisson equation by a Multi-GPU approach, with GPUs located on different cluster nodes interacting by using the MPI protocol, is a relatively recent research topic. The basic idea is to use domain decomposition, so that each GPU solves for one part of the computational domain, and MPI is used to exchange boundary data.
You may wish to have a look at the papers Towards a multi-GPU solver for the three-dimensional
two-phase incompressible Navier-Stokes equations, presented at GTC 2012, and An MPI-CUDA Implementation for Massively
Parallel Incompressible Flow Computations on
Multi-GPU Clusters. Particularly in the first approach, Navier-Stokes equations are solved by Chorin’s projection approach which in turn requires the solution of a Poisson equation, which is the most demanding task and is solved by a MultiGPU/MPI strategy exploiting a Jacobi preconditioned conjugate gradient solver.
Concerning available routines, in the past I have bumped into GAMER, a downloadable software for astrophysics applications. The authors claim that the code contains a variety of GPU-accelerated Poisson solvers and hybrid OpenMP/MPI/GPU parallelization. However, I have never had the chance to download it.
Thrust can be used in a multi GPU environment. You can use the runtime api i.e. cudaSetDevice, to switch devices. Since thrust handles allocations and deallocations for vectors implicitly, care must be taken to make sure that the correct device is selected when device vectors are declared and when they are deallocated i.e. go out of scope.

Large matrix multiplication on gpu

I need to implement a matrix multiplication on GPU with CUDA for large matrices. Size of each matrix alone is bigger than the GPU memory. So I think I need an algorithm to do that efficiently. I went around the internet but couldn't find any. Can anyone give me the name or link of such algorithms.
There isn't really a formal algorithm for this; in general, these sorts of linear algebra operations where the whole problem isn't stored in memory simultaneously are referred to as "out of core" operations.
To solve it, you don't need a particularly elaborate algorithm, just the CUBLAS library and a pencil and paper. For example, you can decompose the matrix product like this:
which gives you four independent sub-matrix multiplication operations. These can be calculated using four calls to CUBLAS gemm using very straightforward host code. You can extend the idea to as many sub-matrices as are required to match the problem size and your GPU capacity. The same principle can also be used to implement matrix multiplication problems on multiple GPUs (see this question for an example).
In the alternative, you can find a working implementation of this precise idea in the Harvard developed SciGPU-GEMM codebase and in the HPL-CUDA linpack implementation (disclaimer: I am affiliated with the latter codebase).

about floating point operation

Recently, I have been making program (FDTD Operation) using the CUDA
development environment, OS is Windows server 2008 , Graphic card is TeslaC2070, compiler is VS2010. This program calculates using single and double precision floating-point.
I was reading the CUDA programming guide 3.2 and 4.0 . In appendix, guide tell me sin(), cos() has maximum accuracy of 2 ULP. My original CPU program produces results which are different to the CUDA Version.
I want to make results correctly same. Is it possible?
To quote Goldberg (a paper that every Computer Scientist, Computational Scientist, and possibly even every scientist who programs, should read):
Due to roundoff errors, the associative laws of algebra do not
necessarily hold for floating-point numbers.
This means that when you change the order of operations—even when using ostensibly associative arithmetic—you are likely to get slightly different answers.
Parallelism, by definition, results in different ordering of operations relative to serial arithmetic. "Embarrasingly parallel" computations, that is, computations where each output element is computed independently from all others, sometimes do not have to worry about this. But collective operations, like reductions or scans, and spatial neighborhood computations, such stencils (as in FDTD), do experience this effect.
In practice, even using a different compiler (and even different compiler options) can change the result of floating point computation, even when compiling the same code, with or without parallelism.

CUDA - Use the CURAND Library for Dummies

I was reading the CURAND Library API and I am a newbie in CUDA and I wanted to see if someone could actually show me a simple code that uses the CURAND Library to generate random numbers. I am looking into generating a large amount of number to use with Discrete Event Simulation. My task is just to develop the algorithms to use GPGPU's to speed up the random number generation. I have implemented the LCG, Multiplicative, and Fibonacci methods in standard C Language Programming. However I want to "port" those codes into CUDA and take advantage of threads and blocks to speed up the process of generating random numbers.
Link 1: http://adnanboz.wordpress.com/tag/nvidia-curand/
That person has two of the methods I will need (LCG and Mersenne Twister) but the codes do not provide much detail. I was wondering if anyone could expand on those initial implementations to actually point me in the right direction on how to use them properly.
Thanks!
Your question is misleading - you say "Use the cuRAND Library for Dummies" but you don't actually want to use cuRAND. If I understand correctly, you actually want to implement your own RNG from scratch rather than use the optimised RNGs available in cuRAND.
First recommendation is to revisit your decision to use your own RNG, why not use cuRAND? If the statistical properties are suitable for your application then you would be much better off using cuRAND in the knowledge that it is tuned for all generations of the GPU. It includes Marsaglia's XORWOW, l'Ecuyer's MRG32k3a, and the MTGP32 Mersenne Twister (as well as Sobol' for Quasi-RNG).
You could also look at Thrust, which has some simple RNGs, for an example see the Monte Carlo sample.
If you really need to create your own generator, then there's some useful techniques in GPU Computing Gems (Emerald Edition, Chapter 16: Parallelization Techniques for Random Number Generators).
As a side note, remember that while a simple LCG is fast and easy to skip-ahead, they typically have fairly poor statistical properties especially when using large quantities of draws. When you say you will need "Mersenne Twister" I assume you mean MT19937. The referenced Gems book talks about parallelising MT19937 but the original developers created the MTGP generators (also referenced above) since MT19937 is fairly complex to implement skip-ahead.
Also as another side note, just using a different seed to achieve parallelisation is usually a bad idea, statistically you are not assured of the independence. You either need to skip-ahead or leap-frog, or else use some other technique (e.g. DCMT) for ensuring there is no correlation between sequences.

Cuda 2d or 3d arrays

I am dealing with a set of (largish 2k x 2k) images
I need to do per-pixel operations down a stack of a few sequential images.
Are there any opinions on using a single 2D large texture + calculating offsets vs using 3D arrays?
It seems that 3D arrays are a bit 'out of the mainstream' in the CUDA api, the allocation transfer functions are very different from the same 2D functions.
There doesn't seem to be any good documentation on the higher level "how and why" of CUDA rather than the specific calls
There is the best practices guide but it doesn't address this
I would recommend you to read the book "Cuda by Example". It goes through all these things that aren't documented as well and it'll explain the "how and why".
I think what you should use if you're rendering the result of the CUDA kernel is to use OpenGL interop. This way, your code processes the image on the GPU and leaves the processed data there, making it much faster to render. There's a good example of doing this in the book.
If each CUDA thread needs to read only one pixel from the first frame and one pixel from the next frame, you don't need to use textures. Textures only benefit you if each thread is reading in a bunch of consecutive pixels. So you're best off using a 3D array.
Here is an example of using CUDA and 3D cuda arrays:
https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st