Cuda 2d or 3d arrays - cuda

I am dealing with a set of (largish 2k x 2k) images
I need to do per-pixel operations down a stack of a few sequential images.
Are there any opinions on using a single 2D large texture + calculating offsets vs using 3D arrays?
It seems that 3D arrays are a bit 'out of the mainstream' in the CUDA api, the allocation transfer functions are very different from the same 2D functions.
There doesn't seem to be any good documentation on the higher level "how and why" of CUDA rather than the specific calls
There is the best practices guide but it doesn't address this

I would recommend you to read the book "Cuda by Example". It goes through all these things that aren't documented as well and it'll explain the "how and why".
I think what you should use if you're rendering the result of the CUDA kernel is to use OpenGL interop. This way, your code processes the image on the GPU and leaves the processed data there, making it much faster to render. There's a good example of doing this in the book.
If each CUDA thread needs to read only one pixel from the first frame and one pixel from the next frame, you don't need to use textures. Textures only benefit you if each thread is reading in a bunch of consecutive pixels. So you're best off using a 3D array.

Here is an example of using CUDA and 3D cuda arrays:
https://github.com/nvpro-samples/gl_cuda_interop_pingpong_st

Related

send custom datatype/class to GPU

all tutorials and introductional material for GPGPU/Cuda often use flat arrays, however I'm trying to port a piece of code which uses somewhat more sophisticated objects compared to an array.
I have a 3-dimensional std::vector whose data I want to have on the GPU. Which strategies are there to get this on the GPU?
I can think of 1 for now:
copy the vector's data on the host to a more simplistic structure like an array. However this seems wasteful because 1) I have to copy data and then send to the GPU; and 2) I have to allocate a 3-dimensional array whose dimensions are the max of the the element count in any of the vectors e.g. using a 2D vector
imagine {{1, 2, 3, 4, .. 1000}, {1}}, In the host memory these are roughly ~1001 allocated items, whereas if I were to copy this to a 2 dimensional array, I would have to allocate 1000*1000 elements.
Are there better strategies?
There are many methodologies for refactoring data to suit GPU computation, one of the challenges being copying data between device and host, the other challenge being representation of data (and also algorithm design) on the GPU to yield efficient use of memory bandwidth. I'll highlight 3 general approaches, focusing on ease of copying data between host and device.
Since you mention std::vector, you might take a look at thrust which has vector container representations that are compatible with GPU computing. However thrust won't conveniently handle vectors of vectors AFAIK, which is what I interpret to be your "3D std::vector" nomenclature. So some (non-trivial) refactoring will still be involved. And thrust still doesn't let you use a vector directly in ordinary CUDA device code, although the data they contain is usable.
You could manually refactor the vector of vectors into flat (1D) arrays. You'll need one array for the data elements (length = total number of elements contained in your "3D" std::vector), plus one or more additional (1D) vectors to store the start (and implicitly the end) points of each individual sub-vector. Yes, folks will say this is inefficient because it involves indirection or pointer chasing, but so does the use of vector containers on the host. I would suggest that getting your algorithm working first is more important than worrying about one level of indirection in some aspects of your data access.
as you point out, the "deep-copy" issue with CUDA can be a tedious one. It's pretty new, but you might want to take a look at Unified Memory, which is available on 64-bit windows and linux platforms, under CUDA 6, with a Kepler (cc 3.0) or newer GPU. With C++ especially, UM can be very powerful because we can extend operators like new under the hood and provide almost seamless usage of UM for shared host/device allocations.

Median selection in CUDA kernel

I need to compute the median of an array of size p inside a CUDA kernel (in my case, p is small e.g. p = 10). I am using an O(p^2) algorithm for its simplicity, but at the cost of time performance.
Is there a "function" to find the median efficiently that I can call inside a CUDA kernel?
I know I could implement a selection algorithm, but I'm looking for a function and/or tested code.
Thanks!
Here are a few hints:
Use a better selection algorithm: QuickSelect is a faster version of QuickSort for selecting the kth element in an array. For compile-time-constant mask sizes, sorting networks are even faster, thanks to high TLP and a O(log^2 n) critical path. If you only have 8-bit values, you can use a histogram-based approach. This paper describes an implementation that takes constant time per pixel, independent of mask size, which makes it very fast for very large mask sizes. You can parallelize it by using a minimal launch strategy (only run as many threads as you need to keep all SMs at max capacity), tiling the image, and letting threads of the same block cooperate on each kernel histogram.
Sort in registers. For small mask sizes, you can keep the entire array in registers, making median selection with a sorting network much faster. For larger mask sizes, you can use shared memory.
Copy all pixels used by the block to shared memory first, and then copy to thread-local buffers that are also in shared memory.
If you only have a few masks that need to go really fast (such as 3x3 and 5x5), use templates to make them compile time constants. This can speed things up a lot because the compiler can unroll loops and re-order a lot more instructions, possibly improving load batching and other goodies, leading to large speed-ups.
Make sure, your reads are coalesced and aligned.
There are many other optimizations you can do. Make sure, you read through the CUDA documents, especially the Programming Guide and the Best Practices Guide.
When you really want to gun for high performance, don't forget to take a good look at a CUDA profiler, such as the Visual Profiler.
Even in a single thread one can sort the array and pick the value in the middle in O(p*log(p)), which makes O(p^2) look excessive. If you have p threads at your disposal it's also possible to sort the array as fast as O(log(p)), although that may not be the fastest solution for small p. See the top answer here:
Which parallel sorting algorithm has the best average case performance?

Multiple GPUs in OptiX (asynchronous launches possible?)

I have some challenges with my Master's thesis I hope you can help me with or maybe point me in the correct direction.
I'm implementing Progressive Photon Mapping using the new approach by Knaus and Zwicker (http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Knaus11.pdf) using OptiX. This approach makes each iteration/frame of PPM independent and more suitable for multi-GPU.
What i do (with a single GPU) is trace a number of photons using OptiX and then store them in a buffer. Then, the photons are then sorted into a spatial hash map using CUDA and thrust, never leaving the GPU. I want to do the spatial hash map creation on GPU since it is the bottleneck of my renderer. Finally, this buffer is used during indirect radiance estimation. So this is a several pass algorithm, consisting of ray-tracing, photon-tracing, photon map generation and finally create image.
I understand that OptiX can support multiple GPU. Each context launch is divided up across the GPUs. Any writes to buffers seems to be serialized and broadcasted to each device so that their buffer contents are the same.
What i would like to do is let one GPU do one frame, while second GPU does the next frame. I can then combine the results, for instance on the CPU or on one of the GPU's in a combine pass. It is also acceptable if i can do each pass in parallel on each device (synchronize between each pass). Is this somehow possible?
For instance, could I create two OptiX contexts mapping to each device on two different host threads. This would allow me to do the CUDA/thrust spatial hash map generation as before, assuming the photons are on one device, and merge the two generated images at the end of the pipeline. However, the programming guide states it does not support multi-thread context handling. I could use multiple processes but then there is a lot of mess with inter-process communication. This approach also requires duplicate work with creating the scene geometry, compiling PTX files and so on.
Thanks!
OptiX already splits the workload accordingly to your GPUs power so your approach will likely not be faster than having OptiX dispose of all the GPUs.
If you want to force your data to remain on the device (notice that in such a situation writes from different devices will not be coherent) you can use the RT_BUFFER_GPU_LOCAL flag as indicated into the programming guide
https://developer.nvidia.com/optix-documentation

CUDA: 1-dimensional cubic spline interpolation in CUDA

I'm making a medical imaging equipment. I want to use CUDA for making faster equipment
I receive 1024 size 1d data from CCD 512 times.
before I perform IFFT
I have to apply high performance interpolation algorithm (like cubic spline interpolation)
to the 1024 size data each (then 1d interpolation 512 times).
Is there any CUDA library to perform cubic spline interpolation?
(I found that there is one library, but it is for 2 or 3 dimensional image.
Since I need to perform other complicated filtering functions, I need the data on the global memory, not on the texture memory.)
Is there any NUFFT (non uniform fast Fourier transform) library (doesn't need to be written for CUDA)?
I'm thinking that if I have NUFFT function, I don't have to do interpolation and IFFT separately which is possible for making even faster equipment.
Since more people have asked this, I have extended my CUDA cubic interpolation code with 1D cubic interpolation as well. You can find the updated code here: http://www.dannyruijters.nl/cubicinterpolation/
A working CUDA example that also contains 1D cubic interpolation can be found in the cudaAccuracyTest sample in the examples subdirectory in CI.zip.
For those of you who are more interested in a SSE approach, I have some working SSE optimized multi-threaded cubic interpolation code (albeit in 3D, not 1D) in the referenceCubicTexture3D sample in the examples subdirectory.
edit: The cubic interpolation code is now available on github. The 1D cubic interpolation code is here.
Regarding #1
Ruijters' bi/tricubic spline interpolation, which is I think what you refer to http://dannyruijters.nl/cubicinterpolation, (edited!) now works with 1D data, thank you! See Danny Ruijters' answer on this page.
Regarding #2
Here're a few NUFFT implementations that I'm aware of, and brief thoughts on them.
The first library mentioned by #ardiyu07, Greengard, et alia's implementation of fast Gaussian gridding, is in Fortran, which I don't know and so I didn't look at this for long (though this does offer type-III nonuniform-to-nonuniform transforms).
The second one is Ferrara's implementation of Greengard's algorithm in Matlab/MEX, and I couldn't get it to give me the correct solution (see my comment to that effect on Mathworks FileExchange, which I just posted).
Potts, et al., http://www-user.tu-chemnitz.de/~potts/nfft/ I couldn't get this to compile in Windows so I gave up on it. It also has type-III NUFFTs.
Fessler, et al., http://web.eecs.umich.edu/~fessler/code/ written in Matlab/MEX and pre-compiled binaries provided for Linux and Windows at least. Definitely written by non-professional programmers, but it's the only one of the 4 that I've gotten to work correctly. I even got it to work in GNU Octave after changing their Matlab source code in a handful of places (basically by seeing where Octave errors were raised), since Octave can use pre-compiled MEX binaries. This also uses a different algorithm than Greengard's or Potts', based on min-max criteria (its solutions are guaranteed to minimize the maximum DFT error), but lacks type-III NUFFTs (only types-I and II: one of the domains has to be uniform).
I believe a fifth NUFFT/"gridding" implementation is by Hargreaves, et al.: http://www-mrsrl.stanford.edu/~brian/gridding/ (paper at http://dx.doi.org/10.1109/TMI.2005.848376). It is in Matlab/MEX. As is, it is not as general-purpose as the previous four on this list, as it's very much embedded in its MRI context.
And here' a sixth implementation, in Cython (fast Python), with type-III nonuniform-to-nonuniform transforms and some other nice features, alas under GPL: https://github.com/mrbell/gfft
I'm working, at a glacial pace, on porting Fessler's algorithm to Python/Cython, and maybe CUDA ("maybe" because just zero-padding the standard (CU)FFT and linear interpolation seems to work well enough). Best of luck.
I don't know about that algorithm, but if what you've found you think fast enough for your equipment, then why dont you change the implementation from using texture memory to just a simple array, and maybe you can do more speedup using shared memory?
I've found some written in matlab and fortran 77:
http://www.cims.nyu.edu/cmcl/nufft/nufft.html
http://www.mathworks.com/matlabcentral/fileexchange/25135-nufft-nufft-usffft
To be honest, your parallelism seems to be a bit low for the GPU. A 6core with SSE optimizations might outperform a GPU here.

Graph algorithms on GPU

The current GPU execution and memory models are somehow limited (memory limit, limit of data structures, no recursion...).
Do you think it would be feasible to implement a graph theory problem on a GPU? For example, vertex cover? dominating set? independent set? max clique?....
Is it also feasible to have branch-and-bound algorithms on GPUs? Recursive backtracking?
You will be interested in
Exploring the Limits of GPUs With Parallel Graph Algorithms
Accelerating large graph algorithms on the GPU using CUDA.
This is tangentially related to your question, but I've implemented a "recursive" backtracking algorithm for enumerating "self-avoiding walks" on a lattice (n.b.: the stack was simulated within the CUDA kernel, to avoid the overhead of creating local variables for a whole bunch of function calls). It's possible to do this efficiently, so I'm sure this can be adapted to a graph theoretical context. Here's a link to a seminar on the topic where I gave some general discussion about backtracking within the Single Instruction Multiple Data (SIMD) paradigm; it's a pdf of about 1MB in size http://bit.ly/9ForGS .
I don't claim to know about the wider literature on graph theoretical algorithms on GPUs, but hope the above helps a little.
(#TheMachineCharmer, thanks for the links.)