Is there efficient way to map graph onto blocks in CUDA programming? - cuda

In parallel computing, it is usually the first step to divide the origin problem into some sub-task and map them onto blocks and threads.
For problems with regular data structure, it is very easy and efficient, for example, matrix multiplication, FFT and so on.
But graph theory problems like shortest path, graph traversal, tree search, have irregular data structure. It seems not easy, at least in my mind, to partition the problem onto blocks and threads when using GPU.
I am wondering if there efficient solutions for this kind of partition?
For simplicity, take single-source shortest-path problem as a example. I am stuck at how to divide the graph so that both locality and coalescing.

The tree data structure is designed to best optimize the sequential way of progressing. In tree search, since each state is highly dependent on the previous state, I think it would not be optimal to parallelize traversal on tree.
As far as the graph is concerned, each connected node can be analyzed in parallel, but I guess there might be redundant operations for overlapping paths.

You can use MapGraph which uses GAS method for all the things u mentioned....they also have some example implemented for the same and Library included for Gather, Apply and Scatter in cuda for GPU and cpu only implementation also.
You can find latest version here: http://sourceforge.net/projects/mpgraph/files/0.3.3/

Related

Faceted search and heat map creation on GPU

I am trying to find ways to filter and render 100 million+ data points as a heat map in real time.
Each point in addition to the (x,y) coordinates has a fixed set of attributes (int, date, bit flags) which can be dynamically chosen by the user in order to filter down the data set.
Would it be feasible to accelerate all or parts of this task on GPUs?
It would help if you were more specific, but I'm assuming that you want to apply a user specified filter to the same 2D spatial data. If this is the case, you could consider organizing your data into a spatial datastructure, such as a Quadtree or K-d tree.
Once you have done this, you could run a GPU kernel for each region in your datastructure based on the filter you want to apply. Each thread will figure out which points in its region satisfy the specified filter.
Definitely, this is the kind of problem that fits into the GPGPU spectrum.
You could decide to create your own kernel to filter your data or simply use some functions of vendor's libraries to that end. Probably, you would normalize, interpolate, and so on, which are common utilities in those libraries. These kind of functions are typically embarrassingly parallel, at it shouldn't be difficult to create your own kernel.
I'd rather use a visualization framework that allows you to filter and visualize your data in real time. Vispy is a great option but, of course, there are some others.

Median selection in CUDA kernel

I need to compute the median of an array of size p inside a CUDA kernel (in my case, p is small e.g. p = 10). I am using an O(p^2) algorithm for its simplicity, but at the cost of time performance.
Is there a "function" to find the median efficiently that I can call inside a CUDA kernel?
I know I could implement a selection algorithm, but I'm looking for a function and/or tested code.
Thanks!
Here are a few hints:
Use a better selection algorithm: QuickSelect is a faster version of QuickSort for selecting the kth element in an array. For compile-time-constant mask sizes, sorting networks are even faster, thanks to high TLP and a O(log^2 n) critical path. If you only have 8-bit values, you can use a histogram-based approach. This paper describes an implementation that takes constant time per pixel, independent of mask size, which makes it very fast for very large mask sizes. You can parallelize it by using a minimal launch strategy (only run as many threads as you need to keep all SMs at max capacity), tiling the image, and letting threads of the same block cooperate on each kernel histogram.
Sort in registers. For small mask sizes, you can keep the entire array in registers, making median selection with a sorting network much faster. For larger mask sizes, you can use shared memory.
Copy all pixels used by the block to shared memory first, and then copy to thread-local buffers that are also in shared memory.
If you only have a few masks that need to go really fast (such as 3x3 and 5x5), use templates to make them compile time constants. This can speed things up a lot because the compiler can unroll loops and re-order a lot more instructions, possibly improving load batching and other goodies, leading to large speed-ups.
Make sure, your reads are coalesced and aligned.
There are many other optimizations you can do. Make sure, you read through the CUDA documents, especially the Programming Guide and the Best Practices Guide.
When you really want to gun for high performance, don't forget to take a good look at a CUDA profiler, such as the Visual Profiler.
Even in a single thread one can sort the array and pick the value in the middle in O(p*log(p)), which makes O(p^2) look excessive. If you have p threads at your disposal it's also possible to sort the array as fast as O(log(p)), although that may not be the fastest solution for small p. See the top answer here:
Which parallel sorting algorithm has the best average case performance?

are there existing libraries for many optimization jobs in parallel on GPU

I'm looking to perform many (thousands) of small optimization jobs on my nVidia Geforce.
With small jobs I mean 3-6 dimensions and around 1000 data points input each. Basically it's for curve fitting purposes, so the objective function to minimize is a sum of squares of a continuous (non-trivial) analytical function, of which I can compute the first derivative analytically. Each dimension is constrained between lower and upper boundary.
The only thing these jobs have in common is the original data series out of which they take different 1000 data points.
I suspect this will be much faster on my GPU than now, running them one by one my CPU, so I could use it for realtime monitoring.
However, the GPU libraries I've seen only focus on calculating a single function evaluation (faster) on the GPU.
There was a thread on my specific question on the nvidia CUDA forum, with more users looking for this, but the forums have been down for a while. It mentioned porting an existing C library (eg. levmar) to the CUDA language, but this got lost...
Do you know of an existing library to run many optimizations in parallel on a gpu?
Thanks!
The GFOR loop is meant to tile together many small problems like this. Each body of the loop is tiled together with the other loop bodies. Many people have used it for optimization problems like you describe. It is available for C/C++, Fortran, or Python as well as MATLAB code.
My disclaimer is that I work on GFOR. But I'm not aware of any other production-level GPU library that does similar optimization problems. You might be able to find some academic projects if you search around.

CUDA - Use the CURAND Library for Dummies

I was reading the CURAND Library API and I am a newbie in CUDA and I wanted to see if someone could actually show me a simple code that uses the CURAND Library to generate random numbers. I am looking into generating a large amount of number to use with Discrete Event Simulation. My task is just to develop the algorithms to use GPGPU's to speed up the random number generation. I have implemented the LCG, Multiplicative, and Fibonacci methods in standard C Language Programming. However I want to "port" those codes into CUDA and take advantage of threads and blocks to speed up the process of generating random numbers.
Link 1: http://adnanboz.wordpress.com/tag/nvidia-curand/
That person has two of the methods I will need (LCG and Mersenne Twister) but the codes do not provide much detail. I was wondering if anyone could expand on those initial implementations to actually point me in the right direction on how to use them properly.
Thanks!
Your question is misleading - you say "Use the cuRAND Library for Dummies" but you don't actually want to use cuRAND. If I understand correctly, you actually want to implement your own RNG from scratch rather than use the optimised RNGs available in cuRAND.
First recommendation is to revisit your decision to use your own RNG, why not use cuRAND? If the statistical properties are suitable for your application then you would be much better off using cuRAND in the knowledge that it is tuned for all generations of the GPU. It includes Marsaglia's XORWOW, l'Ecuyer's MRG32k3a, and the MTGP32 Mersenne Twister (as well as Sobol' for Quasi-RNG).
You could also look at Thrust, which has some simple RNGs, for an example see the Monte Carlo sample.
If you really need to create your own generator, then there's some useful techniques in GPU Computing Gems (Emerald Edition, Chapter 16: Parallelization Techniques for Random Number Generators).
As a side note, remember that while a simple LCG is fast and easy to skip-ahead, they typically have fairly poor statistical properties especially when using large quantities of draws. When you say you will need "Mersenne Twister" I assume you mean MT19937. The referenced Gems book talks about parallelising MT19937 but the original developers created the MTGP generators (also referenced above) since MT19937 is fairly complex to implement skip-ahead.
Also as another side note, just using a different seed to achieve parallelisation is usually a bad idea, statistically you are not assured of the independence. You either need to skip-ahead or leap-frog, or else use some other technique (e.g. DCMT) for ensuring there is no correlation between sequences.

Recursion / stacks and queues in cuda

To traverse a tree data structure in whatever form one might represent, one needs to use either recursion or use iteration with stacks and queues.
How would one do this on the GPU using CUDA? As far as I know, neither recursion nor stack structures likes stacks and queues are supported in CUDA.
In context my problem is that of range searching, where given a point I want to traverse an octree data structure to find all points within a radius of 'r' centered at that point.
The most effecient serial algorithms / data structures do not necessarily make the most effecient parallel implementations.
That said, this is not a new question and a little bit of googling can turn up interesting results.