Recursion / stacks and queues in cuda - cuda

To traverse a tree data structure in whatever form one might represent, one needs to use either recursion or use iteration with stacks and queues.
How would one do this on the GPU using CUDA? As far as I know, neither recursion nor stack structures likes stacks and queues are supported in CUDA.
In context my problem is that of range searching, where given a point I want to traverse an octree data structure to find all points within a radius of 'r' centered at that point.

The most effecient serial algorithms / data structures do not necessarily make the most effecient parallel implementations.
That said, this is not a new question and a little bit of googling can turn up interesting results.

Related

Most efficient tensor format (padded vs packed, time-major vs batch-major) for RNNs [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
Some time ago (before CuDNN introduced its own RNN/LSTM implementation), you would use a tensor of shape [B,T,D] (batch-major) or [T,B,D] (time-major) and then have a straight-forward LSTM implementation.
Straight-forward means e.g. pure Theano or pure TensorFlow.
It was (is?) common wisdom that time-major is more efficient for RNNs/LSTMs.
This might be due to the unrolling internal details of Theano/TensorFlow (e.g. in TensorFlow, you would use tf.TensorArray, and that naturally unrolls over the first axis, so it must be time-major, and otherwise it would imply a transpose to time-major; and not using tf.TensorArray but directly accessing the tensor would be extremely inefficient in the backprop phase).
But I think this is also related to memory locality, so even with your own custom native implementation where you have full control over these details (and thus could choose any format you like), time-major should be more efficient.
(Maybe someone can confirm this?)
(In a similar way, for convolutions, batch-channel major (NCHW) is also more efficient. See here.)
Then CuDNN introduced their own RNN/LSTM implementation and they used packed tensors, i.e. with all padding removed. Also, sequences must be sorted by sequence length (longest first). This is also time-major but without the padded frames.
This caused some difficulty in adopting these kernels because padded tensors (non packed) were pretty standard in all frameworks up to that point, and thus you need to sort by seq length, pack it, then call the kernel, then unpack it and undo the sequence sorting. But slowly the frameworks adopted this.
However, then Nvidia extended the CuDNN functions (e.g. cudnnRNNForwardTrainingEx, and then later cudnnRNNForward). which now supports all three formats:
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED: Data layout is padded, with outer stride from one time-step to the next (time-major, or sequence-major)
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED: Data layout is padded, with outer stride from one batch to the next (batch-major)
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED: The sequence length is sorted and packed as in the basic RNN API (time-major without padding frames, i.e. packed)
CuDNN references:
CuDNN developer guide,
CuDNN API reference
(search for "packed", or "padded").
See for example cudnnSetRNNDataDescriptor. Some quotes:
With the unpacked layout, both sequence major (meaning, time major) and batch major are supported. For backward compatibility, the packed sequence major layout is supported.
This data structure is intended to support the unpacked (padded) layout for input and output of extended RNN inference and training functions. A packed (unpadded) layout is also supported for backward compatibility.
In TensorFlow, since CuDNN supports the padded layout, they have cleaned up the code and only support the padded layout now. I don't see that you can use the packed layout anymore. (Right?)
(I'm not sure why this decision was made. Just to have simpler code? Or is this more efficient?)
PyTorch only supports the packed layout properly (when you have sequences of different lengths) (documentation).
Despite computational efficiency, there is also memory efficiency. Obviously the packed tensor is better w.r.t. memory consumption. So this is not really the question.
I mostly wonder about computational efficiency. Is the packed format most efficient? Or just the same as padded time-major? Time-major is more efficient than batch-major?
(This question is not necessarily about CuDNN, but in general about any naive or optimized implementation in CUDA.)
But obviously, this question also depends on the remaining neural network. When you mix the LSTM together with other modules which might require non-packed tensors, you would have a lot of packing and unpacking, if the LSTM uses the packed format. But consider that you could re-implement all other modules as well to work on packed format: Then maybe packed format would be better in every aspect?
(Maybe the answer is, there is no clear answer. But I don't know. Maybe there is also a clear answer. Last time I actually measured, the answer was pretty clear, at least for some parts of my question, namely that time-major is in general more efficient than batch-major for RNNs. Maybe the answer is, it depends on the hardware. But this should not be a guess, but either with real measurements, or even better with some good explanation. From the best of my knowledge, this should be mostly invariant to the hardware. It would be kind of unexpected to me if the answer varies depending on the hardware. I also assume that packed vs padded probably should not really make any/much a difference, again no matter the hardware. But maybe someone really knows.)

Why are CUDA indices 2D? [duplicate]

I have basically the same question as posed in this discussion. In particular I want to refer to this final response:
I think there are two different questions mixed together in this
thread:
Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the
reasons you and others have described. If the data or calculation has
spatial locality, then so should the assignment of work to threads in
a warp.
Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since
you can do the index calculation trivially yourself at the top of the
kernel. This burns a few arithmetic instructions, but that should be
negligible compared to the kernel launch overhead.
This is why I think the multidimensional grids are intended as a
programmer convenience rather than a way to improve performance. You
do absolutely need to think about each warp's memory access patterns,
though.
I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.
What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?
A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.
There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA
PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions.
The special registers are:
%tid
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid
If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.

CUDA - Use the CURAND Library for Dummies

I was reading the CURAND Library API and I am a newbie in CUDA and I wanted to see if someone could actually show me a simple code that uses the CURAND Library to generate random numbers. I am looking into generating a large amount of number to use with Discrete Event Simulation. My task is just to develop the algorithms to use GPGPU's to speed up the random number generation. I have implemented the LCG, Multiplicative, and Fibonacci methods in standard C Language Programming. However I want to "port" those codes into CUDA and take advantage of threads and blocks to speed up the process of generating random numbers.
Link 1: http://adnanboz.wordpress.com/tag/nvidia-curand/
That person has two of the methods I will need (LCG and Mersenne Twister) but the codes do not provide much detail. I was wondering if anyone could expand on those initial implementations to actually point me in the right direction on how to use them properly.
Thanks!
Your question is misleading - you say "Use the cuRAND Library for Dummies" but you don't actually want to use cuRAND. If I understand correctly, you actually want to implement your own RNG from scratch rather than use the optimised RNGs available in cuRAND.
First recommendation is to revisit your decision to use your own RNG, why not use cuRAND? If the statistical properties are suitable for your application then you would be much better off using cuRAND in the knowledge that it is tuned for all generations of the GPU. It includes Marsaglia's XORWOW, l'Ecuyer's MRG32k3a, and the MTGP32 Mersenne Twister (as well as Sobol' for Quasi-RNG).
You could also look at Thrust, which has some simple RNGs, for an example see the Monte Carlo sample.
If you really need to create your own generator, then there's some useful techniques in GPU Computing Gems (Emerald Edition, Chapter 16: Parallelization Techniques for Random Number Generators).
As a side note, remember that while a simple LCG is fast and easy to skip-ahead, they typically have fairly poor statistical properties especially when using large quantities of draws. When you say you will need "Mersenne Twister" I assume you mean MT19937. The referenced Gems book talks about parallelising MT19937 but the original developers created the MTGP generators (also referenced above) since MT19937 is fairly complex to implement skip-ahead.
Also as another side note, just using a different seed to achieve parallelisation is usually a bad idea, statistically you are not assured of the independence. You either need to skip-ahead or leap-frog, or else use some other technique (e.g. DCMT) for ensuring there is no correlation between sequences.

Is there efficient way to map graph onto blocks in CUDA programming?

In parallel computing, it is usually the first step to divide the origin problem into some sub-task and map them onto blocks and threads.
For problems with regular data structure, it is very easy and efficient, for example, matrix multiplication, FFT and so on.
But graph theory problems like shortest path, graph traversal, tree search, have irregular data structure. It seems not easy, at least in my mind, to partition the problem onto blocks and threads when using GPU.
I am wondering if there efficient solutions for this kind of partition?
For simplicity, take single-source shortest-path problem as a example. I am stuck at how to divide the graph so that both locality and coalescing.
The tree data structure is designed to best optimize the sequential way of progressing. In tree search, since each state is highly dependent on the previous state, I think it would not be optimal to parallelize traversal on tree.
As far as the graph is concerned, each connected node can be analyzed in parallel, but I guess there might be redundant operations for overlapping paths.
You can use MapGraph which uses GAS method for all the things u mentioned....they also have some example implemented for the same and Library included for Gather, Apply and Scatter in cuda for GPU and cpu only implementation also.
You can find latest version here: http://sourceforge.net/projects/mpgraph/files/0.3.3/

Performances evaluation with Message Passing

I have to build a distributed application, using MPI.
One of the decision that I have to take is how to map instances of classes into process (and then into machines), in order to take maximum advantages from a distributed environment.
My question is: there is a model that let me choose the better mapping? I mean, some arrangements are surely wrong (for ex., putting in two different machines two objects that should process together a fairly large amount of data, in a sequential manner, without a stream of tokens to process), but there's a systematically way to determine such wrong arrangements, determined by flow of execution, message complexity, time taken by the computation done by the algorithmic components?
Well, there are data flow diagrams. Those can help identify parallelism's opportunities and pitfalls. The references on the wikipedia page might give you some more theoretical grounding.
When I worked at Lockheed Martin, I was exposed to CSIM, a tool they developed for modeling algorithm mapping to processing blocks.
Another thing you might try is the Join Calculus. I've found examples of programming with it to be surprisingly intuitive, and I think it's well grounded in theory. I'm not sure why it hasn't caught on more.
The other approach is the Pi Calculus, and I think that might be more popular, though it seems harder to understand.
A practical solution to this would be using a different model of distributed-memory parallel programming, that directly addresses your concerns. I work on the Charm++ programming system, whose model is that of individual objects sending messages from one to another. The runtime system facilitates automatic mapping of these objects to available processors, to account for issues of load balance and communication locality.