Decoding RLE in CUDA efficiently - cuda

I need to decode an RLE in CUDA and I have been trying to think about the most efficient way of expanding the RLE into a list with all my values. So Say my values are 2, 3, 4 and my runs are 3, 3 , 1 I want to expand that to 2, 2, 2, 3, 3, 3, 4.
At first I thought I could use cudaMemset but I am pretty sure now that launches a Kernel and I have CUDA Compute Capability 3.0 so even if it were not probably inefficient to launch a new kernel for each value / run pair I do not have dynamic parallelism available to do this.
So I want to know if this solution is sound before I go and implement it since there are so many things that end up not working well on CUDA if you aren't being clever. Would it be reasonable to make a kernel that will call cudaMalloc then cudaMemCpy to the destination? I can easily compute the prefix sums to know where to copy the memory to and from and make all my reading at least coalesced. What I am worried about is calling cudaMalloc and cudaMemCpy so many times.
Another potential option is writing these values to shared memory and then copying those to global memory. I want to know if my first solution should work and be efficient or if I have to do the latter.

You don't want to think about doing a separate operation (e.g. cudaMalloc, or cudaMemset) for each value/run pair.
After computing the prefix sum on the run sequence, the last value in the prefix sum will be the total allocation size. Use that for a single cudaMalloc operation for the entire final expanded sequence.
Once you have the necessary space allocated, and the prefix sum computed, the actual expansion is pretty straightforward.
thrust can make this pretty easy if you want a fast prototype. There is an example code for it.

#RobertCrovella is of course correct, but you can go even further in terms of efficiency if you have the leeway to slightly tweak your compression sceheme.
Sorry for the self-plugging, but you might be interested in my own implementation of a variant of Run-Length Encoding, with the addition of anchoring of output positions into the input (e.g.. "in which offset in which run do we have the 2048th element?"); this allows for a more equitable assignment of work to thread blocks and avoids the need for a full-blown prefix sum. It's still a work-in-progress, so I only get ~34 GB/sec on a 336 GB/sec memory bandwidth card (Titan X) at the time of writing, but it's quite usable.

Related

Is there a way to know what's the extra space that cudaMalloc is going to reserve?

When I use cudaMalloc (100) it reserves more than 100 B (According to some users here it's due to granularity issues and housekeeping information.
Is it possible to determine how big this space will be based on the Bytes I need to reserve?
Thank you so much.
EDIT: I'll explain why I need to know.
I want to apply the convolution algorithm over huge images on the GPU. To do so, since there isn't enough memory on the GPU to hold it, I need to split the image in batches of rows an call the kernel several times.
In fact, I need to send 2 images, the OnlyRead matrix and the Results matrix.
I want to calcule a priori the max number of rows I can send to the device according to the amount of free memory.
The first cudaMalloc executes successfully, but the problem appears when trying to execute the second CudaMalloc since the first reserve took more Bytes than expected.
What I'm doing now is considering the free memory amount a 10% less than what it is... but that's just a magical number that came from nowhere..
"Is there a way to know what's the extra space that cudaMalloc is going to reserve?"
Not without violating CUDA's platform guarantees, no. cudaMalloc() returns a pointer to the requested amount of memory. You can't make any assumptions about the amount of memory that happens to be valid after the end of the requested amount - the CUDA allocator already makes use of suballocators, and unlike CPU-based memory allocators, the data structures to track free lists etc. are not interleaved with the allocated memory. So for example, it would be unwise to assume that the CUDA runtime's guarantees about the alignment of the returned pointers mean anything other than that returned pointers will have a certain alignment.
If you study the CUDA runtime's behavior, that will shed light on the behavior of that particular CUDA runtime, but the behavior may change with future releases and break your code.

CUDA: How to find index of extrema in sub matrices?

I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.

CUDA shared memory - sum reduction from kernel

I am working on big datasets that are image cubes (450x450x1500). I have a kernel that works on individual data elements. Each data element produces 6 intermediate results (floats). My block consists of 1024 threads. The 6 intermediate results are stored in shared memory by each thread (6 float arrays). However, now I need to add each of the intermediate result to produce a sum (6 sum values). I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
Are there any reduction routines that can be called from inside a kernel function on arrays in shared memory?
What will be the best way to solve this problem? I am a newbie to CUDA programming and would welcome any suggestions.
This seems unlikely:
I do not have enough global memory to save these 6 float arrays to global memory and then run a reduction from thrust or any other library from the host code.
I can't imagine how you have enough space to store your data in shared memory but not in global memory.
Anyway, CUB provides reduction routines that can be called from within a threadblock, and that can operate on data stored in shared memory.
Or you can write your own sum-reduction code. It's not terribly hard to do, there are many questions on SO about it, such as this one.
Or you could adapt the cuda sample code.
Update
After seeing all the comments, I understand that instead of doing 1 or a few times of reduction, you need to do the reductions for 450x450x6 times.
In this case there's simpler solution.
You don't need to implement relatively complex parallel reduction for each 1500-D vector。 Since you already have 450x450x6 vectors to reduce, you could reduce all these vectors in parallel using traditional serial reduction method.
You could use a block with 16x16 threads to process a particular region of the image, and a grid with 29x29 blocks to cover the whole 450x450 image.
In each thread, you could iterate over the 1500 frames. In each iterration, you coulde first compute the 6 intermediate results, then add them to the sums. When yo finish all the iterations, you could write the 6 sums to global mem.
That finishes the kernel design. And no shared mem is needed.
You wil find that the performance is very good. Since it is a memory bound operation,it won't be much longer than simply access all the image cube data once.
In case you don't have enough global mem for the whole cube, you could split it into 4 sub-cubes of [1500][225][225], and call the kernel routine on each sub-cube. The only thing you need to change is the grid size.
Have a look at this that explains parallel reduction in CUDA thoroughly.
If I understand it correctly, each thread should sum up "only" 6 floats.
I'm not sure if it is worth doing that by a parallel reduction in general, in the sense that you will experience performance gains.
If you are targeting a Kepler, you may try to use shuffle operations if you properly set the block size so that your intermediate results fit the Streaming Multiprocessor's registers in some way.
As also pointed out by Robert Crovella, your statement about the possibility of storing the intermediate results seems strange as the amount of global memory is certainly larger than the amount of shared memory.

speed up ideas -- can CUDA help here?

I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.
If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.
Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.

matrix multiplication in cuda

say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.