I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.
Related
In CUDA programming, threads and blocks have multiple directions (x, y and z).
Until now, I ignored this and only took into account the x direction (threadIdx.x, blockIdx.x, blockDim.x, etc.).
Apparently, both threads within a block and blocks on the grid are arranged as a cube. However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that? Only using the x direction, am I able to address all threads available to my GPU?
Only using the x direction, am I able to address all threads available to my GPU?
If we are talking about a desire to spin up ~2 trillion threads or less, then there is no particular requirement to use a multidimensional block, or grid. All CUDA GPUs of compute capability 3.0 and higher can launch up to about 2 billion blocks (2^31-1) with 1024 threads each, using a 1-D grid organization.
With methodologies like grid-stride loop it seems rare to me that more than ~2 trillion threads would be needed.
I claim without formal proof that any problem that can be realized in a 1D grid can be realized in a 2D or 3D grid, or vice versa. This is just a mathematical mapping from one realization to another. Furthermore, it should be possible to arrange for important by-products like coalesced access in either realization.
There may be some readability benefits, code complexity benefits, and possibly small performance considerations when realizing in a 1D or multi-dimensional way. The usual case for this that I can think of is when the data to be processed is "inherently" multi-dimensional. In this case, letting the CUDA engine generate 2 or 3 distinct indices for you:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
might be simpler than using a 1D grid index, and computing 2D data indices from those:
int tid = threadIdx.x+blockDim.x*blockIdx.x;
int idx = tid%DATA_WIDTH;
int idy = tid/DATA_WIDTH;
(the integer division operation above is unavoidable in the general case. The modulo operation can be simplified by using the result from the integer division.)
It's arguably an extra line of code and an extra division operation required to get to the same point, when only a 1D grid is created. However I would suggest that even this is small potatoes, and you should use whichever approach seems most reasonable and comfortable to you as a programmer.
If for some reason you desire to spin up more than ~2 Trillion threads, then moving to a multidimensional grid, at least, is unavoidable.
Apparently, both threads within a block and blocks on the grid are arranged as a cube.
To understand how the threadblock thread index is computed in any case, I refer you to the programming guide. It should be evident that one case can be made equivalent to another - each thread gets a unique thread ID no matter how you specify the threadblock dimensions. In my opinion, a threadblock should only be thought of as a "cube" of threads (i.e. 3-dimensional) if you specify the configuration that way:
dim3 block(32,8,4); //for example
However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that?
If you only used a single threadblock dimension to create a thread index in the 32,8,4 case:
int tid = threadIdx.x;
then you certainly would be "addressing" multiple threads (in y, and z) using that approach. That would typically, in my experience, be "broken" code. Therefore a kernel designed to use a multidimensional block or grid may not work correctly if the block or grid is specified as 1 dimensional, and the reverse statement is also true. You can find examples of such problems (thread index calculation not being correct for the grid design) here on the cuda tag.
I am new to cuda and learning GPU programming. I want to add two nxm matrices (float* A and float* B) and store the results in float* C in the kernel. The goal is to get the fastest implementation. I have the following questions:
I was wondering how to arange the blocks and grid to get the best performance ( for both small and large n and m)
It is good to assign one thread to each element of matrices. However, for large n and m it is not possible. What is the best option then?
How can matrix padding improve the performance?
1: A simple method would be to store the matrix as a vector/array of floats where the rows are concatenated. Then you could just use a large number of threads per block and the smallest necessary number of blocks. Here is an example how the kernel could look like.
2: You basically can have a infinite number of threads, as long as the size of the matrix doesn't exceed the free memory on your GPU. They won't be executed simultaneously, but the driver will schedule them for you and you don't have to care about it.
A thread per element generally works good, if you want to try another way, have a look at Grid Stride Loops, which is a scalable method to organize your elements in less threads.
3: I don't see how padding would improve the performance as you get more elements to copy and calculate, but I'm no expert for that.
What is the fastest way to move data that is on the device around in CUDA?
What I need to do is basically copy continuous sub-rows and sub-columns (of which I have the indexes on the device) from row-major matrices into new smaller matrices, but from what I've observed, memory access in CUDA is not particularly efficient, as it seems the cores are optimized to do computation rather that memory stuff.
Now the CPU seems to be pretty good at doing sequential stuff like moving rows of aligned memory from a place to another.
I see three options:
make a kernel that does the memory copying
outside a kernel, call cudaMemcpy(.., device to device) for each position (terribly slow for columns I would guess)
move the memory to the host, create the new smaller matrix and send it back on the device
Now I could test this on my specific gpu, but given its specs I don't think it would be representative. In general, what is recommended?
Edit:
I'm essentially multiplying two matrices A,B but I'm only interested in multiplying the X elements:
A =[[XX XX]
[ XX XX ]
[XX XX ]]
with the corresponding elements in the columns of B. The XX are always of the same length and I know their positions (and there's a fixed number of them per row).
If you have a matrix storage pattern that involves varying spacing between corresponding row elements (or corresponding column elements), none of the input transformation or striding capabilities of cublas will help, and none of the api striding-copy functions (such as cudaMemcpy2D) will help.
You'll need to write your own kernel to gather the data, before feeding it to cublasXgemm. This should be fairly trivial to do, if you have the locations of the incoming data elements listed in a vector or otherwise listed.
I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.
If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.
Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.
say I want to multiply two matrices together, 50 by 50. I have 2 ways to arrange threads and blocks.
a) one thread to calculate each element of the result matrix. So I have a loop in thread multiplies one row and one column.
b) one thread to do each multiplication. Each element of the result matrix requires 50 threads. After multiplications are done, I can use binary reduction to sum the results.
I wasn't sure which way to take, so I took b. It wasn't ideal. In fact it was slow. Any idea why? My guess would be there are just too many threads and they are waiting for resource most of time, is that true?
As with so many things in high performance computing, the key to understanding performance here is understanding the use of memory.
If you are using one thread do to do one multiplication, then for that thread you have to pull two pieces of data from memory, multiply them, then do some logarthmic number of adds. That's three memory accesses for a mult and an add and a bit - the arithmatic intensity is very low. The good news is that there are many many threads worth of tasks this way, each of which only needs a tiny bit of memory/registers, which is good for occupancy; but the memory access to work ratio is poor.
The simple one thread doing one dot product approach has the same sort of problem - each multiplication requires two memory accesses to load. The good news is that there's only one store to global memory for the whole dot product, and you avoid the binary reduction which doesn't scale as well and requires a lot of synchronization; the down side is there's way less threads now, which at least your (b) approach had working for you.
Now you know that there should be some way of doing more operations per memory access than this; for square NxN matricies, there's N^3 work to do the multiplication, but only 3xN^2 elements - so you should be able to find a way to do way more than 1 computation per 2ish memory accesses.
The approach taken in the CUDA SDK is the best way - the matricies are broken into tiles, and your (b) approach - one thread per output element - is used. But the key is in how the threads are arranged. By pulling in entire little sub-matricies from slow global memory into shared memory, and doing calculations from there, it's possible to do many multiplications and adds on each number you've read in from memory. This approach is the most successful approach in lots of applications, because getting data - whether it's over a network, or from main memory for a CPU, or off-chip access for a GPU - often takes much longer than processing the data.
There's documents in NVidia's CUDA pages (esp http://developer.nvidia.com/object/cuda_training.html ) which describe their SDK example very nicely.
Have you looked at the CUDA documentation: Cuda Programming Model
Also, sample source code: Matrix Multiplication
Did you look at
$SDK/nvidia-gpu-sdk-3.1/C/src/matrixMul
i.e. the matrix multiplication example in the SDK?
If you don't need to implement this yourself, just use a library -- CUBLAS, MAGMA, etc., provide tuned matrix multiplication implementations.