I have n (very large) independent linear systems (Ax = b_i). They all have the same A, but b_i is different for (i = 1, ..., n). I want to solve these n systems in parallel in CUDA.
I was thinking that it might be most efficient to do the LU factorization of A in the host and then copy the new A to the constant memory of GPU (because even if I do the LU in device, only one thread should do it and other threads will be idle. Besides, constant memory is faster). Is there any better way for this?
Another issue is that while all threads are solving their system at the same time with the same algorithm, they are all accessing the same place of memory (A[i]) at the same time, which is not coalesced. How can I optimize this ?
(This is assuming A is an stably-invertible n x n matrix.)
Don't solve a much harder problem just because it seems to parallelize better
Let B be the matrix whose columns are b_1 ... b_n. Under our assumptions about A, you actually need to solve the equation A X = B for an n x n matrix of variables, i.e. your solution is A^{-1}B.
So basically you have one matrix inversion and one matrix multiplication. This holds regardless of what software and hardware you're going to use. For inversion and multiplication just use CUBLAS, or cuSparse, or cuSOLVER, or ArrayFire or whatever solves these things the fastest.
You could do both of them together I suppose, but I'm not sure there are optimizations for that).
Related
In CUDA programming, threads and blocks have multiple directions (x, y and z).
Until now, I ignored this and only took into account the x direction (threadIdx.x, blockIdx.x, blockDim.x, etc.).
Apparently, both threads within a block and blocks on the grid are arranged as a cube. However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that? Only using the x direction, am I able to address all threads available to my GPU?
Only using the x direction, am I able to address all threads available to my GPU?
If we are talking about a desire to spin up ~2 trillion threads or less, then there is no particular requirement to use a multidimensional block, or grid. All CUDA GPUs of compute capability 3.0 and higher can launch up to about 2 billion blocks (2^31-1) with 1024 threads each, using a 1-D grid organization.
With methodologies like grid-stride loop it seems rare to me that more than ~2 trillion threads would be needed.
I claim without formal proof that any problem that can be realized in a 1D grid can be realized in a 2D or 3D grid, or vice versa. This is just a mathematical mapping from one realization to another. Furthermore, it should be possible to arrange for important by-products like coalesced access in either realization.
There may be some readability benefits, code complexity benefits, and possibly small performance considerations when realizing in a 1D or multi-dimensional way. The usual case for this that I can think of is when the data to be processed is "inherently" multi-dimensional. In this case, letting the CUDA engine generate 2 or 3 distinct indices for you:
int idx = threadIdx.x+blockDim.x*blockIdx.x;
int idy = threadIdx.y+blockDim.y*blockIdx.y;
might be simpler than using a 1D grid index, and computing 2D data indices from those:
int tid = threadIdx.x+blockDim.x*blockIdx.x;
int idx = tid%DATA_WIDTH;
int idy = tid/DATA_WIDTH;
(the integer division operation above is unavoidable in the general case. The modulo operation can be simplified by using the result from the integer division.)
It's arguably an extra line of code and an extra division operation required to get to the same point, when only a 1D grid is created. However I would suggest that even this is small potatoes, and you should use whichever approach seems most reasonable and comfortable to you as a programmer.
If for some reason you desire to spin up more than ~2 Trillion threads, then moving to a multidimensional grid, at least, is unavoidable.
Apparently, both threads within a block and blocks on the grid are arranged as a cube.
To understand how the threadblock thread index is computed in any case, I refer you to the programming guide. It should be evident that one case can be made equivalent to another - each thread gets a unique thread ID no matter how you specify the threadblock dimensions. In my opinion, a threadblock should only be thought of as a "cube" of threads (i.e. 3-dimensional) if you specify the configuration that way:
dim3 block(32,8,4); //for example
However, if this is the case, why is it enough to specify the x direction? Would I not address multiple threads like that?
If you only used a single threadblock dimension to create a thread index in the 32,8,4 case:
int tid = threadIdx.x;
then you certainly would be "addressing" multiple threads (in y, and z) using that approach. That would typically, in my experience, be "broken" code. Therefore a kernel designed to use a multidimensional block or grid may not work correctly if the block or grid is specified as 1 dimensional, and the reverse statement is also true. You can find examples of such problems (thread index calculation not being correct for the grid design) here on the cuda tag.
I was wondering what the fastest way of computing a sparse matrix-vector product y = Ax in CUDA on multiple (let say n) GPUs is.
My naive approach would be to divide the vector x and y into n chunks, 1 chunk on each GPU. Then also split up the matrix A in smaller n^2 blocks A_ij and computing
y_i = \sum_j A_{i,j} x_j, // GPU j stores A_{i,j} and x_j, result is copied
// to and summed up on GPU i
on the different GPUs j=1..n with let's say cuSPARSE. Would this work? With the unified memory architecture, in principle all GPUs should be able to access the global memory.
Is the memory transfer between the GPUs going to be incredibly slow? I don't expect a large speed up but I was wondering if it is going to be slower than doing the matrix-vector multiplication on 1 single GPU.
I would suggest a different approach. Don't break up the vector x into chunks. Transfer x to all GPUs.
Break up the A matrix according to rows. So, for example, if A had 9 rows, and you have 3 GPUs, then transfer rows 1-3 of A to the first GPU, 4-6 of A to the second GPU, and 7-9 of A to the third GPU.
Then compute the 3 individual pieces of y on the 3 GPUs:
y[1-3] = A[1-3]*x
y[4-6] = A[4-6]*x
y[7-9] = A[7-9]*x
Each of those 3 operations could be done with cusparse<T>csrmv, for example (or CUB now has an spmv routine also).
Reassembly of the y vector should be trivial (concatenation).
There is no need for inter-GPU data transfer during the computation, only on transfer of results (y).
A possible "optimization" would be to partition A based on "work" rather than naively by rows. But the benefit of this would depend on the structure of A, so would require analysis. A simplistic approach to this optimization could be to just break up A based on (approximately) equalizing the number of NZ elements in each chunk.
I have a large rectangular matrix NxM in GPU memory, stored as 1-dimensional array in row-by-row representation. Let us say that this matrix is actually composed of submatrices of size nxm. For simplicity, assume that N is a multiple of n and same with M and m. Let us say, the data type of the array is float or double.
What is an efficient method to find the index of the extrema in each sub-matrix? For example, how to find the 1-dimensional index of the maximum element of each submatrix and write down those indices in some array.
I can hardly imagine to be so self-confident (or arrogant?) to say that one particular solution is the "most efficient way" to do something.
However, some thoughts (without the claim to cover the "most efficient" solution) :
I think that there are basically two "orthogonal" ways of approaching this
For all sub-matrices in parallel: Find the extremum sequentially
For all sub-matrices sequentially: Find the extremum in parallel
The question which one is more appropriate probably depends on the sizes of the matrices. You mentioned that "N is a multiple of n" (similarly for M and m). Let's the matrix of size M x N is composed of a*b sub-matrices of size m x n.
For the first approach, one could simply let each thread take care of one sub-matrix, with a trivial loop like
for (all elements of my sub-matrix) max = element > max ? element : max;
The prerequisite here is that a*b is "reasonably large". That is, when you can launch this kernel for, let's say, 10000 sub-matrices, then this could already bring a good speedup.
In contrast to that, in the second approach, each kernel (with all its threads) would take care of one sub-matrix. In this case, the kernel could be a standard "reduction" kernel. (The reduction is often presented an example for "computing the sum/product of the elements of an array", but it works for any binary associative operation, so instead of computing the sum or product, one can basically use the same kernel for computing the minimum or maximum). So the kernel would be launched for each sub-matrix, and this would only make sense when the sub-matrix is "reasonably large".
However, in both cases, one has to consider the general performance guidelines. Particularly, since in this case, the operation is obviously memory-bound (and not compute-bound), one has to make sure that the accesses to global memory (that is, to the matrix itself) are coalesced, and that the occupancy that is created by the kernel is as high as possible.
EDIT: Of course, one could consider to somehow combine these approaches, but I think that they are at least showing the most important directions of the space of available options.
I have to multiply a very small sized matrix ( size - 10x10 ) with a vector several times 50000 to 100000 times ( could even be more than that). This happens for 1000 different matrices (could be much more). Would there be any significant performance gain by doing this operation on CUDA.
Yes, it's an ideal task for the GPU.
If you want to multiply a single matrix with a vector 50K times and each multiplication is prerequisite to the previous then don't use CUDA. It's a serial problem, best suites for CPU. However if each multiplication is independent you can multiply them simultaneously on CUDA.
The only case where your program will give tremendous speedup is when each vector multiplication iteration is independent to the data of other iterations. This way you'll be able to launch 50K or more iterations simultaneously by launching equal number of threads.
Depending on what exactly you're doing, then yes, this could be done very quickly on a GPU, but you might have to run your own kernel to get some good performance from it.
Without knowing more about your problem, I can't give you too much advice. But I could speculate on a solution:
If you're taking one vector and multiplying it by the same matrix several thousand times, you would be much better of finding the closed form of the matrix to an arbitrary power. You can do this using the Cayley–Hamilton theorem or the Jordan canonical form.
I can't seem to find an implementation of this from a quick googling, but considering I did this in first year linear algebra, it's not too bad. Some info on the Jordan normal form, and raising it to powers can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers and the transformation matrices of it are just a matrix of eigenvectors, and the inverse of that matrix.
Say you have a matrix A, and you find the Jordan normal form J, and the transformation matrices P, P^-1, you find
A^n = P J^n P^-1
I can't seem to find a good link to an implementation of this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And an implementation of this would probably run much quicker on a CPU.
If this is your problem, you should look into this.
I'm trying to figure out the best way to solve a pentadiagonal matrix. Is there something faster than gaussian elimination?
You should do an LU or Cholesky decomposition of the matrix, depending on if your matrix is Hermitian positive definite or not, and then do back substitution with the factors. This is essentially just Gaussian elimination, but tends to have better numerical properties. I recommend using LAPACK since those implementations tend to be the fastest and the most robust. Look at the _GBSV routines, where the blank is one of s, d, c, z, depending on your number type.
Edit: In case you're asking if there is an algorithm faster than the factor/solve (Gaussian elimination) method, no there is not. A specialized factorization routine for a banded matrix takes about 4n*k^2 operations (k is the bandwidth), while the backward substitution takes about 6*n*k operations. Thus, for fixed bandwidth, you cannot do better than linear time in n.