Solve linear equation of AX=B - cuda

I currently solve Ax=b equation two times.
where A is sparse matrix NxN
x, b are vectors of size N. (I have b1 and b2)
I want to reduce times by solving both of them in one shot using cusparse functions.
so what I though is to build from the 2 b's I have, one matrix of size Nx2, and solve it with A as the equation AX=B can do.
Is it theoretically right?
which cusparse function should I use?
please pay attention I'm working with sparse matrix and not dense matrix.
Thanks!

To answer your questions
Yes, it is possible to solve a suitably well conditioned sparse problem for multiple RHS vectors in this way.
Unless your LHS sparse matrix is either tridiagonal or triangular, then you can't use cusparse directly for this.
cusolver 7.5 contains several "low level" routines for factorising sparse matrices, meaning that you can factorize once and reuse the factorisation several times with different RHS, for example cusolverSpXcsrluSolve() can be called after an LU factorisation to solve using the same precomputed factorisation as many times as you require. (Note I originally had assumed that there was a sparse getrs like function in cusolve, and it appears there isn't. I certainly talked to NVIDIA about the use case for one some years ago and thought they had added it, sorry for the confusion there).

Related

Writing a Discrete Fourier Transform program

I would like to write a DFT program using FFT.
This is actually used for very large matrix-vector multiplication (10^8 * 10^8), which is simplified to a vector-to-vector convolution, and further reduced to a Discrete Fourier Transform.
May I ask whether DFT is accurate? Because the matrix has all discrete binary elements, and the multiplication process would not tolerate any non-zero error probability in result. However from the things I currently learnt about DFT it seems to be an approximation algorithm?
Also, may I ask roughly how long would the code be? i.e. would this be something I could start from scratch and compose in C++ in perhaps one or two hundred lines? Cause actually this is for a paper...and all I need is that the complexity analyis is O(nlogn) and the coefficient in front of it doesn't really matter :) So the simplest implementation would be best. (Although I did see some packages like kissfft and FFTW, but they are very lengthy and probably an overkill for my purpose...)
A canonical radix-2 FFT can be written in less than 200 lines of C++. The average numerical error is roughly proportional to O(log N), so you will need to use a large enough numeric type and data scale factor to account for this.
You can compute numerically stable convolutions using the Number Theoretic transform. It uses unique integer sequences to compute the discrete Fourier transform over integer fields/rings. The only caveat is that the signal needs to be integer valued.
It is implementation is roughly the same size as the FFT, but a little faster. You can find my implementation of it at finitetransform.sourceforge.net as the NTTW sub-library. The APFloat library might be more relevant to your needs as they do multiplication of large numbers using convolutions.

Special Case of Matrix multiplication Using CUDA

I am searching for some special functions (CUDA) that dedicate to typical dense matrix multiplications, e.g. A*B, where the size of A is 6*n, the size of B is n*6 and n is very large (n=2^24). I have utilized CUBLAS and some other libraries to test this example, In CUBLAS, for this example, we use 6*6=36 threads, which is far from the total parallelism of GPU, so I split A and B into submatrices(vectors) and then implement dot product function for each of them and the performance has been quite well improved. The problem is, in this case, we need to launch 36 CUDA kernels and in between them there are a lot of same data footprints (same data has been accessed for several times from the global memory of GPU). So I am asking whether there exists any solution to this kind of problem.
I have recently written such a matrix multiplication routine for a client of mine. The trick is to extract more parallelism by splitting the long inner summation into several smaller ones. Then use a separate kernel launch to calculate the full sum from the partial ones.

Should I use CUDA here?

I have to multiply a very small sized matrix ( size - 10x10 ) with a vector several times 50000 to 100000 times ( could even be more than that). This happens for 1000 different matrices (could be much more). Would there be any significant performance gain by doing this operation on CUDA.
Yes, it's an ideal task for the GPU.
If you want to multiply a single matrix with a vector 50K times and each multiplication is prerequisite to the previous then don't use CUDA. It's a serial problem, best suites for CPU. However if each multiplication is independent you can multiply them simultaneously on CUDA.
The only case where your program will give tremendous speedup is when each vector multiplication iteration is independent to the data of other iterations. This way you'll be able to launch 50K or more iterations simultaneously by launching equal number of threads.
Depending on what exactly you're doing, then yes, this could be done very quickly on a GPU, but you might have to run your own kernel to get some good performance from it.
Without knowing more about your problem, I can't give you too much advice. But I could speculate on a solution:
If you're taking one vector and multiplying it by the same matrix several thousand times, you would be much better of finding the closed form of the matrix to an arbitrary power. You can do this using the Cayley–Hamilton theorem or the Jordan canonical form.
I can't seem to find an implementation of this from a quick googling, but considering I did this in first year linear algebra, it's not too bad. Some info on the Jordan normal form, and raising it to powers can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers and the transformation matrices of it are just a matrix of eigenvectors, and the inverse of that matrix.
Say you have a matrix A, and you find the Jordan normal form J, and the transformation matrices P, P^-1, you find
A^n = P J^n P^-1
I can't seem to find a good link to an implementation of this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And an implementation of this would probably run much quicker on a CPU.
If this is your problem, you should look into this.

Best GPU algorithm for calculating lists of neighbours

Given a collection of thousands of points in 3D, I need to get the list of neighbours for each particle that fall inside some cutoff value (in terms of euclidean distance), and if possible, sorted from nearest fo farthest.
Which is the fastest GPU algorithm for this purpose in the CUDA or OpenCL languages?
One of the fastest GPU MD codes I'm aware of, HALMD, uses a (highly tuned) version of the same sort of approach that is used in the CUDA SDK examples, "Particles". Both the HALMD paper and the Particles whitepaper are very clearly written. The underling algorithm is to assign particles into cutoff-radius-sized bins, do a radix sort based on that index, and then look at particles in the neighbouring bins.
Fast k Nearest Neighbor Search using GPU
I haven't tested, used it, nothing. I just googled and posted the first link I found.

What is the best algorithm for solving a band-diagonal matrix?

I'm trying to figure out the best way to solve a pentadiagonal matrix. Is there something faster than gaussian elimination?
You should do an LU or Cholesky decomposition of the matrix, depending on if your matrix is Hermitian positive definite or not, and then do back substitution with the factors. This is essentially just Gaussian elimination, but tends to have better numerical properties. I recommend using LAPACK since those implementations tend to be the fastest and the most robust. Look at the _GBSV routines, where the blank is one of s, d, c, z, depending on your number type.
Edit: In case you're asking if there is an algorithm faster than the factor/solve (Gaussian elimination) method, no there is not. A specialized factorization routine for a banded matrix takes about 4n*k^2 operations (k is the bandwidth), while the backward substitution takes about 6*n*k operations. Thus, for fixed bandwidth, you cannot do better than linear time in n.