Best approach for convolution of multiple small matrices using CUDA - cuda

I need to preform multiple convolutions with small matrices and kernels, and I was hoping that utilizing the many processors of the GPU would enable me to it as fast as possible.
The problem is as follows: I have many matrices (~1,000 to ~10,000) or relatively small sizes (~15x15 down to 1x1 - as in scalar), and a certain number of convolution masks (~20 to 1). I need to convolve all the matrices with each convolution mask
example:
A; %5,000 matrices of size 10x10, A(i) = a 10x10 matrix
B; 10 matrices of size 5x5, B(k) = a 5x5 matrix
res(j)=conv(A,B(1)); %res(j) is the result of convolving all 5,000
%matrices in A by the j'th kernel B(j)
the goal is computing res(1),...,res(10) as quickly as possible
I would like to hear suggestions about how to implement the most efficient algorithm.
FFT based convolution would probably be too slow.
Every implementation I've seen so far is for 2d convolution, meant to convolve 2 large matrices, while I need to convolve many small matrices.
I know very little about CUDA programming right now, but I'm in the process of learning.
I was hoping to figure this out myself, but due to time constraints, I am forced to ask for any advice anyone with experience can give me, while I learn how to code in CUDA.
Thank you!
p.s. any pointers to an implementation that suits my purposes is more than appreciated. I am a university students, and this is for a small research project, so nothing I need to pay for please...

I do not pretend to give an ultimate answer to your question, but I would just like to point out a couple of things:
As you mentioned, a first possibility would be to use the FFT approach. A problem on this line is that (correct me if I'm wrong) the cuFFT library is primarily designed to cope with large matrices, so to fruitfully benefit from this approach would be developing FFT routines efficient for small matrices. I just want to indicate that there are some algorithms of this kind, please see for example the paper: Small Discrete Fourier Transforms on GPUs. I have no direct experience with the performance of CUDA FFTs on small matrices of the indicated type, but perhaps it could be interesting for you since the mask matrices are in a low number (10) and so you can "recycle" their FFTs for a large number of convolutions (5000).
If you decide not to use the FFT approach, then, if you have a GPU architecture with compute capability >=3.5, then dynamic parallelism could be a good candidate to calculate convolutions. If you regard the evaluation of each convolution matrix element as an interpolation, then you will have interpolation problems of size 15x15 and dynamic parallelism could help, see the post: Benefit of splitting a big CUDA kernel and using dynamic parallelism

One approach is to use ArrayFire's GFOR loop, which I work on.
You can tile as many small convolutions into one big kernel launch as you want, as long as you don't run out of GPU memory, as follows:
array x = randu(5); // the input
array y = randu(m,5); // the output
array f = constant(1,3); // the kernel
gfor (array k, 0, m-1) {
y(span,k) = convolve(x,f);
}
Good luck!

Related

DM Script, why does the fourier transform of gaussian-kenel needs modulus

Recently I learn DM_Script for TEM image processing
I needed Gaussian blur process and I found one whose name is 'Gaussian Blur' in http://www.dmscripting.com/recent_updates.html
This code implements Gaussian blur algorithm by multiplying the fast fourier transform(FFT) of source image by the FFT of Gaussian-kernel image and finally doing inverse fourier transform of it.
Here is the part of the code,
// Carry out the convolution in Fourier space
compleximage fftkernelimg:=realFFT(kernelimg) (-> FFT of Gaussian-kernel image)
compleximage FFTSource:=realfft(warpimg) (-> FFT of source image)
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
realimage invFFT:=realIFFT(FFTProduct)
The point I want to ask is this
compleximage FFTProduct:=FFTSource*fftkernelimg.modulus().sqrt()
Why does the FFT of Gaussian-kernel need '.modulus().sqrt()' for the convolution?
It is related to the fact that the fourier transform of a Gaussian function becomes another Gaussian function?
Or It is related to a sort of limitation of discrete fourier transform?
Please answer me
Thanks
This is related to the general precision limitation of any floating point numeric computing. (see f.e. here, or more in depth here)
A rotational (real-valued) Gaussian of stand.dev. sigma should be transformed into a 100% real-values rotational Gaussioan of 1/sigma. However, doing this numerically will show you deviations: Just try the following:
number sigma = 30
number A0 = 1
realimage first := RealImage( "First", 8, 256, 256 )
first = A0 * exp( - (iradius**2/(2*sigma*sigma) ))
first.showimage()
complexImage second := FFT(first)
second.Showimage()
image nonZeroImaginaryMask = ( 0 != second.Imaginary() )
nonZeroImaginaryMask.Showimage()
nonZeroImaginaryMask.SetLimits(0,1)
When you then multiply these complex images (before back-transferring) you are introducing even more errors. By using modulus, one ensures that the forward transformed kernel is purely real and hence a better "damping" curve.
A better implementation of a FFT filtering code would actually create the FFT(Gaussian) directly with a std.dev of 1/sigma, as this is the analytically correct result. Doing a FFT of the kernel only makes sense if the kernel (or its FFT) is not analytically known.
In general: When implementing any "maths" into a program code, it can pay hugely to think it through with numerical computation limits in the back of your head. Reduce actual computation whenever possible (i.e. compute analytically and use the result instead of relying on brute force numerical computation) and try to "reshape" equations when possible, f.e. avoid large sums over many small numbers, be careful about checks against exact numeric values, try to avoid expressions which are very sensitive on small numerica errors etc.

Calculating (A - B(D^-1)B^T )^-1 with CUDA

What might be the most efficient way of calculating the following expression using CUDA C ?
(A - B(D^-1)B^T )^-1
where D is a very large symmetric matrix and A is a small symmetric matrix, which makes B and B^T medium sized rectangular non-symmetric matrices. Of course (^-1) and (^T) are the inverse and transpose operations, respectively.
If you are available to "low" level programming, then matrix inversion could be performed by CULA or MAGMA libraries.
CULA Dense contains single (real or complex) precision of System Solve, Linear Least Squares Solve, and Constrained Linear Least Squares Solve. CULA Sparse is a collection of iterative solvers for sparse matrices. Magma contains dgetrf and dgetri to calculate inverses of square double precision matrices.
For matrix multiplications, including transpositions, you could use cuBLAS routines.
If you prefer "higher" level programming, then ArrayFire enables you to perform matrix multiplications, inversions, transposes, solution of linear systems, and elementwise operations with a more naturale mathematical syntax. Also, Matlab has a GPU Computing Support for NVIDIA CUDA-Enabled GPUs.

Apply PCA on very large sparse matrix

I am doing a text classification task with R, and I obtain a document-term matrix with size 22490 by 120,000 (only 4 million non-zero entries, less than 1% entries). Now I want to reduce the dimensionality by utilizing PCA (Principal Component Analysis). Unfortunately, R cannot handle this huge matrix, so I store this sparse matrix in a file in the "Matrix Market Format", hoping to use some other techniques to do PCA.
So could anyone give me some hints for useful libraries (whatever the programming language), which could do PCA with this large-scale matrix with ease, or do a longhand PCA by myself, in other words, calculate the covariance matrix at first, and then calculate the eigenvalues and eigenvectors for the covariance matrix.
What I want is to calculate all PCs (120,000), and choose only the top N PCs, who accounts for 90% variance. Obviously, in this case, I have to give a threshold a priori to set some very tiny variance values to 0 (in the covariance matrix), otherwise, the covariance matrix will not be sparse and its size would be 120,000 by 120,000, which is impossible to handle with one single machine. Also, the loadings (eigenvectors) will be extremely large, and should be stored in sparse format.
Thanks very much for any help !
Note: I am using a machine with 24GB RAM and 8 cpu cores.
The Python toolkit scikit-learn has a few PCA variants, of which RandomizedPCA can handle sparse matrices in any of the formats supported by scipy.sparse. scipy.io.mmread should be able to parse the Matrix Market format (I never tried it, though).
Disclaimer: I'm on the scikit-learn development team.
EDIT: the sparse matrix support from RandomizedPCA has been deprecated in scikit-learn 0.14. TruncatedSVD should be used in its stead. See the documentation for details.
Instead of running PCA, you could try Latent Dirichlet Allocation (LDA), which decomposes the document-word matrix into a document-topic and topic-word matrix. Here is a link to an R implementation: http://cran.r-project.org/web/packages/lda/ - there are quite a few implementations out there, though if you google.
With LDA you need to specify a fixed number of topics (similar to principle components) in advance. A potentially better alternative is HDP-LDA (http://www.gatsby.ucl.ac.uk/~ywteh/research/npbayes/npbayes-r21.tgz), which learns the number of topics that form a good representation of your corpus.
If you can fit our dataset in memory (which it seems like you can), then you also shouldn't have a problem running the LDA code.
As a number of people on the scicomp forum pointed out, there should be no need to compute all of the 120k principle components. Algorithms like http://en.wikipedia.org/wiki/Power_iteration calculate the largest eigenvalues of a matrix, and LDA algorithms will converge to a minimum-description-length representation of the data given the number of topics specified.
In R big.PCA of bigpca package http://cran.r-project.org/web/packages/bigpca/bigpca.pdf does the job.
text classification task
I resolved almost same problem using a technique for PCA of sparse matrix .
This technique can handle very large sparse matrix.
The result shows such simple PCA outperfoms word2vec.
It intends the simple PCA outperforms LDA.
I suppose you wouldn't be able to compute all principle components. But still you can obtain reduced dimension version of your dataset matrix. I've implemented a simple routine in MATLAB, which can easily be replicated in python.
Compute the covariance matrix of your input dataset, and convert it to a dense matrix. Assuming S is you input 120,000 * 22490 sparse matrix, this would be like:
Smul=full(S.'*S);
Sm=full(mean(S));
Sm2=120000*Sm.'*Sm;
Scov=Smul-Sm2;
Apply eigs function on the covariance matrix to obtain the first N dominant eigenvectors,
[V,D] = eigs(Scov,N);
And obtain pcs by projecting the zero centered matrix on eigenvectors,
Sr=(S-Sm)*V;
Sr is the reduced dimension version of S.

CUDA cublas<t>gbmv understanding

I recently wanted to use a simple CUDA matrix-vector multiplication. I found a proper function in cublas library: cublas<<>>gbmv. Here is the official documentation
But it is actually very poor, so I didn't manage to understand what the kl and ku parameters mean. Moreover, I have no idea what stride is (it must also be provided).
There is a brief explanation of these parameters (Page 37), but it looks like I need to know something else.
A search on the internet doesn't provide tons of useful information on this question, mostly references to different version of documentation.
So I have several questions to GPU/CUDA/cublas gurus:
How do I find more understandable docs or guides about using cublas?
If you know how to use this very function, couldn't you explain me how do I use it?
Maybe cublas library is somewhat extraordinary and everyone uses something more popular, better documented and so on?
Thanks a lot.
So BLAS (Basic Linear Algebra Subprograms) generally is an API to, as the name says, basic linear algebra routines. It includes vector-vector operations (level 1 blas routines), matrix-vector operations (level 2) and matrix-matrix operations (level 3). There is a "reference" BLAS available that implements everything correctly, but most of the time you'd use an optimized implementation for your architecture. cuBLAS is an implementation for CUDA.
The BLAS API was so successful as an API that describes the basic operations that it's become very widely adopted. However, (a) the names are incredibly cryptic because of architectural limitations of the day (this was 1979, and the API was defined using names of 8 characters or less to ensure it could widely compile), and (b) it is successful because it's quite general, and so even the simplest function calls require a lot of extraneous arguments.
Because it's so widespread, it's often assumed that if you're doing numerical linear algebra, you already know the general gist of the API, so implementation manuals often leave out important details, and I think that's what you're running into.
The Level 2 and 3 routines generally have function names of the form TMMOO.. where T is the numerical type of the matrix/vector (S/D for single/double precision real, C/Z for single/double precision complex), MM is the matrix type (GE for general - eg, just a dense matrix you can't say anything else about; GB for a general banded matrix, SY for symmetric matrices, etc), and OO is the operation.
This all seems slightly ridiculous now, but it worked and works relatively well -- you quickly learn to scan these for familiar operations so that SGEMV is a single-precision general-matrix times vector multiplication (which is probably what you want, not SGBMV), DGEMM is double-precision matrix-matrix multiply, etc. But it does take some practice.
So if you look at the cublas sgemv instructions, or in the documentation of the original, you can step through the argument list. First, the basic operation is
This function performs the matrix-vector multiplication
y = a op(A)x + b y
where A is a m x n matrix stored in column-major format, x and y
are vectors, and and are scalars.
where op(A) can be A, AT, or AH. So if you just want y = Ax, as is the common case, then a = 1, b = 0. and transa == CUBLAS_OP_N.
incx is the stride between different elements in x; there's lots of situations where this would come in handy, but if x is just a simple 1d array containing the vector, then the stride would be 1.
And that's about all you need for SGEMV.

Apple FFT Accelerate Framework Inverse FFT from Array of Real Numbers

I am using the accelerate framework FFT functions to produce a spectrogram of a sound sample. This part works great. However, I want to (effectively) manipulate the spectrum directly (ie manipulate the real numbers), and then call the inverse again, how would I go about doing that? It looks like the INVERSE call expects an array of IMAGINARY numbers, but how can I produce that from my manipulated real numbers? I have tried making the realp array my reals, and the imagp part zero, but that doesn't seem to work.
The reason I ask this is because I wish to run an FFT on a voice audio sample, and then run the FFT again and then lifter the low part of the cepstrum (thus hopefully separating the vocal tract components from the pitch) and then run an inverse FFT again to produce a spectrogram showing the vocal tract (formant) information more clearly (ie, without the pitch information). However, I seem to be running into problems on the inverse FFT, into which I am passing in my real values (cepstrum) in the realp array and the imagp is zero. I think I am doing something wrong here and the results are unexpected.
You need to process the complex forward FFT results, rather than the real magnitudes, or else the shape of the IFFT result spectrum will be distorted. Don't consider them imaginary numbers, consider them to be part of a 2D vector containing the required angular phase information.
If your cepstrum lifter/filter alters only the real magnitudes, then you can try using the amount of change of the real magnitudes as scaling factors to alter your forward complex FFT result before doing a complex IFFT.