I am trying to optimize some code for performance and i notice that i am forced to make a conversion from sparse to full vectors since the built-in ifft and fft functions do not support the sparse matrix type. My signal is however sparse and I want to exploit this fact.
Does anybody have a suggestion what you can do here?
Nithin
Related
Original input activations undergo kernel transformation through im2col to improve memory access patterns. But, when we are converting the original matrix into im2col matrix, then also we are accessing the same original memory patters. So, why im2col operation itself isn't slow?
The main reason for im2col is that the input and kernels can be represented as two big matrices and the convolution can be done in a single matrix multiplication. This speeds up the process because a matrix multiplication can be
parallelized very well.
Just the memory access is not the problem and as you said im2col has to access the original tensors the same way a simple convolution operation would.
I am new to CUDA. While writing a fast 3D array summation program on the 3rd dimension, there are some questions coming into my mind:
The most natural way is to use each matrix entry as threads, and each thread loops over the 3rd dimension. Under this scenario, is the memory considered coalesced? Since neighboring threads access neighboring elements; they only have strides on loop variables.
For improved performance, a reduction on the 3rd dimension certainly helps.
Are there any libraries to use? For 2D summation, using cuBLAS is considered a good choice. I am thinking of a forced type conversion, which cheats the compiler to regard the piece of memory as a 2D array, and using a matrix-vector multiplication from cuBLAS.
That's a coalesced read.
You can use cuBLAS in the same way. Just tell GEMV that the first (uncontracted) dimension is nx*ny.
I would like to write a DFT program using FFT.
This is actually used for very large matrix-vector multiplication (10^8 * 10^8), which is simplified to a vector-to-vector convolution, and further reduced to a Discrete Fourier Transform.
May I ask whether DFT is accurate? Because the matrix has all discrete binary elements, and the multiplication process would not tolerate any non-zero error probability in result. However from the things I currently learnt about DFT it seems to be an approximation algorithm?
Also, may I ask roughly how long would the code be? i.e. would this be something I could start from scratch and compose in C++ in perhaps one or two hundred lines? Cause actually this is for a paper...and all I need is that the complexity analyis is O(nlogn) and the coefficient in front of it doesn't really matter :) So the simplest implementation would be best. (Although I did see some packages like kissfft and FFTW, but they are very lengthy and probably an overkill for my purpose...)
A canonical radix-2 FFT can be written in less than 200 lines of C++. The average numerical error is roughly proportional to O(log N), so you will need to use a large enough numeric type and data scale factor to account for this.
You can compute numerically stable convolutions using the Number Theoretic transform. It uses unique integer sequences to compute the discrete Fourier transform over integer fields/rings. The only caveat is that the signal needs to be integer valued.
It is implementation is roughly the same size as the FFT, but a little faster. You can find my implementation of it at finitetransform.sourceforge.net as the NTTW sub-library. The APFloat library might be more relevant to your needs as they do multiplication of large numbers using convolutions.
Can someone recommend a good iterative method that can be used to solve a dense non-symmetric linear system?
From my understanding the bi-conjugate gradient method has an irregular convergence and is unstable.
Maybe try GMRES. It requires storing the intermediate residuals and orthogonalizing against them, so the storage grows with iterations. This way it avoids the problem of CG of accumulating the numerical error with iterations.
I'm trying to figure out the best way to solve a pentadiagonal matrix. Is there something faster than gaussian elimination?
You should do an LU or Cholesky decomposition of the matrix, depending on if your matrix is Hermitian positive definite or not, and then do back substitution with the factors. This is essentially just Gaussian elimination, but tends to have better numerical properties. I recommend using LAPACK since those implementations tend to be the fastest and the most robust. Look at the _GBSV routines, where the blank is one of s, d, c, z, depending on your number type.
Edit: In case you're asking if there is an algorithm faster than the factor/solve (Gaussian elimination) method, no there is not. A specialized factorization routine for a banded matrix takes about 4n*k^2 operations (k is the bandwidth), while the backward substitution takes about 6*n*k operations. Thus, for fixed bandwidth, you cannot do better than linear time in n.