Writing a Discrete Fourier Transform program - fft

I would like to write a DFT program using FFT.
This is actually used for very large matrix-vector multiplication (10^8 * 10^8), which is simplified to a vector-to-vector convolution, and further reduced to a Discrete Fourier Transform.
May I ask whether DFT is accurate? Because the matrix has all discrete binary elements, and the multiplication process would not tolerate any non-zero error probability in result. However from the things I currently learnt about DFT it seems to be an approximation algorithm?
Also, may I ask roughly how long would the code be? i.e. would this be something I could start from scratch and compose in C++ in perhaps one or two hundred lines? Cause actually this is for a paper...and all I need is that the complexity analyis is O(nlogn) and the coefficient in front of it doesn't really matter :) So the simplest implementation would be best. (Although I did see some packages like kissfft and FFTW, but they are very lengthy and probably an overkill for my purpose...)

A canonical radix-2 FFT can be written in less than 200 lines of C++. The average numerical error is roughly proportional to O(log N), so you will need to use a large enough numeric type and data scale factor to account for this.

You can compute numerically stable convolutions using the Number Theoretic transform. It uses unique integer sequences to compute the discrete Fourier transform over integer fields/rings. The only caveat is that the signal needs to be integer valued.
It is implementation is roughly the same size as the FFT, but a little faster. You can find my implementation of it at finitetransform.sourceforge.net as the NTTW sub-library. The APFloat library might be more relevant to your needs as they do multiplication of large numbers using convolutions.

Related

Implementing large linear regression models using CUDA

For analysis of 10^6 genetic factors and their GeneXGene interactions (~5x10^11), I have numerous and independent linear regression problems which are probably suitable for analysis on GPUs.
The objective is to exhaustively search for GeneXGene interaction effects in modulating an outcome variable (a brain phenotype) using linear regression with the interaction term included.
As far as I know, the Householder QR factorization could be the solution for fitting regression models, however, given that each regression matrix in this particular work could easily approach the size of ~ 10'000x10, even each single regression matrix does not seem to fit in GPU on-chip memory (shared, registers etc.).
Should I accept this as a problem which is inherently bandwidth-limited and keep the matrices in GPU global memory during regression analysis, or are other strategies possible?
EDIT
Here are more details about the problem:
There will be approximately 10'000 subjects, each with ~1M genetic parameters (genetic matrix:10'000x10^6). The algorithm in each iteration should select two columns of this genetic matrix (10'000x2) and also around 6 other variables unrelated to genetic data (age, gender etc) so the final regression model will be dealing with a matrix like the size of 10'000x[2(genetic factors)+6(covariates)+2(intercept&interaction term)] and an outcome variable vector (10'000x1). This same process will be repeated ~5e11 times each time with a given pair of genetic factors. Those models passing a predefined statistical threshold should be saved as output.
The specific problem is that although there are ~5e11 separate regression models, even a single one does not seem to fit in on-chip memory.
I also guess that sticking with CUDA libraries may not be the solution here as this mandates most of the data manipulation to take place on the CPU side and only sending each QR decomposition to GPU?
You whole data matrix (1e4 x 1e6) may be too large to fit in the global memory, while each of your least squares solving (1e4 x 10) may be too small to fully utilize the GPU.
For each least squares problem, you could use cuSolver for QR factorization and triangular solving.
http://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
If the problem size is too small to fully utilize the GPU, you could use concurrent kernel execution to solve multiple equations at the same time.
https://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/
For the whole data matrix, if it can not fit into the global memory, you could work on only part of it at a time. For example you could divide the matrix into ten (1e4 x 1e5) blocks, each time you load two of the blocks through PCIe, select all possible two-column combinations from the two blocks respectively, solve the equation and then load another two blocks. Maximize the block size will help you minimize the PCIe data transfer. With proper design, I'm sure the time for PCIe data transfer will be much smaller than solving 1e12 equations. Furthermore you could overlap the data transfer with solver kernel executions.
https://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/

speed up ideas -- can CUDA help here?

I'm working on an algorithm that has to do a small number
of operations on a large numbers of small arrays, somewhat independently.
To give an idea:
1k sorting of arrays of length typically of 0.5k-1k elements.
1k of LU-solve of matrices that have rank 10-20.
everything is in floats.
Then, there is some horizontality to this problem: the above
operations have to be carried independently on 10k arrays.
Also, the intermediate results need not be stored: for example, i don't
need to keep the sorted arrays, only the sum of the smallest $m$ elements.
The whole thing has been programmed in c++ and runs. My question is:
would you expect a problem like this to enjoy significant speed ups
(factor 2 or more) with CUDA?
You can run this in 5 lines of ArrayFire code. I'm getting speedups of ~6X with this over the CPU. I'm getting speedups of ~4X with this over Thrust (which was designed for vectors, not matrices). Since you're only using a single GPU, you can run ArrayFire Free version.
array x = randu(512,1000,f32);
array y = sort(x); // sort each 512-element column independently
array x = randu(15,15,1000,f32), y;
gfor (array i, x.dim(2))
y(span,span,i) = lu(x(span,span,i)); // LU-decomposition of each 15x15 matrix
Keep in mind that GPUs perform best when memory accesses are aligned to multiples of 32, so a bunch of 32x32 matrices will perform better than a bunch of 31x31.
If you "only" need a factor of 2 speed up I would suggest looking at more straightforward optimisation possibilities first, before considering GPGPU/CUDA. E.g. assuming x86 take a look at using SSE for a potential 4x speed up by re-writing performance critical parts of your code to use 4 way floating point SIMD. Although this would tie you to x86 it would be more portable in that it would not require the presence of an nVidia GPU.
Having said that, there may even be simpler optimisation opportunities in your code base, such as eliminating redundant operations (useless copies and initialisations are a favourite) or making your memory access pattern more cache-friendly. Try profiling your code with a decent profiler to see where the bottlenecks are.
Note however that in general sorting is not a particularly good fit for either SIMD or CUDA, but other operations such as LU decomposition may well benefit.
Just a few pointers, you maybe already incorporated:
1) If you just need the m smallest elements, you are probably better of to just search the smallest element, remove it and repeat m - times.
2) Did you already parallelize the code on the cpu? OpenMP or so ...
3) Did you think about buying better hardware? (I know it´s not the nice think to do, but if you want to reach performance goals for a specific application it´s sometimes the cheapest possibility ...)
If you want to do it on CUDA, it should work conceptually, so no big problems should occur. However, there are always the little things, which depend on experience and so on.
Consider the thrust-library for the sorting thing, hopefully someone else can suggest some good LU-decomposition algorithm.

Should I use CUDA here?

I have to multiply a very small sized matrix ( size - 10x10 ) with a vector several times 50000 to 100000 times ( could even be more than that). This happens for 1000 different matrices (could be much more). Would there be any significant performance gain by doing this operation on CUDA.
Yes, it's an ideal task for the GPU.
If you want to multiply a single matrix with a vector 50K times and each multiplication is prerequisite to the previous then don't use CUDA. It's a serial problem, best suites for CPU. However if each multiplication is independent you can multiply them simultaneously on CUDA.
The only case where your program will give tremendous speedup is when each vector multiplication iteration is independent to the data of other iterations. This way you'll be able to launch 50K or more iterations simultaneously by launching equal number of threads.
Depending on what exactly you're doing, then yes, this could be done very quickly on a GPU, but you might have to run your own kernel to get some good performance from it.
Without knowing more about your problem, I can't give you too much advice. But I could speculate on a solution:
If you're taking one vector and multiplying it by the same matrix several thousand times, you would be much better of finding the closed form of the matrix to an arbitrary power. You can do this using the Cayley–Hamilton theorem or the Jordan canonical form.
I can't seem to find an implementation of this from a quick googling, but considering I did this in first year linear algebra, it's not too bad. Some info on the Jordan normal form, and raising it to powers can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers and the transformation matrices of it are just a matrix of eigenvectors, and the inverse of that matrix.
Say you have a matrix A, and you find the Jordan normal form J, and the transformation matrices P, P^-1, you find
A^n = P J^n P^-1
I can't seem to find a good link to an implementation of this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And an implementation of this would probably run much quicker on a CPU.
If this is your problem, you should look into this.

How can I convert a very large integer from one base/radix to another using FFT?

Are there known algorithms which will take a big integer with n digits encoded in one base/radix and convert it to another arbitrary base? (Let's say from base 7 to base 19.) n can be really big, like more than 100 000 digits, so I am looking for something better than O(n2) run time.
I have seen some algorithms that can multiply two huge integers using the Fast Fourier Transform (FFT), with the theoretical complexity of O(n log n), where n is the number of digits, so I wonder if something similar exists for bases/radix conversion?
I'm not well versed on the topic myself, but here's a page that hints at how to do radix conversion a bit faster than the naive remainder-and-divide algorithm:
GNU MP - Binary to Radix
The page hints that you need a fast divide-and-conquer division algorithm, which in turn needs a fast multiplication algorithm (Karatsuba, Toom-Cook, FFT, etc.).

What is the best algorithm for solving a band-diagonal matrix?

I'm trying to figure out the best way to solve a pentadiagonal matrix. Is there something faster than gaussian elimination?
You should do an LU or Cholesky decomposition of the matrix, depending on if your matrix is Hermitian positive definite or not, and then do back substitution with the factors. This is essentially just Gaussian elimination, but tends to have better numerical properties. I recommend using LAPACK since those implementations tend to be the fastest and the most robust. Look at the _GBSV routines, where the blank is one of s, d, c, z, depending on your number type.
Edit: In case you're asking if there is an algorithm faster than the factor/solve (Gaussian elimination) method, no there is not. A specialized factorization routine for a banded matrix takes about 4n*k^2 operations (k is the bandwidth), while the backward substitution takes about 6*n*k operations. Thus, for fixed bandwidth, you cannot do better than linear time in n.