Intel-MKL FFT performance for some conditions - fft

I am currently using Intel's MKL 2D FFT routines.
I am running into a condition where the performance is changing by a factor of 4-5.
What I am doing is implementing a type of band pass filter using FFT libraries. The results of test are correct, but the speed is an issue.
What I am seeing is about 1.3 sec on the forward FFT and between 1.3 and 6 seconds on the inverse FFT.
I have tracked this down to the weights I am applying after the forward pass of the FFT.
The weights are between 0 and -1, mostly 0 when I am getting the 6 seconds.
If I set the weights to 1 before applying the time is 1.3 seconds. Other test show this kind of behavior without using weights of 1.
My questions is how can the values I am applying cause this kind of slow down? I could understand a minor change in execution time, but not this dramatic of a change.
Thanks,
Jim K
I don't know if this is specific to the MKL version of the FFT or a general issue.

Some CPUs may require many more execution cycles to do floating point arithmetic operations using underflowed operands or when producing underflowed results.
For your filter coefficients, you can try weights far far larger than zero (in relation to a value near an IEEE double or float underflowed number), and still have a filter with a better than -120 dB stopband. Try that.
Some CPU and OS combinations might allow turning off underflowed floating point arithmetic or results. That may also help.

Related

Most efficient computational method to numerically minimize a 8 variables constrained system

I'm working for quite some time on finding a numerical instance for solution of a 8 variables system of 7 very complicated inequalities plus region specification. Unfortunately I cannot produce a MWE or nothing of the sort since the inputs are really long.
My current method is Mathematica's NMinimize routine, minimizing one of the 7 inequalities subject to every other condition as constraint -- The FindInstance command simply quits the kernel without being able to finish running.
The NMinimize is able to produce output, but besides being slower than would be optimal, produce results that do not obey every constraint.
The thing is that I need to be certain, for each benchmark I run, that if the output doesn't satisfy every constraint it is because such a set of real numbers doesn't exist -- which with my current method I can't be, by experience.
So: is there a foolproof, as efficient as possible, computational method for me to find a single instance of numerical solution to 7 complicated inequalities (involving trigonometric functions) of 8 variables or be sure that such a set doesn't exist?
It could be a Mathematica/python/fortran package, genetic algorithm or anything -- as long as there is clear enough documentation.
You need to give importance multiplier to constraints and the optimization method should not be greedy.
A genetic algorithm combined with multiple-starting points (or simulated annealing for diminishing mutations) tends to converge to global minima (hence not greedy) with more time given to it but there is not guarantee that the heuristic will complete X function in Y time. The more time given to it, the better it converges to global minima.
In genetic algorithm, you can add big constraint penalties like this:
fitness_minima = some_function_output_between_1_and_10 +
constraints_breached?1000.0f:0;
so that the DNAs with no contraint-violations will be favored for the crossover part of GA.
"As efficient as possible" depends on your algorithm. If you can parallelize the algorithm and run it on multiple GPUs, it should give substantial speedup over CPU. Compared to some hours of Mona-Lisa painting by CPU, a parallelized version running on 3 low-end GPUs complete within 10 minutes (https://www.youtube.com/watch?v=QRZqBLJ6brQ). At least some OpenCL/CUDA supporting libraries/frameworks (like Tensorflow) should be able to accelerate your algorithm if you don't want to do the work distribution yourself.

Writing a Discrete Fourier Transform program

I would like to write a DFT program using FFT.
This is actually used for very large matrix-vector multiplication (10^8 * 10^8), which is simplified to a vector-to-vector convolution, and further reduced to a Discrete Fourier Transform.
May I ask whether DFT is accurate? Because the matrix has all discrete binary elements, and the multiplication process would not tolerate any non-zero error probability in result. However from the things I currently learnt about DFT it seems to be an approximation algorithm?
Also, may I ask roughly how long would the code be? i.e. would this be something I could start from scratch and compose in C++ in perhaps one or two hundred lines? Cause actually this is for a paper...and all I need is that the complexity analyis is O(nlogn) and the coefficient in front of it doesn't really matter :) So the simplest implementation would be best. (Although I did see some packages like kissfft and FFTW, but they are very lengthy and probably an overkill for my purpose...)
A canonical radix-2 FFT can be written in less than 200 lines of C++. The average numerical error is roughly proportional to O(log N), so you will need to use a large enough numeric type and data scale factor to account for this.
You can compute numerically stable convolutions using the Number Theoretic transform. It uses unique integer sequences to compute the discrete Fourier transform over integer fields/rings. The only caveat is that the signal needs to be integer valued.
It is implementation is roughly the same size as the FFT, but a little faster. You can find my implementation of it at finitetransform.sourceforge.net as the NTTW sub-library. The APFloat library might be more relevant to your needs as they do multiplication of large numbers using convolutions.

Do different arithmetic operations have different processing times?

Are the basic arithmetic operations same with respect to processor usage. For e.g if I do an addition vs division in a loop, will the calculation time for addition be less than that for division?
I am not sure if this question belongs here or computer science SE
Yes. Here is a quick example:
http://my.safaribooksonline.com/book/hardware/9788131732465/instruction-set-and-instruction-timing-of-8086/app_c
those are the microcode and the timing of the operation of a massively old architecture, the 8086. it is a fairly simple point to start.
of relevant note, they are measured in cycles, or clocks, and everything move at the speed of the cpu (they are synchronized on the main clock or frequency of the microprocessor)
if you scroll down on that table you'll see a division taking anywhere from 80 to 150 cycles.
also note operation speed is affected by which area of memory the operand reside.
note that on modern processor you can have parallel instruction executed concurrently (even if the cpu is single threaded) and some of them are executed out of order, then vector instruction murky the question even more.
i.e. a SSE multiplication can multiply multiple number in a single operation (taking multiple time)
Yes. Different machine instructions are not equally expensive.
You can either do measurements yourself or use one of the references in this question to help you understand the costs for various instructions.

Resolve system of equations with 10th degree polynomial, LSM

I have numerical problem with resolve system of equations (polynomial 10th degree) using ordinary LSM (Least Square Method). I obtained parameters with huge and very small values - therefore I can't inverse matrix constructed in this method - precision is to low even in extended variables. I tried do this in C++,Matlab,Delphi.
Can somebody know application instruments which can I do this with enough accurancy or numerical tips do get good results. Standard calculation on matrix is unfortunatly elusive.
I think that your problem comes from the fact that you are using 10th order polynomials, which quite often lead to numerical problems:
First of all, they can be unsuitable because of large oscillations. Even when interpolating a simple function, these oscillations can be present, see the famous Runge's example.
Secondly, the fitting of the high order polynomials can lead to hill-conditioned linear systems, which is why you could not invert the matrix (which you should anyway not do). I made a simple experiment: I took 11 equidistant points (on the interval [0,1]) and assembled the matrix of the linear system to solve. Matlab gives me a condition number of about 1e8, so the least square matrix has a condition number of 1e16. So your matrix is 'close to singular' and this means that all the numerical precision is lost.
So, the best way to get rid of your problem is to get rid of the 10th order polynomial. You should maybe consider lower order polynomials, splines or piecewise polynomial approximations.
If you really need 10th order polynomials (e.g. if you know that your data have been generated by such a polynomial), then do not invert the matrix. Use a good preconditioner and an iterative method to solve the system without inverting the matrix.

Should I use CUDA here?

I have to multiply a very small sized matrix ( size - 10x10 ) with a vector several times 50000 to 100000 times ( could even be more than that). This happens for 1000 different matrices (could be much more). Would there be any significant performance gain by doing this operation on CUDA.
Yes, it's an ideal task for the GPU.
If you want to multiply a single matrix with a vector 50K times and each multiplication is prerequisite to the previous then don't use CUDA. It's a serial problem, best suites for CPU. However if each multiplication is independent you can multiply them simultaneously on CUDA.
The only case where your program will give tremendous speedup is when each vector multiplication iteration is independent to the data of other iterations. This way you'll be able to launch 50K or more iterations simultaneously by launching equal number of threads.
Depending on what exactly you're doing, then yes, this could be done very quickly on a GPU, but you might have to run your own kernel to get some good performance from it.
Without knowing more about your problem, I can't give you too much advice. But I could speculate on a solution:
If you're taking one vector and multiplying it by the same matrix several thousand times, you would be much better of finding the closed form of the matrix to an arbitrary power. You can do this using the Cayley–Hamilton theorem or the Jordan canonical form.
I can't seem to find an implementation of this from a quick googling, but considering I did this in first year linear algebra, it's not too bad. Some info on the Jordan normal form, and raising it to powers can be found at http://en.wikipedia.org/wiki/Jordan_normal_form#Powers and the transformation matrices of it are just a matrix of eigenvectors, and the inverse of that matrix.
Say you have a matrix A, and you find the Jordan normal form J, and the transformation matrices P, P^-1, you find
A^n = P J^n P^-1
I can't seem to find a good link to an implementation of this, but computing the closed form of a 10x10 matrix would be significantly less time consuming than 50,000 matrix multiplications. And an implementation of this would probably run much quicker on a CPU.
If this is your problem, you should look into this.