Accelerate vDSP FFT resulting in NaN under demanding scenario - fft

I'm using the vDSP framework for a real-time audio application based on FFT computation.
After having lots of problems trying to figure out why the algorithm was producing incorrect results, I found out the following comment on the official vDSP FFT help code (DemonstrateFFT.c, lines 242, 416, 548)
/* Zero the signal before timing because repeated FFTs on non-zero
data can cause abnormalities such as infinities, NaNs, and
subnormal numbers.
*/
In order to reproduce the error, just comment line 247 (no zero the signal) and add something similar to the following line at line 273 (just after the vDSP_fft_zrip method)
if (isnan(Observed.realp[0])) printf("Iteration %lu: NaN\n",i); // it would work with any of the components of Observed
It is interesting to observe that reducing N (i.e. increasing the amount of FFTs per time unit) makes the zrip algorithm to fail before, which kinds of makes sense since the comment advices about performing repeated FFTs.
The behavior is also observed with the vDSP_fft_zrop algorithm.
I'm really wondering what's the point about performing FFTs of "zero data" as advised on the comment. Either I'm missing something important, or definitely the vDSP framework is not suited at all for real-time audio processing.

Normal 16 and 24-bit "real time" audio samples will not see this issue.
But benchmarks can create bigger and smaller numbers that can exceed the range of double precision floats when iterated enough times, and when using many functions, not just FFTs. Try iterating exp() fed back to itself, that will blow up even faster. It's a problem one encounters using any finite precision computer arithmetic (not just the ARM and x86 CPUs that vDSP uses).

Related

fft output show unexpected symmetry

I am running a cfft on a signal. The output seems to show symmetry. I know that
an fft is symmetrical, but the code
arm_cfft_f32(&arm_cfft_sR_f32_len512, &FFTBuf[0], 0, 1);
arm_cmplx_mag_f32(&FFTBuf[0], &FFTMagBuf[0], FFT_LEN);
accounts for this as the FFTMagBuf is Half the length of the Input array.
The output though, still appears to show symmetry
[1]https://imgur.com/K0uMDAm
arrows point to my whistle, which shows nicely, surrounded by much noise.
the middle one is probably a harmonic(my whistling is crap). but left right symmetry is noticeable.
I am using an stm32f4 disco board, and the samples are from the on-board mems microphone, and each block of samples(in this case 1024, to give an fft of 512 length) is passed through a hann window.
I am using a modified version of tony dicola's spectrogramui.py for visualization.
According to the documentation arm_cmplx_mag_f32 computes the magnitude of a complex signal. That's why FFTMagBuf has to be half the size of FFTBuf: both arrays hold real numbers but, the complex samples are made of two reals. It's unrelated to the simmetry of the FFT.
So, the output signal has exactly the same number of samples as the input.
That is, you compute the complex FFT of a real signal, which has some kind of symmetry (you need to account for complex conjugation too), and you take the magnitude, which is symmetric. Of course, the plot is then symmetric too.

Why does FFT of sine wave have magnitudes in multiple bins

I've been playing around with Web Audio some. I have a simple oscillator node playing at a frequency of context.sampleRate / analyzerNode.fftSize * 5 (107.666015625 in this case). When I call analyzer.getByteFrequencyData I would expect it to have a value in the 5th bin, and no where else. What I actually see is [0,0,0,240,255,255,255,240,0,0...]
Why am I getting values in multiple bins?
The webaudio AnalyserNode applies a Blackman window before computing the FFT. This windowing function will smear the single tone.
That has to do that your sequence is finite and therefore your signal is supposed to last for a finite amount of time. Surely you are calculating the FFT with a rectangular window, i.e. your signal is consider to last for the amount of generated samples only and that "discontinuity" (i.e. the fact that the signal has a finite number of samples) creates the spectral leakage. To minimise this effect, you could try several windows functions that when applied to your data prior the FFT calculation, reduces this effect.
It looks like you might be clipping somewhere in your computation by using a test signal too large for your data or arithmetic format. Try again using a floating point format.

CURAND properties of generators

CURAND comes with an array of random number generators, but I have failed to find any comparison of the performance (and randomness) properties of each of them; mostly, I'd be interested in which generator to use for which application to gain maximum performance. I'd be happy if someone could quickly outline the differences between them or link me a resource that does so.
Thanks in advance.
This picture shows the performance for different RNGs.
For randomness, it should be only related to the RNG type/algorithm. So you can refer to Intel MKL doc. There's detail info and research papers in it. The type names in both CURAND and MKL are very similar.
http://software.intel.com/sites/products/documentation/hpc/mkl/mklman/GUID-3D7D2650-A414-4C95-AF33-BE291BAB2AC3.htm
First difference is efficiency. XORWOW is default generator, but isn't always most efficient. For instance, Philox is faster for generating normally distributed floats.
Another difference is, that in practice You can generate more than one float with each call with some generators.
For example, with Philox You can generate even 4 floats normally or uniformly distributed with each call, while with XORWOW you can generate max two floats normally or uniformly distributed.
__device__ float4
curand_normal4 (curandStatePhilox4_32_10_t *state)
Next difference is period of pseudorandom sequence (Total state space of the PRNG before
you start to see repeats). Xorwow has period about 2^190 (with the state set up after 2^67 for the same seed)*. For Philox, subsequence and offset together define offset in a sequence with period 2^128.
Note that if You run millions of threads with the same seed You could run out of state space per thread and start seeing repeats. ((2^190) / (10^6)) / (2^67) = 1.0633824 × 10^31
One more difference is size of the states. For Xorwow sizeof(curandState_t) is 48 bytes and sizeof(curandStatePhilox4_32_10_t) is 64 bytes.
When You run millions of threads (each thread has its own curand state) you can run out of device memory. 1024^2*64 ~= 64 megabytes per million threads.
XORWOW, Philox, MRR32k3a, MTGP32 are Pseudo-random generators while both Sobols are Quasi-ranom generators.
*When calling curand_init with a seed, it scrambles that seed and then skips ahead 2^67 numbers (this is kind of expensive but has some nice properties)
sources:
https://developer.nvidia.com/cuRAND
http://cs.brown.edu/courses/cs195v/lecture/week11.pdf

CUDA samples matrixMul error

I am very new to cuda and started reading about parallel programming and cuda just a few weeks ago. After I installed the cuda toolkit, I was browsing the sdk samples (which come with the installation of the toolkit) and wanted to try some of them out. I started with matrixMul from 0_Simple folder. This program executes fine (I am using Visual Studio 2010).
Now I want to change the size of the matrices and try with a bigger one (for example 960X960 or 1024x1024). In this case, something crashes (I get black screen, and then the message: display driver stopped responding and has recovered).
I am changing this two lines in the code (from main function):
dim3 dimsA(8*4*block_size, 8*4*block_size, 1);
dim3 dimsB(8*4*block_size, 8*4*block_size, 1);
before they were:
dim3 dimsA(5*2*block_size, 5*2*block_size, 1);
dim3 dimsB(5*2*block_size, 5*2*block_size, 1);
Can someone point to me what I am doing wrong. and should I alter something else in this example for it to work properly. Thx!
Edit: like some of you suggested, i changed the timeout value (0 somehow did not work for me, I set the timeout to 60), so my driver does not crash, but I get huge list of errors, like:
... ... ...
Error! Matrix[409598]=6.40005159, ref=6.39999986 error term is > 1e-5
Error! Matrix[409599]=6.40005159, ref=6.39999986 error term is > 1e-5
Does this got something to do with the allocation of the memory. Should I make changes there and what could they be?
Your new problem is actually just the strict tolerances provided in the NVidia example. Your kernel is running correctly. It's just complaining that accumluated error is greater than the limit that they had set for this example. This is just because you're doing a lot more math operations which are all accumulating error. If you look at the numbers it's giving you, you're only off of the reference answer by about 0.00005, which is not unusual after a lot of single-precision floating-point math. The reason you're getting these errors now and not with the default matrix sizes is that the original matricies were smaller and thus required a lot less operations to multiply. Matrix multiplication of N x N matricies requires on the order of N^3 operations, so the number of operations required increases much faster than the size of the matrix and the accumulated error would increase in proportion with the number of operations.
If you look near the end of the runTest() function, there's a call to computeGold() which computes the reference answer on your CPU. There should then be a call to something like shrCompareL2fe that compares the results. The last parameter to this is a tolerance. If you increase the size of this tolerance (say, to 1e-3 or 1e-4 instead of 1e-5,) you should eliminate these error messages. Note that there may be a couple of these calls. The version of the SDK examples that I have has an optional CUBLAS implementation, so it has a comparison for that against the gold, too. The one right after the print statement that says "Comparing CUDA matrixMul & Host results" is the one you'd want to change.
I'd advise looking at the indexing used in the kernel (matrixMulCUDA) a bit closer - it sounds like you're writing to unallocated memory.
More specifically, is the only thing that you changed the dimsA and dimsB variables? Inside the kernel they use the thread and block index to access the data - did you also increase the data size accordingly? There is no bounds checking going on in the kernel, so if you just change the kernel launch configuration, but not the data, then odds are you're writing past your data into some other memory
Have you disabled Timeout Detection and Recovery (TDR) in Windows? It is entirely possible that your code is running fine but that the larger matricies caused the kernel execution to exceed Windows' timeout, which causes Windows to assume the card is locked up, so it resets the card and gives you a message identical to the one you describe. Even if that is not your problem here, you definitely want to disable that before doing any serious CUDA work in Windows. The timeout is quite short by default, since normal graphics rendering should take small fractions of a second per frame.
See this post on the NVidia forums that describes TDR and how to turn it off:
WDDM TDR - NVidia devtalk forum
In particular, you probably want to set the key HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrLevel to 0 (Detection Disabled).
Alternatively, you can increase the timeout period by setting
HKLM\System\CurrentControlSet\Control\GraphicsDrivers\TdrDelay. It defaults to 2 and is specified in seconds. Personally, I have found that TDR is always annoying when doing work in CUDA, so I just turn it off entirely. IIRC, you need to restart your system for any TDR-related changes to take effect.

fermi cuda double precision against C

there is a small error between CPU and GPU double precision results, using a fermi GPU.
e.g. for a small test set, I get the following absolute error for: (Number 1(CPU) - Number 2(GPU)) = 3E-018.
in binary form it is as expected very small…
NUMBER 1 in binary:
xxxxxxxxxxxxx11100000001001
vs
NUMBER 2 in binary:
xxxxxxxxxxxx111100000001010
Although this is a difference of one binary digit, I am keen to eliminate any differences, as the errors addup during my code.
any tips from those familiar with fermi? if this is unavoidable can I get C/C++ to mimic the fermi rounding off behaviour?
You should take a look at this post.
Floating point is not associative, so if a compiler chooses to do operations in a different order then you'll get a different result. Two versions of the same compiler can produce differences! Different compilers are even more likely to produce differences, and if you're doing work in parallel on the GPU (you are, right?) then you're inherently doing operations in a different order...
Fermi hardware is IEEE754-2008 compliant, which means that in addition to IEEE754 standard rounding it also has the fused multiply-add (FMA) instruction which avoids losing precision between multiplication and addition.