How to interpret the result from KissFFT's kiss_fftr (FFT for a real signal) function - fft

I'm using KissFFT's real function to transform some real audio signals. I'm confused, since I input a real signal with nfft samples, but the result is nfft/2+1 complex frequency bins.
From KissFFT's README:
The real (i.e. not complex) optimization code only works for even length ffts. It does two half-length FFTs in parallel (packed into real&imag), and then combines them via twiddling. The result is nfft/2+1 complex frequency bins from DC to Nyquist.
So I have no concrete knowledge of how to interpret the result. My assumption is the data is packed like r[0]i[0]r[1]i[1]...r[nfft/2]i[nfft/2], where r[0] would be DC, i[0] is the first frequency bin, r[1] the second, and so on. Is this the case?

Yes. The reason kiss_fftr makes only Nfft/2+1 bins is because the DFT of a real signal is conjugate-symmetric. The coefficients corresponding to negative frequencies ( -pi:0 or pi:2pi , whichever way to like to think about it) , are the conjugated coefficients from [0:pi).
Note the out[0] and out[Nfft/2] bins (DC and Nyquist) have zero in the imaginary part. I've seen some libraries pack these two real parts together in the first complex, but I view that as a breach of contract that leads to difficult-to-diagnose, nearly-right bugs.
Tip: If you are using float for your data type (default), you can cast the output array to float complex* (c99) or std::complex* (c++). The packing for the kiss_fft_cpx struct is compatible. The reason it doesn't use these by default is because kiss_fft works with other types beside float and double and on older ANSI C compilers that lack these features.
Here's a contrived example (assuming c99 compiler and type==float)
float get_nth_bin_phase(const float * in, int nfft, int whichbin )
{
kiss_fftr_cfg st = kiss_fftr_alloc(1024,0,0,0);
float complex * out = malloc(sizeof(float complex)*(nfft/2+1));
kiss_fftr(st,in,(kiss_fft_cpx*)out);
whichbin %= nfft;
if ( whichbin <= nfft/2 )
ph = cargf(out[whichbin]);
else
ph = cargf( conjf( out[nfft-whichbin] ) );
free(out);
kiss_fft_free(st);
return ph;
}

r[1]and i[1] of the fftr result constitute a complex vector. Together they give you a magnitude (sqrt of the sum of the squares of the 2 components) and a phase (via atan2()) of the first frequency bin.

Related

Numerical stability of ODE system

I have to perform a numerical solving of an ODE system which has the following form:
du_j/dt = f_1(u_j, v_j, t) + g_1(t)v_(j-1) + h_1(t)v_(j+1),
dv_j/dt = f_2(u_j, v_j, t) + g_2(t)u_(j-1) + h_2(t)u_(j+1),
where u_j(t) and v_j(t) are complex-valued scalar functions of time t, f_i and g_i are given functions, and j = -N,..N. This is an initial value problem and the task is to find the solution at a certain time T.
If g_i(t) = h_i(t) = 0, then the equations for different values of j can be solved independently. In this case I obtain a stable and accurate solutions with the aid of the fourth-order Runge-Kutta method. However, once I turn on the couplings, the results become very unstable with respect to the time grid step and explicit form of the functions g_i, h_i.
I guess it is reasonable to try to employ an implicit Runge-Kutta scheme, which might be stable in such a case, but if I do so, I will have to evaluate the inverse of a huge matrix of size 4*N*c, where c depends on the order of the method (e. g. c = 3 for the Gauss–Legendre method) at each step. Of course, the matrix will mostly contain zeros and have a block tridiagonal form but it still seems to be very time-consuming.
So I have two questions:
Is there a stable explicit method which works even when the coupling functions g_i and h_i are (very) large?
If an implicit method is, indeed, a good solution, what is the fastest method of the inversion of a block tridiagonal matrix? At the moment I just perform a simple Gauss method avoiding redundant operations which arise due to the specific structure of the matrix.
Additional info and details that might help us:
I use Fortran 95.
I currently consider g_1(t) = h_1(t) = g_2(t) = h_2(t) = -iAF(t)sin(omega*t), where i is the imaginary unit, A and omega are given constants, and F(t) is a smooth envelope going slowly, first, from 0 to 1 and then from 1 to 0, so F(0) = F(T) = 0.
Initially u_j = v_j = 0 unless j = 0. The functions u_j and v_j with great absolute values of j are extremely small for all t, so the initial peak does not reach the "boundaries".
To 1) There will be no stable explicit method if your functions are very large. This is due to the fact that the area of stability of explicit (Runge-Kutta) methods is compact.
To 2) If your matrices are larger then 100x100 you could use this method:
Inverses of Block Tridiagonal Matrices and Rounding Errors.

fftw library, what is the output size of fftw_plan_dft_r2c_1d?

Im new to fftw library. I have an array of n real data and use fftw_plan_dft_r2c_1d to find fft spectrum. What would be the size of the output. Is it n as same as the input? Also, is the result center around 0 Hz or I have to manually do it?
For a real-to-complex transform you get N / 2 + 1 complex outputs for N real inputs (the redundant symmetric outputs are not generated).
The 0 Hz component is in bin 0.
This is all covered in the FFTW manual.
This is not an answer to your question, but I hope it can be a solution to your problem.
If you only want to find the spectrum of your data , you might use the "halfcomplex" format.
Here is an piece of code:
double *in,*out;
fftw_plan myplan;
in = fftw_malloc (N*sizeof(double));
out = fftw_malloc (N*sizeof(double));
myplan = fftw_plan_r2r_1d(N,in,out,FFTW_R2HC,FFTW_FORWARD);
// Fill in[] with your data.
...
fftw_execute(myplan);
Now out contains r0, r1, r2, ..., rn/2, i(n+1)/2-1, ..., i2, i1 , as it is written in the manual.
r0 ,out[0],is the mean value of your data/signal.
r1 ,out[1],is the real part of the first element of the DFT.
...
i0 is 0 because you're using real data , so it isn't stored in out.
i1 ,out[N-1],is the imaginary part of the first element of the DFT.
i2 ,out[N-2],is the imaginary part of the second element of the DFT.
If N is a even number , then r(N/2) out[N/2] is the Nyquist frequency amplitude.
Im new to fftw library
Remember that FFTW computes only the product of your data by the trigonometric functions, but it don't normalize them.
You can find more info about the halfcomplex here.

CUDA Atomic operation on array in global memory

I have a CUDA program whose kernel basically does the following.
I provide a list of n points in cartesian coordinates e.g. (x_i,y_i) in a plane of dimension dim_x * dim_y. I invoke the kernel accordingly.
For every point on this plane (x_p,y_p) I calculate by a formula the time it would take for each of those n points to reach there; given those n points are moving with a certain velocity.
I order those times in increasing order t_0,t_1,...t_n where the precision of t_i is set to 1. i.e. If t'_i=2.3453 then I would only use t_i=2.3.
Assuming the times are generated from a normal distribution I simulate the 3 quickest times to find the percentage of time those 3 points reached earliest. Hence suppose prob_0 = 0.76,prob_1=0.20 and prob_2=0.04 by a random experiment. Since t_0 reaches first most amongst the three, I also return the original index (before sorting of times) of the point. Say idx_0 = 5 (An integer).
Hence for every point on this plane I get a pair (prob,idx).
Suppose n/2 of those points are of one kind and the rest are of other. A sample image generated looks as follows.
Especially when precision of the time was set to 1 I noticed that the number of unique 3 tuples of time (t_0,t_1,t_2) was just 2.5% of the total data points i.e. number of points on the plane. This meant that most of the times the kernel was uselessly simulating when it could just use the values from previous simulations. Hence I could use a dictionary having key as 3-tuple of times and value as index and prob. Since as far as I know and tested, STL can't be accessed inside a kernel, I constructed an array of floats of size 201000000. This choice was by experimentation since none of the top 3 times exceeded 20 seconds. Hence t_0 could take any value from {0.0,0.1,0.2,...,20.0} thus having 201 choices. I could construct a key for such a dictionary like the following
Key = t_o * 10^6 + t_1 * 10^3 + t_2
As far as the value is concerned I could make it as (prob+idx). Since idx is an integer and 0.0<=prob<=1.0, I could retrieve both of those values later by
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
So now my kernel looks like the following
__global__ my_kernel(float* points,float* dict,float *p,float *i,size_t w,...){
unsigned int col = blockIdx.y*blockDim.y + threadIdx.y;
unsigned int row = blockIdx.x*blockDim.x + threadIdx.x;
//Calculate time taken for each of the points to reach a particular point on the plane
//Order the times in increasing order t_0,t_1,...,t_n
//Calculate Key = t_o * 10^6 + t_1 * 10^3 + t_2
if(dict[key]>0.0){
prob=dict[key]-floor(dict[key])
idx = floor(dict[key])
}
else{
//Simulate and find prob and idx
dict[key]=(prob+idx)
}
p[row*width+col]=prob;
i[row*width+col]=idx;
}
The result is quite similar to the original program for most points but for some it is wrong.
I am quite sure that this is due to race condition. Notice that dict was initialized with all zeroes. The basic idea would be to make the data structure "read many write once" in a particular location of the dict.
I am aware that there might be much more optimized ways of solving this problem rather than allocating so much memory. Please let me know in that case. But I would really like to understand why this particular solution is failing. In particular I would like to know how to use atomicAdd in this setting. I have failed to use it.
Unless your simulation in the else branch is very long (~100s of floating-point operations), a lookup table in global memory is likely to be slower than running the computation. Global memory access is very expensive!
In any case, there is no way to save time by "skipping work" using conditional branching. The Single Instruction, Multiple Thread architecture of a GPU means that the instructions for both sides of the branch will be executed serially, unless all of the threads in a block follow the same branch.
edit:
The fact that you are seeing a performance increase as a result of introducing the conditional branch and you didn't have any problems with deadlock suggests that all the threads in each block are always taking the same branch. I suspect that once dict starts getting populated, the performance increase will go away.
Perhaps I have misunderstood something, but if you want to calculate the probability of an event x, assuming a normal distribution and given the mean mu and standard deviation sigma, there is no need to generate a load of random numbers and approximate a Gaussian curve. You can directly calculate the probability:
p = exp(-((x - mu) * (x - mu) / (2.0f * sigma * sigma))) /
(sigma * sqrt(2.0f * M_PI));

CUDA, float precision

I am using CUDA 4.0 on Geforce GTX 580 (Fermi) . I have numbers as small as 7.721155e-43 . I want to multiply them with each other just once or better say I want to calculate 7.721155e-43 * 7.721155e-43 .
My experience showed me I can't do it just straight forward. Could you please give me suggestion? Do I need to use double precision? How?
The magnitude of the smallest normal IEEE single-precision number is about 1.18e-38, the smallest denormal gets you down to about 1.40e-45. As a consequece an operand of magnitude 7.82e-43 will comprise only about 9 non-zero bits, which in itself may already be a problem, even before you get to the multiplication (whose result will underflow to zero in single precision). So you may also want to look at any up-stream computation that produces these tiny numbers.
If these small numbers are intermediate terms in a mathematical expression, rewriting that expression into a mathematically equivalent one that does not involve tiny intermediates would be one way of addressing the issue. Or you could scale some operands by factors that are powers of two (so as to not incur additional round-off due to the scaling). For example, scale by 2^24 = 16777216.
Lastly, you can switch part of the computation to double precision. To do so, simply introduce temporary variables of type double, perform the computation on them, then convert the final result back to float:
float r, f = 7.721155e-43f;
double d, t;
d = (double)f; // explicit cast is not necessary, since converting to wider type
t = d * d;
[... more intermediate computation, leaving result in 't' ...]
r = (float)t; // since conversion is to narrower type, cast will avoid warnings
In statistics we often have to work with likelihoods that end up being very small numbers and the standard technique is to use logs for everything. Then multiplication on a log scale is just addition. All intermediate numbers are stored as logs. Indeed it can take a bit of getting used to - but the alternative will often fail even when doing relatively modest computations. In R (for my convenience!) which uses doubles and prints 7 significant figures by default btw:
> 7.721155e-43 * 7.721155e-43
[1] 5.961623e-85
> exp(log(7.721155e-43) + log(7.721155e-43))
[1] 5.961623e-85

How would you write this algorithm for large combinations in the most compact way?

The number of combinations of k items which can be retrieved from N items is described by the following formula.
N!
c = ___________________
(k! * (N - k)!)
An example would be how many combinations of 6 Balls can be drawn from a drum of 48 Balls in a lottery draw.
Optimize this formula to run with the smallest O time complexity
This question was inspired by the new WolframAlpha math engine and the fact that it can calculate extremely large combinations very quickly. e.g. and a subsequent discussion on the topic on another forum.
http://www97.wolframalpha.com/input/?i=20000000+Choose+15000000
I'll post some info/links from that discussion after some people take a stab at the solution.
Any language is acceptable.
Python: O(min[k,n-k]2)
def choose(n,k):
k = min(k,n-k)
p = q = 1
for i in xrange(k):
p *= n - i
q *= 1 + i
return p/q
Analysis:
The size of p and q will increase linearly inside the loop, if n-i and 1+i can be considered to have constant size.
The cost of each multiplication will then also increase linearly.
This sum of all iterations becomes an arithmetic series over k.
My conclusion: O(k2)
If rewritten to use floating point numbers, the multiplications will be atomic operations, but we will lose a lot of precision. It even overflows for choose(20000000, 15000000). (Not a big surprise, since the result would be around 0.2119620413Ă—104884378.)
def choose(n,k):
k = min(k,n-k)
result = 1.0
for i in xrange(k):
result *= 1.0 * (n - i) / (1 + i)
return result
Notice that WolframAlpha returns a "Decimal Approximation". If you don't need absolute precision, you could do the same thing by calculating the factorials with Stirling's Approximation.
Now, Stirling's approximation requires the evaluation of (n/e)^n, where e is the base of the natural logarithm, which will be by far the slowest operation. But this can be done using the techniques outlined in another stackoverflow post.
If you use double precision and repeated squaring to accomplish the exponentiation, the operations will be:
3 evaluations of a Stirling approximation, each requiring O(log n) multiplications and one square root evaluation.
2 multiplications
1 divisions
The number of operations could probably be reduced with a bit of cleverness, but the total time complexity is going to be O(log n) with this approach. Pretty manageable.
EDIT: There's also bound to be a lot of academic literature on this topic, given how common this calculation is. A good university library could help you track it down.
EDIT2: Also, as pointed out in another response, the values will easily overflow a double, so a floating point type with very extended precision will need to be used for even moderately large values of k and n.
I'd solve it in Mathematica:
Binomial[n, k]
Man, that was easy...
Python: approximation in O(1) ?
Using python decimal implementation to calculate an approximation. Since it does not use any external loop, and the numbers are limited in size, I think it will execute in O(1).
from decimal import Decimal
ln = lambda z: z.ln()
exp = lambda z: z.exp()
sinh = lambda z: (exp(z) - exp(-z))/2
sqrt = lambda z: z.sqrt()
pi = Decimal('3.1415926535897932384626433832795')
e = Decimal('2.7182818284590452353602874713527')
# Stirling's approximation of the gamma-funciton.
# Simplification by Robert H. Windschitl.
# Source: http://en.wikipedia.org/wiki/Stirling%27s_approximation
gamma = lambda z: sqrt(2*pi/z) * (z/e*sqrt(z*sinh(1/z)+1/(810*z**6)))**z
def choose(n, k):
n = Decimal(str(n))
k = Decimal(str(k))
return gamma(n+1)/gamma(k+1)/gamma(n-k+1)
Example:
>>> choose(20000000,15000000)
Decimal('2.087655025913799812289651991E+4884377')
>>> choose(130202807,65101404)
Decimal('1.867575060806365854276707374E+39194946')
Any higher, and it will overflow. The exponent seems to be limited to 40000000.
Given a reasonable number of values for n and K, calculate them in advance and use a lookup table.
It's dodging the issue in some fashion (you're offloading the calculation), but it's a useful technique if you're having to determine large numbers of values.
MATLAB:
The cheater's way (using the built-in function NCHOOSEK): 13 characters, O(?)
nchoosek(N,k)
My solution: 36 characters, O(min(k,N-k))
a=min(k,N-k);
prod(N-a+1:N)/prod(1:a)
I know this is a really old question but I struggled with a solution to this problem for a long while until I found a really simple one written in VB 6 and after porting it to C#, here is the result:
public int NChooseK(int n, int k)
{
var result = 1;
for (var i = 1; i <= k; i++)
{
result *= n - (k - i);
result /= i;
}
return result;
}
The final code is so simple you won't believe it will work until you run it.
Also, the original article gives some nice explanation on how he reached the final algorithm.