I have some FFT data, 257 dimensions, every 10 ms, with 121 frames, i.e. 1.21 secs. I think the first dimension is probably something else and the remaining are the FFT coefficients, I guess.
It's probably just spectogram data. From a comment about the FFT data, sqrt10 and mean-variance-normalization might have been applied on it.
From there, I want to calculate back some PCM signal for 44.1 Hz so I can play the sound. I asked the same question in a more mathematical way here but maybe StackOverflow is a better place because I actually want to implement this.
I also asked the same question about the theory here on DSP SE.
How would I do that? Maybe I need some more information (which I have to find out somehow) - which? Maybe these missing information can be intelligently guessed somehow?
This question is both about the theory and practical implementation. The implementation is trivial I guess. But a concrete example in some language would be nice to help understanding the theory. Maybe C++ with FFTW? I skipped through the FFTW docs but I fail to understand all the terminology and some background, e.g. here. Why is it from complex to real or the other way, I only want real to real. What are those REDFT? What's a DCT, DFT, DST? FFTW_HC2R?
I read all the FFT data, i.e. 121 * 257 floats, into a vector freq_bins.
std::vector<float32_t> freq_bins; // FFT data
int freq_bins_count = 257;
size_t len = 121;
std::vector<float32_t> pcm; // output, PCM data
int N = freq_bins_count;
std::vector<double> out(N), orig_in(N);
// inspiration: https://stackoverflow.com/questions/2459295/invertible-stft-and-istft-in-python/6891772#6891772
for(int f = 0; f < len; ++f) {
size_t pos = freq_bins_count * f;
for(int i = 0; i < N; ++i)
out[i] = pow(freq_bins[pos + i] + offset, 10); // fft was sqrt10 + mvn
fftw_plan q = fftw_plan_r2r_1d(N, &out[0], &orig_in[0], FFTW_REDFT00, FFTW_ESTIMATE);
fftw_execute(q);
fftw_destroy_plan(q);
// naive overlap-and-add
auto start_frame = size_t(f * dt * sampleRate);
for(int i = 0; i < N; ++i) {
sample_t frame = orig_in[i] * scale / (2 * (N - 1));
size_t idx = start_frame + i;
while(idx >= pcm.size())
pcm.push_back(0);
pcm[idx] += frame;
}
}
But this is wrong, I guess. I just get garbage out.
Related might be this question. Or this.
If the data you are have is real then the data you have is most probably spectrogram data and if the data you are receiving is complex then you most probably have raw short time fourier transform (STFT) data (See the diagram on this post to see how STFT/spectrogram data is produced). Spectrogram data is produced by taking the magnitude squared of STFT data and is thus not invertible because all the phase information in the audio signal has been lost but raw STFT data is invertible so if that is what you have then you might want to look for a library that performs the inverse STFT function and try using that.
As for the question of what the FFT dimensions in your data represent, I reckon the 257 data points you are receiving every 10ms are the result of a 512 point FFT being used in the STFT process.The first sample is the 0Hz frequency and the rest of the 256 data points are one half of the FFT spectrum (the other half of the FFT data has been discarded because the input to the FFT is real and so one half of the FFT data is simply the complex conjugate of the other half).
In addition to this, I would like to point out that just because you are receiving FFT data every 10ms 121 times does not mean the audio signal is 1.21s.The STFT is usually produced by using overlapping windows so your audio signal is might be shorter than 1.21s.
You'd simply push that data you have through the inverse fourier transform. All FFT libraries offer forward and backward transformation functions.
Related
I'm implementing a realtime graphics engine (C++ / OpenGL) that moves a vehicle over time along a specified course that is described by a polynomial function. The function itself was programmatically generated outside the application and is of a high order (I believe >25), so I can't really post it here (I don't think it matters anyway). During runtime the function does not change, so it's easy to calculate the first and second derivatives once to have them available quickly later on.
My problem is that I have to move along the curve with a constant speed (say 10 units per second), so my function parameter is not equal to the time directly, since the arc length between two points x1 and x2 differs dependent on the function values. For example the difference f(a+1) - f(a) may be way larger or smaller than f(b+1) - f(b), depending on how the function looks at points a and b.
I don't need a 100% accurate solution, since the movement is only visual and will not be processed any further, so any approximation is OK as well. Also please keep in mind that the whole thing has to be calculated at runtime each frame (60fps), so solving huge equations with complex math may be out of the question, depending on computation time.
I'm a little lost on where to start, so even any train of thought would be highly appreciated!
Since the criterion was not to have an exact solution, but a visually appealing approximation, there were multiple possible solutions to try out.
The first approach (suggested by Alnitak in the comments and later answered by coproc) I implemented, which is approximating the actual arclength integral by tiny iterations. This version worked really well most of the time, but was not reliable at really steep angles and used too many iterations at flat angles. As coproc already pointed out in the answer, a possible solution would be to base dx on the second derivative.
All these adjustments could be made, however, I need a runtime friendly algorithm. With this one it is hard to predict the number of iterations, which is why I was not happy with it.
The second approach (also inspired by Alnitak) is utilizing the first derivative by "pushing" the vehicle along the calculated slope (which is equal to the derivative at the current x value). The function for calculating the next x value is really compact and fast. Visually there is no obvious inaccuracy and the result is always consistent. (That's why I chose it)
float current_x = ...; //stores current x
float f(x) {...}
float f_derv(x) {...}
void calc_next_x(float units_per_second, float time_delta) {
float arc_length = units_per_second * time_delta;
float derv_squared = f_derv(current_x) * f_derv(current_x);
current_x += arc_length / sqrt(derv_squared + 1);
}
This approach, however, will possibly only be accurate enough for cases with high frame time (mine is >60fps), since the object will always be pushed along a straight line with a length depending on said frame time.
Given the constant speed and the time between frames the desired arc length between frames can be computed. So the following function should do the job:
#include <cmath>
typedef double (*Function)(double);
double moveOnArc(Function f, const double xStart, const double desiredArcLength, const double dx = 1e-2)
{
double arcLength = 0.;
double fPrev = f(xStart);
double x = xStart;
double dx2 = dx*dx;
while (arcLength < desiredArcLength)
{
x += dx;
double fx = f(x);
double dfx = fx - fPrev;
arcLength += sqrt(dx2 + dfx*dfx);
fPrev = fx;
}
return x;
}
Since you say that accuracy is not a top criteria, choosing an appropriate dx the above function might work right away. Ofcourse, it could be improved by adjusting dx automatically (e.g. based on the second derivative) or by refining the endpoint with a binary search.
I have a lot of random floating point numbers residing in global GPU memory. I also have "buckets" that specify ranges of numbers they will accept and a capacity of numbers they will accept.
ie:
numbers: -2 0 2 4
buckets(size=1): [-2, 0], [1, 5]
I want to run a filtration process that yields me
filtered_nums: -2 2
(where filtered_nums can be a new block of memory)
But every approach I take runs into a huge overhead of trying to synchronize threads across bucket counters. If I try to use a single-thread, the algorithm completes successfully, but takes frighteningly long (over 100 times slower than generating the numbers in the first place).
What I am asking for is a general high-level, efficient, as-simple-as-possible approach algorithm that you would use to filter these numbers.
edit
I will be dealing with 10 buckets and half a million numbers. Where all the numbers fall into exactly 1 of the 10 bucket ranges. Each bucket will hold 43000 elements. (There are excess elements, since the objective is to fill every bucket, and many numbers will be discarded).
2nd edit
It's important to point out that the buckets do not have to be stored individually. The objective is just to discard elements that would not fit into a bucket.
You can use thrust::remove_copy_if
struct within_limit
{
__host__ __device__
bool operator()(const int x)
{
return (x >=lo && x < hi);
}
};
thrust::remove_copy_if(input, input + N, result, within_limit());
You will have to replace lo and hi with constants for each bin..
I think you can templatize the kernel, but then again you will have to instantiate the template with actual constants. I can't see an easy way at it, but I may be missing something.
If you are willing to look at third party libraries, arrayfire may offer an easier solution.
array I = array(N, input, afDevice);
float **Res = (float **)malloc(sizeof(float *) * nbins);
for(int i = 0; i < nbins; i++) {
array res = where(I >= lo[i] && I < hi[i]);
Res[i] = res.device<float>();
}
I have been thinking of how to perform this operation on CUDA using reductions, but I'm a bit at a loss as to how to accomplish it. The C code is below. The important part to keep in mind -- the variable precalculatedValue depends on both loop iterators. Also, the variable ngo is not unique to every value of m... e.g. m = 0,1,2 might have ngo = 1, whereas m = 4,5,6,7,8 could have ngo = 2, etc. I have included sizes of loop iterators in case it helps to provide better implementation suggestions.
// macro that translates 2D [i][j] array indices to 1D flattened array indices
#define idx(i,j,lda) ( (j) + ((i)*(lda)) )
int Nobs = 60480;
int NgS = 1859;
int NgO = 900;
// ngo goes from [1,900]
// rInd is an initialized (and filled earlier) as:
// rInd = new long int [Nobs];
for (m=0; m<Nobs; m++) {
ngo=rInd[m]-1;
for (n=0; n<NgS; n++) {
Aggregation[idx(n,ngo,NgO)] += precalculatedValue;
}
}
In a previous case, when precalculatedValue was only a function of the inner loop variable, I saved the values in unique array indices and added them with a parallel reduction (Thrust) after the fact. However, this case has me stumped: the values of m are not uniquely mapped to the values of ngo. Thus, I don't see a way of making this code efficient (or even workable) to use a reduction on. Any ideas are most welcome.
Just a stab...
I suspect that transposing your loops might help.
for (n=0; n<NgS; n++) {
for (m=0; m<Nobs; m++) {
ngo=rInd[m]-1;
Aggregation[idx(n,ngo,NgO)] += precalculatedValue(m,n);
}
}
The reason I did this is because idx varies more rapidly with ngo (function of m) than with n, so making m the inner loop improves coherence. Note I also made precalculatedValue a function of (m, n) because you said that it is -- this makes the pseudocode clearer.
Then, you could start by leaving the outer loop on the host, and making a kernel for the inner loop (64,480-way parallelism is enough to fill most current GPUs).
In the inner loop, just start by using an atomicAdd() to handle collisions. If they are infrequent, on Fermi GPUs performance shouldn't be too bad. In any case, you will be bandwidth bound since arithmetic intensity of this computation is low. So once this is working, measure the bandwidth you are achieving, and compare to the peak for your GPU. If you are way off, then think about optimizing further (perhaps by parallelizing the outer loop -- one iteration per thread block, and do the inner loop using some shared memory and thread cooperation optimizations).
The key: start simple, measure performance, and then decide how to optimize.
Note that this calculation looks very similar to a histogram calculation, which has similar challenges, so you might want to google for GPU histograms to see how they have been implemented.
One idea is to sort (rInd[m], m) pairs using thrust::sort_by_key() and then (since the rInd duplicates will be grouped together), you can iterate over them and do your reductions without collisions. (This is one way to do histograms.) You could even do this with thrust::reduce_by_key().
I'm using CUDA to run a problem where I need a complex equation with many input matrices. Each matrix has an ID depending on its set (between 1 and 30, there are 100,000 matrices) and the result of each matrix is stored in a float[N] array where N is the number of input matrices.
After this, the result I want is the sum of every float in this array for each ID, so with 30 IDs there are 30 result floats.
Any suggestions on how I should do this?
Right now, I read the float array (400kb) back to the host from the device and run this on the host:
// Allocate result_array for 100,000 floats on the device
// CUDA process input matrices
// Read from the device back to the host into result_array
float result[10] = { 0 };
for (int i = 0; i < N; i++)
{
result[input[i].ID] += result_array[i];
}
But I'm wondering if there's a better way.
You could use cublasSasum() to do this - this is a bit easier than adapting one of the SDK reductions (but less general of course). Check out the CUBLAS examples in the CUDA SDK.
The number of combinations of k items which can be retrieved from N items is described by the following formula.
N!
c = ___________________
(k! * (N - k)!)
An example would be how many combinations of 6 Balls can be drawn from a drum of 48 Balls in a lottery draw.
Optimize this formula to run with the smallest O time complexity
This question was inspired by the new WolframAlpha math engine and the fact that it can calculate extremely large combinations very quickly. e.g. and a subsequent discussion on the topic on another forum.
http://www97.wolframalpha.com/input/?i=20000000+Choose+15000000
I'll post some info/links from that discussion after some people take a stab at the solution.
Any language is acceptable.
Python: O(min[k,n-k]2)
def choose(n,k):
k = min(k,n-k)
p = q = 1
for i in xrange(k):
p *= n - i
q *= 1 + i
return p/q
Analysis:
The size of p and q will increase linearly inside the loop, if n-i and 1+i can be considered to have constant size.
The cost of each multiplication will then also increase linearly.
This sum of all iterations becomes an arithmetic series over k.
My conclusion: O(k2)
If rewritten to use floating point numbers, the multiplications will be atomic operations, but we will lose a lot of precision. It even overflows for choose(20000000, 15000000). (Not a big surprise, since the result would be around 0.2119620413×104884378.)
def choose(n,k):
k = min(k,n-k)
result = 1.0
for i in xrange(k):
result *= 1.0 * (n - i) / (1 + i)
return result
Notice that WolframAlpha returns a "Decimal Approximation". If you don't need absolute precision, you could do the same thing by calculating the factorials with Stirling's Approximation.
Now, Stirling's approximation requires the evaluation of (n/e)^n, where e is the base of the natural logarithm, which will be by far the slowest operation. But this can be done using the techniques outlined in another stackoverflow post.
If you use double precision and repeated squaring to accomplish the exponentiation, the operations will be:
3 evaluations of a Stirling approximation, each requiring O(log n) multiplications and one square root evaluation.
2 multiplications
1 divisions
The number of operations could probably be reduced with a bit of cleverness, but the total time complexity is going to be O(log n) with this approach. Pretty manageable.
EDIT: There's also bound to be a lot of academic literature on this topic, given how common this calculation is. A good university library could help you track it down.
EDIT2: Also, as pointed out in another response, the values will easily overflow a double, so a floating point type with very extended precision will need to be used for even moderately large values of k and n.
I'd solve it in Mathematica:
Binomial[n, k]
Man, that was easy...
Python: approximation in O(1) ?
Using python decimal implementation to calculate an approximation. Since it does not use any external loop, and the numbers are limited in size, I think it will execute in O(1).
from decimal import Decimal
ln = lambda z: z.ln()
exp = lambda z: z.exp()
sinh = lambda z: (exp(z) - exp(-z))/2
sqrt = lambda z: z.sqrt()
pi = Decimal('3.1415926535897932384626433832795')
e = Decimal('2.7182818284590452353602874713527')
# Stirling's approximation of the gamma-funciton.
# Simplification by Robert H. Windschitl.
# Source: http://en.wikipedia.org/wiki/Stirling%27s_approximation
gamma = lambda z: sqrt(2*pi/z) * (z/e*sqrt(z*sinh(1/z)+1/(810*z**6)))**z
def choose(n, k):
n = Decimal(str(n))
k = Decimal(str(k))
return gamma(n+1)/gamma(k+1)/gamma(n-k+1)
Example:
>>> choose(20000000,15000000)
Decimal('2.087655025913799812289651991E+4884377')
>>> choose(130202807,65101404)
Decimal('1.867575060806365854276707374E+39194946')
Any higher, and it will overflow. The exponent seems to be limited to 40000000.
Given a reasonable number of values for n and K, calculate them in advance and use a lookup table.
It's dodging the issue in some fashion (you're offloading the calculation), but it's a useful technique if you're having to determine large numbers of values.
MATLAB:
The cheater's way (using the built-in function NCHOOSEK): 13 characters, O(?)
nchoosek(N,k)
My solution: 36 characters, O(min(k,N-k))
a=min(k,N-k);
prod(N-a+1:N)/prod(1:a)
I know this is a really old question but I struggled with a solution to this problem for a long while until I found a really simple one written in VB 6 and after porting it to C#, here is the result:
public int NChooseK(int n, int k)
{
var result = 1;
for (var i = 1; i <= k; i++)
{
result *= n - (k - i);
result /= i;
}
return result;
}
The final code is so simple you won't believe it will work until you run it.
Also, the original article gives some nice explanation on how he reached the final algorithm.