Why is Tensorflow.js fftSize different by a factor of two to the Web Audio API fftSize? - fft

tf.data.microphone has an fftSize parameter, described as "the number of samples used to compute each nonoverlapping “frame” of audio" in the Deep Learning with JavaScript book.
However, when a browser AnalyserNode is configured, this number is doubled:
this.analyser.fftSize = this.fftSize * 2;
(see microphone_iterator.ts).
The AnalyserNode.fftSize is also described as, "the window size in samples that is used when performing a FFT".
Why is this quantity doubled?
I suspect part of the reason is that there is a related quantity to the widow size, namely the number of frequency bins, which is set to half the fftSize. Indeed, in the source of MicrophoneIterator, fftSize looks to be used as the number of frequency bins:
this.freqData = new Float32Array(this.fftSize);
...
this.analyser.getFloatFrequencyData(this.freqData)
But elsewhere it is used as if it were indeed the window size - here's the polling interval for frames:
this.fftSize / this.sampleRateHz * 1e3
Hoping someone can clarify the precise interpretation of this parameter for me.

Related

Statistical method to know when enough performance test iterations have been performed

I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)

Indirect Kalman Filter for Inertial Navigation System

I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).

understanding getByteTimeDomainData and getByteFrequencyData in web audio

The documentation for both of these methods are both very generic wherever I look. I would like to know what exactly I'm looking at with the returned arrays I'm getting from each method.
For getByteTimeDomainData, what time period is covered with each pass? I believe most oscopes cover a 32 millisecond span for each pass. Is that what is covered here as well? For the actual element values themselves, the range seems to be 0 - 255. Is this equivalent to -1 - +1 volts?
For getByteFrequencyData the frequencies covered is based on the sampling rate, so each index is an actual frequency, but what about the actual element values themselves? Is there a dB range that is equivalent to the values returned in the returned array?
getByteTimeDomainData (and the newer getFloatTimeDomainData) return an array of the size you requested - its frequencyBinCount, which is calculated as half of the requested fftSize. That array is, of course, at the current sampleRate exposed on the AudioContext, so if it's the default 2048 fftSize, frequencyBinCount will be 1024, and if your device is running at 44.1kHz, that will equate to around 23ms of data.
The byte values do range between 0-255, and yes, that maps to -1 to +1, so 128 is zero. (It's not volts, but full-range unitless values.)
If you use getFloatFrequencyData, the values returned are in dB; if you use the Byte version, the values are mapped based on minDecibels/maxDecibels (see the minDecibels/maxDecibels description).
Mozilla 's documentation describes the difference between getFloatTimeDomainData and getFloatFrequencyData, which I summarize below. Mozilla docs reference the Web Audio
experiment ; the voice-change-o-matic. The voice-change-o-matic illustrates the conceptual difference to me (it only works in my Firefox browser; it does not work in my Chrome browser).
TimeDomain/getFloatTimeDomainData
TimeDomain functions are over some span of time.
We often visualize TimeDomain data using oscilloscopes.
In other words:
we visualize TimeDomain data with a line chart,
where the x-axis (aka the "original domain") is time
and the y axis is a measure of a signal (aka the "amplitude").
Change the voice-change-o-matic "visualizer setting" to Sinewave to
see getFloatTimeDomainData(...)
Frequency/getFloatFrequencyData
Frequency functions (GetByteFrequencyData) are at a point in time; i.e. right now; "the current frequency data"
We sometimes see these in mp3 players/ "winamp bargraph style" music players (aka "equalizer" visualizations).
In other words:
we visualize Frequency data with a bar graph
where the x-axis (aka "domain") are frequencies or frequency bands
and the y-axis is the strength of each frequency band
Change the voice-change-o-matic "visualizer setting" to Frequency bars to see getFloatFrequencyData(...)
Fourier Transform (aka Fast Fourier Transform/FFT)
Another way to think about "time domain vs frequency" is shown the diagram below, from Fast Fourier Transform wikipedia
getFloatTimeDomainData gives you the chart on on the top (x-axis is Time)
getFloatFrequencyData gives you the chart on the bottom (x-axis is Frequency)
a Fast Fourier Transform (FFT) converts the Time Domain data into Frequency data, in other words, FFT converts the first chart to the second chart.
cwilso has it backwards.
the time data array is the longer one (fftSize), and the frequency data array is the shorter one (half that, frequencyBinCount).
fftSize of 2048 at the usual sample rate of 44.1kHz means each sample has 1/44100 duration, you have 2048 samples at hand, and thus are covering a duration of 2048/44100 seconds, which 46 milliseconds, not 23 milliseconds. The frequencyBinCount is indeed 1024, but that refers to the frequency domain (as the name suggests), not the time domain, and it the computation 1024/44100, in this context, is about as meaningful as adding your birth date to the fftSize.
A little math illustrating what's happening: Fourier transform is a 'vector space isomorphism', that is, a mapping going bijectively (i.e., reversible) between 2 vector spaces of the same dimension; the 'time domain' and the 'frequency domain.' The vector space dimension we have here (in both cases) is fftSize.
So where does the 'half' come from? The frequency domain coefficients 'count double'. Either because they 'actually are' complex numbers, or because you have the 'sin' and the 'cos' flavor. Or, because you have a 'magnitude' and a 'phase', which you'll understand if you know how complex numbers work. (Those are 3 ways to say the same in a different jargon, so to speak.)
I don't know why the API only gives us half of the relevant numbers when it comes to frequency - I can only guess. And my guess is that those are the 'magnitude' numbers, and the 'phase' numbers are thrown out. The reason that this is my guess is that in applications, magnitude is far more important than phase. Still, I'm quite surprised that the API throws out information, and I'd be glad if some expert who actually knows (and isn't guessing) can confirm that it's indeed the magnitude. Or - even better (I love to learn) - correct me.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

How to detect local maxima and curve windows correctly in semi complex scenarios?

I have a series of data and need to detect peak values in the series within a certain number of readings (window size) and excluding a certain level of background "noise." I also need to capture the starting and stopping points of the appreciable curves (ie, when it starts ticking up and then when it stops ticking down).
The data are high precision floats.
Here's a quick sketch that captures the most common scenarios that I'm up against visually:
One method I attempted was to pass a window of size X along the curve going backwards to detect the peaks. It started off working well, but I missed a lot of conditions initially not anticipated. Another method I started to work out was a growing window that would discover the longer duration curves. Yet another approach used a more calculus based approach that watches for some velocity / gradient aspects. None seemed to hit the sweet spot, probably due to my lack of experience in statistical analysis.
Perhaps I need to use some kind of a statistical analysis package to cover my bases vs writing my own algorithm? Or would there be an efficient method for tackling this directly with SQL with some kind of local max techniques? I'm simply not sure how to approach this efficiently. Each method I try it seems that I keep missing various thresholds, detecting too many peak values or not capturing entire events (reporting a peak datapoint too early in the reading process).
Ultimately this is implemented in Ruby and so if you could advise as to the most efficient and correct way to approach this problem with Ruby that would be appreciated, however I'm open to a language agnostic algorithmic approach as well. Or is there a certain library that would address the various issues I'm up against in this scenario of detecting the maximum peaks?
my idea is simple, after get your windows of interest you will need find all the peaks in this window, you can just compare the last value with the next , after this you will have where the peaks occur and you can decide where are the best peak.
I wrote one simple source in matlab to show my idea!
My example are in wave from audio file :-)
waveFile='Chick_eco.wav';
[y, fs, nbits]=wavread(waveFile);
subplot(2,2,1); plot(y); legend('Original signal');
startIndex=15000;
WindowSize=100;
endIndex=startIndex+WindowSize-1;
frame = y(startIndex:endIndex);
nframe=length(frame)
%find the peaks
peaks = zeros(nframe,1);
k=3;
while(k <= nframe - 1)
y1 = frame(k - 1);
y2 = frame(k);
y3 = frame(k + 1);
if (y2 > 0)
if (y2 > y1 && y2 >= y3)
peaks(k)=frame(k);
end
end
k=k+1;
end
peaks2=peaks;
peaks2(peaks2<=0)=nan;
subplot(2,2,2); plot(frame); legend('Get Window Length = 100');
subplot(2,2,3); plot(peaks); legend('Where are the PEAKS');
subplot(2,2,4); plot(frame); legend('Peaks in the Window');
hold on; plot(peaks2, '*');
for j = 1 : nframe
if (peaks(j) > 0)
fprintf('Local=%i\n', j);
fprintf('Value=%i\n', peaks(j));
end
end
%Where the Local Maxima occur
[maxivalue, maxi]=max(peaks)
you can see all the peaks and where it occurs
Local=37
Value=3.266296e-001
Local=51
Value=4.333496e-002
Local=65
Value=5.049438e-001
Local=80
Value=4.286804e-001
Local=84
Value=3.110046e-001
I'll propose a couple of different ideas. One is to use discrete wavelets, the other is to use the geographer's concept of prominence.
Wavelets: Apply some sort of wavelet decomposition to your data. There are multiple choices, with Daubechies wavelets being the most widely used. You want the low frequency peaks. Zero out the high frequency wavelet elements, reconstruct your data, and look for local extrema.
Prominence: Those noisy peaks and valleys are of key interest to geographers. They want to know exactly which of a mountain's multiple little peaks is tallest, the exact location of the lowest point in the valley. Find the local minima and maxima in your data set. You should have a sequence of min/max/min/max/.../min. (You might want to add an arbitrary end points that are lower than your global minimum.) Consider a min/max/min sequence. Classify each of these triples per the difference between the max and the larger of the two minima. Make a reduced sequence that replaces the smallest of these triples with the smaller of the two minima. Iterate until you get down to a single min/max/min triple. In your example, you want the next layer down, the min/max/min/max/min sequence.
Note: I'm going to describe the algorithmic steps as if each pass were distinct. Obviously, in a specific implementation, you can combine steps where it makes sense for your application. For the purposes of my explanation, it makes the text a little more clear.
I'm going to make some assumptions about your problem:
The windows of interest (the signals that you are looking for) cover a fraction of the entire data space (i.e., it's not one long signal).
The windows have significant scope (i.e., they aren't one pixel wide on your picture).
The windows have a minimum peak of interest (i.e., even if the signal exceeds the background noise, the peak must have an additional signal excess of the background).
The windows will never overlap (i.e., each can be examined as a distinct sub-problem out of context of the rest of the signal).
Given those, you can first look through your data stream for a set of windows of interest. You can do this by making a first pass through the data: moving from left to right, look for noise threshold crossing points. If the signal was below the noise floor and exceeds it on the next sample, that's a candidate starting point for a window (vice versa for the candidate end point).
Now make a pass through your candidate windows: compare the scope and contents of each window with the values defined above. To use your picture as an example, the small peaks on the left of the image barely exceed the noise floor and do so for too short a time. However, the window in the center of the screen clearly has a wide time extent and a significant max value. Keep the windows that meet your minimum criteria, discard those that are trivial.
Now to examine your remaining windows in detail (remember, they can be treated individually). The peak is easy to find: pass through the window and keep the local max. With respect to the leading and trailing edges of the signal, you can see n the picture that you have a window that's slightly larger than the actual point at which the signal exceeds the noise floor. In this case, you can use a finite difference approximation to calculate the first derivative of the signal. You know that the leading edge will be somewhat to the left of the window on the chart: look for a point at which the first derivative exceeds a positive noise floor of its own (the slope turns upwards sharply). Do the same for the trailing edge (which will always be to the right of the window).
Result: a set of time windows, the leading and trailing edges of the signals and the peak that occured in that window.
It looks like the definition of a window is the range of x over which y is above the threshold. So use that to determine the size of the window. Within that, locate the largest value, thus finding the peak.
If that fails, then what additional criteria do you have for defining a region of interest? You may need to nail down your implicit assumptions to more than 'that looks like a peak to me'.