How to interpret output "Percent_Sig" in ParcelAllocaiton Function (Sem Tools)? - structural-equation-model

In the parcel allocation function, there is an output called "Percent_Sig".
I interpreted this value as an averaged p-value acros allocations. (e.g. 1rst allocation p-value 0.56, 2nd allocation p-value 0.34, 3rd allocation p-value 0.54 -> averaged p-value = 0.48)
However in the description of the sem tool package it says, it represents the "proportion of allocations in which each test of fit was significant."
How do I interpet this value then?
For instance, if it is a value of Percent_Sig = 0.48. Okay, I know that in 48 % allocations there was a significant p-value. But when would I say the probability is low enough to say that my chi sqaure value (because low p-values mean a better model fit) is good.
Would be happy about an answer :)

there is an output called "Percent_Sig". I interpreted this value as an averaged p-value acros allocations
Nope.
it represents the "proportion of allocations in which each test of fit was significant."
Yup.
I know that in 48 % allocations there was a significant p-value. But when would I say the probability is low enough to say that my chi square value (because low p-values mean a better model fit) is good.
The proportion significant is not meant to help you test your model, but rather to provide an idea about how uncertain you should be about whether your model would be rejected if you chose a different random-allocation of items to parcels. This is the issue with arbitrary allocations, as discussed in the older papers listed among References on the ?parcelAllocation help page.
To obtain a single test statistic for your model (which appropriately accounts for the uncertainty due to random allocation), as well as tests for individual parameters, you can save the allocations as a list of data sets, then treat them as multiple imputations. This is discussed in the later papers among the References, and demonstrated in the ## POOL RESULTS section of the help-page Examples.

Related

Statistical method to know when enough performance test iterations have been performed

I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)

understanding getByteTimeDomainData and getByteFrequencyData in web audio

The documentation for both of these methods are both very generic wherever I look. I would like to know what exactly I'm looking at with the returned arrays I'm getting from each method.
For getByteTimeDomainData, what time period is covered with each pass? I believe most oscopes cover a 32 millisecond span for each pass. Is that what is covered here as well? For the actual element values themselves, the range seems to be 0 - 255. Is this equivalent to -1 - +1 volts?
For getByteFrequencyData the frequencies covered is based on the sampling rate, so each index is an actual frequency, but what about the actual element values themselves? Is there a dB range that is equivalent to the values returned in the returned array?
getByteTimeDomainData (and the newer getFloatTimeDomainData) return an array of the size you requested - its frequencyBinCount, which is calculated as half of the requested fftSize. That array is, of course, at the current sampleRate exposed on the AudioContext, so if it's the default 2048 fftSize, frequencyBinCount will be 1024, and if your device is running at 44.1kHz, that will equate to around 23ms of data.
The byte values do range between 0-255, and yes, that maps to -1 to +1, so 128 is zero. (It's not volts, but full-range unitless values.)
If you use getFloatFrequencyData, the values returned are in dB; if you use the Byte version, the values are mapped based on minDecibels/maxDecibels (see the minDecibels/maxDecibels description).
Mozilla 's documentation describes the difference between getFloatTimeDomainData and getFloatFrequencyData, which I summarize below. Mozilla docs reference the Web Audio
experiment ; the voice-change-o-matic. The voice-change-o-matic illustrates the conceptual difference to me (it only works in my Firefox browser; it does not work in my Chrome browser).
TimeDomain/getFloatTimeDomainData
TimeDomain functions are over some span of time.
We often visualize TimeDomain data using oscilloscopes.
In other words:
we visualize TimeDomain data with a line chart,
where the x-axis (aka the "original domain") is time
and the y axis is a measure of a signal (aka the "amplitude").
Change the voice-change-o-matic "visualizer setting" to Sinewave to
see getFloatTimeDomainData(...)
Frequency/getFloatFrequencyData
Frequency functions (GetByteFrequencyData) are at a point in time; i.e. right now; "the current frequency data"
We sometimes see these in mp3 players/ "winamp bargraph style" music players (aka "equalizer" visualizations).
In other words:
we visualize Frequency data with a bar graph
where the x-axis (aka "domain") are frequencies or frequency bands
and the y-axis is the strength of each frequency band
Change the voice-change-o-matic "visualizer setting" to Frequency bars to see getFloatFrequencyData(...)
Fourier Transform (aka Fast Fourier Transform/FFT)
Another way to think about "time domain vs frequency" is shown the diagram below, from Fast Fourier Transform wikipedia
getFloatTimeDomainData gives you the chart on on the top (x-axis is Time)
getFloatFrequencyData gives you the chart on the bottom (x-axis is Frequency)
a Fast Fourier Transform (FFT) converts the Time Domain data into Frequency data, in other words, FFT converts the first chart to the second chart.
cwilso has it backwards.
the time data array is the longer one (fftSize), and the frequency data array is the shorter one (half that, frequencyBinCount).
fftSize of 2048 at the usual sample rate of 44.1kHz means each sample has 1/44100 duration, you have 2048 samples at hand, and thus are covering a duration of 2048/44100 seconds, which 46 milliseconds, not 23 milliseconds. The frequencyBinCount is indeed 1024, but that refers to the frequency domain (as the name suggests), not the time domain, and it the computation 1024/44100, in this context, is about as meaningful as adding your birth date to the fftSize.
A little math illustrating what's happening: Fourier transform is a 'vector space isomorphism', that is, a mapping going bijectively (i.e., reversible) between 2 vector spaces of the same dimension; the 'time domain' and the 'frequency domain.' The vector space dimension we have here (in both cases) is fftSize.
So where does the 'half' come from? The frequency domain coefficients 'count double'. Either because they 'actually are' complex numbers, or because you have the 'sin' and the 'cos' flavor. Or, because you have a 'magnitude' and a 'phase', which you'll understand if you know how complex numbers work. (Those are 3 ways to say the same in a different jargon, so to speak.)
I don't know why the API only gives us half of the relevant numbers when it comes to frequency - I can only guess. And my guess is that those are the 'magnitude' numbers, and the 'phase' numbers are thrown out. The reason that this is my guess is that in applications, magnitude is far more important than phase. Still, I'm quite surprised that the API throws out information, and I'd be glad if some expert who actually knows (and isn't guessing) can confirm that it's indeed the magnitude. Or - even better (I love to learn) - correct me.

FFT Magnitude Output

I'm just getting into signal processing and need to do some DFT/FFT work.
If I take a signal with two freqs of 2Hz and 5Hz: x(t)=sin(2*2pi*t)+sin(5*2pi*t). I sample at 100Hz for 5 sec (so my DFT size is 500).
Because my inputs are real values I get a symmetric DFT, so can discard the 2nd half and convert the DFT values into magnitude by doing sqrt(re^2+c^2).
My bin widths are 100/500 = 0.2Hz, and so I get:
With peaks at 2Hz and 5Hz as expected.
My question is: why are the magnitudes different?
On a related note, why are there not two perfect spikes at 2hz and 5Hz, i.ee the graph has non-zero values at 1.5 and 2.5 etc. Is this spectral leakage?
I expect your 500 data points are being processed as a 512 point FFT (most FFT libraries do not support arbitrary size inputs and so typically they zero pad to the next highest power of 2). If that is the case then you will be seeing the effects of spectral leakage. Applying a window function prior to the FFT should fix this. Note that you will still see "skirts" on either side of your peaks - this is due to the uncertainty introduced by a finite sampling window.

Why is the knapsack problem pseudo-polynomial?

I know that Knapsack is NP-complete while it can be solved by DP. They say that the DP solution is pseudo-polynomial, since it is exponential in the "length of input" (i.e. the numbers of bits required to encode the input). Unfortunately I did not get it. Can anybody explain that pseudo-polynomial thing to me slowly ?
The running time is O(NW) for an unbounded knapsack problem with N items and knapsack of size W. W is not polynomial in the length of the input though, which is what makes it pseudo-polynomial.
Consider W = 1,000,000,000,000. It only takes 40 bits to represent this number, so input size = 40, but the computational runtime uses the factor 1,000,000,000,000 which is O(240).
So the runtime is more accurately said to be O(N.2bits in W), which is exponential.
Also see:
How to understand the knapsack problem is NP-complete?
The NP-Completeness of Knapsack
Complexity of dynamic programming algorithm for the 0-1 knapsack problem
Pseudo-polynomial time
In most of our problems, we're dealing with large lists of numbers which fit comfortably inside standard int/float data types. Because of the way most processors are built to handle 4-8 byte numbers at a time at no additional cost (relative to numbers than fit in, say, 1 byte), we rarely encounter a change in running time from scaling our numbers up or down within ranges we encounter in real problems - so the dominant factor remains just the sheer quantity of data points, the n or m factors that we're used to.
(You can imagine that the Big-O notation is hiding a constant factor that divides-out 32 or 64 bits-per-datum, leaving only the number-of-data-points whenever each of our numbers fit in that many bits or less)
But try reworking with other algorithms to act on data sets involving big ints - numbers that require more than 8 bytes to represent - and see what that does to the runtime. The magnitude of the numbers involved always makes a difference, even in the other algorithms like binary sort, once you expand beyond the buffer of safety conventional processors give us "for-free" by handling 4-8 byte batches.
The trick with the Knapsack algorithm that we discussed is that it's unusually sensitive (relative to other algorithms ) to the magnitude of a particular parameter, W. Add one bit to W and you double the running time of the algorithm. We haven't seen that kind of dramatic response to changes in value in other algorithms before this one, which is why it might seem like we're treating Knapsack differently - but that's a genuine analysis of how it responds in a non-polynomial fashion to changes in input size.
The way I understand this is that the capacity would've been O(W) if the capacity input were an array of [1,2,...,W], which has a size of W. But the capacity input is not an array of numbers, it's instead a single integer. The time complexity is about the relationship to the size of input. The size of an integer is NOT the value of the integer, but the number of bits representing it. We do later convert this integer W into an array [1,2,...,W] in the algorithm, leading people into mistakenly thinking W is the size, but this array is not the input, the integer itself is.
Think of input as "an array of stuff", and the size as "how many stuff in the array". The item input is actually an array of n items in the array so size=n. The capacity input is NOT an array of W numbers in it, but a single integer, represented by an array of log(W) bits. Increase the size of it by 1 (adding 1 meaningful bit), W doubles so run time doubles, hence the exponential time complexity.
The Knapsack algorithm's run-time is bound not only on the size of the input (n - the number of items) but also on the magnitude of the input (W - the knapsack capacity) O(nW) which is exponential in how it is represented in computer in binary (2^n) .The computational complexity (i.e how processing is done inside a computer through bits) is only concerned with the size of the inputs, not their magnitudes/values.
Disregard the value/weight list for a moment. Let's say we have an instance with knapsack capacity 2. W would take two bits in the input data. Now we shall increase the knapsack capacity to 4, keeping the rest of the input. Our input has only grown by one bit, but the computational complexity has increased twofold. If we increase the capacity to 1024, we would have just 10 bits of the input for W instead of 2, but the complexity has increased by a factor of 512. Time complexity grows exponentially in the size of W in binary (or decimal) representation.
Another simple example that helped me understand the pseudo-polynomial concept is the naive primality testing algorithm. For a given number n we are checking if it's divided evenly by each integer number in range 2..√n, so the algorithm takes √(n−1) steps. But here, n is the magnitude of the input, not it's size.
Now The regular O(n) case
By contrast, searching an array for a given element runs in polynomial time: O(n). It takes at most n steps and here n is the size of the input (the length of the array).
[ see here ]
Calculating bits required to store decimal number
Complexity is based on input. In knapsack problem, Inputs are size, max Capacity, and profit, weight arrays. We construct dp table as size * W so we feel as its of polynomial time complexity. But, input W is an integer, not an array. So, it will be O(size*(no Of bits required to store given W)). If no of bits increase by 1, then running time doubles. Thus it is exponential, thereby pseudo-polynomial.

How to represent stereo audio data for FFT

How should stereo (2 channel) audio data be represented for FFT? Do you
A. Take the average of the two channels and assign it to the real component of a number and leave the imaginary component 0.
B. Assign one channel to the real component and the other channel to the imag component.
Is there a reason to do one or the other? I searched the web but could not find any definite answers on this.
I'm doing some simple spectrum analysis and, not knowing any better, used option A). This gave me an unexpected result, whereas option B) went as expected. Here are some more details:
I have a WAV file of a piano "middle-C". By definition, middle-C is 260Hz, so I would expect the peak frequency to be at 260Hz and smaller peaks at harmonics. I confirmed this by viewing the spectrum via an audio editing software (Sound Forge). But when I took the FFT myself, with option A), the peak was at 520Hz. With option B), the peak was at 260Hz.
Am I missing something? The explanation that I came up with so far is that representing stereo data using a real and imag component implies that the two channels are independent, which, I suppose they're not, and hence the mess-up.
I don't think you're taking the average correctly. :-)
C. Process each channel separately, assigning the amplitude to the real component and leaving the imaginary component as 0.
Option B does not make sense. Option A, which amounts to convert the signal to mono, is OK (if you are interested in a global spectrum).
Your problem (double freq) is surely related to some misunderstanding in the use of your FFT routines.
Once you take the FFT you need to get the Magnitude of the complex frequency spectrum. To get the magnitude you take the absolute of the complex spectrum |X(w)|. If you want to look at the power spectrum you square the magnitude spectrum, |X(w)|^2.
In terms of your frequency shift I think it has to do with you setting the imaginary parts to zero.
If you imagine the complex Frequency spectrum as a series of complex vectors or position vectors in a cartesian space. If you took one discrete frequency bin X(w), there would be one real component representing its direction in the real axis (x -direction), and one imaginary component in the in the imaginary axis (y - direction). There are four important values about this discrete frequency, 1. real value, 2. imaginary value, 3. Magnitude and, 4. phase. If you just take the real value and set imaginary to 0, you are setting Magnitude = real and phase = 0deg or 90deg. You have hence forth modified the resulting spectrum, and applied a bias to every frequency bin. Take a look at the wiki on Magnitude of a vector, also called the Euclidean norm of a vector to brush up on your understanding. Leonbloy was correct, but I hope this was more informative.
Think of the FFT as a way to get information from a single signal. What you are asking is what is the best way to display data from two signals. My answer would be to treat each independently, and display an FFT for each.
If you want a really fast streaming FFT you can read about an algorithm I wrote here: www.depthcharged.us/?p=176