How to print probability for repeated measures logistic regression? - binary

I would like SAS to print the probability of my binary dependent variable occurring (“Calliphoridae” a particular fly family being present (1) or not (0), at a specific instance for my continuous independent variable (“degree_index” that was recorded from .055 to 2.89, but can be continuously recorded past 2.89 and always increases as time goes on) using Proc GENMOD. How do I change my code to print the probability, for example, that Calliphoridae is present at degree_index=.1?
My example code is:
proc genmod data=thesis descending ;
class Body_number ;
model Calliphoridae = degree_index / dist=binomial link=logit ;
repeated subject=Body_number/ type=cs;
estimate 'degreeindex=.1' intercept 1 degree_index 0 /exp;
estimate 'degree_index=.2' intercept 1 degree_index .1 /exp;run;
I get an output for the contrast estimate results as mean estimate at degree_index=.1 is ..99; degree_index=.2 is .98.

I think that it is correctly modeling the probability...I just didn't include the square of
the degree-day index. If you do, it allows the probability to increase and decrease. I
realized this when I did the probability by hand
(e^-1.1307x+.2119)/(1+e^-1.1307x+.2119) to verify that this really was modeling
probability when y=1 for the mean estimates at specific x values...and then I realized that it is
fitting a regression line and cannot increase and decrease because there is only
one x value. http://www.stat.sc.edu/~hansont/stat704/chapter14a.pdf

Related

Making sense of soundMixer.computeSpectrum

All examples that I can find on the Internet just visualize the result array of the function computeSpectrum, but I am tasked with something else.
I generate a music note and I need by analyzing the result array to be able to say what note is playing. I figured out that I need to set the second parameter of the function call 'FFTMode' to true and then it returns sound frequencies. I thought that really it should return only one non-zero value which I could use to determine what note I generated using Math.sin function, but it is not the case.
Can somebody suggest a way how I can accomplish the task? Using the soundMixer.computeSpectrum is a requirement because I am going to analyze more complex sounds later.
FFT will transform your signal window into set of Nyquist sine waves so unless 440Hz is one of them you will obtain more than just one nonzero value! For a single sine wave you would obtain 2 frequencies due to aliasing. Here an example:
As you can see for exact Nyquist frequency the FFT response is single peak but for nearby frequencies there are more peaks.
Due to shape of the signal you can obtain continuous spectrum with peaks instead of discrete values.
Frequency of i-th sample is f(i)=i*samplerate/N where i={0,1,2,3,4,...(N/2)-1} is sample index (first one is DC offset so not frequency for 0) and N is the count of samples passed to FFT.
So in case you want to detect some harmonics (multiples of single fundamental frequency) then set the samplerate and N so samplerate/N is that fundamental frequency or divider of it. That way you would obtain just one peak for harmonics sinwaves. Easing up the computations.

Statistical method to know when enough performance test iterations have been performed

I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)

STFT Clarification (FFT for real-time input)

I get how the DFT via correlation works, and use that as a basis for understanding the results of the FFT. If I have a discrete signal that was sampled at 44.1kHz, then that means if I were to take 1s of data, I would have 44,100 samples. In order to run the FFT on that, I would have to have an array of 44,100 and a DFT with N=44,100 in order to get the resolution necessary to detect a frequencies up to 22kHz, right? (Because the FFT can only correlate the input with sinusoidal components up to a frequency of N/2)
That's obviously a lot of data points and calculation time, and I have read that this is where the Short-time FT (STFT) comes in. If I then take the first 1024 samples (~23ms) and run the FFT on that, then take an overlapping 1024 samples, I can get the continuous frequency domain of the signal every 23ms. Then how do I interpret the output? If the output of the FFT on static data is N/2 data points with fs/(N/2) bandwidth, what is the bandwidth of the STFT's frequency output?
Here's an example that I ran in Mathematica:
100Hz sine wave at 44.1kHz sample rate:
Then I run the FFT on only the first 1024 points:
The frequency of interest is then at data point 3, which should somehow correspond to 100Hz. I think 44100/1024 = 43 is something like a scaling factor, which means that a signal with 1Hz in this little window will then correspond to a signal of 43Hz in the full data array. However, this would give me an output of 43Hz*3 = 129Hz. Is my logic correct but not my implementation?
As I have already stated in my earlier comments, the variable N affects the resolution achievable by the output frequency spectrum and not the range of frequencies you can detect.A larger N gives you a higher resolution at the expense of higher computation time and a lower N gives you lower computation time but can cause spectral leakage, which is the effect you have seen in your last figure.
As for your other question, well, theoretically the bandwidth of an FFT is infinite but we band-limit our result to the band of frequencies in the range [-fs/2 to fs/2] because all frequencies outside that band are susceptible to aliasing and are therefore of no use.Furthermore, if the input signal is real (which is true in most cases including ours) then the frequencies from [-fs/2 to 0] are just a reflection of the frequencies from [0 to fs/2] and so some FFT procedures just output the FFT spectrum from [0 to fs/2], which I think applies to your case.This means that the N/2 data points that you received as output represent the frequencies in the range [0 to fs/2] so that is the bandwidth you are working with in the case of the FFT and also in the case of the STFT (the STFT is just a series of FFT's, each FFT in a STFT will give you a spectrum with data points in this band).
I would also like to point out that the STFT will most likely not reduce your computation time if your input is a varying signal such as music because in that case you will need to take perform it several times over the duration of the song for it to be of any use, it will however enable you to understand the frequency characteristics of your song much better that you would do if you just performed one FFT.
To visualise the results of an FFT you use frequency (and/or phase) spectrum plots but in order to visualise the results of an STFT you will most probably need to create a spectrogram which is basically a graph can is made by just basically putting the individual FFT spectrums side by side.The process of creating a spectrogram can be seen in the figure below (Source: Dan Ellis - Introduction to Speech Processing).The spectrogram will show you how your signal's frequency characteristics change over time and how you interpret it will depend on what specific features you are looking to extract/detect from the audio.You might want to look at the spectrogram wikipedia page for more information.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.
Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

FFT Magnitude Output

I'm just getting into signal processing and need to do some DFT/FFT work.
If I take a signal with two freqs of 2Hz and 5Hz: x(t)=sin(2*2pi*t)+sin(5*2pi*t). I sample at 100Hz for 5 sec (so my DFT size is 500).
Because my inputs are real values I get a symmetric DFT, so can discard the 2nd half and convert the DFT values into magnitude by doing sqrt(re^2+c^2).
My bin widths are 100/500 = 0.2Hz, and so I get:
With peaks at 2Hz and 5Hz as expected.
My question is: why are the magnitudes different?
On a related note, why are there not two perfect spikes at 2hz and 5Hz, i.ee the graph has non-zero values at 1.5 and 2.5 etc. Is this spectral leakage?
I expect your 500 data points are being processed as a 512 point FFT (most FFT libraries do not support arbitrary size inputs and so typically they zero pad to the next highest power of 2). If that is the case then you will be seeing the effects of spectral leakage. Applying a window function prior to the FFT should fix this. Note that you will still see "skirts" on either side of your peaks - this is due to the uncertainty introduced by a finite sampling window.