How do you calculate the relative probability using a Monte Carlo simulation?

How do you calculate the relative probability using a Monte Carlo simulation? - octave

We are supposed to plot a graph, on octave, of the relative probability and cumulative probability of picking a two of spades from a deck of cards. Relative probability=number of successes observed so far/number of experiments run so far, while cumulative probability = number of successes run so far/N where N is the number of experiments run (in this case N=10000 and N=100000). I have already determined the number of successes observed by using the random function on octave. However, I don't understand how to obtain the number of experiments run so far.

Related

How does score function help in policy gradient?

I'm trying to learn policy gradient methods for reinforcement learning but I stuck at the score function part.
While searching for maximum or minimum points in a function, we take the derivative and set it to zero, then look for the points that holds this equation.
In policy gradient methods, we do it by taking the gradient of the expectation of trajectories and we get:
Objective function image
Here I could not get how this gradient of log policy shifts the distribution (through its parameters θ) to increase the scores of its samples mathematically? Don't we look for something that make this objective function's gradient zero as I explained above?

What you want to maximize is
J(theta) = int( p(tau;theta)*R(tau) )
The integral is over tau (the trajectory) and p(tau;theta) is its probability (i.e., of seeing the sequence state, action, next state, next action, ...), which depends on both the dynamics of the environment and the policy (parameterized by theta). Formally
p(tau;theta) = p(s_0)*pi(a_0|s_0;theta)*P(s_1|s_0,a_0)*pi(a_1|s_1;theta)*P(s_2|s_1,a_1)*...
where P(s'|s,a) is the transition probability given by the dynamics.
Since we cannot control the dynamics, only the policy, we optimize w.r.t. its parameters, and we do it by gradient ascent, meaning that we take the direction given by the gradient. The equation in your image comes from the log-trick df(x)/dx = f(x)*d(logf(x))/dx.
In our case f(x) is p(tau;theta) and we get your equation. Then since we have access only to a finite amount of data (our samples) we approximate the integral with an expectation.
Step after step, you will (ideally) reach a point where the gradient is 0, meaning that you reached a (local) optimum.
You can find a more detailed explanation here.
EDIT
Informally, you can think of learning the policy which increases the probability of seeing high return R(tau). Usually, R(tau) is the cumulative sum of the rewards. For each state-action pair (s,a) you therefore maximize the sum of the rewards you get from executing a in state s and following pi afterwards. Check this great summary for more details (Fig 1).

How to define the number of factors in parallel analysis

I conducted an Exploratory Factor Analysis (Principal Axis Factoring) on my data and wanted to determine the number of factors to extract via. Horn's Parallel Analysis.
However I have two problems:
The parallel analysis suggests to extract 1 factor, however the plot shows more than one intersection of my "FA Actual Data" and my "FA Simulated Data" line. I do not get why it is just one factor (the first intersection) then.... This plot does not look typical to other parallel analysis plots.
Why does the number of factors to extract change with the number of observations (n.obs) I state? I mean that I just changed the number of observations from 50 to 500 (which is a lie), however then parallel analysis suggested 5 factors to extract instead of 9. I do not get why....
Thank you so much for any helpful tips.
Valerie
fa.parallel(cor(My_Data), n.obs = 50, fa="fa", fm="pa")
Parallel analysis suggests that the number of factors = 1 and the number of components = NA

STFT Clarification (FFT for real-time input)

I get how the DFT via correlation works, and use that as a basis for understanding the results of the FFT. If I have a discrete signal that was sampled at 44.1kHz, then that means if I were to take 1s of data, I would have 44,100 samples. In order to run the FFT on that, I would have to have an array of 44,100 and a DFT with N=44,100 in order to get the resolution necessary to detect a frequencies up to 22kHz, right? (Because the FFT can only correlate the input with sinusoidal components up to a frequency of N/2)
That's obviously a lot of data points and calculation time, and I have read that this is where the Short-time FT (STFT) comes in. If I then take the first 1024 samples (~23ms) and run the FFT on that, then take an overlapping 1024 samples, I can get the continuous frequency domain of the signal every 23ms. Then how do I interpret the output? If the output of the FFT on static data is N/2 data points with fs/(N/2) bandwidth, what is the bandwidth of the STFT's frequency output?
Here's an example that I ran in Mathematica:
100Hz sine wave at 44.1kHz sample rate:
Then I run the FFT on only the first 1024 points:
The frequency of interest is then at data point 3, which should somehow correspond to 100Hz. I think 44100/1024 = 43 is something like a scaling factor, which means that a signal with 1Hz in this little window will then correspond to a signal of 43Hz in the full data array. However, this would give me an output of 43Hz*3 = 129Hz. Is my logic correct but not my implementation?

As I have already stated in my earlier comments, the variable N affects the resolution achievable by the output frequency spectrum and not the range of frequencies you can detect.A larger N gives you a higher resolution at the expense of higher computation time and a lower N gives you lower computation time but can cause spectral leakage, which is the effect you have seen in your last figure.
As for your other question, well, theoretically the bandwidth of an FFT is infinite but we band-limit our result to the band of frequencies in the range [-fs/2 to fs/2] because all frequencies outside that band are susceptible to aliasing and are therefore of no use.Furthermore, if the input signal is real (which is true in most cases including ours) then the frequencies from [-fs/2 to 0] are just a reflection of the frequencies from [0 to fs/2] and so some FFT procedures just output the FFT spectrum from [0 to fs/2], which I think applies to your case.This means that the N/2 data points that you received as output represent the frequencies in the range [0 to fs/2] so that is the bandwidth you are working with in the case of the FFT and also in the case of the STFT (the STFT is just a series of FFT's, each FFT in a STFT will give you a spectrum with data points in this band).
I would also like to point out that the STFT will most likely not reduce your computation time if your input is a varying signal such as music because in that case you will need to take perform it several times over the duration of the song for it to be of any use, it will however enable you to understand the frequency characteristics of your song much better that you would do if you just performed one FFT.
To visualise the results of an FFT you use frequency (and/or phase) spectrum plots but in order to visualise the results of an STFT you will most probably need to create a spectrogram which is basically a graph can is made by just basically putting the individual FFT spectrums side by side.The process of creating a spectrogram can be seen in the figure below (Source: Dan Ellis - Introduction to Speech Processing).The spectrogram will show you how your signal's frequency characteristics change over time and how you interpret it will depend on what specific features you are looking to extract/detect from the audio.You might want to look at the spectrogram wikipedia page for more information.

Temperature Scale in SA

First, this is not a question about temperature iteration counts or automatically optimized scheduling. It's how the data magnitude relates to the scaling of the exponentiation.
I'm using the classic formula:
if(delta < 0 || exp(-delta/tK) > random()) { // new state }
The input to the exp function is negative because delta/tK is positive, so the exp result is always less then 1. The random function also returns a value in the 0 to 1 range.
My test data is in the range 1 to 20, and the delta values are below 20. I pick a start temperature equal to the initial computed temperature of the system and linearly ramp down to 1.
In order to get SA to work, I have to scale tK. The working version uses:
exp(-delta/(tK * .001)) > random()
So how does the magnitude of tK relate to the magnitude of delta? I found the scaling factor by trial and error, and I don't understand why it's needed. To my understanding, as long as delta > tK and the step size and number of iterations are reasonable, it should work. In my test case, if I leave out the extra scale the temperature of the system does not decrease.
The various online sources I've looked at say nothing about working with real data. Sometimes they include the Boltzmann constant as a scale, but since I'm not simulating a physical particle system that doesn't help. Examples (typically with pseudocode) use values like 100 or 1000000.
So what am I missing? Is scaling another value that I must set by trial and error? It's bugging me because I don't just want to get this test case running, I want to understand the algorithm, and magic constants mean I don't know what's going on.

Classical SA has 2 parameters: startingTemperate and cooldownSchedule (= what you call scaling).
Configuring 2+ parameters is annoying, so in OptaPlanner's implementation, I automatically calculate the cooldownSchedule based on the timeGradiant (which is a double going from 0.0 to 1.0 during the solver time). This works well. As a guideline for the startingTemperature, I use the maximum score diff of a single move. For more information, see the docs.

How do i know if this is random enough?

I wrote a program in java that rolls a die and records the total number of times each value 1-6 is rolled. I rolled 6 Million times. Here's the distribution:
#of 0's: 0
#of 1's: 1000068
#of 2's: 999375
#of 3's: 999525
#of 4's: 1001486
#of 5's: 1000059
#of 6's: 999487
(0 wasn't an option.)
Is this distribution consistant with random dice rolls?
What objective statistical tests might confirm that the dice rolls are indeed random enough?
EDIT: questions have been raised over application: a game that i want to be as fair as can be reasonably achieved.

To test whether this particular distribution is consistent with the expected distribution of numbers rolled with a "fair" dive, you need to perform the Pearson's Chi-square test.
Note that this still will not prove that your algorithm is "fair", only that these particular results look "fair".
To test whether your algorithm is "fair" in general, use the Diehard tests, as others have mentioned.

If your random number generator passes the Diehard tests, that's the best you can do.
Even a physical die won't be perfect with 1/6 per face.
Increase the trials by an order of magnitude, then do it again. If you get 1/6 for each trial you'll be fine.

This test alone isn't enough to determine randomness. Not that it's completely useless, but a "random" dice roller that outputs 1,2,3,4,5,6 and repeats would be perfectly random according to this test.
Another suggested test: pick a number, x, and each time it is rolled, record the statistics of what number comes next; you should see an even distribution again. Repeat for all six values of x. If it passes this test it is probably random enough to be used as a dice roller.

The probability that 6'000'000 dice rolls will end up in exactly 1'000'000 outcomes of each is close to 0. As long as the sum if the outcomes is correct, and that the variance (error) of the difference from the expected outcome goes towards 0 (relatively) when the number of trials increase, then your random function is not wrong.
You can either prove it mathematically or by testing the random function with larger and larger trial sequences to see that it converges.
For a repeated number of tests, the sum for each outcome should approximate Gaussian distribution. E.g. each outcome 1-6 should fall within normal distribution centered around 1'000'000 with a variance that is inversely proportional to the number of dice rolls.
The other tests, the Diehard tests, tests that the actual sequence of dice rolls is random in itself and not that the outcome of 6'000'000 rolls for example is 100'000 consecutive 1's, then 100'000 2's and so on and finally some random sequences.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008