Evaluate_policy records much higher mean reward then stable baselines 3 logger - reinforcement-learning

As the title says, I am testen PPO with the Cartpole Environment using SB3, but if I look at the performance measured be the evaluate_policy function I reach a mean reward of 475 reliable at 20000 timesteps, but I need about 90000 timesteps if I look at console log to get comparable results during learning.
Why does my model perform so much better using the evaluation helper?
I used the same hyperparameters in both cases, and I used a new environment for the evaluation with the helper method.

I think I have solved the "problem":
evaluate_policy uses deterministic action in it's default settings, which leads to better results faster.

Related

Most efficient computational method to numerically minimize a 8 variables constrained system

I'm working for quite some time on finding a numerical instance for solution of a 8 variables system of 7 very complicated inequalities plus region specification. Unfortunately I cannot produce a MWE or nothing of the sort since the inputs are really long.
My current method is Mathematica's NMinimize routine, minimizing one of the 7 inequalities subject to every other condition as constraint -- The FindInstance command simply quits the kernel without being able to finish running.
The NMinimize is able to produce output, but besides being slower than would be optimal, produce results that do not obey every constraint.
The thing is that I need to be certain, for each benchmark I run, that if the output doesn't satisfy every constraint it is because such a set of real numbers doesn't exist -- which with my current method I can't be, by experience.
So: is there a foolproof, as efficient as possible, computational method for me to find a single instance of numerical solution to 7 complicated inequalities (involving trigonometric functions) of 8 variables or be sure that such a set doesn't exist?
It could be a Mathematica/python/fortran package, genetic algorithm or anything -- as long as there is clear enough documentation.
You need to give importance multiplier to constraints and the optimization method should not be greedy.
A genetic algorithm combined with multiple-starting points (or simulated annealing for diminishing mutations) tends to converge to global minima (hence not greedy) with more time given to it but there is not guarantee that the heuristic will complete X function in Y time. The more time given to it, the better it converges to global minima.
In genetic algorithm, you can add big constraint penalties like this:
fitness_minima = some_function_output_between_1_and_10 +
constraints_breached?1000.0f:0;
so that the DNAs with no contraint-violations will be favored for the crossover part of GA.
"As efficient as possible" depends on your algorithm. If you can parallelize the algorithm and run it on multiple GPUs, it should give substantial speedup over CPU. Compared to some hours of Mona-Lisa painting by CPU, a parallelized version running on 3 low-end GPUs complete within 10 minutes (https://www.youtube.com/watch?v=QRZqBLJ6brQ). At least some OpenCL/CUDA supporting libraries/frameworks (like Tensorflow) should be able to accelerate your algorithm if you don't want to do the work distribution yourself.

Does backpropagation use optimization function to update weights?

I know that backpropagation calculates the derivative of the cost function with respect to the parameters of the model (weight and bias). However, I need to make sure that backpropagation does not update weight and bias; instead, it uses OPTIMIZERs in order to update weights and bias like Adam, Gradient Descent, and others
Thanks in advance
If I understand well your question: when you use an optimizer in a deep learning framework (PyTorch/TensorFlow), say "Adam", the weights updating are being performed by the optimizer. This process is taking place automatically, you do not need to manually write any code, the framework does the updating weights+biases for you.

Q-learning, how about picking the action that actually gives most reward?

So in Q learning, you update the Q function by Qnew(s,a) = Q(s,a) + alpha(r + gamma*MaxQ(s',a) - Q(s,a).
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Of course, the training time would probably increase because you actually do all the actions once for each update, but since you are guaranteed to select the best action each time (except when exploring), it would give you a global optimum policy in the end?
This is a bit similar to value iteration, except I'm don't have and not building a model for the problem.
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
It is typically assumed in Reinforcement Learning that we do not have the ability to reset the (simulated) environment. Sure, when we're working on simulations it often may technically be possible, but generally we hope that work in RL can also extend to "real-world" problems outside of simulations afterwards, where that would no longer be possible.
If you do have that possibility, it would generally be recommended to look into search algorithms like Monte-Carlo Tree Search, rather than Reinforcement Learning like Sarsa, Q-learning, etc. I suspect your suggestion might work slightly better than Q-learning indeed in this case, but things like MCTS would be even better.
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Given that you don't have access to the model, you have to resort to model free methods. What you are suggesting is basically a Dynamics programming backup. See the slides 28 - 31 in David Silver's lecture notes for various backup strategies to iterate on the value function.
However, note that this is just for prediction (i.e. estimating the value function for a given policy) and not for control (figuring out the best policy). There won't be a Max involved in prediction. To do control, you can use the above policy evaluation + greedy policy improvement to arrive at a "policy iteration based on Dynamic prog backup policy evaluation" method.
The other options for model-free control are SARSA [+ greedy policy improvement] (on policy) and Q-learning (off-policy). These are Q-function based methods, though.
If you are just trying to win the game, and not necessarily interested in RL techniques discussed above, then you also have the choice of using purely planning based methods (like Monte Carlo Tree Search). Finally, you can combine planning and learning with methods such as Dyna.

Is BatchNorm turned off when inferencing?

I read from several sources that implicitly suggest batchnorm being turned off for inference but I have no definite answer for this.
Most common is to use a moving average of mean and std for your batch normalization as used by Keras for example (https://github.com/keras-team/keras/blob/master/keras/layers/normalization.py). If you just turn it off the network will perform worse on the same data, due to changes in how the images are processed.
This is done by storing the average mean and the average std of all the batches used during training the network. Then in inference this moving average is used for normalization.

Benchmarking: When can I stop making measurements?

I have a series of functions that are all designed to do the same thing. The same inputs produce the same outputs, but the time that it takes to do them varies by function. I want to determine which one is 'fastest', and I want to have some confidence that my measurement is 'statistically significant'.
Perusing Wikipedia and the interwebs tells me that statistical significance means that a measurement or group of measurements is different from a null hypothesis by a p-value threshold. How would that apply here? What is the null hypothesis between function A being faster than function B?
Once I've got that whole setup defined, how do I figure out when to stop measuring? I'll typically see that a benchmark is run three times, and then the average is reported; why three times and not five or seven? According to this page on Statistical Significance (which I freely admit I do not understand fully), Fisher used 8 as the number of samples that he needed to measure something with 98% confidence; why 8?
I would not bother applying statistics principles to benchmarking results. In general, the term "statistical significance" refers to the likelihood that your results were achieved accidentally, and do not represent an accurate assessment of the true values. In statistics, as a result of simple probability, the likelihood of a result being achieved by chance decreases as the number of measurements increases. In the benchmarking of computer code, it is a trivial matter to increase the number of trials (the "n" in statistics) so that the likelihood of an accidental result is below any arbitrary threshold you care to define (the "alpha" or level of statistical significance).
To simplify: benchmark by running your code a huge number of times, and don't worry about statistical measurements.
Note to potential down-voters of this answer: this answer is somewhat of a simplification of the matter, designed to illustrate the concepts in an accessible way. Comments like "you clearly don't understand statistics" will result in a savage beat-down. Remember to be polite.
You are asking two questions:
How do you perform a test of statistical significance that the mean time of function A is greater than the mean time of function B?
If you want a certain confidence in your answer, how many samples should you take?
The most common answer to the first question is that you either want to compute a confidence interval or perform a t-test. It's not different than any other scientific experiment with random variation. To compute the 95% confidence interval of the mean response time for function A simply take the mean and add 1.96 times the standard error to either side. The standard error is the square root of the variance divided by N. That is,
95% CI = mean +/- 1.96 * sqrt(sigma2/N))
where sigma2 is the variance of speed for function A and N is the number of runs you used to calculate mean and variance.
Your second question relates to statistical power analysis and the design of experiments. You describe a sequential setup where you are asking whether to continue sampling. The design of sequential experiments is actually a very tricky problem in statistics, since in general you are not allowed to calculate confidence intervals or p-values and then draw additional samples conditional on not reaching your desired significance. If you wish to do this, it would be wiser to set up a Bayesian model and calculate your posterior probability that speed A is greater than speed B. This, however, is massive overkill.
In a computing environment it is generally pretty trivial to achieve a very small confidence interval both because drawing large N is easy and because the variance is generally small -- one function obviously wins.
Given that Wikipedia and most online sources are still horrible when it comes to statistics, I recommend buying Introductory Statistics with R. You will learn both the statistics and the tools to apply what you learn.
The research you site sounds more like a highly controlled environment. This is purely a practical answer that has proven itself time and again to be effective for performance testing.
If you are benchmarking code in a modern, multi-tasking, multi-core, computing environment, the number of iterations required to achieve a useful benchmark goes up as the length of time of the operation to be measured goes down.
So, if you have an operation that takes ~5 seconds, you'll want, typically, 10 to 20 iterations. As long as the deviation across the iterations remains fairly constant, then your data is sound enough to draw conclusions. You'll often want to throw out the first iteration or two because the system is typically warming up caches, etc...
If you are testing something in the millisecond range, you'll want 10s of thousands of iterations. This will eliminate noise caused by other processes, etc, firing up.
Once you hit the sub-millisecond range -- 10s of nanoseconds -- you'll want millions of iterations.
Not exactly scientific, but neither is testing "in the real world" on a modern computing system.
When comparing the results, consider the difference in execution speed as percentage, not absolute. Anything less than about 5% difference is pretty close to noise.
Do you really care about statistical significance or plain old significance? Ultimately you're likely to have to form a judgement about readability vs performance - and statistical significance isn't really going to help you there.
A couple of rules of thumb I use:
Where possible, test for enough time to make you confident that little blips (like something else interrupting your test for a short time) won't make much difference. Usually I reckon 30 seconds is enough for this, although it depends on your app. The longer you test for, the more reliable the test will be - but obviously your results will be delayed :)
Running a test multiple times can be useful, but if you're timing for long enough then it's not as important IMO. It would alleviate other forms of error which made a whole test take longer than it should. If a test result looks suspicious, certainly run it again. If you see significantly different results for different runs, run it several more times and try to spot a pattern.
The fundamental question you're trying to answer is how likley is it that what you observe could have happened by chance? Is this coin fair? Throw it once: HEADS. No it's not fair it always comes down heads. Bad conclusion! Throw it 10 times and get 7 Heads, now what do you conclude? 1000 times and 700 heads?
For simple cases we can imagine how to figure out when to stop testing. But you have a slightly different situation - are you really doing a statistical analysis?
How much control do you have of your tests? Does repeating them add any value? Your computer is deterministic (maybe). Eistein's definition of insanity is to repeat something and expect a different outcome. So when you run your tests do you get repeatable answers? I'm not sure that statistical analyses help if you are doing good enough tests.
For what you're doing I would say that the first key thing is to make sure that you really are measuring what you think. Run every test for long enough that any startup or shutdown effects are hidden. Useful performance tests tend to run for quite extended periods for that reason. Make sure that you are not actually measuing the time in your test harness rather than the time in your code.
You have two primary variables: how many iterations of your method to run in one test? How many tests to run?
Wikipedia says this
In addition to expressing the
variability of a population, standard
deviation is commonly used to measure
confidence in statistical conclusions.
For example, the margin of error in
polling data is determined by
calculating the expected standard
deviation in the results if the same
poll were to be conducted multiple
times. The reported margin of error is
typically about twice the standard
deviation.
Hence if your objective is to be sure that one function is faster than another you could run a number of tests of each, compute the means and standard deviations. My expectation is that if your number of iterations within any one test is high then the standard deviation is going to be low.
If we accept that defintion of margin of error, you can see whether the two means are further apart than their total margin's of error.