Stable-Baselines3 package, model.learn() function - how do total_timesteps and num_eval_episodes work together? - reinforcement-learning

I am using the SB3 package for RL, and I'm trying out the model.learn() function.
I don't understand exactly what model.learn() parameters do in terms of how they work together and with my environment.
My RL is working from a tabular dataset, so there's an inherent limitation to number of timesteps possible.
Let's say these are my conditions:
I have a dataset with 20,000 rows (possible timesteps)
In my environment, my step() function contains an if-statement which flips "done" to True when the number of steps taken reaches 1,000 (the step() function counts the number of times it's been called since the initialization of the env).
I run model.learn() with total_timesteps = 30,000 .
I encounter no errors when I do this. Can someone please explain what is happening? Is model.learn() running my environment through the first 1,000 timesteps, then re-starts and keeps looping this way until 30,000 total timesteps have been taken?
If so, how does num_eval_episodes feed into this? Does it change how the function runs? If so, how?
I'm sorry for the scattered question, I appreciate any clarification.

I'm working with SB3 as well these days and I think your own assessment that "model.learn() is running the environment through the first 1,000 timesteps, then re-starts and keeps looping this way until 30,000 total timesteps have been taken" is probably correct.
Have you ever set the if-statement that flips "done" to True to a number of steps greater than your dataset?
As far as I know SB3 works that way so that you can train on environments with or without a fixed number of timesteps without getting problems with infinite training in cases where the terminal state is never reached.
In my own application which also has episodes with a fixed number of timesteps per episode (n_max_timesteps) I always set "total_timesteps=n_episodes**n_max_timesteps*" in model.learn().
"n_eval_episodes" runs the agent for a specified number of episodes from reset to reaching a final / terminal state.

Related

OpenAI gym GuessingGame-v0 possible solutions

I have been struggling to solve the GuessingGame-v0 environment which is part of the OpenAI gym.
In the environment each episode a random number within a range is selected and the agent must "guess" what this random number is. The agent is only provided with the observation of whether the guess was too large or too small.
After researching how to frame the problem I think it may be possible to frame the problem as a Hidden Markov Model, but I am unsure of how to do this.
Each episode the randomly selected number changes and because of this I don't know how the model won't have to change each episode as the goal state is continually shifting.
I could not find any resources on the environment or any environments similar to it other than the documentation provided by OpenAI which I did not find useful.
I would greatly appreciate any assistance on how to solve this environment.
I'm putting this as an answer so people don't have to read through the list of comments.
You need a program that can simply cycle through:
generate the random number
agent guesses a number (within the allowable guess range)
test whether the number is within 1%.
if the number is within 1%, stop the iteration, maybe print the guess at this point
if the iteration is at step 200, stop the iteration and maybe produce some out that gives the final guessed number and the fact it is not within 1%
if not 200 steps or 1%: a) if the number is too high, record the guess and that it is too high, or b) if the number is too low, record the guess and that it is too low. Iterate through that number bound. Repeat until either the 1% or 200 steps criterion is reached.
Another thought for you: do you need a starting low number and a starting high number?
There are a number of ways in which to implement this solution. There is also a range of programming software in which the solution can be implemented. The particular software you use is probably the one with which you are most familiar.
Good luck!

different between effect of episodes and time in DQN and where is the updating the experience replay

In DQN paper of DeepMind company, there are two loops one for episodes and one for running time in each step (one for training and one for different time-step of running). Am I right?
Since, nothing is done in outer loop except initialization and reset to conditions of first step, what are their differences?
For instance, in case 1, if we run for 1000 episodes and 400 time steps what are the differences we should expected in case 2, if we run for 4000 episodes and 100 time steps?
(is their difference that the second one has more chance to get rid of local minimum or something similar to that? or both are the same?)
Another question is where updating the experience replay is investigated?
enter image description here
For your first question: the answer is yes, there are two loops, and they do have differences.
You have to think of the true meaning of an episode. In most cases, we can consider each episode a 'game'. A 'game' needs to have an end. And we need to do our best to let every game end within the length of an episode (imagine what you can learn if you cannot get out of a labyrinth game). The Q values of DQN is an approximation of 'current reward' + 'discounted future rewards', while you need to know when will the future ends to make a better approximation.
So assume we usually take 200 steps to finish the game, then an episode of 100 time steps has a huge difference from an episode of 400 time steps.
For experience replay update, it happens in every time step. I don't get what you're asking. If you can explain your question in detail I think I could answer it.

stepSimulation parameters in Bullet Physics

I use Bullet for physics simulation and don't care about real-time simulation - it's ok if one minute of model time lasts two hours in real time. I am trying to call a callback every fixed amount of time in model time, but realized that I don't understand how StepSimulation works.
The documentation of StepSimulation() isn't that clear. I would strongly appreciate if someone explained what its parameters are for.
timeStep is simply the amount of time to step the simulation by. Typically you're going to be passing it the time since you last called it.
Why is it so? Why does everyone use the time passed since last simulation for this parameter? What will happen if we make this parameter fixed - say 0.1?
The third parameter is the size of that internal step. The second parameter is the maximum number of steps that Bullet is allowed to take each time you call it. It's important that timeStep < maxSubSteps * fixedTimeStep, otherwise you are losing time.
What exactly is internal step? Why is its size inversely proportional to resolution? Is it basically inverse frequency?
P.S. I somewhat duplicate the question which already exists, but the answer doesn't clarify what I would like to know.
Why is it so? Why does everyone use the time passed since last simulation for this parameter? What will happen if we make this parameter fixed - say 0.1?
It is so because the physics needs to know how much time has passed or it will run too fast or slow and will appear wrong to the user.
You can absolutely fix the parameter for your application.
For example
stepSimulation(btScalar(1.)/btScalar(60.), btScalar(1.)/btScalar(60.));
Will step your physics world on by 1/60th of a second. In a game this would result in it running faster/slower depending on how far out the actual frame rate is from the 1/60th of a second we are passing.
What exactly is internal step? Why is its size inversely proportional to resolution? Is it basically inverse frequency?
It is the duration that bullet will simulate at a time. For example, if you put 1/60 and 1/6 of a second has passed since the last step, then bullet will do 10 internal steps and not one large 1/6 step. This is so that it will produce the same results. A varying timestep will not.
This is a great article on why you need a fixed physics timestep and what happens when you don't: http://gafferongames.com/game-physics/fix-your-timestep/

ScriptDb operational limits

I have a script that generates about 20,000 small objects with about 8 simple properties. My desire was to toss these objects into ScriptDb for later processing of the data.
What I'm experiencing though is that even with a savebatch operation that the process takes much longer then desired and then silently stops. By too much time, it's often greater then the 5 min execution limit, though without throwing any error. The script runs so long that I've not attempted to check a mutation result to see what didn't make it, but from a check after exectution it appears that most do not.
So though I'm quite certain that my collection of objects is below the storage size limit, is there a lesser known limit or throttle on accesses that is causing me problems? Are the number of objects the culprit here, should I be instead attempting to save one big object that's a collection of the lessers?
I think it's the amount of data you're writing. I know you can store 20,000 small objects, you just can't write that much in 5 minutes. Write 1000 then quit. Write the next thousand, etc. Run your function 20 times and the data is loaded. If you need to do this more/automated, use ScriptApp.

Project Euler 298 - there must be a correct answer? (only pastebinned code)

Project Euler has a paging file problem (though it's disguised in other words).
I tested my code(pastebinned so as not to spoil it for anyone) against the sample data and got the same memory contents+score as the problem. However, there is nowhere near a consistent grouping of scores. It asks for the expected difference in scores after 50 turns. A random sampling of scores:
1.50000000
1.78000000
1.64000000
1.64000000
1.80000000
2.02000000
2.06000000
1.56000000
1.66000000
2.04000000
I've tried a few of those as answers, but none of them have been accepted... I know some people have succeeded, so I'm really confused - what the heck am I missing?
Your problem likely is that you don't seem to know the definition of Expected Value.
You will have to run the simulation multiple times and for each score difference, maintain the frequency of that occurence and then take the weighted mean to get the expected value.
Of course, given that it is Project Euler problem, there is probably a mathematical formula which can be used readily.
Yep, there is a correct answer. To be honest, Monte Carlo can theoretically come close in on the expect value given the law of large numbers. However, you won't want to try it here. Because practically each time you run the simu, you will have a slightly different result rounded to eight decimal places (And I think this setting does exactly deprive anybody of any chance of even thinking to use Monte Carlo). If you are lucky, you will have one simu that delivers the answer after lots of trials, given that you have submitted all the previous and failed. I think, captcha is the second way that euler project let you give up any brute-force approach.
Well, agree with Moron, you have to figure out "expected value" first. The principle of this problem is, you have to find a way to enumerate every possible "essential" outcomes after 50 rounds. Each outcome will have its own |L-R|, so sum them up, you will have the answer. No need to say, brute-force approach fails in most of the case, especially in this case. Fortunately, we have dynamic programming (dp), which is fast!
Basically, dp saves the computation results in each round as states and uses them in the next. Thus it avoids repeating the same computation over and over again. The difficult part of this problem is to find a way to represent a state, that is to say, how you would like to save your temp results. If you have solved problem 290 in dp, you can get some hints there about how to understand the problem and formulate a state.
Actually, that isn't the most difficult part for the mind. The hardest mental piece is whether you realize that some memory statuses of the two players are numerically different but substantially equivalent. For example, L:12345 R:12345 vs L:23456 R:23456 or even vs L:98765 R:98765. That is due to the fact that the call is random. That is also why I wrote possible "essential" outcomes. That is, you can summarize some states into one. And only by doing so, your program can finish in reasonal time.
I would run your simulation a whole bunch of times and then do a weighted average of the | L- R | value over all the runs. That should get you closer to the expected value.
Just submitting one run as an answer is really unlikely to work. Imagine it was dice roll expected value. Roll on dice, score a 6, submit that as expected value.