different between effect of episodes and time in DQN and where is the updating the experience replay - deep-learning

In DQN paper of DeepMind company, there are two loops one for episodes and one for running time in each step (one for training and one for different time-step of running). Am I right?
Since, nothing is done in outer loop except initialization and reset to conditions of first step, what are their differences?
For instance, in case 1, if we run for 1000 episodes and 400 time steps what are the differences we should expected in case 2, if we run for 4000 episodes and 100 time steps?
(is their difference that the second one has more chance to get rid of local minimum or something similar to that? or both are the same?)
Another question is where updating the experience replay is investigated?
enter image description here

For your first question: the answer is yes, there are two loops, and they do have differences.
You have to think of the true meaning of an episode. In most cases, we can consider each episode a 'game'. A 'game' needs to have an end. And we need to do our best to let every game end within the length of an episode (imagine what you can learn if you cannot get out of a labyrinth game). The Q values of DQN is an approximation of 'current reward' + 'discounted future rewards', while you need to know when will the future ends to make a better approximation.
So assume we usually take 200 steps to finish the game, then an episode of 100 time steps has a huge difference from an episode of 400 time steps.
For experience replay update, it happens in every time step. I don't get what you're asking. If you can explain your question in detail I think I could answer it.

Related

Stable-Baselines3 package, model.learn() function - how do total_timesteps and num_eval_episodes work together?

I am using the SB3 package for RL, and I'm trying out the model.learn() function.
I don't understand exactly what model.learn() parameters do in terms of how they work together and with my environment.
My RL is working from a tabular dataset, so there's an inherent limitation to number of timesteps possible.
Let's say these are my conditions:
I have a dataset with 20,000 rows (possible timesteps)
In my environment, my step() function contains an if-statement which flips "done" to True when the number of steps taken reaches 1,000 (the step() function counts the number of times it's been called since the initialization of the env).
I run model.learn() with total_timesteps = 30,000 .
I encounter no errors when I do this. Can someone please explain what is happening? Is model.learn() running my environment through the first 1,000 timesteps, then re-starts and keeps looping this way until 30,000 total timesteps have been taken?
If so, how does num_eval_episodes feed into this? Does it change how the function runs? If so, how?
I'm sorry for the scattered question, I appreciate any clarification.
I'm working with SB3 as well these days and I think your own assessment that "model.learn() is running the environment through the first 1,000 timesteps, then re-starts and keeps looping this way until 30,000 total timesteps have been taken" is probably correct.
Have you ever set the if-statement that flips "done" to True to a number of steps greater than your dataset?
As far as I know SB3 works that way so that you can train on environments with or without a fixed number of timesteps without getting problems with infinite training in cases where the terminal state is never reached.
In my own application which also has episodes with a fixed number of timesteps per episode (n_max_timesteps) I always set "total_timesteps=n_episodes**n_max_timesteps*" in model.learn().
"n_eval_episodes" runs the agent for a specified number of episodes from reset to reaching a final / terminal state.

OpenAI gym GuessingGame-v0 possible solutions

I have been struggling to solve the GuessingGame-v0 environment which is part of the OpenAI gym.
In the environment each episode a random number within a range is selected and the agent must "guess" what this random number is. The agent is only provided with the observation of whether the guess was too large or too small.
After researching how to frame the problem I think it may be possible to frame the problem as a Hidden Markov Model, but I am unsure of how to do this.
Each episode the randomly selected number changes and because of this I don't know how the model won't have to change each episode as the goal state is continually shifting.
I could not find any resources on the environment or any environments similar to it other than the documentation provided by OpenAI which I did not find useful.
I would greatly appreciate any assistance on how to solve this environment.
I'm putting this as an answer so people don't have to read through the list of comments.
You need a program that can simply cycle through:
generate the random number
agent guesses a number (within the allowable guess range)
test whether the number is within 1%.
if the number is within 1%, stop the iteration, maybe print the guess at this point
if the iteration is at step 200, stop the iteration and maybe produce some out that gives the final guessed number and the fact it is not within 1%
if not 200 steps or 1%: a) if the number is too high, record the guess and that it is too high, or b) if the number is too low, record the guess and that it is too low. Iterate through that number bound. Repeat until either the 1% or 200 steps criterion is reached.
Another thought for you: do you need a starting low number and a starting high number?
There are a number of ways in which to implement this solution. There is also a range of programming software in which the solution can be implemented. The particular software you use is probably the one with which you are most familiar.
Good luck!

Will alpha-beta pruning remove randomness in my solution with minimax?

Existing implementation:
In my implementation of Tic-Tac-Toe with minimax, I look for all boxes where I can get best result and chose 1 of them randomly, so that the same solution isn't displayed each time.
For ex. if the returned list is [1, 0 , 1, -1], at some point, I will randomly chose between the two highest values.
Question about Alpha-Beta Pruning:
Based on what I understood, when the algorithm finds that it is winning from one path, it would no longer need to look for other paths that might/ might not lead to a winning case.
So will this, like I feel, cause the earliest possible box that leads to the best solution to be displayed as the result and seem the same each time? For example at the time of first move, all moves lead to a draw. So will the 1st box be selected each time?
How can I bring randomness to the solution like with the minimax solution? One way that I thought about now could be to randomly pass the indices to the alpha-beta algorithm. So the result will be the first best solution in that randomly sorted list of positions.
Thanks in advance. If there is some literature on this, I'd be glad to read it.
If someone could post some good reference for aplha-beta pruning, That'll be excellent as I had a hard time understanding how to apply it.
To randomly pick among multiple best solutions (all equal) in alpha-beta pruning, you can modify your evaluation function to add a very small random number whenever you evaluate a game state. You should just make sure that the magnitude of that random number is never greater than the true difference between the evaluations of two states.
For example, if the true evaluation function for your game state can only return values -1, 0, and 1, you could add a randomly generated number in the range [0.0, 0.01] to the evaluation of every game state.
Without this, alpha-beta pruning doesn't necessarily find only one solution. Consider this example from wikipedia. In the middle, you see that two solutions with an evaluation of 6 were found, so it can find more than one. I do actually think it will still find all moves leading to optimal solutions at the root node, but not actually find all solutions deep down in the tree. Suppose, in the example image, that the pruned node with score of 9 in the middle actually had a score of 6. It would still get pruned there, so that particular solution wouldn't be found, but the move from root node leading to it (the middle move at root) would still be found. So, eventually, you would be able to reach it.
Some interesting notes:
This implementation would also work in minimax, and avoid the need to store a list of multiple (equally good) solutions
In more complex games than Tic Tac Toe, where you cannot search the complete state space, adding a small random number for the max player and deducting a small random number for the min player like this may actually slightly improve your heuristic evaluation function. The reason for this is as follows. Suppose in state A you have 5 moves available, and in state B you have 10 moves available, which all result in the same heuristic evaluation score. Intuitively, the successors of state B may be slightly better, because you had more moves available; in many games, having more moves available means that you are in a better position. Because you generated 10 random numbers for the 10 successors of state B, it is also a bit more likely that the highest generated random number is among those 10 (instead of the 5 numbers generated for successors of A)

Getting average or keeping temp data in db - performance concern

I am building a little app for users to create collections. I want to have a rating system in there. And now, since I want to cover all my fields, let's pretend that I have a lot of visitors. Performance comes into play, especially with rates.
Let's suppose that I have rates table, and there I have id, game_id, user_id and rate. Data comes simple, for every user there is one entry. Let's suppose again, that 1000 users will rate one game. And I want to print out average rate on that game subpage (and somewhere else, like on the games list). For now, I have two scenarios to go with:
Getting AVG each time the game is displayed.
Creating another column in games, called temprate and store there rate for the game. It would be updated evey time someone votes.
Those two scenarios have obvious flaws. First one is more stressful to my host, since it definietly will consume more power of the machine. Secound is more work while rating (getting all the game data, submitting rate, getting new AVG).
Please advice me, which scenario should I go with? Or maybe you have some other ideas?
I work with PDO and no framework.
So I've finally manage to solve this issue. I used file caching based on dumping arrays into files. I just go with something like if (cache) { $var = cache } else { $var = db }. I am using JG Cache, for now, but propably I'll write myself something similar soon, but for now - it's a great solution.
I'd have gone with a variation of your "number 2" solution (update a separate rating column), maybe in a separate table just for this.
If the number of writes becomes a problem, then that'd be well after select avg(foo) from ... does, and there are lots of ways to mitigate it by just updating the average rating periodically or just processing new votes every so often.
Likely then eventually you can't just do an avg() anyway because you have to consider each vote for fraud, calculating a sort score and who knows what else.,

Project Euler 298 - there must be a correct answer? (only pastebinned code)

Project Euler has a paging file problem (though it's disguised in other words).
I tested my code(pastebinned so as not to spoil it for anyone) against the sample data and got the same memory contents+score as the problem. However, there is nowhere near a consistent grouping of scores. It asks for the expected difference in scores after 50 turns. A random sampling of scores:
1.50000000
1.78000000
1.64000000
1.64000000
1.80000000
2.02000000
2.06000000
1.56000000
1.66000000
2.04000000
I've tried a few of those as answers, but none of them have been accepted... I know some people have succeeded, so I'm really confused - what the heck am I missing?
Your problem likely is that you don't seem to know the definition of Expected Value.
You will have to run the simulation multiple times and for each score difference, maintain the frequency of that occurence and then take the weighted mean to get the expected value.
Of course, given that it is Project Euler problem, there is probably a mathematical formula which can be used readily.
Yep, there is a correct answer. To be honest, Monte Carlo can theoretically come close in on the expect value given the law of large numbers. However, you won't want to try it here. Because practically each time you run the simu, you will have a slightly different result rounded to eight decimal places (And I think this setting does exactly deprive anybody of any chance of even thinking to use Monte Carlo). If you are lucky, you will have one simu that delivers the answer after lots of trials, given that you have submitted all the previous and failed. I think, captcha is the second way that euler project let you give up any brute-force approach.
Well, agree with Moron, you have to figure out "expected value" first. The principle of this problem is, you have to find a way to enumerate every possible "essential" outcomes after 50 rounds. Each outcome will have its own |L-R|, so sum them up, you will have the answer. No need to say, brute-force approach fails in most of the case, especially in this case. Fortunately, we have dynamic programming (dp), which is fast!
Basically, dp saves the computation results in each round as states and uses them in the next. Thus it avoids repeating the same computation over and over again. The difficult part of this problem is to find a way to represent a state, that is to say, how you would like to save your temp results. If you have solved problem 290 in dp, you can get some hints there about how to understand the problem and formulate a state.
Actually, that isn't the most difficult part for the mind. The hardest mental piece is whether you realize that some memory statuses of the two players are numerically different but substantially equivalent. For example, L:12345 R:12345 vs L:23456 R:23456 or even vs L:98765 R:98765. That is due to the fact that the call is random. That is also why I wrote possible "essential" outcomes. That is, you can summarize some states into one. And only by doing so, your program can finish in reasonal time.
I would run your simulation a whole bunch of times and then do a weighted average of the | L- R | value over all the runs. That should get you closer to the expected value.
Just submitting one run as an answer is really unlikely to work. Imagine it was dice roll expected value. Roll on dice, score a 6, submit that as expected value.