So basically, I'm trying to set up an AI agent to learn how to control mouse movements using Tensor-force and pyautogui. I want this agent to be curious and able to respond to different things in the environment. Should I implement a separate agent to control the rewards given to the first agent, and give said second agent a randomized reward based on first agent's actions, or is there already something for agent curiosity in Tensor-force? I've noticed a few things mentioned about action_exploration in Tensor-force's library, but I don't really understand what it's supposed to do...
Related
Good morning, Im facing a "RL" problem, which have many constraints, the main idea is that my agent will control many different machines with for example ordering them to go out for doing their missions (we don't give importance for the mission), or ordering them to enter to the depot and choosing for them the right place where they should sit (depending from constraints).
The problem is: the agent will take decision at periods of time that are defined, for each periode we know which of actions (go out, go in) are allowed. He will for example at 8oclock decide to order for 4 machines to go out, and at 14oclock decide to bring back 2 machines(with choosing for them the right place).
In literature i show many ideas which refers to BDQ, but is it recquired for my problem ? Im thinking about having actions like [chooseMachine1, chooseMachine2,chooseMachine3...chooseMachineN, goOut, goInPlace1, goInPlace2, goInPlace3, goInPlace4]. And in the code specifying the logic that depending of the period we are, i expose for the begening a number M<=N of the machines to choose (with giving 0 probability to those actions that aren't possible for the moment' if it is 14oclock you know that only the machines that are out are concerned with the agent decision'), if the agent choose Machine1, so he will access to only the possible actions from choosing it.
So, my question is, do you think that my ideas are right ? (am beginner), my idea is to make a DQN with giving the logic for the possible/impossible actions,
Do you think that a BDQ is more accurate with my problem ? like having N branchs for N machines which have the same possible actions (brach1(Machine1) : go out, goPlace1, goPlace2 ...)
If it is the case is there any implementation examples ?
If you have ressources to advise me, i will be glad of checking them.
Thank You
What would an agent navigating a maze do in case the chosen action would run it into a wall?
I think the usual approach in RL is to allow the move and than handle the result with the environment. In such a way the environment can simply make nothing happen or even give a negative reward when an action is "disallowed".
At training convergence the agent will hopefully learn to not chose ineffective actions.
Is there a way to model action masking for continuous action spaces? I want to model economic problems with reinforcement learning. These problems often have continuous action and state spaces. In addition, the state often influences what actions are possible and, thus, the allowed actions change from step to step.
Simple example:
The agent has a wealth (continuous state) and decides about spending (continuous action). The next periods is then wealth minus spending. But he is restricted by the budget constraint. He is not allowed to spend more than his wealth. What is the best way to model this?
What I tried:
For discrete actions it is possible to use action masking. So in each time step, I provided the agent with information which action is allowed and which not. I also tried to do it with contiuous action space by providing lower and upper bound on allowed actions and clip the actions smapled from actor network (e.g. DDPG).
I am wondering if this is a valid thing to do (it works in a simple toy model) because I did not find any RL library that implements this. Or is there a smarter way/best practice to include the information about allowed actions to the agent?
I think you are on the right track. I've looked into masked actions and found two possible approaches: give a negative reward when trying to take an invalid action (without letting the environment evolve), or dive deeper into the neural network code and let the neural network output only valid actions.
I've always considered this last approach as the most efficient, and your approach of introducing boundaries seems very similar to it. So as long as this is the type of mask (boundaries) you are looking for, I think you are good to go.
I'm a newbie in RL so please forgive me if I ask stupid question:)
I'm working a DQN project right now and it's very similar to the simplest snake game. The game is wrote in js and has a demo (in which snake moves randomly). But since I don't know how to write js, I can't pass the action value to the game during trainng process, what I'm doing now is generating random game image and training the dqn model instead.
What I want to ask is that: Is it possible to do in this way? Can the Q(s,r) still converge? If it's possible, is there anything I should pay attention to? and do I need the episilon parameter anymore?
Thank you very much:)
I'd definitely say no!
The problem is that the agent will only learn from random decisions and can never try if a learned action produces maybe even more reward. So everything he learns will be based on the starting frames.
Further, the agent will, in your case, never learn how to handle his size (if it grows like in snake) because he will never grow due to the bad random decisions.
Imagine a child that tries to ride a bike and you lift it off the bike as soon as it has ridden one meter. It will probably be able to ride one or even more meters straight but will never be able to do turns, etc.
I'm interested in writing certain software that uses machine learning, and performs certain actions based on external data.
However I've run into problem (that was always interesting to me) -
how is it possible to write machine learning software that issues orders or sequences of orders?
The problem is that as I understand it, neural network gets bunch on inputs, and "recalls" output based on results of previous trainings. Instantly (well, more or less). So I'm not sure how "issuing orders" could fit into that system, especially when actions performed by system affect the system with certain delay. I'm also a bit unsure how is it possible to train this thing.
Examples of such system:
1. First person shooter enemy controller. As I understand it, it is possible to implement neural network controller for the bot that will switch bot behavior strategies(well, assign priorities to them) based on some inputs (probably something like health, ammo, etc). But I don't see a way to make higher-order controller, that could issue sequence of commands like "go there, then turn left". Also, bot's actions will affect variables that control bot's behavior. I.e. shooting reduces ammo, falling from heights reduces health, etc.
2. Automated market trader. It is certainly possible to make system that will try to predict the next market price of something. However, I don't see how is it possible to make system that would issue order to buy something, watch the trend, then sell it back to gain profit/cover up losses.
3. Car driver. Again, (as I understand it) it is possible to make system that will maintain desired movement vector based on position/velocity/torque data and results of previous training. However I don't see a way to make such system (learn to) perform sequence of actions.
I.e. as I understood it, neural net is technically a matrix - you give it input, it produces output. But what about generating sequences of actions that could change environment program operates in?
If such tasks are not entirely suitable for neural networks, what else could be used?
P.S. I understand that the question isn't exactly clear, and I suspect that I'm missing some knowledge. So I'll appreciate some pointers (i.e. books/resources to read, etc).
You could try to connect the output neurons to controllers directly, e.g. moving forward, turning, or shooting in the ego shooter, or buying orders for the trader. However, I think that the best results are gained nowadays when you let the neural net solve one rather specific subproblem, and then let a "normal" program interpret its answer. For example, you could let the neural net construct a map overlay of "where do I want to be", which the bot then translates into movements. The neural network for the trader could produce a "how much do I want which paper", which the bot then translates into buying or selling orders.
The decision which subproblem should be solved by a neural network is a very central one for its design. The important thing is that good solutions can be taught to the neural network.
Edit: Expanding this in the examples: When the ego shooter bot gets shot, it should not have wanted to be there; when it gets to shoot someone else, it should have wanted to be there more. When the trader loses money from a paper, it should have wanted it less before; if it gains, it should have wanted it more. These things can be taught.
The problem you are describing is known as Reinforcement Learning. Reinforcement learning is essentially a machine learning algorithm (such as a neural network) coupled with a controller. It has been used for all of the applications you mention, even to drive real cars.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm interested in how the protocols (and game loop) work for these type of games; any pointers or insights are appreciated.
I guess the main loop would have a world state which would be advanced a few "ticks" per second, but how are the commands of the players executed? What kind of data needs to go back and forth?
I can go into a lot of detail about this but first, go read "1500 archers" http://www.gamasutra.com/view/feature/3094/1500_archers_on_a_288_network_.php and this will answer many of your questions. Here's a summary:
First, most games use UDP due to the real-time nature of the game. THe game loop looks something like this:
read network data
do client-side predictions and compare with where the network says
your objects should actually be
mess with physics to fudge what the network says with what your
local game state is
send data back out onto the network based on what you did this
frame (if anything)
render
That's vastly simplified and "messing with physics" could easily be a 200 page book on its own but it involves predicting client-side where something is likely to be, getting data from the server that is old but tells exactly where an object was/should be, and then interpolating those values somehow to make the object appear "close enough" to where it's actually supposed to be that no one notices. This is super-critical in first person shooters but not as much for real-time strategy.
For real-time strategy, what typically happens is a turn-based system where time is divided into discreet chunks called "turns" that happen sequentially and each turn has a number generated by a monotonic function that guarantees ever increasing values in a particular order without duplicates. On any given turn n, each client sends a message to all other clients with their intended action on turn n + m, where m is an arbitrary number that is usually fairly small and can be best determined through trial and error as well as playtesting. Once all the clients have sent their intended action, each client executes all actions that were sent on turn n + m. This introduces a tiny delay in when an action is ordered by the user and when it executes, however this is usually not noticable.
There are several techniques which can be used to fudge the time as well. For example, if you highlite a unit and then tell it to move, it will make a sound and have an animation when it starts moving but won't actually move right away. However, the network message of an intent to move that unit is sent immediately so by the time the screen responds to the player's input, the network messages have already been sent and acknowledged. You can fudge it further by introducing a small delay (100ms or so) between the mouse click and the game object's response. This is usually not noticable by the player but 100ms is an eternity in a LAN game and even with a broadband connection on a home computer the average ping is probably around 15-60ms or so, which gives you ample time to send the packet prior to the move.
As for data to send, there are two types of data in games: deterministic and non-deterministic. deterministic actions are grounded in game physics so that when the action starts, there is a 100% guarantee that I can predict the result of that action. This data never needs to be sent accross the network since I can determine what it will be on the client based on the initial state. Note that using a random number generator with the same seed on every client turns "random" events into deterministic behavior. Non-deterministic data is usually user input but it is possible to predict what a user's input is likely to be in many cases. The way these pair in a real-time strategy game is that the non-deterministic event is some sort of order to one of my game objects. Once the game object has been ordered to move, the way in which it moves is 100% deterministic. Therefore, all you need to send on the network is the ID of the object, the command given (make this an enum to save bandwidth), and the target of the command (if any, so a spell may have no target if it's an area of affet but a move command has an end-destination). If the user clicks like 100 times to make a unit move, there is no need to send a separate move command for each click since they're all in the same general area so be sure to filter this out as well since it will kill your bandwidth.
One final trick for handling a possible delay between a command and its execution is something called a local perception filter. If I get a move order some time t after the order was given, I know when the unit should have started moving and I know its end destination. Rather than teleporting the unit to get it where it's supposed to be, I can start its movement late and then mess with physics to speed it up slightly so that it can catch up to where it's supposed to be, and then slow it back down to put it in the correct place. The exact value you need to speed it up is also relative and playtesting is the only way to determine the correct value because it just has to "feel right" in order for it to be correct. You can do the same thing with firing bullets and missiles as well and it's highly effective for that. The reason this works is that humans aren't horribly good at seeing subtle changes in movement, particularly if an object is heading directly towards them or away from them, so they just don't notice.
The next thing to think about is cutting down on bandwidth. Don't send messages to clients that couldn't possible see or interact with a unit that is moving. Don't send the same message over and over again because the user clicks. Don't send messages immediately for events that have no immediate affect. Finally, don't require an acknowledgement for events that will be stale should they fail to be received. If I don't get a movement update, by the time I re-transmit that update, its value will be so old that it's no longer relevant so it's better to just send another move and use a local perception filter to catch up or use a cubic spline to interpolate the movement so that it looks more correct or something of that nature. However, an event that's critical, such as a "you're dead" or "your flag has been taken" should be acknowledged and re-transmitted if needed. I teach network game programming at Digipen so feel free to ask any other questions about this as I can probably provide you with an answer. Network game programming can be quite complicated but ultimately it's all about making choices in your implementation and understanding the consequences of your choice.
Check out Battle for Wesnoth.
http://www.wesnoth.org/
It's free, open source, and totally awesome. You can learn a lot from digging into its source.
Discussion of Age of Empires network architecture here
IMHO, that style of peer-to-peer duplicated-replay-based architecture is impressive, but a bit of a dead end for anything more than 8 or so players.