Defining observation space and reward for traffic signal phase optimization for reinforcement learning - reinforcement-learning

I am trying to use Reinforcement Learning for traffic signal phase optimization for improving traffic flow at intersections.
I am aware that in practice we won't be able to get the information about all the vehicles in each of the lanes.
If we use a camera for getting information about the queue length then we can get accurate data only upto, say 200 meters.
Should I take this into consideration while defining my observation space or can I directly use the data from sumo?
Furthermore, what should be the ideal observation space for such a task?
sumo_rl allows to use various metrics for reward calucation such as pressure metric, queue length metric, etc. What will be a good choice of rewards for my use case or what factors should I consider while defining my reward?
I have tried getting metrics from the e2 detector's output file such as throughput, lane delay and queue length. For the agent however, I might not be able to use them (as traci/sumo wrappers offer better implementations?) So how do I use traci for getting this modified information?

Yes, you should try to match your observation space as close to the real world as possible. SUMO can also filter the data directly (for instance with an E3 detector).
If you want to maximize flow than the reward should also include the flow metric (throughput). It's quite easy to get it via traci (as you already noticed) but I cannot tell how it integrates with your framework since you did not give details about it.

Related

Best practice to set Drake's simulator for fixed integration when using with reinforcement learning?

I'm using drake for some model-free reinforcement learning and I noticed that Drake uses a non-fixed step integration when simulating an update. This makes sense for the sake of integrating multiple times over a smaller duration when the accelerations of a body is large, but in the case of using reinforcement learning this results in some significant compute overhead and slow rollouts. I was wondering if there is a principled way to allow the simulation environment to operate in a fixed timestep integration mode beyond the method that I'm currently using (code below). I'm using the PyDrake bindings, and PPO as the RL algorithm currently.
integrator = simulator.get_mutable_integrator()
integrator.set_fixed_step_mode(True)
On way to change the integrator that is used for continuous-time dynamics is to call ResetIntegratorFromFlags. For example, to use the RungeKutta2Integrator you would call:
ResetIntegratorFromFlags(simulator=simulator, scheme="runge_kutta2", max_step_size=0.01)
The other thing to keep in mind is whether the System(s) you are simulating use continuous- or discrete-time dynamics, and whether that is configurable in those particular System(s). If there are no continuous-time dynamics being simulated, then the choice of integrator does not matter. Only the update period(s) of the discrete systems will matter.
In particular, if you are simulating a MultibodyPlant, it takes a time_step argument to its constructor. When zero, it will use continue-time dynamics; when greater than zero, it will use discrete-time dynamics.
When I've used Drake for RL, I've almost always put the MultibodyPlant into discrete mode. Staring with time_step=0.001 is usually a safe choice. You might be able to use a larger step depending on the bodies and properties in the scene.
I agree with #jwnimmer-tri -- I suspect for your use case you want to put the MultibodyPlant into discrete mode by specifying the time_step in the constructor.
And to the higher-level question -- I do think it is better to use fixed-step integration in RL. The variable-step integration is more accurate for any one rollout, but could introduce (small) artificial non-smoothness as your algorithm changes the parameters or initial conditions.

Multiagent (not deep) reinforcement learning? Modeling the problem

I have N number of agents/users accessing a single wireless channel and at each time, only one agent can access the channel and receive a reward.
Each user has a buffer that can store B number of packets and I assume it as infinite buffer.
Each user n gets observation from the environment if the packet in time slot t was successful or failure (collision). If more than one users access the channel, they get penalty.
This feedback from the channel is same for all the users since we only have one channel. The reward is - B_n (negative of the number of packets in buffer). Each user wants to maximize its own reward and try to empty the buffer.
Packets arrive at each users following a poisson process with average $\lambda$ packets per time slot.
Each user has a history of previous 10 time slots that it uses as an input to the DQN to output the probability of taking action A_n: stay silent or transmit. The history is (A_n, F, B_n)
Each user is unaware of the action and buffer status of other users.
I am trying to model my problem with multiagent reinforcement learning and so far I have tried it with DQN but results are more or less like a random scheme. It could be that users don't have much contextual information in order to learn the behaviour of other users? Or can there be any other reason?
I would like to know how can I model my environment since the state (in RL sense) is static, the environment doesn't changes. The only thing that changes is each users history at each time slot. So I am not sure if its a partially observable MDP or should it be modelled as multiagent single-arm bandit problem which I don't know is correct or not.
The second concern is that I have tried DQN but it has not worked and I would like to know if such problem can be used with tabular Q-learning? I have not seen multiagent works in which anyone has used QL. Any insights might be helpful.
Your problem can be modeled as a Decentralized POMDP (see a overview here).
Summarizing this approach, you consider a multi-agent system where each agent model his own policy, and then you try to build a joint policy through these individual ones. Of course that, the complexity grows as the number of agents, states and actions increases,so for that you have several approaches mainly based in heuristics to prune branches of this joint policy tree that are not "good" in comparison with others. A very know example using this approach is exactly about routing packages where is possible define a discrete action/space.
But be aware that even for tiny system, the complexity becomes often infeasible!

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

Does increase in number of mining pool decrease the block generation time

Can any one clarify whether (or, not) an increase in mining pool in Ethereum network with decrease the average block generation time? (For example, if another pool like ethermine join the network today and start mining). Since all the pools are competing with each other, I am getting confused
No, block generation times are driven by the current difficulty for solving the the algorithm used in the Proof of Work model. Only when a solution is found, is the block accepted to the chain and the difficulty determines how long it will take to find that solution. This difficulty automatically adjusts to speed up or slow down block generation times.
From the mining section of the Ethereum wiki:
The proof of work algorithm used is called Ethash (a modified version of Dagger-Hashimoto) involves finding a nonce input to the algorithm so that the result is below a certain threshold depending on the difficulty. The point in PoW algorithms is that there is no better strategy to find such a nonce than enumerating the possibilities while verification of a solution is trivial and cheap. If outputs have a uniform distribution, then we can guarantee that on average the time needed to find a nonce depends on the difficulty threshold, making it possible to control the time of finding a new block just by manipulating difficulty.
The difficulty dynamically adjusts so that on average one block is produced by the entire network every 12 seconds.
(Note that the current block generation time is closer to 15 seconds. You can find the block generation times on Etherscan)

What does the EpisodeParameterMemory of keras-rl do?

I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.
What does the line
memory = EpisodeParameterMemory(limit=1000, window_length=1)
do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?
EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).
Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.
The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.
Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).
A little late to the party, but I feel like the answer doesn't really answer the question.
I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):
We’ll be using experience replay
memory for training our DQN. It stores the transitions that the agent
observes, allowing us to reuse this data later. By sampling from it
randomly, the transitions that build up a batch are decorrelated. It
has been shown that this greatly stabilizes and improves the DQN
training procedure.
Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).