(crossposted:https://ai.stackexchange.com/questions/15693/state-reward-per-step-in-a-multiagnet-environment)
In a single agent environment, the agent takes an action, then observes the next state and reward:
for ep in num_episodes:
action = dqn.select_action(state)
next_state, reward = env.step(action)
Implicitly, the for moving the simulation (env) forward is embedded inside the env.step() function.
Now in the multiagent scenario, agent 1 ($a_1$) has to make a decision at time $t_{1a}$, which will finish at time $t_{2a}$, and agent 2 ($a_2$) makes a decision at time $t_{1b} < t_{1a}$ which is finished at $t_{2b} > t_{2a}$.
If both of their actions would start and finish at the same time, then it could easily be implemented as:
for ep in num_episodes:
action1, action2 = dqn.select_action([state1, state2])
next_state_1, reward_1, next_state_2, reward_2 = env.step([action1, action2])
because the env can execute both in parallel, wait till they are done, and then return the next states and rewards. But in the scenario that I described previously, it is not clear how to implement this (at least to me). Here, we need to explicitly track time, a check at any timepoint to see if an agent needs to make a decision, Just to be concrete:
for ep in num_episodes:
for t in total_time:
action1 = dqn.select_action(state1)
env.step(action1) # this step might take 5t to complete.
as such, the step() function won't return the reward till 5 t later.
#In the mean time, agent 2 comes and has to make a decision. its reward and next step won't be observed till 10 t later.
To summarize, how would one implement a multiagent environment with asynchronous action/rewards per agents?
Related
I am using Ray 1.3.0 (for RLlib) with a combination of SUMO version 1.9.2 for the simulation of a multi-agent scenario. I have configured RLlib to use a single PPO network that is commonly updated/used by all N agents. My evaluation settings look like this:
# === Evaluation Settings ===
# Evaluate with every `evaluation_interval` training iterations.
# The evaluation stats will be reported under the "evaluation" metric key.
# Note that evaluation is currently not parallelized, and that for Ape-X
# metrics are already only reported for the lowest epsilon workers.
"evaluation_interval": 20,
# Number of episodes to run per evaluation period. If using multiple
# evaluation workers, we will run at least this many episodes total.
"evaluation_num_episodes": 10,
# Whether to run evaluation in parallel to a Trainer.train() call
# using threading. Default=False.
# E.g. evaluation_interval=2 -> For every other training iteration,
# the Trainer.train() and Trainer.evaluate() calls run in parallel.
# Note: This is experimental. Possible pitfalls could be race conditions
# for weight synching at the beginning of the evaluation loop.
"evaluation_parallel_to_training": False,
# Internal flag that is set to True for evaluation workers.
"in_evaluation": True,
# Typical usage is to pass extra args to evaluation env creator
# and to disable exploration by computing deterministic actions.
# IMPORTANT NOTE: Policy gradient algorithms are able to find the optimal
# policy, even if this is a stochastic one. Setting "explore=False" here
# will result in the evaluation workers not using this optimal policy!
"evaluation_config": {
# Example: overriding env_config, exploration, etc:
"lr": 0, # To prevent any kind of learning during evaluation
"explore": True # As required by PPO (read IMPORTANT NOTE above)
},
# Number of parallel workers to use for evaluation. Note that this is set
# to zero by default, which means evaluation will be run in the trainer
# process (only if evaluation_interval is not None). If you increase this,
# it will increase the Ray resource usage of the trainer since evaluation
# workers are created separately from rollout workers (used to sample data
# for training).
"evaluation_num_workers": 1,
# Customize the evaluation method. This must be a function of signature
# (trainer: Trainer, eval_workers: WorkerSet) -> metrics: dict. See the
# Trainer.evaluate() method to see the default implementation. The
# trainer guarantees all eval workers have the latest policy state before
# this function is called.
"custom_eval_function": None,
What happens is every 20 iterations (each iteration collecting "X" training samples), there is an evaluation run of a minimum of 10 episodes. The sum of reward received by all N agents is summed over these episodes and that is set as the reward sum for that particular evaluation run. Over time, I notice that there is a pattern with the reward sums that repeats over the same interval of evaluation runs continuously, and the learning goes nowhere.
UPDATE (23/06/2021)
Unfortunately, I did not have TensorBoard activated for that particular run but from the mean rewards that were collected during evaluations (that happens every 20 iterations) of 10 episodes each, it is clear that there is a repeating pattern as shown in the annotated plot below:
The 20 agents in the scenario should be learning to avoid colliding but instead continue to somehow stagnate at a certain policy and end up showing the exact same reward sequence during evaluation?
Is this a characteristic of how I have configured the evaluation aspect, or should I be checking something else? I would be grateful if anyone could advise or point me in the right direction.
Thank you.
Step 1: I noticed that when I stopped the run at some point for any reason, and then restarted it from the saved checkpoint after restoration, most graphs on TensorBoard (including rewards) charted out the line in EXACTLY the same fashion all over again, which made it look like the sequence was repeating.
Step 2: This led me to believe that there was something wrong with my checkpoints. I compared the weights in checkpoints using a loop and voila, they are all the same! Not a single change! So either there was something wrong with the saving/restoring of checkpoints which after a bit of playing around I found was not the case. So it just meant my weights were not being updated!
Step 3: I sifted through my training configuration to see if something there was preventing the network from learning, and I noticed I had set my "multiagent" configuration option "policies_to_train" to a policy that did not exist. This unfortunately, either did not throw a warning/error or it did and I completely missed it.
Solution step: By setting the multiagent "policies_to_train" configuration option correctly, it started to work!
Could it be that due to the multi-agent dynamics, your policy is chasing its tail? How many policies do you have? Are they competing/collaborating/neutral to each other?
Note that multi-agent training can be very unstable and seeing these fluctuations is quite normal as the different policies get updated and then have to face different "env"-dynamics b/c of that (env=env+all other policies, which appear as part of the env as well).
I am working on a pipeline with 6 stages: F D I X0 X1 W. I am asked how many instructions need to be killed when a branch miss-predict happens.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch. In the pipeline diagram, it looks like it would require killing 4 instructions that are in the process of flowing through the pipeline. Is that correct?
I am also asked how many need to be killed if the pipeline is a three-wide superscalar. This one I am not sure on. I think that it would be 12 because you can fetch 3 instructions at a time. Is that correct?
kill all the instructions that came after the branch
Not if this is a real MIPS. MIPS has one branch-delay slot: The instruction after a branch always executes whether the branch is taken or not. (jal's return address is the end of the delay slot so it doesn't execute twice.)
This was enough to fully hide the 1 cycle of branch latency on classic MIPS I (R2000), which used a scalar classic RISC 5-stage pipeline. It managed that 1 cycle branch latency by forwarding from the first half of an EX clock cycle to an IF starting in the 2nd half of a clock cycle. This is why MIPS branch conditions are all "simple" (don't need carry propagation through the whole word), like beq between two registers but only one-operand bgez / bltz against an implicit 0 for signed 2's complement comparisons. That only has to check the sign bit.
If your pipeline was well-designed, you'd expect it to resolve branches after X0 because the MIPS ISA is already limited to make low-latency branch decision easy for the ALU. But apparently your pipeline is not optimized and branch decisions aren't ready until the end of X1, defeating the purpose of making it run MIPS code instead of RISC-V or whatever other RISC instruction set.
I have come up with 4. I think this because the branch resolution happens in X1 and we will need to kill all the instructions that came after the branch.
I think 4 cycles looks right for a generic scalar pipeline without a branch delay slot.
At the end of that X1 cycle, there's an instruction in each of the previous 4 pipeline stages, waiting to move to the next stage on that clock edge. (Assuming no other pipeline bubbles). The delay-slot instruction is one of those and doesn't need to be killed.
(Unless there was an I-cache miss fetching the delay slot instruction, in which case the delay slot instruction might not even be in the pipeline yet. So it's not as simple as killing the 3 stages before X0, or even killing all but the oldest previous instruction in the pipeline. Delay slots are not free to implement, also complicating exception handling.)
So 0..3 instructions need to be killed in pipeline stages from F to I. (If it's possible for the delay-slot instruction to be in one of those stages, you have to detect that special case. If it isn't, e.g. I-cache miss latency long enough that it's either in X0 or still waiting to be fetched, then the pipeline can just kill those first 3 stages and do something based on X0 being a bubble or not.)
I think that it would be 12 because you can fetch 3 instructions at a time
No. Remember the branch itself is one of a group of 3 instructions that can go through the pipeline. In the predict-not-taken case, presumably the decode stage would have sent all 3 instructions in that fetch/decode group down the pipe.
The worst case is I think when the branch is the first (oldest in program order) instruction in a group. Then 1 (or 2 with no branch delay slot) instructions from that group in X1 have to be killed, as well as all instructions in previous stages. Then (assuming no bubbles) you're cancelling 13 (or 14) instructions, 3 in each previous stage.
The best case is when the branch is last (youngest in program order) in a group of 3. Then you're discarding 11 (or 12 with no delay slot).
So for a 3-wide version of this pipeline with no delay slot, depending on bubbles in previous pipeline stages, you're killing 0..14 instructions that are in the pipeline already.
Implementing a delay slot sucks; there's a reason newer ISAs don't expose that pipeline detail. Long-term pain for short-term gain.
I see such code segment in proposer_task(xcom_base.c)
if(threephase || ep->p->force_delivery){
push_msg_3p(ep->site, ep->p, ep->prepare_msg, ep->msgno, normal);
}else{
push_msg_2p(ep->site, ep->p);
}
the threepahse is int const threephase = 0 and force_delivery == 0 here
push_msg_eq is normal paxos include prepare, accept and learn phase
but push_msg_2p will skip prepare phase and directly send accept request
I want to know why, Thanks a lot.
If you look at the paper Paxos Made Simple page 10 paragraph 3 says:
A newly chosen leader executes phase 1 for infinitely many instances
of the consensus algorithm [...]
Then paragraph 4:
Since failure of the leader and election of a new one should be rare
events, the effective cost of executing a state machine command—that
is, of achieving consensus on the command/value—is the cost of
executing only phase 2 of the consensus algorithm. It can be shown
that phase 2 of the Paxos consensus algorithm has the minimum possible
cost of any algorithm for reaching agreement in the presence of faults.
Hence, the Paxos algorithm is essentially optimal.
This is saying that a leader only issues a prepare during a leader failover. After that it streams accept messages. It then has "optimal messaging" in that the leader only needs one round trip to know a value is chosen (the accept message and its acknowledgment).
In a three node cluster, a leader self-accepts instantaneously, then only needs one accept acknowledgment from a second node to have a majority. It then knows the value is chosen without having to await the response from the 3rd node (which could be down). That is as efficient as you can get. The value is known to be accepted at a second node with strong consistency.
Given that is how paxos works to get maximum efficiency we should expect that mysql xcom has a mode that skips the prepare message phase in steady state.
You can read more about the Paxos Made Simple techniques on my blog here.
You might be interested to know about the latest developments of Paxos where you don't need a majority response for accept messages in the cluster using FPaxos and tricks like the even nodes optimization.
Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).
I need the figure out how to manage my retries in Nservicebus.
If there is any exception in my flow, It should retry 10 times every 10 seconds. But when I search in Nservicebus' website (http://docs.particular.net/nservicebus/errors/automatic-retries), there are 2 different retry mechanisms which are First Level Retry(FLR) and Second Level Retry (SLR).
FLR is for transient errors. When you got exception, It will try instantly according to your MaxRetries parameter. This parameter should be 1 for me.
SLR is for errors that persist after FLR, where a small delay is needed between retries. There is a config parameter called "TimeIncrease" defines a delay time between tries. However, Nservicebus do these retries increasingly delay time. When you set this parameter to 10 second. It will try 10.seconds, 30.seconds, 60.seconds and so on.
What do you suggest to me to provide my first request to try every 10 seconds with or without these mechanisms?
I found my answer;
The reply of Particular Software's community(John Simon), You need to apply a custom retry policy, have a look at http://docs.particular.net/nservicebus/errors/automatic-retries#second-level-retries-custom-retry-policy-simple-policy for an example.