Early stopping in Caffe - caffe

Seems this related PR is dead now, is there any workaround to use early stopping in Caffe? maybe using python on top of Caffe?

A first part is easy to do manually: let monitor your validation error then stop when this one do not change a lot (below a threshold). Then let consider the state with the lowest validation error as the "optimal" network.
The real problem is then to benefit from the full train+val dataset from there. There are two basic strategies:
Retrain your network with train+val for the same number of epoch OR for the same number of data (i.e let compute the number of minibatches that were used to reach the "optimal" state and set the number of passes such that the same number of minibatches (with same size...) go through the network
Let keep the "optimal" network then add the validation data and continue the training. If you reach the same error rate as before, let stop. Else, let just fix an a priori number of epochs.

You could apply this patch for early stopping to standard Caffe RC 1.0.0. It adds an optional early_stop_param into the solver. You can specify the test network ID, the length of tries to check for improvement in the test loss and a "skip" so that not every iteration is tried. Disclosure: I am one of the developers.

Related

Defining observation space and reward for traffic signal phase optimization for reinforcement learning

I am trying to use Reinforcement Learning for traffic signal phase optimization for improving traffic flow at intersections.
I am aware that in practice we won't be able to get the information about all the vehicles in each of the lanes.
If we use a camera for getting information about the queue length then we can get accurate data only upto, say 200 meters.
Should I take this into consideration while defining my observation space or can I directly use the data from sumo?
Furthermore, what should be the ideal observation space for such a task?
sumo_rl allows to use various metrics for reward calucation such as pressure metric, queue length metric, etc. What will be a good choice of rewards for my use case or what factors should I consider while defining my reward?
I have tried getting metrics from the e2 detector's output file such as throughput, lane delay and queue length. For the agent however, I might not be able to use them (as traci/sumo wrappers offer better implementations?) So how do I use traci for getting this modified information?
Yes, you should try to match your observation space as close to the real world as possible. SUMO can also filter the data directly (for instance with an E3 detector).
If you want to maximize flow than the reward should also include the flow metric (throughput). It's quite easy to get it via traci (as you already noticed) but I cannot tell how it integrates with your framework since you did not give details about it.

Why is a target network required?

I have a concern in understanding why a target network is necessary in DQN? I’m reading paper on “human-level control through deep reinforcement learning”
I understand Q-learning. Q-learning is value-based reinforcement learning algorithm that learns “optimal” probability distribution between state-action that will maximize it’s long term discounted reward over a sequence of timesteps.
The Q-learning is updated using the bellman equation, and a single step of the q-learning update is given by
Q(S, A) = Q(S, A) + $\alpha$[R_(t+1) + $\gamma$ (Q(s’,a;’) - Q(s,a)]
Where alpha and gamma are learning and discount factors.
I can understand that the reinforcement learning algorithm will become unstable and diverge.
The experience replay buffer is used so that we do not forget past experiences and to de-correlate datasets provided to learn the probability distribution.
This is where I fail.
Let me break the paragraph from the paper down here for discussion
The fact that small updates to $Q$ may significantly change the policy and therefore change the data distribution — understood this part. Changes to Q-network periodically may lead to unstability and changes in distribution. For example, if we always take a left turn or something like this.
and the correlations between the action-values (Q) and the target values r + $gamma$ (argmax(Q(s’,a’)) — This says that the reward + gamma * my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on.
We used an iterative update that adjusts the action-values (Q) towards target values that are only periodically updated, thereby reducing correlations with the target.
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
But I do not understand how it is going to solve it?
So, in summary a target network required because the network keeps changing at each timestep and the “target values” are being updated at each timestep?
The difference between Q-learning and DQN is that you have replaced an exact value function with a function approximator. With Q-learning you are updating exactly one state/action value at each timestep, whereas with DQN you are updating many, which you understand. The problem this causes is that you can affect the action values for the very next state you will be in instead of guaranteeing them to be stable as they are in Q-learning.
This happens basically all the time with DQN when using a standard deep network (bunch of layers of the same size fully connected). The effect you typically see with this is referred to as "catastrophic forgetting" and it can be quite spectacular. If you are doing something like moon lander with this sort of network (the simple one, not the pixel one) and track the rolling average score over the last 100 games or so, you will likely see a nice curve up in score, then all of a sudden it completely craps out starts making awful decisions again even as your alpha gets small. This cycle will continue endlessly regardless of how long you let it run.
Using a stable target network as your error measure is one way of combating this effect. Conceptually it's like saying, "I have an idea of how to play this well, I'm going to try it out for a bit until I find something better" as opposed to saying "I'm going to retrain myself how to play this entire game after every move". By giving your network more time to consider many actions that have taken place recently instead of updating all the time, it hopefully finds a more robust model before you start using it to make actions.
On a side note, DQN is essentially obsolete at this point, but the themes from that paper were the fuse leading up to the RL explosion of the last few years.

What does the EpisodeParameterMemory of keras-rl do?

I have found the keras-rl/examples/cem_cartpole.py example and I would like to understand, but I don't find documentation.
What does the line
memory = EpisodeParameterMemory(limit=1000, window_length=1)
do? What is the limit and what is the window_length? Which effect does increasing either / both parameters have?
EpisodeParameterMemory is a special class that is used for CEM. In essence it stores the parameters of a policy network that were used for an entire episode (hence the name).
Regarding your questions: The limit parameter simply specifies how many entries the memory can hold. After exceeding this limit, older entries will be replaced by newer ones.
The second parameter is not used in this specific type of memory (CEM is somewhat of an edge case in Keras-RL and mostly there as a simple baseline). Typically, however, the window_length parameter controls how many observations are concatenated to form a "state". This may be necessary if the environment is not fully observable (think of it as transforming a POMDP into an MDP, or at least approximately). DQN on Atari uses this since a single frame is clearly not enough to infer the velocity of a ball with a FF network, for example.
Generally, I recommend reading the relevant paper (again, CEM is somewhat of an exception). It should then become relatively clear what each parameter means. I agree that Keras-RL desperately needs documentation but I don't have time to work on it right now, unfortunately. Contributions to improve the situation are of course always welcome ;).
A little late to the party, but I feel like the answer doesn't really answer the question.
I found this description online (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html#replay-memory):
We’ll be using experience replay
memory for training our DQN. It stores the transitions that the agent
observes, allowing us to reuse this data later. By sampling from it
randomly, the transitions that build up a batch are decorrelated. It
has been shown that this greatly stabilizes and improves the DQN
training procedure.
Basically you observe and save all of your state transitions so that you can train your network on them later on (instead of having to make observations from the environment all the time).

MXNet distributed training accuracy

I am using MXNet to finetune Resnet model on Caltech 256 dataset from the following example:
https://mxnet.incubator.apache.org/how_to/finetune.html
I am primarily doing it for a POC to test distributed training (which I'll later use in my actual project).
First I ran this example on a single machine with 2 GPUs for 8 epochs. I took around 20 minutes and the final validation accuracy was 0.809072.
Then I ran it on 2 machines (identical, each with 2 GPUs) with distributed setting and partitioned the training data in half for these two machines (using num_parts and part_index).
8 epochs took only 10 minutes, but the final validation accuracy was only 0.772847 (highest of the two). Even when I used 16 epochs, I was only able to achieve 0.797006.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage? Maybe I am missing something.
I can post my code and run command if required.
Thanks
EDIT
Some more info to help with the answer:
MXNet version: 0.11.0
Topology: 2 workers (each on a separate machine)
Code: https://gist.github.com/reactivefuture/2a1f9dcd3b27c0fe8215b4e3d25056ce
Command to start:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
I have used a hacky way to do partitioning (using IP addresses) since I couldn't get kv.num_workers and kv.rank to work.
So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage?
No it is not normally, distributed training, indeed, should be used to speed up the training process, not to slow it down. However there are many ways to do it in a wrong way.
Based on the provided data it feels like workers are still running in the single training('device') mode, or maybe kv_store is created incorrectly. Therefore each worker just trains model himself. In such case you should see validation result after 16 epoch been close to the single machine with 8 epoch (simply because in cluster you are splitting the data). In your case it is 0.797006 vs 0.809072. Depends on how many experiments you have executed this numbers might be treated as equal. I would focus my investigation on the way how cluster bootstrapped.
If you need to dive deeper on the topic how to create kv_store(or what is this) and use it with the distributed training please see this article.
In general in order to give a better answer, in the future pleace provide at least the following information:
what is the version of MXNet?
what is the topology of the cluster, with the following information:
how many logical workers are used;
how many servers are used (are they on the same machines with workers)?
how do you start the training (ideally with the code)
if it is not possible to provide code, at least specify type of kv_store
how do you partitioning data between worker
EDIT
Even though call that starts training looks correct:
python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1
There is, at least one problem in the training.py itself. If you look here, it actually does not respect type of kv-store from the input argument and just uses 'device'. Therefore all workers are trining training the model separatly(and not in a cluster). I believe fixing this one line should help.
I would again advice to read the article in order to familiarize yourself in the topic how MXNet cluster is working. Such problems can be easily spotted by analyzing debug logs and observing that there are no kv-store created and therefore cluster is not training anything (only stand-alone machines are doing something).

deep learning: How do I know my net is not memorizing

I have a convolutional neural network and my input data are 10.000 images of the same object from different views (angles in 3D around the image). My network converges, but I am not sure if the network has memorized all the different angles / views or not. Since I only have one object I cannot really check test it with different data.
My training / test plot looks like this (red trainig, green test):
Since the test is lower than training I expect the network to learn all the images by heart? Even though I have 10.000 kind of different images.
First, "memorize" is not a term we apply to the learning process, since it's not exact regurgitation of prior examples.
This is a matter of your experimental process. You get to define the success criteria. Is 95% accuracy good enough for your intended application? What, to you, is good enough performance to declare success?
One way to build a more convincing argument is to make the typical third partition: besides training and test sets, save part of your data for validation. You do the training and testing as you've already done. When the model has converged, you apply it to the validation set to predict results. If that test passes your success criterion, then you have a finished model.