Hello StackOverflow Community!
I have a question about Actor-Critic Models in Reinforcement Learning.
While listening policy gradient methods classes of Berkeley University, it is said in the lecture that in the actor-critic algorithms where we both optimize our policy with some policy parameters and our value functions with some value function parameters, we use same parameters in both optimization problems(i.e. policy parameters = value function parameters) in some algorithms (e.g. A2C/A3C)
I could not understand how this works. I was thinking that we should optimize them separately. How does this shared parameter solution helps us?
Thanks in advance :)
You can do it by sharing some (or all) layers of their network. If you do, however, you are assuming that there is a common state representation (the intermediate layer output) that is optimal w.r.t. both. This is a very strong assumption and it usually doesn't hold. It has been shown to work for learning from image, where you put (for instance) an autoencoder on the top both the actor and the critic network and train it using the sum of their loss function.
This is mentioned in PPO paper (just before Eq. (9)). However, they just say that they share layers only for learning Atari games, not for continuous control problems. They don't say why, but this can be explained as I said above: Atari games have a low-dimensional state representation that is optimal for both the actor and the critic (e.g., the encoded image learned by an autoencoder), while for continuous control you usually pass directly a low-dimensional state (coordinates, velocities, ...).
A3C, which you mentioned, was also used mostly for games (Doom, I think).
From my experience, in control sharing layers never worked if the state is already compact.
Related
Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s), rather than to the state-action-value function Q(s,a)?
If not, why not, or, could it easily be implemented?
Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?
I ask because
I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on Vi(si,t+1) for the different achievable post-period states si,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).
I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s), but overall it still targets to readily learn the Q(s,a) in a decentralized way, which seems not conducive in my case.
This is my first post here, and I came here to discuss or get clarifications on something that I have trouble understanding, namely model-free vs model-based RL methods. I am currently implementing Q-learning, but am not certain I am doing it correctly.
Example: Say I am applying Q-learning to an inverted pendulum, where the reward is given as the absolute distance between the pendulum upward position, and terminal state (or goal state) is defined to be when the pendulum is very close to upward position.
Would this setup mean that I have a model-free or model-based setup? From how I have understood, this would be model-based as I have a model of the environment that is giving me the reward (R=abs(pos-wantedPos)). But then I saw an implementation of this using Q-learning (https://medium.com/#tuzzer/cart-pole-balancing-with-q-learning-b54c6068d947), which is a model-free algorithm. Now I am clueless...
Thankful for all responses.
Vanilla Q-learning is model-free.
The idea behind reinforcement learning is that an agent is trained to learn an optimal policy based on pairs of states and rewards--this is in contrast to trying to model the environment.
If you took a model-based approach, you would be trying to model the environment and ultimately perform value iteration or policy iteration of the Markov decision process.
In reinforcement learning, it is assumed you do not have the MDP, and thus must try to find an optimal policy based on the various rewards you receive from your experiences.
For a longer explanation, check out this post.
I am trying to train a learning model to recognize one specific scene. For example, say I would like to train it to recognize pictures taken at an amusement park and I already have 10 thousand pictures taken at an amusement park. I would like to train this model with those pictures so that it would be able to give a score for other pictures of the probability that they were taken at an amusement park. How do I do that?
Considering this is an image recognition problem, I would probably use a convolutional neural network, but I am not quite sure how to train it in this case.
Thanks!
There are several possible ways. The most trivial one is to collect a large number of negative examples (images from other places) and train a two-class model.
The second approach would be to train a network to extract meaningful low-dimensional representations from an input image (embeddings). Here you can use siamese training to explicitly train the network to learn similarities between images. Such an approach is employed for face recognition, for instance (see FaceNet). Having such embeddings, you can use some well-established methods for outlier detections, for instance, 1-class SVM, or any other classifier. In this case you also need negative examples.
I would heavily augment your data using image cropping - it is the most obvious way to increase the amount of training data in your case.
In general, your success in this task strongly depends on the task statement (are restricted to parks only, or any kind of place) and the proper data.
I am fairly new to Deep Learning and get quite overwhelmed by the many different Nets and their field of application. Thus, I want to know if there is some kind of overview which kind of different networks exist, what there key-features are and what kind of purpose they have.
For example I know abut LeNet, ConvNet, AlexNet - and somehow they are the same but still differ?
There are basically two types of neural networks, supervised and unsupervised learning. Both need a training set to "learn". Imagine training set as a massive book where you can learn specific information. In supervised learning, the book is supplied with answer key but without the solution manual, in contrast, unsupervised learning comes without answer key or solution manual. But the goal is the same, which is that to find patterns between the questions and answers (supervised learning) and questions (unsupervised learning).
Now we have differentiate between those two, we can go into the models. Let's discuss about supervised learning, which basically has 3 main models:
artificial neural network (ANN)
convolutional neural network (CNN)
recurrent neural network (RNN)
ANN is the simplest of all three. I believe that you have understand it, so we can move forward to CNN.
Basically in CNN all you have to do is to convolve our input with feature detectors. Feature detectors are matrices which have the dimension of (row,column,depth(number of feature detectors). The goal of convolving our input is to extract informations related to spatial data. Let's say you want to distinguish between cats and dogs. Cats have whiskers but dogs does not. Cats also have different eyes than dogs and so on. But the downside is, the more convolution layers will result in slower computation time. To mitigate that, we do some kind of processing called pooling or downsampling. Basically, this reduce the size of feature detectors while minimizing lost features or information. Then the next step would be flattening or squashing all those 3d matrix into (n,1) dimension so you can input it into ANN. Then the next step is self explanatory, which is normal ANN. Because CNN is inherently able to detect certain features, it mostly(maybe always) used for classification, for example image classification, time series classification, or maybe even video classification. For a crash course in CNN, check out this video by Siraj Raval. He's my favourite youtuber of all time!
Arguably the most sophisticated of all three, RNN is bestly described as neural networks that have "memory" by introducing "loops" within them which allow information to persist. Why is this important? As you are reading this, your brain use previous memory to comprehend all of this information. You don't seem to rethink everything from scratch again and this is what traditional neural networks do, which is to forget everything and re-learn again. But native RNN aren't effective so when people talk about RNN they mostly refer to LSTM which stands for Long Short-Term Memory. If that seems confusing to you, Cristopher Olah will give you in depth explanation in a very simple way. I advice you to check out his link for complete understanding about how RNN, especially LSTM variant
As for unsupervised learning, I'm so sorry that I haven't got the time to learn them, so this is the best I can do. Good luck and have fun!
They are the same type of Networks. Convolutional Neural Networks. The problem with the overview is that as soon as you post something it is already outdated. Most of the networks you describe are already old, even though they are only a few years old.
Nevertheless you can take a look at the networks supplied by caffe (https://github.com/BVLC/caffe/tree/master/models).
In my personal view the most important concepts in deep Learning are recurrent networks (https://keras.io/layers/recurrent/), residual connections, inception blocks (see https://arxiv.org/abs/1602.07261). The rest are largely theoretical concepts, which would not fit in a stack overflow answer.
I do know that feedforward multi-layer neural networks with backprop are used with Reinforcement Learning as to help it generalize the actions our agent does. This is, if we have a big state space, we can do some actions, and they will help generalize over the whole state space.
What do recurrent neural networks do, instead? To what tasks are they used for, in general?
Recurrent Neural Networks, RNN for short (although beware that RNN is often used in the literature to designate Random Neural Networks, which effectively are a special case of Recurrent NN), come in very different "flavors" which causes them to exhibit various behaviors and characteristics. In general, however these many shades of behaviors and characteristics are rooted in the availability of [feedback] input to individual neurons. Such feedback comes from other parts of the network, be it local or distant, from the same layer (including in some cases "self"), or even on different layers (*). Feedback information it treated as "normal" input the neuron and can then influence, at least in part, its output.
Unlike back propagation which is used during the learning phase of a Feed-forward Network for the purpose of fine-tuning the relative weights of the various [Feedfoward-only] connections, FeedBack in RNNs constitute true a input to the neurons they connect to.
One of the uses of feedback is to make the network more resilient to noise and other imperfections in the input (i.e. input to the network as a whole). The reason for this is that in addition to inputs "directly" pertaining to the network input (the types of input that would have been present in a Feedforward Network), neurons have the information about what other neurons are "thinking". This extra info then leads to Hebbian learning, i.e. the idea that neurons that [usually] fire together should "encourage" each other to fire. In practical terms this extra input from "like-firing" neighbor neurons (or no-so neighbors) may prompt a neuron to fire even though its non-feedback inputs may have been such that it would have not fired (or fired less strongly, depending on type of network).
An example of this resilience to input imperfections is with associative memory, a common employ of RNNs. The idea is to use the feeback info to "fill-in the blanks".
Another related but distinct use of feedback is with inhibitory signals, whereby a given neuron may learn that while all its other inputs would prompt it to fire, a particular feedback input from some other part of the network typically indicative that somehow the other inputs are not to be trusted (in this particular context).
Another extremely important use of feedback, is that in some architectures it can introduce a temporal element to the system. A particular [feedback] input may not so much instruct the neuron of what it "thinks" [now], but instead "remind" the neuron that say, two cycles ago (whatever cycles may represent), the network's state (or one of its a sub-states) was "X". Such ability to "remember" the [typically] recent past is another factor of resilience to noise in the input, but its main interest may be in introducing "prediction" into the learning process. These time-delayed input may be seen as predictions from other parts of the network: "I've heard footsteps in the hallway, expect to hear the door bell [or keys shuffling]".
(*) BTW such a broad freedom in the "rules" that dictate the allowed connections, whether feedback or feed-forward, explains why there are so many different RNN architectures and variations thereof). Another reason for these many different architectures is that one of the characteristics of RNN is that they are not readily as tractable, mathematically or otherwise, compared with the feed-forward model. As a result, driven by mathematical insight or plain trial-and-error approach, many different possibilities are being tried.
This is not to say that feedback network are total black boxes, in fact some of the RNNs such as the Hopfield Networks are rather well understood. It's just that the math is typically more complicated (at least to me ;-) )
I think the above, generally (too generally!), addresses devoured elysium's (the OP) questions of "what do RNN do instead", and the "general tasks they are used for". To many complement this information, here's an incomplete and informal survey of applications of RNNs. The difficulties in gathering such a list are multiple:
the overlap of applications between Feed-forward Networks and RNNs (as a result this hides the specificity of RNNs)
the often highly specialized nature of applications (we either stay in with too borad concepts such as "classification" or we dive into "Prediction of Carbon shifts in series of saturated benzenes" ;-) )
the hype often associated with neural networks, when described in vulgarization texts
Anyway, here's the list
modeling, in particular the learning of [oft' non-linear] dynamic systems
Classification (now, FF Net are also used for that...)
Combinatorial optimization
Also there are a lots of applications associated with the temporal dimension of the RNNs (another area where FF networks would typically not be found)
Motion detection
load forecasting (as with utilities or services: predicting the load in the short term)
signal processing : filtering and control
There is an assumption in the basic Reinforcement Learning framework that your state/action/reward sequence is a Markov Decision Process. That basically means that you do not need to remember any information about previous states from this episode to make decisions.
But this is obviously not true for all problems. Sometimes you do need to remember some recent things to make informed decisions. Sometimes you can explicitly build the things that need to be remembered into the state signal, but in general we'd like our system to learn what it needs to remember. This is called a Partially Observable Markov Decision Process (POMDP), and there are a variety of methods used to deal with it. One possibly solution is to use a recurrent neural network, since they incorporate details from previous time steps into the current decision.