A recommendation system is running on real time. It can suggest 3 products for say x,y,z. But for today's data it is only giving out product x all the time. What has to be tweaked. Any solution in terms of reinforcement learning ?
Related
Most materials (e.g., David Silver's online course) I can find offer discussions about the relationship between supervised learning and reinforcement learning. However, it is actually a comparison between supervised learning and online reinforcement learning where the agent runs in the environment (or simulates interactions) to get feedback given limited knowledge about the underlying dynamics.
I am more curious about offline (batch) reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then? and what are the similarities they may share?
I am more curious about the offline (batch) setting for reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then ? and what are the similarities they may share ?
In the online setting, the fundamental difference between supervised learning and reinforcement learning is the need for exploration and the trade-off between exploration/exploitation in RL. However also in the offline setting there are several differences which makes RL a more difficult/rich problem than supervised learning. A few differences I can think of on the top of my head:
In reinforcement learning the agent receives what is termed "evaluative feedback" in terms of a scalar reward, which gives the agent some feedback of the quality of the action that was taken but it does not tell the agent if this action is the optimal action or not. Contrast this with supervised learning where the agent receives what is termed "instructive feedback": for each prediction that the learner makes, it receives a feedback (a label) that says what the optimal action/prediction was. The differences between instructive and evaluative feedback is detailed in Rich Sutton's book in the first chapters. Essentially reinforcement learning is optimization with sparse labels, for some actions you may not get any feedback at all, and in other cases the feedback may be delayed, which creates the credit-assignment problem.
In reinforcement learning you have a temporal aspect where the goal is to find an optimal policy that maps states to actions over some horizon (number of time-steps). If the horizon T=1, then it is just a one-off prediction problem like in supervised learning, but if T>1 then it is a sequential optimization problem where you have to find the optimal action not just in a single state but in multiple states and this is further complicated by the fact that the actions taken in one state can influence which actions should be taken in future states (i.e. it is dynamic).
In supervised learning there is a fixed i.i.d distribution from which the data points are drawn (this is the common assumption at least). In RL there is no fixed distribution, rather this distribution depends on the policy that is followed and often this distribution is not i.i.d but rather correlated.
Hence, RL is a much richer problem than supervised learning. In fact, it is possible to convert any supervised learning task into a reinforcement learning task: the loss function of the supervised task can be used as to define a reward function, with smaller losses mapping to larger rewards. Although it is not clear why one would want to do this because it converts the supervised problem into a more difficult reinforcement learning
problem. Reinforcement learning makes fewer assumptions than supervised learning and is therefore in general a harder problem to solve than supervised learning. However, the opposite is not possible, it is in general not possible to convert a reinforcement learning problem into a supervised learning problem.
I am new to deep learning (I just finished to read deep learning with pytorch), and I was wondering what is the best neural network architecture for my case.
I have a large multiclass classification problem (user identification problem), about 1000 classes in which each class is a user. I have about 2000 features for each user after one-hot encoding and cleaning. Data are highly imbalanced, but I can always use oversampling/downsampling techniques.
I was wondering what is the best architecture to implement for my case. I've always seen deep learning applied to time series or images, so I'm not sure about what to use in this case. I was thinking about a multi-layer perceptron but maybe there are better solutions.
Thanks for your tips and help. Have a nice day!
You can try triplet learning instead of simple classification.
From your 1000 users, you can make, c * 1000 * 999 / 2 pairs. c is the average number of samples per class/user.
https://arxiv.org/pdf/1412.6622.pdf
In deep reinforcement learning, is there any way to decay learning rate wrt to cumulative reward. I mean, decay learning rate when the agent is able to learn and maximize the reward?
It is common to modify learning rates with number of steps, so it would certainly be possible to modify learning rates as a function of cumulative reward.
One risk would be that you do not know what reward you are seeking at the beginning of training, so reducing the learning rate too early is a common problem. If you target a reward of 80, with the learning rate declining sharply as you attain that value, you will never know if your algorithm could have attained 90, as learning will stop at 80.
Another problem is setting the target too high. If you set the target for 100, meaning that the learning rate does not reduce as you reach 85, the instability may mean that the algorithm cannot converge well enough to reach 90.
So in general, I think people try a variety of learning schedules, and if possible sometimes let the algorithms run for plenty of time to see if they converge.
If lots of iterations are needed in a simulated environment before a reinforcement learning (RL) algorithm to work in real world, why we don’t use the same simulated environment to generate the labeled data and then use supervised learning methods instead of RL?
The reason is because the two fields has a fundamental difference:
One tries to replicate previous results and the other tries to be better than previous results.
There are 4 fields in machine learning:
Supervised learning
Unsupervised Learning
Semi-supervised Learning
Reinforcement learning
Let's talking about the two fields you asked for, and let's intuitively explore them with a real life example of archery.
Supervised Learning
For supervised learning, we would observe a master archer in action for maybe a week and record how far they pulled the bow string back, angle of shot, etc. And then we go home and build a model. In the most ideal scenario, our model becomes equally as good as the master archer. It cannot get better because the loss function in supervised learning is usually MSE or Cross entropy, so we simply try to replicate the feature label mapping. After building the model, we deploy it. And let's just say we're extra fancy and make it learn online. So we continually take data from the master archer and continue to learn to be exactly the same as the master archer.
The biggest takeaway:
We're trying to replicate the master archer simply because we think he is the best. Therefore we can never beat him.
Reinforcement Learning
In reinforcement learning, we simply build a model and let it try many different things. And we give it a reward / penalty depending on how far the arrow was from the bullseye. We are not trying to replicate any behaviour, instead, we try to find our own optimal behaviour. Because of this, we are not given any bias towards what we think the optimal shooting strategy is.
Because RL does not have any prior knowledge, it may be difficult for RL to converge on difficult problems. Therefore, there is a method called apprenticeship learning / imitation learning, where we basically give the RL some trajectories of master archers just so it can have a starting point and begin to converge. But after that, RL will explore by taking random actions sometimes to try to find other optimal solutions. This is something that supervised learning cannot do. Because if you explore using supervised learning, you are basically saying by taking this action in this state is optimal. Then you try to make your model replicate it. But this scenario is wrong in supervised learning, and should instead be seen as an outlier in the data.
Key differences of Supervised learning vs RL:
Supervised Learning replicates what's already done
Reinforcement learning can explore the state space, and do random actions. This then allows RL to be potentially better than the current best.
Why we don’t use the same simulated environment to generate the labeled data and then use supervised learning methods instead of RL
We do this for Deep RL because it has an experience replay buffer. But this is not possible for supervised learning because the concept of reward is lacking.
Example: Walking in a maze.
Reinforcement Learning
Taking a right in square 3: Reward = 5
Taking a left in square 3: Reward = 0
Taking a up in square 3: Reward = -5
Supervised Learning
Taking a right in square 3
Taking a left in square 3
Taking a up in square 3
When you try to make a decision in square 3, RL will know to go right. Supervised learning will be confused, because in one example, your data said to take a right in square 3, 2nd example says to take left, 3rd example says to go up. So it will never converge.
In short, supervised learning is passive learning, that is, all the data is collected before you start training your model.
However, reinforcement learning is active learning. In RL, usually, you don't have much data at first and you collect new data as you are training your model. Your RL algorithm and model decide what specific data samples you can collect while training.
Supervised Learning is about the generalization of the knowledge given by the supervisor (training data) to use in an uncharted area (test data). It is based on instructive feedback where the agent is provided with correct actions (labels) to take given a situation (features).
Reinforcement Learning is about learning through interaction by trial-and-error. There is no instructive feedback but only evaluative feedback that evaluates the action taken by an agent by informing how good the action taken was instead of saying the correct action to take.
In supervised learning we have target labelled data which is assumed to be correct.
In RL that's not the case we have nothing but rewards. Agents needs to figure itself which action to take by playing with the environment while observing the rewards it gets.
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of a training data set, it is bound to learn from its experience.
I am learning about the approach employed in Reinforcement Learning for robotics and I came across the concept of Evolutionary Strategies. But I couldn't understand how RL and ES are different. Can anyone please explain?
To my understanding, I know of two main ones.
1) Reinforcement learning uses the concept of one agent, and the agent learns by interacting with the environment in different ways. In evolutionary algorithms, they usually start with many "agents" and only the "strong ones survive" (the agents with characteristics that yield the lowest loss).
2) Reinforcement learning agent(s) learns both positive and negative actions, but evolutionary algorithms only learns the optimal, and the negative or suboptimal solution information are discarded and lost.
Example
You want to build an algorithm to regulate the temperature in the room.
The room is 15 °C, and you want it to be 23 °C.
Using Reinforcement learning, the agent will try a bunch of different actions to increase and decrease the temperature. Eventually, it learns that increasing the temperature yields a good reward. But it also learns that reducing the temperature will yield a bad reward.
For evolutionary algorithms, it initiates with a bunch of random agents that all have a preprogrammed set of actions it is going to do. Then the agents that has the "increase temperature" action survives, and moves onto the next generation. Eventually, only agents that increase the temperature survive and are deemed the best solution. However, the algorithm does not know what happens if you decrease the temperature.
TL;DR: RL is usually one agent, trying different actions, and learning and remembering all info (positive or negative). EM uses many agents that guess many actions, only the agents that have the optimal actions survive. Basically a brute force way to solve a problem.
I think the biggest difference between Evolutionary Strategies and Reinforcement Learning is that ES is a global optimization technique while RL is a local optimization technique. So RL can converge to a local optima converging faster while ES converges slower to a global minima.
Evolution Strategies optimization happens on a population level. An evolution strategy algorithm in an iterative fashion (i) samples a batch of candidate solutions from the search space (ii) evaluates them and (iii) discards the ones with low fitness values. The sampling for a new iteration (or generation) happens around the mean of the best scoring candidate solutions from the previous iteration. Doing so enables evolution strategies to direct the search towards a promising location in the search space.
Reinforcement learning requires the problem to be formulated as a Markov Decision Process (MDP). An RL agent optimizes its behavior (or policy) by maximizing a cumulative reward signal received on a transition from one state to another. Since the problem is abstracted as an MDP learning can happen on a step or episode level. Learning per step (or N steps) is done via temporal-Difference learning (TD) and per episode is done via Monte Carlo methods. So far I am talking about learning via action-value functions (learning the values of actions). Another way of learning is by optimizing the parameters of a neural network representing the policy of the agent directly via gradient ascent. This approach is introduced in the REINFORCE algorithm and the general approach known as policy-based RL.
For a comprehensive comparison check out this paper https://arxiv.org/pdf/2110.01411.pdf