Stable-baselines3 vs. Tianshou - reinforcement-learning

What would you recommend between Stable-Baselines3 and Tianshou for applied research in Reinforcement Learning?
Can anyone provide a comparison of the strengths and weaknesses of each library?

Related

Supervised learning v.s. offline (batch) reinforcement learning

Most materials (e.g., David Silver's online course) I can find offer discussions about the relationship between supervised learning and reinforcement learning. However, it is actually a comparison between supervised learning and online reinforcement learning where the agent runs in the environment (or simulates interactions) to get feedback given limited knowledge about the underlying dynamics.
I am more curious about offline (batch) reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then? and what are the similarities they may share?
I am more curious about the offline (batch) setting for reinforcement learning where the dataset (collected learning experiences) is given a priori. What are the differences compared to supervised learning then ? and what are the similarities they may share ?
In the online setting, the fundamental difference between supervised learning and reinforcement learning is the need for exploration and the trade-off between exploration/exploitation in RL. However also in the offline setting there are several differences which makes RL a more difficult/rich problem than supervised learning. A few differences I can think of on the top of my head:
In reinforcement learning the agent receives what is termed "evaluative feedback" in terms of a scalar reward, which gives the agent some feedback of the quality of the action that was taken but it does not tell the agent if this action is the optimal action or not. Contrast this with supervised learning where the agent receives what is termed "instructive feedback": for each prediction that the learner makes, it receives a feedback (a label) that says what the optimal action/prediction was. The differences between instructive and evaluative feedback is detailed in Rich Sutton's book in the first chapters. Essentially reinforcement learning is optimization with sparse labels, for some actions you may not get any feedback at all, and in other cases the feedback may be delayed, which creates the credit-assignment problem.
In reinforcement learning you have a temporal aspect where the goal is to find an optimal policy that maps states to actions over some horizon (number of time-steps). If the horizon T=1, then it is just a one-off prediction problem like in supervised learning, but if T>1 then it is a sequential optimization problem where you have to find the optimal action not just in a single state but in multiple states and this is further complicated by the fact that the actions taken in one state can influence which actions should be taken in future states (i.e. it is dynamic).
In supervised learning there is a fixed i.i.d distribution from which the data points are drawn (this is the common assumption at least). In RL there is no fixed distribution, rather this distribution depends on the policy that is followed and often this distribution is not i.i.d but rather correlated.
Hence, RL is a much richer problem than supervised learning. In fact, it is possible to convert any supervised learning task into a reinforcement learning task: the loss function of the supervised task can be used as to define a reward function, with smaller losses mapping to larger rewards. Although it is not clear why one would want to do this because it converts the supervised problem into a more difficult reinforcement learning
problem. Reinforcement learning makes fewer assumptions than supervised learning and is therefore in general a harder problem to solve than supervised learning. However, the opposite is not possible, it is in general not possible to convert a reinforcement learning problem into a supervised learning problem.

How TVM is different from MLIR?

As per my understanding, both TVM and MLIR are used as compiler infrastructure for deep learning neural networks. Is my understanding correct?.
And Which would be better if we are building a compiler for custom hardware that runs deep learning inferences?
I found this discussion helpful:
https://discuss.tvm.apache.org/t/google-lasted-work-mlir-primer/1721

Deep Value-only Reinforcement Learning: Train V(s) instead of Q(s,a)?

Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s), rather than to the state-action-value function Q(s,a)?
If not, why not, or, could it easily be implemented?
Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?
I ask because
I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on Vi(si,t+1) for the different achievable post-period states si,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).
I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s), but overall it still targets to readily learn the Q(s,a) in a decentralized way, which seems not conducive in my case.

Reinforcement Learning tools

What is the difference between Tensorforce, Kerasrl, and chainerrl used for Reinforcement Learning?
AS far as I've found all three work with OpenAI gym environments and have the same reinforcement learning algorithms that have been implemented. Is there a difference in performance?
they are different Deep learning bank-ends.
TensorFlow , Keras and Chainer are different libraries used for inference of Neural network based AI algorithms.
Open AI is a Reinforcement Learning library.
These are two different technologies.
if you want Reinforcement learning for Tensorflow, back-end and RL tf library, checkout
https://github.com/google/dopamine
This has no connection with OpenAI. Pure Google tech.
short answer: Keras is more "High Level" than tensorflow in the sense that you can write code quicker with Keras but it's less flexible. Checkout this this post for instance.
Tensorflow , keras and Chainer all these are frameworks. These frameworks can be used to implement Deep Reinforcement learning models. As Jaggernaut said Keras is more of high level (meaning : pretty easy to learn) Keras uses Tensorflow backend to function.

What are the similarities between A3C and PPO in reinforcement learning policy gradient methods?

Is there any easy way to merge properties of PPO with an A3C method? A3C methods run a number of parrel actors and optimize the parameters. I am trying to merge PPO with A3C.
PPO has a built-in mechanism(surrogate clipping objective function) to prevent large gradient updates & generally outperforms A3C on most continuous control environments.
In order for PPO to enjoy the benefits of parallel computing like A3C, Distributed PPO(DPPO) is the way to go.
Check out the links below to find out more information about DPPO.
Pseudo code from the original DeepMind paper
Original DeepMind paper: Emergence of Locomotion Behaviours in Rich Environments
If you plan to implement your DPPO code in Python with Tensorflow, I will suggest you to try Ray for the part on distributed execution.