Gym Taxi-v2 is deprecated. My implementation of Q-learning still works with Taxi-v3 but for some reason, env.render() shows the wrong taxi position at each step.
Anyway, apart from an added wall, what are the differences between Taxi-v2 v Taxi-v3?
There were small corrections in the description and in the map, you can look at the pull request in github for the details.
Related
I am trying to make a PPO model using the stable-baselines3 library. I want to use a policy network with an LSTM layer in it. However, I can't find such a possibility on the library's website although it exists on the previous version of stable-baselines here https://stable-baselines.readthedocs.io/en/master/modules/policies.html#stable_baselines.common.policies.MlpLstmPolicy.
Does this possibility exist in stable-baselines3 (not stable-baselines)? if not, is there any other possibility I can do this? Thanx.
From the migration doc.
https://stable-baselines3.readthedocs.io/en/master/guide/migration.html
Breaking ChangesĀ¶
LSTM policies (MlpLstmPolicy, CnnLstmPolicy) are not supported for
the time being (see PR #53 for a recurrent PPO implementation)
Currently this functionality does not exist on stable-baselines3.
However, on their contributions repo (stable-baselines3-contrib) they have an experimental version of PPO with LSTM policy. I have not tried it myself, but according to this pull request it works.
You can find it on the feat/ppo-lstm branch, which may get merged onto master soon.
My state for a custom Gym environment is not the same as my observation space. The observation is calculated from the state.
How will RL that requires exploring starts etc, work? Or do I get it wrong?
I imagine the algorithm to sample from my observation space and then setting the state of the environment and checking an action. But this will not work with my environment.
From the question above you see I'm newby with RL and with Gym. What RL should I use in above case? How would you address such a situation?
Any tips?
My custom Gym environment is now selecting a random start state. Therefore, by using this environment, one can achieve "Exploring Starts". So, I do not need to worry any more that my observation is not the same as my state. For example, implementing Monte Carlo ES for Black Jack, as described in the RLbook2018, the state of my environment includes the hidden card of the dealer, while an observation does not.
I was confused at the time as I wanted the algorithm itself to pick the random state and set it into the environment.
PS,
If you need to save states of previous "alternative realities", search SO or Google for wrappers, and how they do that for MCTS (Monte Carlo Tree Search).
I'm trying to load the decomposable attention model proposed in this paper The decomposable attention model (Parikh et al, 2017) combined with ELMo embeddings trained on SNLI., and used the code suggested as the demo website described:
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/decomposable-attention-elmo-2020.04.09.tar.gz", "textual_entailment")
predictor.predict(
hypothesis="Two women are sitting on a blanket near some rocks talking about politics.",
premise="Two women are wandering along the shore drinking iced tea."
)
I found this from log:
Did not use initialization regex that was passed: .*token_embedder_tokens\._projection.*weight
and the prediction was also different from what I got on the demo website (which I intended to see). Did I miss anything here?
Also, I tried the two other versions of the pretrained model, decomposable-attention-elmo-2018.02.19.tar.gz and decomposable-attention-elmo-2020.02.10.tar.gz. Neither of them works and I got this error:
ConfigurationError: key "token_embedders" is required at location "model.text_field_embedder."
What do I need to do to get the exact output as presented in the demo website?
ELMo is a bit difficult in this way in that it keeps state, and you don't get the same output if you call it twice. It depends on what you processed beforehand. In general, ELMo should be warmed up with a few queries before using it seriously.
If you're still seeing large discrepancies in the output, let us know and we'll look into it.
The old versions of the model don't work with the new code. That's why we published the new model versions.
TRPO - RL: I need to get a 8DOF robot arm to move a specified point. I need to implement the TRPO RL code using OpenAI gym. I already have the gazebo environement. But I am unsure of how to write the code for the reward functons and the algorithm for the joint space motion. Please help.
Reward
Gazebo should be able to tell you the position of the end-effector link from which we can calculate the progress made towards a specified point after each step (i.e. positive if moving towards the goal, negative if away, and 0 otherwise).
This alone should encourage the end-effector towards the goal.
You may want to confirm that the system is able to learn with just this basic reward first before considering other criterions such as smoothness (avoid jerking motions), handedness (positioning the elbows on the left/right) etc.
These are significantly harder to specify and will have to be hand-designed according to your needs, possibly based on the joint states and/or some other derivatives that are available in your environment.
Motion
This will largely depend on your stack.
I am adding this part in just as a passing comment, but for instance, if you are using ROS as your middleware, then you can easily integrate Move-It to handle all the movement for you.
I'm using joint positions from a Kinect camera as my state space but I think it's going to be too large (25 joints x 30 per second) to just feed into SARSA or Qlearning.
Right now I'm using the Kinect Gesture Builder program which uses Supervised Learning to associate user movement to specific gestures. But that requires supervised training which I'd like to move away from. I figure the algorithm might pick up certain associations between joints that I would when I classify the data myself (hands up, step left, step right, for example).
I think feeding that data into a deep neural network and then pass that into a reinforcement learning algorithm might give me a better result.
There was a paper on this recently. https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
I know Accord.net has both deep neural networks and RL but has anyone combined them together? Any insights?
If I understand correctly from your question + comment, what you want is to have an agent that performs discrete actions using a visual input (raw pixels from a camera). This looks exactly like what DeepMind guys recently did, extending the paper you mentioned. Have a look at this. It is the newer (and better) version of playing Atari games. They also provide an official implementation, which you can download here.
There is even an implementation in Neon which works pretty well.
Finally, if you want to use continuous actions, you might be interested in this very recent paper.
To recap: yes, somebody combined DNN + RL, it works and if you want to use raw camera data to train an agent with RL, this is definitely one way to go :)