reinforcement learning model design - how to add upto 5

reinforcement learning model design - how to add upto 5 - reinforcement-learning

I am experimenting with reinforcement learning in python using Keras. Most of the tutorials available use OpenAI Gym library to create the environment, state, and action sets.
After practicing with many good examples written by others, I decided that I want to create my own reinforcement learning environment, state, and action sets.
This is what I think will be fun to teach the machine to do.
An array of integers from 1 to 4. I will call these targets.
targets = [[1, 2, 3, 4]]
Additional numbers list (at random) from 1 to 4. I will call these bullets.
bullets = [1, 2, 3, 4]
When I shoot a bullet to a target, the target's number will be the sum of original target num + bullet num.
I want to shoot a bullet (one at a time) at one of the targets to make
For example, given targets [1 2 3 4] and bullet 1, I want the machine to predict the correct index to shoot at.
In this case, it should be index 3, because 4 + 1 = 5
curr_state = [[1, 2, 3, 4]]
bullet = 1
action = 3 (<-- index of the curr_state)
next_state = [[1, 2, 3, 5]]
I have been picking my brain to think of the best way to construct this into a reinforcement design. I tried some, but the model result is not very good (meaning, it most likely fails to make number 5).
Mostly because the state is a 2D: (1) targets; (2) bullet at that time. The method I employed so far is to convert the state as the following:
State = 5 - targets - bullet
I was wondering if anyone can think of a better way to design this model?
Thanks in advance!

Alright, so it looks like no one is helping you out, so I just wrote a Python environment file for you as you described. I also made it as much OpenAI style for you as possible, here is the link to it, it is in my GitHub repository. You can copy the code or fork it. I will explain it below:
https://github.com/RuiNian7319/Miscellaneous/blob/master/ShootingRange.py
States = [0, 1, 2, ..., 10]
Actions = [-2, -1, 0, 1, 2]
So the game starts at a random number between 0 - 10 (you can change this easily if you want), and the random number is your "target" you described above. Given this target, your AI agent can fire the gun, and it shoots bullets corresponding to the numbers above. The objective is for your bullet and the target to add up to 5. There are negatives in case your AI agent overshoots 5, or if the target is a number above 5.
To get a positive reward, the agent has to get 5. So if the current value is 3, and the agent shoots 2, then the agent will get a reward of 1 since he got the total value of 5, and that episode will end.
There are 3 ways for the game to end:
1) Agent gets 5
2) Agent fails to get 5 in 15 tries
3) The number is above 10. In this case, we say the target is too far
Sometimes, you need to shoot multiple times to get 5. So, if your agent shoots, its current bullet will be added to the state, and the agent tries again from that new state.
Example:
Current state = 2. Agent shoots 2. New state is 4. And the agent starts at 4 at the next time step. This "sequential decision making" creates a reinforcement learning environment, rather than a contextual bandit.
I hope this makes sense, let me know if you have any questions.

Related

What is the best way to model an environment to force an agent to select `x out of n` choices?

I have an RL problem where I want the agent to make a selection of x out of an array of size n.
I.e. if I have [0, 1, 2, 3, 4, 5] then n = 6 and if x = 3 a valid action could be
[2, 3, 5].
Right now what I tried is have n scores:
Output n continuous numbers, and select the x highest ones. This works quite ok.
And I tried iteratively replacing duplicates out of a Multi Discrete action. Where we have x values that can be anything from 0 to n-1.
Is there some other optimal action space I am missing that would force the agent to make unique choices?
Many thanks for your valuable insights and tips in advance! I am happy to try all!

Since reinforcement learning mostly about interacting with environment, you can approach like this:
Your agent starts choosing actions. After choosing the first action, you can either update the possible choices it has by removing the last choice (with temporary action list) or you can update the values of the chosen action (giving it either negative reward or punishing it). I think this could solve your problem.

Do the multiple heads in Multi head attention actually lead to more parameters or different outputs?

I am trying to understand Transformers. While I understand the concept of the encoder-decoder structure and the idea behind self-attention what I am stuck at is the "multi head part" of the "MultiheadAttention-Layer".
Looking at this explanation https://jalammar.github.io/illustrated-transformer/, which I generally found very good, it appears that multiple weight matrices (one set of weight matrices per head) are used to transform the original input value into the query, key and value, which are then used to calculate the attention scores and the actual output of the MultiheadAttention layer. I also understand the idea of multiple heads to the individual attention heads can focus on different parts (as depicted in the link).
However, this seems to contradict other observations I have made:
In the original paper https://arxiv.org/abs/1706.03762, it is stated that the input is split into parts of equal size per attention head.
So, for example I have:
batch_size = 1
sequence_length = 12
embed_dim = 512 (I assume that the dimension for ```query```, ```key``` and ```value``` are equal)
Then the shape of my query, key and token would each be [1, 12, 512]
We assume we have five heads, so num_heads = 2
This results in a dimension per head of 512/2=256. According to my understanding this should result in the shape [1, 12, 256] for each attention head.
So, am I correct in assuming that this depiction https://jalammar.github.io/illustrated-transformer/ just does not display this factor appropriately?
Does the splitting of the input into different heads actually lead to different calculations in the layer or is it just done to make computations faster?
I have looked at the implementation in torch.nn.MultiheadAttention and printed out the shapes at various stages during the forward pass through the layer. To me it appears that the operations are conducted in the following order:
Use the in_projection weight matrices to get the query, key and value from the original inputs. After this the shape for query, key and value is [1, 12, 512]. From my understanding the weights in this step are the parameters that are actually learned in the layer during training.
Then the shape is modified for the multiple heads into [2, 12, 256].
After this the dot product between query and key is calculated, etc.. The output of this operation has the shape [2, 12, 256].
Then the output of the heads is concatenated which results in the shape [12, 512].
The attention_output is multiplied by the output projection weight matrices and we get [12, 1, 512] (The batch size and the sequence_length is sometimes switched around). Again here we have weights that are being trained inside the matrices.
I printed the shape of the parameters in the layer for different num_heads and the amount of the parameters does not change:
First parameter: [1536,512] (The input projection weight matrix, I assume, 1536=3*512)
Second parameter: [1536] (The input projection bias, I assume)
Third parameter: [512,512] (The output projection weight matrix, I assume)
Fourth parameter: [512] (The output projection bias, I assume)
On this website https://towardsdatascience.com/transformers-explained-visually-part-3-multi-head-attention-deep-dive-1c1ff1024853, it is stated that this is only a "logical split". This seems to fit my own observations using the pytorch implementation.
So does the number of attention heads actually change the values that are outputted by the layer and the weights learned by the model? The way I see it, the weights are not influenced by the number of heads.
Then how can multiple heads focus on different parts (similar to the filters in convolutional layers)?

how to define observation and action space for an array like input

I am working on a problem and I want to implement it as a reinforcement learning problem and integrate it into OpenAI gym. My states are in the form of lists of length n which each element is chosen from a discrete interval [0, m].
for example for n=6 and m=3, this is a sample from the observation space:
[0 2 1 3 3 2]
and the possible accessible states from this space is a set of other lists which are achieved by changing a number of k elements the elements in the list with a number from the same [0, m].
for example, for k=1 we can have the following states as two subsequent states of the previous state:
[0 2 2 3 3 2]
or
[0 3 1 3 3 2]
My question is that what is an efficient way to represent the "actions" in the OpenAI gym for such a scenario?
One way that come to my mind is to just use the next state as the action itself, for example, if I write:
action = env.action_space.sample()
the action would be the next state (which also implicitly contains the action) and then in the env.step(action) make the state equal to the next state.
Does anyone know a better way or using the implicit action representation with the next state is the optimal way?
Does anyone know a predefined gym environment that also has the same representation?
what are the cons of the implicit representation of the actions that I just explained?

Sarsa and Q Learning (reinforcement learning) don't converge optimal policy

I have a question about my own project for testing reinforcement learning technique. First let me explain you the purpose. I have an agent which can take 4 actions during 8 steps. At the end of this eight steps, the agent can be in 5 possible victory states. The goal is to find the minimum cost. To access of this 5 victories (with different cost value: 50, 50, 0, 40, 60), the agent don't take the same path (like a graph). The blue states are the fail states (sorry for quality) and the episode is stopped.
enter image description here
The real good path is: DCCBBAD
Now my question, I don't understand why in SARSA & Q-Learning (mainly in Q learning), the agent find a path but not the optimal one after 100 000 iterations (always: DACBBAD/DACBBCD). Sometime when I compute again, the agent falls in the good path (DCCBBAD). So I would like to understand why sometime the agent find it and why sometime not. And there is a way to look at in order to stabilize my agent?
Thank you a lot,
Tanguy

TD;DR;
Set your epsilon so that you explore a bunch for a large number of episodes. E.g. Linearly decaying from 1.0 to 0.1.
Set your learning rate to a small constant value, such as 0.1.
Don't stop your algorithm based on number of episodes but on changes to the action-value function.
More detailed version:
Q-learning is only garranteed to converge under the following conditions:
You must visit all state and action pairs infinitely ofter.
The sum of all the learning rates for all timesteps must be infinite, so
The sum of the square of all the learning rates for all timesteps must be finite, that is
To hit 1, just make sure your epsilon is not decaying to a low value too early. Make it decay very very slowly and perhaps never all the way to 0. You can try , too.
To hit 2 and 3, you must ensure you take care of 1, so that you collect infinite learning rates, but also pick your learning rate so that its square is finite. That basically means =< 1. If your environment is deterministic you should try 1. Deterministic environment here that means when taking an action a in a state s you transition to state s' for all states and actions in your environment. If your environment is stochastic, you can try a low number, such as 0.05-0.3.
Maybe checkout https://youtu.be/wZyJ66_u4TI?t=2790 for more info.

What is the name of this data structure or technique of using relative difference between sequence members

Let's say I have a sequence of values (e.g., 3, 5, 8, 12, 15) and I want to occasionally decrease all of them by a certain value.
If I store them as the sequence (0, 2, 3, 4, 3) and keep a variable as a base of 3, I now only have to change the base (and check the first items) whenever I want to decrease them instead of actually going over all the values.
I know there's an official term for this, but when I literally translate from my native language to English it doesn't come out right.

Differential Coding / Delta Encoding?
I don't know a name for the data structure, but it's basically just base+offset :-)

An offset?

If I understand your question right, you're rebasing. That's normally used in reference to patching up addresses in DLLs from a load address.
I'm not sure that's what you're doing, because your example seems to be incorrect. In order to come out with { 3, 5, 8, 12, 15 }, with a base of 3, you'd need { 0, 2, 5, 9, 12 }.

I'm not sure. If you imagine your first array as providing the results of some function of an index value f(i) where f(0) is 3, f(1) is 5, and so forth, then your second array is describing the function f`(i) where f(i+1) = f(i) + f'(i) given f(0) = 3.
I'd call it something like a derivative function, where the process of retrieving your original data is simply the summation function.
What will happen more often, will you be changing f(0) or retrieving values from f(i)? Is this technique rooted in a desire to optimize?
Perhaps you're looking for a term like "Inductive Sequence" or "Induction Sequence." (I just made that up.)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008