How to generalise over multiple dependent actions in Reinforcement Learning - reinforcement-learning

I am trying to build an RL agent to price paid for airline seats (not the ticket). The general set up is:
After choosing their flights (for n people on a booking), a customer will view a web page with the available seat types and their prices visible.
They select between zero and n seats from a seat map with a variety of different prices for different seats, to be added to their booking.
After perhaps some other steps, they pay for the booking and the agent is rewarded with the seat revenue.
I have not decided on a general architecture yet. I want to take various booking and flight information into account, so I know I will be using function approximation (most likely a neural net) to generalise over the state space.
However, I am less clear on how to set up my action space. I imagine an action would amount to a vector with a price for each different seat type. If I have, for example, 8 different seat types, and 10 different price points for each, this gives me a total of 10^8 different actions, many of which will be very similar. Additionally, each sub-action (pricing one seat type) is somewhat dependent on the others, in the sense that the price of one seat type will likely affect the demand (and hence reward contribution) for another. Hence, I doubt the problem can be decomposed into a set of sub-problems.
I'm interested if there has been any research into dealing with a problem like this. Clearly any agent I build needs some way to generalise across actions to some degree, since collecting real data on millions of actions is not possible, even just for one state.
As I see it, this comes down to two questions:
Is it possible to get an agent to understand actions in relative terms? Say for example, one set of potential prices is [10, 12, 20]. Can I get my agent to realise that there is a natural ordering there, and that the first two pricing actions are more similar to each other than to the third possible action?
Further to this, is it possible to generalise from this set of actions - could an agent be set up to understand that the set of prices [10, 13, 20] is very similar to the first set?
I haven't been able to find any literature on this, especially relating to the second question - any help would be much appreciated!

Correct me if I'm wrong, but I am going to assume this is what you are asking and will answer accordingly.
I am building an RL and it needs to be smart enough to understand that if I were to buy one airplane ticket, it will subsequently affect the price of other airplane tickets because there is now less supply.
Also, the RL agent must realize that actions very close to each other are relatively similar actions, such as [10, 12, 20] ≈ [10, 13, 20]
1) In order to provide memory to your RL agent, you can do this in two ways. The easy way is to feed the states as a vector of past purchased tickets, as well as the current ticket.
Example: Let's say we build the RL to remember at least the past 3 transactions. At the very beginning, our state vector will be [0, 0, 3], meaning that there was no purchases of tickets previously (the zeros), and currently, we are purchasing ticket #3. Then, the next time step's state vector can be [0, 3, 6], telling the RL agent that previously, ticket #3 has been picked, and now we're buying ticket #6. The neural network will learn that the state vector [0, 0, 6] should map to a different outcome compared to [0, 3, 6], because in the first case, ticket #6 was the first ticket purchased and there was lots of supply. But in the 2nd case, ticket 3 was already sold, so now all the remaining tickets went up in price.
The proper and more complex way would be to use a recurrent neural network as your function approximator for your RL agent. The recurrent neural network architecture allows for certain "important" states to be remembered by the neural network. In your case, the amount of tickets purchased previously is important, so the neural network will remember the previously purchased tickets, and calculate the output accordingly.
2) Any function approximation reinforcement learning algorithm will automatically generalize sets of actions close to each other. The only RL architectures that would not do this would be tabular based approaches.
The reason is the following:
We can think of these function approximators simply as a line. Neural networks simply build a highly nonlinear continuous line (neural networks are trained using backpropagation and gradient descent, so they must be continuous), and a set of states will map to a unique set of outputs. Because it is a line, sets of states that are very similar SHOULD map to outputs that are also very close. In the most basic case, imagine y = 2x. If our input x = 1, our y = 2. And if our input x is 1.1, which is very close to 1, our output y = 2.2, which is very close to 2 because they are on a continuous line.
For tabular approach, there is simply a matrix. On the y axis, you have the states, and on the x axis, you have the actions. In this approach, the states and actions are discrete. Depending on the discretization, the difference can be massive, and if the system is poorly discretized, the actions very close to each other MAY not be generalized.
I hope this helps, please let me know if anything is unclear.

Related

How to reveal relations between number of words and target with self-attention based models?

Transformers can handle variable length input, but what if the number of words might correlate with the target? Let's say we want to perform a sentiment analysis for some reviews where the longer reviews are more probable to be bad. How can the model harness this knowledge? Of course a simple solution could be to add this count as a feature after the self-attention layer. However, this hand-crafted-like approach wouldn't reveal more complex relations, for example if there is a high number of word X, it correlates with target 1, except if there is also high number of word Y, in which case the target tends to be 0.
How could this information be included using deep learning? Paper recommendations in the topic are also well appreciated.

Building an autonomic drugs widget for medical education

I've made my way over to this community because I'm planning on building a widget to help medical students with understanding the effects of various autonomic medications on cardiovascular metrics like heart rate (HR), BP (systolic, diastolic, and mean) and peripheral resistance (SVR). Some background - I'm a 3rd year med student in the US without fluency in any programming languages (which makes this particularly difficult), but am willing to spend the time to pick up what I need to know to make this happen.
Regarding the project:
The effects of autonomic medications like epinephrine, norepinephrine, beta-blockers, and alpha-blockers on the cardiovascular system is of great interest to physicians because these drugs can be used to resuscitate, to prep for surgery, to slow the progression of cardiovascular and respiratory disease, and even as antidotes for certain toxicities. There are four receptor types we are primarily concerned with - alpha1, alpha2, beta1, beta2. The receptor selectivity profile of any given drug is what governs its effects on the CV system. The way these effects are taught and tested in med school classrooms and by the United States board exams is in the form of graphs.
The impetus for this project is that me and many of my classmates struggled with this concept when we were initially learning it, and I believe a large part of that arises from the lack of a resource which shows the changes in the graphs from baseline, in real time.
When being taught this info, we are required to consider: a) the downstream effects when the receptor types listed above are stimulated (by an agonist drug) or inhibited (by an antagonist); b) the receptor specificities of each of the autonomic drugs of interest (there are about 8 that are very important); c) how to interpret the graphs shown above and how those graphs would change if multiple autonomics were administered in succession. (Exams and the boards love to show graphs with various points marked along it, then ask which drugs are responsible for the changes seen, just like the example above.)
The current methods of learning these three points is a mess, and having gone through it, I'd like to do what I can to contribute to building a more effective resource.
My goal is to create a widget that allows a user to visualize these changes with up to 3 drugs in succession. Here is a rough sketch of the goal.
In this example, norepinephrine has strong alpha1 agonist effects which causes an increase in systolic (blue line), diastolic (red line), and mean BP, as well as peripheral resistance. Due to the increased BP, there is a reflexive decrease in HR.
Upon the administration of phentolamine, a strong alpha1 antagonist, the BP and SVR decline while HR increases reflexively.
Regarding the widget, I would like the user to be able to choose up to 3 drugs from a drop down menu (eg. Drug 1, Drug 2, Drug 3), and the graphs to reflect the effects of those drugs on the CV metrics while ALSO taking into account the interactions of the drugs with themselves.
This is an IMPORTANT point: the order in which drugs are added is important because certain receptors become blocked, preventing other drugs from having their primary effect so they revert to their secondary effect.
If you're still following me on this, what I'm looking for is some help in figuring out how best to approach all the possibilities that can happen. Should I try to understand if-then statements and write a script to produce graphs based off those? (eg. if epi, then Psys = x, Pdia = y, MAP = z). Should I create a contingency table in excel in which I list the 8 drugs I'm focusing on and make values for the metrics and then plot those, essentially taking into account all the permutations? Any thoughts and direction would be greatly appreciated.
Thank you for your time.

Reinforcement learning with pair of actions

I'm learning the reinforcement learning in python and followed some training, and most of them dealing with simple actions (like up, down, right, or left), so basically one action at a time.
In my project I have actions in different ways: It has a pair of actions, means an action in addition to an offset been taken within this action...like (action-type, offset-been-taken).
Action types for example are: u1_set,u1_clear,u2_set,u2_clear,u3_set,u3_clear.
And on each action, there is attenuation offset associated with this implemented action (offset like -1,-0.5,0,+0.5,+1), so as example of some pair of actionswill be like (u2_set, +1), (u2_clear, -0.5),...etc.
Wondering what will be the best way to implement the reinforcement learning in this situation (pair of actions and offset) and if there is a good example available online to share.
Thanks in advance.
By far the easiest approach will be to simply treat every possible pair of "sub-actions" as a single complete action. So, in your example, every action is a pair (U, Offset), where U is one of {u1_set, u1_clear, u2_set, u2_clear, u3_est, u3_clear}, and Offset is one of {-1, -0.5, 0, +0.5, +1}. With this example, there would be a total of 6 x 5 = 30 possible pairs, so 30 different actions. That should be perfectly fine for most RL approaches.
If you move on to more complex situations (too many possible pairs), you could start considering more complex solutions as well. For example, you could treat the problem of selection an action-type as a first RL problem, and then the problem of selecting an offset as an additional, separate RL problem (possibly with an enhanced state representation that also contains the already-selected action type).
Or, if you were to move on to Reinforcement Learning with Neural Networks, you could simply have two separate "heads" as output layers, both connected to otherwise the same architecture.
I suspect those last two paragraphs may be unnecessarily complex, especially if you've only just started learning RL, and the first paragraph may be just fine.

Parametrization of sparse sampling algorithms

I have a question about the parametrization of C, H and lambda in the paper: "A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes" (or for anyone with some general knowledge on reinforcement learning and especially lambda), in page 5.
More precisely, I do not see any indication of if the parametrizations H, C or lambda are dependent on such factors as the sparsity or distance of a reward, given the environment might have rewards within any number of steps in the future.
For example, let's assume that there is an environment that requires a string of 7 actions to reach a reward from an average starting state, and another that requires 2 actions. When planning with trees, it seems obvious that, given the usual exponential branching of the state space, C (size of the sample) and H (horizon length) should be dependent on how far removed from the current state these rewards are. For the one with 2 steps away from an average state it might be enough to have an H = 2 for example. Similarly C should be dependent on the sparsity of a reward, that is, if there are 1000 possible states and only one of them has a reward, C should be higher than if the reward would be found with every 5 states (assume multiple states give the same reward vs. a goal-oriented problem).
So the question is, are my assumptions correct, or what have I missed about sampling? Those definitions on page 5 of the linked pdf do not have any mention of any dependency on the branching factor or sparsity of rewards.
Thank you for your time.

Area of overlap of two circular Gaussian functions

I am trying to write code that finds the overlap of between 3D shapes.
Each shape is defined by two intersecting normal distributions (one in the x direction, one in the y direction).
Do you have any suggestions of existing code that addresses this question or functions that I can utilize to build this code? Most of my programming experience has been in R, but I am open to solutions in other languages as well.
Thank you in advance for any suggestions and assistance!
The longer research context on this question: I am studying the use of acoustic space by insects. I want to know whether randomly assembled groups of insects would have calls that are more or less similar than we observe in natural communities (a randomization test). To do so, I need to randomly select insect species and calculate the similarity between their calls.
For each species, I have a mean and variance for two call characteristics that are approximately normally distributed. I would like to use these two call characteristics to build a 3D probability distribution for the species. I would then like to calculate the amount by which the PDF for one species overlaps with another.
Please accept my apologies if the question is not clear or appropriate for this forum.
I work in small molecule drug discovery, and I frequently use a program (ROCS, by OpenEye Scientific Software) based on algorithms that represent molecules as collections of spherical Gaussian functions and compute intersection volumes. You might look at the following references, as well as the ROCS documentation:
(1) Grant and Pickup, J. Phys. Chem. 1995, 99, 3503-3510
(2) Grant, Gallardo, and Pickup, J. Comp. Chem. 1996, 17, 1653-1666