Parametrization of sparse sampling algorithms - reinforcement-learning

I have a question about the parametrization of C, H and lambda in the paper: "A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes" (or for anyone with some general knowledge on reinforcement learning and especially lambda), in page 5.
More precisely, I do not see any indication of if the parametrizations H, C or lambda are dependent on such factors as the sparsity or distance of a reward, given the environment might have rewards within any number of steps in the future.
For example, let's assume that there is an environment that requires a string of 7 actions to reach a reward from an average starting state, and another that requires 2 actions. When planning with trees, it seems obvious that, given the usual exponential branching of the state space, C (size of the sample) and H (horizon length) should be dependent on how far removed from the current state these rewards are. For the one with 2 steps away from an average state it might be enough to have an H = 2 for example. Similarly C should be dependent on the sparsity of a reward, that is, if there are 1000 possible states and only one of them has a reward, C should be higher than if the reward would be found with every 5 states (assume multiple states give the same reward vs. a goal-oriented problem).
So the question is, are my assumptions correct, or what have I missed about sampling? Those definitions on page 5 of the linked pdf do not have any mention of any dependency on the branching factor or sparsity of rewards.
Thank you for your time.

Related

Questions About Deep Q-Learning

I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.
My only question is how the weights are updated. According to this site the weights are updated as follows:
I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?
In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case:
Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.
In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this:
Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).
The equation you presented is the Q-learning update rule set in a gradient update rule.
alpha is the step-size
R is the reward
Gamma is the discounting factor
You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.

How to generalise over multiple dependent actions in Reinforcement Learning

I am trying to build an RL agent to price paid for airline seats (not the ticket). The general set up is:
After choosing their flights (for n people on a booking), a customer will view a web page with the available seat types and their prices visible.
They select between zero and n seats from a seat map with a variety of different prices for different seats, to be added to their booking.
After perhaps some other steps, they pay for the booking and the agent is rewarded with the seat revenue.
I have not decided on a general architecture yet. I want to take various booking and flight information into account, so I know I will be using function approximation (most likely a neural net) to generalise over the state space.
However, I am less clear on how to set up my action space. I imagine an action would amount to a vector with a price for each different seat type. If I have, for example, 8 different seat types, and 10 different price points for each, this gives me a total of 10^8 different actions, many of which will be very similar. Additionally, each sub-action (pricing one seat type) is somewhat dependent on the others, in the sense that the price of one seat type will likely affect the demand (and hence reward contribution) for another. Hence, I doubt the problem can be decomposed into a set of sub-problems.
I'm interested if there has been any research into dealing with a problem like this. Clearly any agent I build needs some way to generalise across actions to some degree, since collecting real data on millions of actions is not possible, even just for one state.
As I see it, this comes down to two questions:
Is it possible to get an agent to understand actions in relative terms? Say for example, one set of potential prices is [10, 12, 20]. Can I get my agent to realise that there is a natural ordering there, and that the first two pricing actions are more similar to each other than to the third possible action?
Further to this, is it possible to generalise from this set of actions - could an agent be set up to understand that the set of prices [10, 13, 20] is very similar to the first set?
I haven't been able to find any literature on this, especially relating to the second question - any help would be much appreciated!
Correct me if I'm wrong, but I am going to assume this is what you are asking and will answer accordingly.
I am building an RL and it needs to be smart enough to understand that if I were to buy one airplane ticket, it will subsequently affect the price of other airplane tickets because there is now less supply.
Also, the RL agent must realize that actions very close to each other are relatively similar actions, such as [10, 12, 20] ≈ [10, 13, 20]
1) In order to provide memory to your RL agent, you can do this in two ways. The easy way is to feed the states as a vector of past purchased tickets, as well as the current ticket.
Example: Let's say we build the RL to remember at least the past 3 transactions. At the very beginning, our state vector will be [0, 0, 3], meaning that there was no purchases of tickets previously (the zeros), and currently, we are purchasing ticket #3. Then, the next time step's state vector can be [0, 3, 6], telling the RL agent that previously, ticket #3 has been picked, and now we're buying ticket #6. The neural network will learn that the state vector [0, 0, 6] should map to a different outcome compared to [0, 3, 6], because in the first case, ticket #6 was the first ticket purchased and there was lots of supply. But in the 2nd case, ticket 3 was already sold, so now all the remaining tickets went up in price.
The proper and more complex way would be to use a recurrent neural network as your function approximator for your RL agent. The recurrent neural network architecture allows for certain "important" states to be remembered by the neural network. In your case, the amount of tickets purchased previously is important, so the neural network will remember the previously purchased tickets, and calculate the output accordingly.
2) Any function approximation reinforcement learning algorithm will automatically generalize sets of actions close to each other. The only RL architectures that would not do this would be tabular based approaches.
The reason is the following:
We can think of these function approximators simply as a line. Neural networks simply build a highly nonlinear continuous line (neural networks are trained using backpropagation and gradient descent, so they must be continuous), and a set of states will map to a unique set of outputs. Because it is a line, sets of states that are very similar SHOULD map to outputs that are also very close. In the most basic case, imagine y = 2x. If our input x = 1, our y = 2. And if our input x is 1.1, which is very close to 1, our output y = 2.2, which is very close to 2 because they are on a continuous line.
For tabular approach, there is simply a matrix. On the y axis, you have the states, and on the x axis, you have the actions. In this approach, the states and actions are discrete. Depending on the discretization, the difference can be massive, and if the system is poorly discretized, the actions very close to each other MAY not be generalized.
I hope this helps, please let me know if anything is unclear.

Why do we need the hyperparameters beta and alpha in LDA?

I'm trying to understand the technical part of Latent Dirichlet Allocation (LDA), but I have a few questions on my mind:
First: Why do we need to add alpha and gamma every time we sample the equation below? What if we delete the alpha and gamma from the equation? Would it still be possible to get the result?
Second: In LDA, we randomly assign a topic to every word in the document. Then, we try to optimize the topic by observing the data. Where is the part which is related to posterior inference in the equation above?
If you look at the inference derivation on Wiki, the alpha and beta are introduced simply because the theta and phi are both drawn from Dirichlet distribution uniquely determined by them separately. The reason of choosing Dirichlet distribution as the prior distribution (e.g. P(phi|beta)) are mainly for making the math feasible to tackle by utilizing the nice form of conjugate prior (here is Dirichlet and categorical distribution, categorical distribution is a special case of multinational distribution where n is set to one, i.e. only one trial). Also, the Dirichlet distribution can help us "inject" our belief that doc-topic and topic-word distribution are centered in a few topics and words for a document or topic (if we set low hyperparameters). If you remove alpha and beta, I am not sure how it will work.
The posterior inference is replaced with joint probability inference, at least in Gibbs sampling, you need joint probability while pick one dimension to "transform the state" as the Metropolis-Hasting paradigm does. The formula you put here is essentially derived from the joint probability P(w,z). I would like to refer you the book Monte Carlo Statistical Methods (by Robert) to fully understand why inference works.

Computation consideration with different Caffe's network topology (difference in number of output)

I would like to use one of Caffe's reference model i.e. bvlc_reference_caffenet. I found that my target class i.e. person is one of the classes included in the ILSVRC dataset that has been trained for the model. As my goal is to classify whether a test image contains a person or not, I may achieve this by the following:
Use inference directly with 1000 number of output. This doesn't
require training/learning.
Change the network topology a little bit with the final FC layer's number of output (num_output) is set to 2 (instead of 1000). Retrain it as a binary classification problem.
My concern is about computational effort at deployment/prediction phase (testing). The latter looks more expensive computationally than the former. This is because during prediction phase it needs to compute those 1000 output possibilities to find the one with the highest score. What I'm not sure is that, it could be the case that there's a heuristic (which I'm not aware of) that simplifies the computation.
Can somebody please help cross check my understanding on this.

How does SPSS assign factor scores for cases where underlying variables were pairwise deleted?

Here's a simplified example of what I'm trying to figure out from a report. All analyses are being run in SPSS, which I don't have and don't use (my experience is with SAS and R).
They were running a regression to predict overall meal satisfaction from food type ordered, self-reported food flavor, and self-reported food texture.
But food flavor and texture are highly correlated, so they conducted a factor analysis, found food flavor and texture load on one factor, and used the factor scores in the regression.
However, about 40% of respondents don't have responses on self-reported food texture, so they used pairwise deletion while making the factors.
My question is when SPSS calculates the factor scores and outputs them as new variables in the data set, what does it do with people who had an input for a factor that was pairwise deleted?
How does it calculate (if it calculates it at all) factor scores for those people who had a response pairwise deleted during the creation of the factors and who therefore have missing data for one of the variables?
Factor scores are a linear combination of their scaled inputs. That is, given normalized variables X1, ..., Xn, we have the following (where LaTeX formatting isn't supported and the L's indicate the loadings)
f = \sum_{i=1}^n L_i X_i
In your case n = 2. Now suppose one of the X_i is missing. How do you take a sum involving a missing value? Clearly, you cannot... unless you impute it.
I don't know what the analyst who created your report did. I suggest you ask them.