Firstly: My apologies for putting "Pro.blem" in the title - SO won't let me put "Problem" there.
Anyways, during trig yesterday, I had an idea. Suppose I want to write a program that utilizes artificial intelligence to solve problems. I stripped it down to an implementation of Dijkstra's algorithm on a directed graph, using actions as nodes and requirements/results as paths. For example, let's say Fred's in the living room and he's hungry. Some of Fred's possible actions are:
Get up:
Requirements: state = sitting or lying down.
Results: state = standing.
Lie down:
Requirements: state = standing or sitting.
Results: state = lying down.
Fall asleep:
Requirements: state = lying down, location = bedroom.
Results: state = asleep.
Walk to kitchen:
Requirements: state = standing, location is not kitchen.
Results: location = kitchen.
Walk to bedroom:
Requirements: state = standing, location is not bedroom.
Results: location = bedroom.
Prepare food:
Requirements: state = standing, location = kitchen.
Results: hasfood = true.
Eat:
Requirements: hasfood = true.
Results: hungry = false, hasfood = false.
Actions such as "get up" and "lie down" are easy because there is one requirement and one result. Actions such as "walk to kitchen" and "walk to bedroom" present more of a problem, because they have more than one requirement. How can I use requirements/results as a path if paths intertwine with each other?
Ultimately, the question(s):
Could problem solving + pathfinding work in practice (or has it worked already)? Would it make more sense to use requirements/results as nodes and actions as paths? If you think this approach is promising, please respond with pseudocode or an explanation for implementation.
Related
I am pretty new in RL. Could anyone suggest results/paper about whether or not policy gradient (or more general RL algorithms) can be applied to the problems where actions does not determine next state? e.g. next state is independent to action P(s_{t+1} | s_{t}, a_{t}) = P(s_{t+1} | s_{t})
I think it is doable as it won't change the derivation of policy gradient, e.g.
enter image description for my derivations
Also, I am curious about the difference between the RL setting where the next state is independent to action P(s_{t+1} | s_{t}, a_{t}) = P(s_{t+1} | s_{t}) and the multi-armed bandits setting. If the problem is the next state variable is independent of actions, what would be the correct framework to start with?
I am a beginner python user who is trying to get a feel for computer science, I've been learning how to use it by studying concepts/subjects I'm already familiar with, such as Computation Fluid Mechanics & Finite Element Analysis. I got my degree in mechanical engineering, so not much CS background.
I'm studying a series by Lorena Barba on jupyter notebook viewer, Practical Numerical Methods, and i'm looking for some help, hopefully someone familiar with the subjects of CFD & FEA in general.
if you click on the link below and go to the following output line, you'll find what i have below. Really confused on this block of code operated within the function that is defined.
Anyway. If there is anyone out there, with any suggestions on how to tackle learning python, HELP
In[9]
rho_hist = [rho0.copy()]
rho = rho0.copy() **# im confused by the role of this variable here**
for n in range(nt):
# Compute the flux.
F = flux(rho, *args)
# Advance in time using Lax-Friedrichs scheme.
rho[1:-1] = (0.5 * (rho[:-2] + rho[2:]) -
dt / (2.0 * dx) * (F[2:] - F[:-2]))
# Set the value at the first location.
rho[0] = bc_values[0]
# Set the value at the last location.
rho[-1] = bc_values[1]
# Record the time-step solution.
rho_hist.append(rho.copy())
return rho_hist
http://nbviewer.jupyter.org/github/numerical-mooc/numerical-mooc/blob/master/lessons/03_wave/03_02_convectionSchemes.ipynb
The intent of the first two lines is to preserve rho0 and provide copies of it for the history (copy so that later changes in rho0 do not reflect back here) and as the initial value for the "working" variable rho that is used and modified during the computation.
The background is that python list and array variables are always references to the object in question. By assigning the variable you produce a copy of the reference, the address of the object, but not the object itself. Both variables refer to the same memory area. Thus not using .copy() will change rho0.
a = [1,2,3]
b = a
b[2] = 5
print a
#>>> [1, 2, 5]
Composite objects that themselves contain structured data objects will need a deepcopy to copy the data on all levels.
Numpy array values changed without being aksed?
how to pass a list as value and not as reference?
I am trying to implement Q-learning, in an environment where R (rewards) are stochastich time-dependent variables, and they are arrive in real time, after const time interval deltaT. States S (scalars) also arrive after const time interval deltaT. The task for an agent is to give optimal action after it gets (S(ndeltaT),R(ndeltaT)).
My problem is that i am very new to RL, and i don't understand how this algo should be implemented, most papers describing Q-learning algo are in "scientific english" which is not helping me.
OnTimer() executes after fixed interval:
double a = 0.95;
double g = 0.95;
double old_state = 0;
action new_action = null;
action old_action = random_action;
void OnTimer()
{
double new_state = environment.GetNewState();
double Qmax = 0;
foreach(action a in Actions)
{
if(Q(new_state, a) > Qmax)
Qmax = Q(new_state, a);
new_action = a;
}
double reward = environment.Reward(old_state, old_action);
Q(old_state, old_action) = Q(old_state, old_action) + a*(reward + g*Qmax - Q(old_state, old_action));
old_state = new_state;
old_action = new_action;
agent.ExecuteInEnvironment(new_action);
}
Question:
Is this a proper implementation of online Q-learning, because it does not seem to work? Why is this not working optimal when n*deltaT -> inf, please help it is very important.
It's hard to say exactly what's going wrong without more information, but it doesn't look like you've implemented the algorithm correctly. Generally, the algorithm is:
Start out in an initial state as the current state.
Select the next action from the current state using a learning policy (such as epsilon greedy). The learning algorithm will pick the action which will cause the transition from the current state to the next state.
The (current state, action) pair will tell you what the next state is.
Find Qmax (which I think you're doing correctly). One exception might be that Qmax should be 0 if the next state is a terminal state, but you might not have one.
Get the reward for the (current state, action, next state) tuple. You seem to be ignoring the transition to the next state in your calculation.
Update Q value for (old state, old action). I think you're doing this correctly.
Set current state to next state
Return to step 2, unless the current state is terminal.
Do you know the probability of your selected action actually causing your agent to move to the intended state, or is that something you have to estimate by observation? If states are just arriving arbitrarily and you don't have any control over what happens, this might not be an appropriate environment to apply reinforcement learning.
I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).
I am working on an AI bot for the game Defcon. The game has cities, with varying populations, and defensive structures with limited range. I'm trying to work out a good algorithm for placing defence towers.
Cities with higher populations are more important to defend
Losing a defence tower is a blow, so towers should be placed reasonably close together
Towers and cities can only be placed on land
So, with these three rules, we see that the best kind of placement is towers being placed in a ring around the largest population areas (although I don't want an algorithm just to blindly place a ring around the highest area of population, sometime there might be 2 sets of cities far apart, in which case the algorithm should make 2 circles, each one half my total towers).
I'm wondering what kind of algorithms might be used for determining placement of towers?
I would define a function determines the value of a tower placed at that position. Then search for maxima in that function and place a tower there.
A sketch for the function could look like this:
if water return 0
popsum = sum for all city over (population/distance) // it's better to have towers close by
towersum = - sum for all existing towers (1/distance) // you want you towers spread somewhat evenly
return popsum + towersum*f // f adjusts the relative importance of spreading towers equally and protecting the population centers with many towers
Should give a reasonable algorithm to start with. For improvement you might change the 1/distance function to something different, to get a faster or slower drop of.
I'd start with implementing a fitness function that calculates the expected protection provided by a set of towers on a given map.
You'd calculate the amount of population inside the "protected" area where areas covered by two towers is rated a bit higher than area covered by only one tower (the exact scaling factor depends a lot on the game mechanics, 'though).
Then you could use a genetic algorithm to experiment with different sets of placements and let that run for several (hundered?) iterations.
If your fitness function is a good fit to the real quality of the placement and your implementation of the genetic algorithm is correct, then you should get a reasonable result.
And once you've done all that you can start developing an attack plan that tries to optimize the casualties for any given set of defense tower placements. Once you have that you can set the two populations against each other and reach even better defense plans this way (that is one of the basic ideas of artificial life).
I don't know the game but from your description it seems that you need an algorithm similar to the one for solving the (weighted) k-centers problem. Well, unfortunately, this is an NP hard problem so in the best case you'll get an approximation upper bounded by some factor.
Take a look here: http://algo2.iti.kit.edu/vanstee/courses/kcenter.pdf
Just define a utility function that takes a potential build position as input and returns a "rating" for that position. I imagine it would look something like:
utility(position p) = k1 * population_of_city_at_p +
k2 * new_area_covered_if_placed_at_p +
k3 * number_of_nearby_defences
(k1, k2, and k3 are arbitrary constants that you'll need to tune)
Then, just randomly sample of bunch of different points, p and choose the one with the highest utility.