EKF - How to detect if filter is converged? - kalman-filter

I implemented an EKF. The algorithm works very well but I need a criterion to detected when the filter is converged after initialisation. What is the best / most common way to do this. I have two ideas:
1.) When innovation has reached a pre defined limit.
2.) When estimated variance has reached a pre defined limit.
Any suggestions ?

The most common error when dealing with Kalman filters (and especially EKF) is to think that the convergence of the P matrix is equivalent to real convergence of the estimation.
You need to look at the normalized innovations.
The innovation is the difference between the expected (predicted measurement) to the actual one:
Innov = y - h(x_predicted)
The normalized innovation is the Mahalanobis distance of the innovation, which correlates with the convergence once the P matrix is small:
d^2 = Innov.transpose * Cov(Innov).inverse * Innov
Where Cov(Innov) = Cov(y - h(x_predicted)) = R + H * P_predicted * H.Transpose

I usually look at the derivative of the estimated variance. If it is not changing any more, the filter has converged.
This usually works quite well.

Related

Backpropagation on Two Layered Networks

i have been following cs231n lectures of Stanford and trying to complete assignments on my own and sharing these solutions both on github and my blog. But i'm having a hard time on understanding how to modelize backpropagation. I mean i can code modular forward and backward passes but what bothers me is that if i have the model below : Two Layered Neural Network
Lets assume that our loss function here is a softmax loss function. In my modular softmax_loss() function i am calculating loss and gradient with respect to scores (dSoft = dL/dY). After that, when i'am following backwards lets say for b2, db2 would be equal to dSoft*1 or dW2 would be equal to dSoft*dX2(outputs of relu gate). What's the chain rule here ? Why isnt dSoft equal to 1 ? Because dL/dL would be 1 ?
The softmax function is outputs a number given an input x.
What dSoft means is that you're computing the derivative of the function softmax(x) with respect to the input x. Then to calculate the derivative with respect to W of the last layer you use the chain rule i.e. dL/dW = dsoftmax/dx * dx/dW. Note that x = W*x_prev + b where x_prev is the input to the last node. Therefore dx/dW is just x and dx/db is just 1, which means that dL/dW or simply dW is dsoftmax/dx * x_prev and dL/db or simply db is dsoftmax/dx * 1. Note that here dsoftmax/dx is dSoft we defined earlier.

What is the difference between RSE and MSE?

I am going through Introduction to Statistical Learning in R by Hastie and Tibshirani. I came across two concepts: RSE and MSE. My understanding is like this:
RSE = sqrt(RSS/N-2)
MSE = RSS/N
Now I am building 3 models for a problem and need to compare them. While MSE come intuitively to me, I was also wondering if calculating RSS/N-2 will make any use which is according to above is RSE^2
I think I am not sure which to use where?
RSE is an estimate of the standard deviation of the residuals, and therefore also of the observations. Which is why it's equal to RSS/df. And in your case, as a simple linear model df = 2.
MSE is mean squared error observed in your models, and it's usually calculated using a test set to compare the predictive accuracy of your fitted models. Since we're concerned with the mean error of the model, we divide by n.
I think RSE ⊂ MSE (i.e. RSE is part of MSE).
And MSE = RSS/ degree of freedom
MSE for a single set of data (X1,X2,....Xn) would be RSS over N
or more accurately is RSS/N-1
(since your freedom to vary will be reduced by one when U have used up all the freedom)
But in linear regression concerning X and Y with binomial term, the degree of freedom is affected by both X and Y thus N-2
thus yr MSE = RSS/N-2
and one can also call this RSE
And in over parameterized model, meaning one have a collection of many ßs (more than 2 terms| y~ ß0 + ß1*X + ß2*X..), one can even penalize the model by reducing the denominator by including the number of parameters:
MSE= RSS/N-p (p is the number of fitted parameters)

Statistical method to know when enough performance test iterations have been performed

I'm doing some performance/load testing of a service. Imagine the test function like this:
bytesPerSecond = test(filesize: 10MB, concurrency: 5)
Using this, I'll populate a table of results for different sizes and levels of concurrency. There are other variables too, but you get the idea.
The test function spins up concurrency requests and tracks throughput. This rate starts off at zero, then spikes and dips until it eventually stabilises on the 'true' value.
However it can take a while for this stability to occur, and there are lot of combinations of input to evaluate.
How can the test function decide when it's performed enough samples? By enough, I suppose I mean that the result isn't going to change beyond some margin if testing continues.
I remember reading an article about this a while ago (from one of the jsperf authors) that discussed a robust method, but I cannot find the article any more.
One simple method would be to compute the standard deviation over a sliding window of values. Is there a better approach?
IIUC, you're describing the classic problem of estimating the confidence interval of the mean with unknown variance. That is, suppose you have n results, x1, ..., xn, where each of the xi is a sample from some process of which you don't know much: not the mean, not the variance, and not the distribution's shape. For some required confidence interval, you'd like to now whether n is large enough so that, with high probability the true mean is within the interval of your mean.
(Note that with relatively-weak conditions, the Central Limit Theorem guarantees that the sample mean will converge to a normal distribution, but to apply it directly you would need the variance.)
So, in this case, the classic solution to determine if n is large enough, is as follows:
Start by calculating the sample mean μ = ∑i [xi] / n. Also calculate the normalized sample variance s2 = ∑i [(xi - μ)2] / (n - 1)
Depending on the size of n:
If n > 30, the confidence interval is approximated as μ ± zα / 2(s / √(n)), where, if necessary, you can find here an explanation on the z and α.
If n < 30, the confidence interval is approximated as μ ± tα / 2(s / √(n)); see again here an explanation of the t value, as well as a table.
If the confidence is enough, stop. Otherwise, increase n.
Stability means rate of change (derivative) is zero or close to zero.
The test function spins up concurrency requests and tracks throughput.
This rate starts off at zero, then spikes and dips until it eventually
stabilises on the 'true' value.
I would track your past throughput values. For example last X values or so. According to this values, I would calculate rate of change (derivative of your throughput). If your derivative is close to zero, then your test is stable. I will stop test.
How to find X? I think instead of constant value, such as 10, choosing a value according to maximum number of test can be more suitable, for example:
X = max(10,max_test_count * 0.01)

Numerical integration of a discontinuous function in multiple dimensions

I have a function f(x) = 1/(x + a+ b*I*sign(x)) and I want to calculate the
integral of
dx dy dz f(x) f(y) f(z) f(x+y+z) f(x-y - z)
over the entire R^3 (b>0 and a,- b are of order unity). This is just a representative example -- in practice I have n<7 variables and 2n-1 instances of f(), n of them involving the n integration variables and n-1 of them involving some linear combintation of the integration variables. At this stage I'm only interested in a rough estimate with relative error of 1e-3 or so.
I have tried the following libraries :
Steven Johnson's cubature code: the hcubature algorithm works but is abysmally slow, taking hundreds of millions of integrand evaluations for even n=2.
HintLib: I tried adaptive integration with a Genz-Malik rule, the cubature routines, VEGAS and MISER with the Mersenne twister RNG. For n=3 only the first seems to be somewhat viable option but it again takes hundreds of millions of integrand evaluations for n=3 and relerr = 1e-2, which is not encouraging.
For the region of integration I have tried both approaches: Integrating over [-200, 200]^n (i.e. a region so large that it essentially captures most of the integral) and the substitution x = sinh(t) which seems to be a standard trick.
I do not have much experience with numerical analysis but presumably the difficulty lies in the discontinuities from the sign() term. For n=2 and f(x)f(y)f(x-y) there are discontinuities along x=0, y=0, x=y. These create a very sharp peak around the origin (with a different sign in the various quadrants) and sort of 'ridges' at x=0,y=0,x=y along which the integrand is large in absolute value and changes sign as you cross them. So at least I know which regions are important. I was thinking that maybe I could do Monte Carlo but somehow "tell" the algorithm in advance where to focus. But I'm not quite sure how to do that.
I would be very grateful if you had any advice on how to evaluate the integral with a reasonable amount of computing power or how to make my Monte Carlo "idea" work. I've been stuck on this for a while so any input would be welcome. Thanks in advance.
One thing you can do is to use a guiding function for your Monte Carlo integration: given an integral (am writing it in 1D for simplicity) of ∫ f(x) dx, write it as ∫ f(x)/g(x) g(x) dx, and use g(x) as a distribution from which you sample x.
Since g(x) is arbitrary, construct it such that (1) it has peaks where you expect them to be in f(x), and (2) such that you can sample x from g(x) (e.g., a gaussian, or 1/(1+x^2)).
Alternatively, you can use a Metropolis-type Markov chain MC. It will find the relevant regions of the integrand (almost) by itself.
Here are a couple of trivial examples.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)