Determining edge weights given a list of walks in a graph - language-agnostic

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.

Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.

Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.

I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)

Related

Would this be a valid Implementation of an ordinal CrossEntropy?

Would this be a valid implementation of a cross entropy loss that takes the ordinal structure of the GT y into consideration? y_hat is the prediction from a neural network.
ce_loss = F.cross_entropy(y_hat, y, reduction="none")
distance_weight = torch.abs(y_hat.argmax(1) - y) + 1
ordinal_ce_loss = torch.mean(distance_weight * ce_loss)
I'll attempt to answer this question by first fully defining the task, since the question is a bit sparse on details.
I have a set of ordinal classes (e.g. first, second, third, fourth,
etc.) and I would like to predict the class of each data example from
among this set. I would like to define an entropy-based loss-function
for this problem. I would like this loss function to weight the loss
between a predicted class torch.argmax(y_hat) and the true class y
according to the ordinal distance between the two classes. Does the
given loss expression accomplish this?
Short answer: sure, it is "valid". You've roughly implemented L1-norm ordinal class weighting. I'd question whether this is truly the correct weighting strategy for this problem.
For instance, consider that for a true label n, the bin n response is weighted by 1, but the bin n+1 and n-1 responses are weighted by 2. This means that a lot more emphasis will be placed on NOT predicting false positives than on correctly predicting true positives, which may imbue your model with some strange bias.
It also means that examples on the edge will result in a larger total sum of weights, meaning that you'll be weighting examples where the true label is say "first" or "last" more highly than the intermediate classes. (Say you have 5 classes: 1,2,3,4,5. A true label of 1 will require distance_weight of [1,2,3,4,5], the sum of which is 15. A true label of 3 will require distance_weight of [3,2,1,2,3], the sum of which is 11.
In general, classification problems and entropy-based losses are underpinned by the assumption that no set of classes or categories is any more or less related than any other set of classes. In essence, the input data is embedded into an orthogonal feature space where each class represents one vector in the basis. This is quite plainly a bad assumption in your case, meaning that this embedding space is probably not particularly elegant: thus, you have to correct for it with sort of a hack-y weight fix. And in general, this assumption of class non-correlation is probably not true in a great many classification problems (consider e.g. the classic ImageNet classification problem, wherein the class pairs [bus,car], and [bus,zebra] are treated as equally dissimilar. But this is probably a digression into the inherent lack of usefulness of strict ontological structuring of information which is outside the scope of this answer...)
Long Answer: I'd highly suggest moving into a space where the ordinal value you care about is instead expressed in a continuous space. (In the first, second, third example, you might for instance output a continuous value over the range [1,max_place]. This allows you to benefit from loss functions that already capture well the notion that predictions closer in an ordered space are better than predictions farther away in an ordered space (e.g. MSE, Smooth-L1, etc.)
Let's consider one more time the case of the [first,second,third,etc.] ordinal class example, and say that we are trying to predict the places of a set of runners in a race. Consider two races, one in which the first place runner wins by 30% relative to the second place runner, and the second in which the first place runner wins by only 1%. This nuance is entirely discarded by the ordinal discrete classification. In essence, the selection of an ordinal set of classes truncates the amount of information conveyed in the prediction, which means not only that the final prediction is less useful, but also that the loss function encodes this strange truncation and binarization, which is then reflected (perhaps harmfully) in the learned model. This problem could likely be much more elegantly solved by regressing the finishing position, or perhaps instead by regressing the finishing time, of each athlete, and then performing the final ordinal classification into places OUTSIDE of the network training.
In conclusion, you might expect a well-trained ordinal classifier to produce essentially a normal distribution of responses across the class bins, with the distribution peak on the true value: a binned discretization of a space that almost certainly could, and likely should, be treated as a continuous space.

perMANOVA for small sample size

I have data of 6 groups with sample size of n = 2, 10, 2, 9, 3, 1 and I want to perform Permutational multivariate analysis of variance (PERMANOVA) on these data.
My question is: Is it correct to run perMANOVA on these data with the small sample size? The results look strange for me because the group of n = 1 showed insignificant difference to other groups although the graphical representation of the groups clearly show a difference.
Thank you
I would not trust any result with group of n=1 because there is no source of variation to define difference among groups.
I have also received some answers from other platforms. I put them here for information:
The sample size is simply too small to yield a stable solution via manova. Note that the n = 1 cell contributes a constant value for that cell's mean, no matter what you do by way of permutations.
Finally, note that the effective per-cell sample size with unequal cell n for one-way designs tracks well to the harmonic mean of n. For your data set as it stands, that means an "effective" per-cell n of about 2.4. Unless differences are gigantic on the DV set, no procedure (parametric or exact/permutation) will have the statistical power to detect differences with that size.
MANOVA emphasizes the attribute scattering in the study group and the logic of this analysis is based on the scattering of scores. It is not recommended to use small groups with one or more people (I mean less than 20 people) to perform parametric tests such as MANOVA. In my opinion, use non-parametric tests to examine small groups.

Indirect Kalman Filter for Inertial Navigation System

I'm trying to implement an Inertial Navigation System using an Indirect Kalman Filter. I've found many publications and thesis on this topic, but not too much code as example. For my implementation I'm using the Master Thesis available at the following link:
https://fenix.tecnico.ulisboa.pt/downloadFile/395137332405/dissertacao.pdf
As reported at page 47, the measured values from inertial sensors equal the true values plus a series of other terms (bias, scale factors, ...).
For my question, let's consider only bias.
So:
Wmeas = Wtrue + BiasW (Gyro meas)
Ameas = Atrue + BiasA. (Accelerometer meas)
Therefore,
when I propagate the Mechanization equations (equations 3-29, 3-37 and 3-41)
I should use the "true" values, or better:
Wmeas - BiasW
Ameas - BiasA
where BiasW and BiasA are the last available estimation of the bias. Right?
Concerning the update phase of the EKF,
if the measurement equation is
dzV = VelGPS_est - VelGPS_meas
the H matrix should have an identity matrix in corrispondence of the velocity error state variables dx(VEL) and 0 elsewhere. Right?
Said that I'm not sure how I have to propagate the state variable after update phase.
The propagation of the state variable should be (in my opinion):
POSk|k = POSk|k-1 + dx(POS);
VELk|k = VELk|k-1 + dx(VEL);
...
But this didn't work. Therefore I've tried:
POSk|k = POSk|k-1 - dx(POS);
VELk|k = VELk|k-1 - dx(VEL);
that didn't work too... I tried both solutions, even if in my opinion the "+" should be used. But since both don't work (I have some other error elsewhere)
I would ask you if you have any suggestions.
You can see a snippet of code at the following link: http://pastebin.com/aGhKh2ck.
Thanks.
The difficulty you're running into is the difference between the theory and the practice. Taking your code from the snippet instead of the symbolic version in the question:
% Apply corrections
Pned = Pned + dx(1:3);
Vned = Vned + dx(4:6);
In theory when you use the Indirect form you are freely integrating the IMU (that process called the Mechanization in that paper) and occasionally running the IKF to update its correction. In theory the unchecked double integration of the accelerometer produces large (or for cheap MEMS IMUs, enormous) error values in Pned and Vned. That, in turn, causes the IKF to produce correspondingly large values of dx(1:6) as time evolves and the unchecked IMU integration runs farther and farther away from the truth. In theory you then sample your position at any time as Pned +/- dx(1:3) (the sign isn't important -- you can set that up either way). The important part here is that you are not modifying Pned from the IKF because both are running independent from each other and you add them together when you need the answer.
In practice you do not want to take the difference between two enourmous double values because you will lose precision (because many of the bits of the significand were needed to represent the enormous part instead of the precision you want). You have grasped that in practice you want to recursively update Pned on each update. However, when you diverge from the theory this way, you have to take the corresponding (and somewhat unobvious) step of zeroing out your correction value from the IKF state vector. In other words, after you do Pned = Pned + dx(1:3) you have "used" the correction, and you need to balance the equation with dx(1:3) = dx(1:3) - dx(1:3) (simplified: dx(1:3) = 0) so that you don't inadvertently integrate the correction over time.
Why does this work? Why doesn't it mess up the rest of the filter? As it turns out, the KF process covariance P does not actually depend on the state x. It depends on the update function and the process noise Q and so on. So the filter doesn't care what the data is. (Now that's a simplification, because often Q and R include rotation terms, and R might vary based on other state variables, etc, but in those cases you are actually using state from outside the filter (the cumulative position and orientation) not the raw correction values, which have no meaning by themselves).

Generalization functions for Q-Learning

I have to do some work with Q Learning, about a guy that has to move furniture around a house (it's basically that). If the house is small enough, I can just have a matrix that represents actions/rewards, but as the house size grows bigger that will not be enough. So I have to use some kind of generalization function for it, instead. My teacher suggests I use not just one, but several ones, so I could compare them and so. What you guys recommend?
I heard that for this situation people are using Support Vector Machines, also Neural Networks. I'm not really inside the field so I can't tell. I had in the past some experience with Neural Networks, but SVM seem a lot harder subject to grasp. Are there any other methods that I should look for? I know there must be like a zillion of them, but I need something just to start.
Thanks
Just as a refresher of terminology, in Q-learning, you are trying to learn the Q-functions, which depend on the state and action:
Q(S,A) = ????
The standard version of Q-learning as taught in most classes tells you that you for each S and A, you need to learn a separate value in a table and tells you how to perform Bellman updates in order to converge to the optimal values.
Now, lets say that instead of table you use a different function approximator. For example, lets try linear functions. Take your (S,A) pair and think of a bunch of features you can extract from them. One example of a feature is "Am I next to a wall," another is "Will the action place the object next to a wall," etc. Number these features f1(S,A), f2(S,A), ...
Now, try to learn the Q function as a linear function of those features
Q(S,A) = w1 * f1(S,A) + w2*f2(S,A) ... + wN*fN(S,A)
How should you learn the weights w? Well, since this is a homework, I'll let you think about it on your own.
However, as a hint, lets say that you have K possible states and M possible actions in each state. Lets say you define K*M features, each of which is an indicator of whether you are in a particular state and are going to take a particular action. So
Q(S,A) = w11 * (S==1 && A == 1) + w12 * (S == 1 && A == 2) + w21 * (S==2 && A==3) ...
Now, notice that for any state/action pair, only one feature will be 1 and the rest will be 0, so Q(S,A) will be equal to the corresponding w and you are essentially learning a table. So, you can think of the standard, table Q-learning as a special case of learning with these linear functions. So, think of what the normal Q-learning algorithm does, and what you should do.
Hopefully you can find a small basis of features, much fewer than K*M, that will allow you to represent your space well.

What is the proper method of constraining a pseudo-random number to a smaller range?

What is the best way to constrain the values of a PRNG to a smaller range? If you use modulus and the old max number is not evenly divisible by the new max number you bias toward the 0 through (old_max - new_max - 1). I assume the best way would be something like this (this is floating point, not integer math)
random_num = PRNG() / max_orginal_range * max_smaller_range
But something in my gut makes me question that method (maybe floating point implementation and representation differences?).
The random number generator will produce consistent results across hardware and software platforms, and the constraint needs to as well.
I was right to doubt the pseudocode above (but not for the reasons I was thinking). MichaelGG's answer got me thinking about the problem in a different way. I can model it using smaller numbers and test every outcome. So, let's assume we have a PRNG that produces a random number between 0 and 31 and you want the smaller range to be 0 to 9. If you use modulus you bias toward 0, 1, 2, and 3. If you use the pseudocode above you bias toward 0, 2, 5, and 7. I don't think there can be a good way to map one set into the other. The best that I have come up with so far is to regenerate the random numbers that are greater than old_max/new_max, but that has deep problems as well (reducing the period, time to generate new numbers until one is in the right range, etc.).
I think I may have naively approached this problem. It may be time to start some serious research into the literature (someone has to have tackled this before).
I know this might not be a particularly helpful answer, but I think the best way would be to conceive of a few different methods, then trying them out a few million times, and check the result sets.
When in doubt, try it yourself.
EDIT
It should be noted that many languages (like C#) have built in limiting in their functions
int maximumvalue = 20;
Random rand = new Random();
rand.Next(maximumvalue);
And whenever possible, you should use those rather than any code you would write yourself. Don't Reinvent The Wheel.
This problem is akin to rolling a k-sided die given only a p-sided die, without wasting randomness.
In this sense, by Lemma 3 in "Simulating a dice with a dice" by B. Kloeckner, this waste is inevitable unless "every prime number dividing k also divides p". Thus, for example, if p is a power of 2 (and any block of random bits is the same as rolling a die with a power of 2 number of faces) and k has prime factors other than 2, the best you can do is get arbitrarily close to no waste of randomness, such as by batching multiple rolls of the p-sided die until p^n is "close enough" to a power of k.
Let me also go over some of your concerns about regenerating random numbers:
"Reducing the period": Besides batching of bits, this concern can be dealt with in several ways:
Use a PRNG with a bigger "period" (maximum cycle length).
Add a Bays–Durham shuffle to the PRNG's implementation.
Use a "true" random number generator; this is not trivial.
Employ randomness extraction, which is discussed in Devroye and Gravel 2015-2020 and in my Note on Randomness Extraction. However, randomness extraction is pretty involved.
Ignore the problem, especially if it isn't a security application or serious simulation.
"Time to generate new numbers until one is in the right range": If you want unbiased random numbers, then any algorithm that does so will generally have to run forever in the worst case. Again, by Lemma 3, the algorithm will run forever in the worst case unless "every prime number dividing k also divides p", which is not the case if, say, k is 10 and p is 32.
See also the question: How to generate a random integer in the range [0,n] from a stream of random bits without wasting bits?, especially my answer there.
If PRNG() is generating uniformly distributed random numbers then the above looks good. In fact (if you want to scale the mean etc.) the above should be fine for all purposes. I guess you need to ask what the error associated with the original PRNG() is, and whether further manipulating will add to that substantially.
If in doubt, generate an appropriately sized sample set, and look at the results in Excel or similar (to check your mean / std.dev etc. for what you'd expect)
If you have access to a PRNG function (say, random()) that'll generate numbers in the range 0 <= x < 1, can you not just do:
random_num = (int) (random() * max_range);
to give you numbers in the range 0 to max_range?
Here's how the CLR's Random class works when limited (as per Reflector):
long num = maxValue - minValue;
if (num <= 0x7fffffffL) {
return (((int) (this.Sample() * num)) + minValue);
}
return (((int) ((long) (this.GetSampleForLargeRange() * num))) + minValue);
Even if you're given a positive int, it's not hard to get it to a double. Just multiply the random int by (1/maxint). Going from a 32-bit int to a double should provide adequate precision. (I haven't actually tested a PRNG like this, so I might be missing something with floats.)
Psuedo random number generators are essentially producing a random series of 1s and 0s, which when appended to each other, are an infinitely large number in base two. each time you consume a bit from you're prng, you are dividing that number by two and keeping the modulus. You can do this forever without wasting a single bit.
If you need a number in the range [0, N), then you need the same, but instead of base two, you need base N. It's basically trivial to convert the bases. Consume the number of bits you need, return the remainder of those bits back to your prng to be used next time a number is needed.