What is the best way to find the closest list to an expected list? - data-analysis

I am currently working on a program where I try to experimentally come up with an ordering of elements, then compare to a given ordering. For instance:
Experimental: A, C, B, F, E, D
Given: A, B, C, D, E, F
At the end I am trying to find some metric by which I can measure how close my experimental ordering is to the given ordering. I know that all of the same elements will be present in both. Is the number of elements in the correct position divided by the total number of elements in the list the best I can do? Thanks!

I think this largely depends on how you define similarity between two sequences. I will give you some ideas and then define the corresponding distance function.
Just the correct positions matter: In this case you just count the number of correctly positioned elements (as you proposed in your question)
The difference to the desired position is important: You could sum up the differences of the position in the experimental to the position in the given sequence for each character
The ranking between elements is important: Here you could count how many pairs of elements are in correct order (similar to Kendall rank correlation). Beside this one there are a couple more rank correlation measures.
The cost to transform one list into the other: In this case you would have to calculate the minimum number of swaps in order to get from one list to the other. If you also care about how far elements are from their desired position you could only allow for swaps of adjacent elements. Computing this, is a little more complicated, but this geeksforgeeks might help.
If you want to have a distance between 0 and 1 you would have to normalize the results. I am sure there are more, these are just the ones I could think of from the top of my head.

Related

Coding for observation space using a list of values (openai gym)

I have a tuple of tuples as my observations space that each item corresponds to an action for that space.
Think of a long panel with button that can have multiple discrete values and I can switch any one of them. If the panel has 10 items then my action spaces is
self.action_space = spaces.Discrete(10)
What I want to do is simplify my observation_space in such a way that I can provide my list of discrete values. How do I define that?
PS: my observation space is currently a list of 10 values (categorical), each distinct within its space. e.g., the first can take only A and B, the second can only take C and D, and so on.
You're looking for Multi Discrete Action Space

Quadtrees: a common intersect method failing to handle a simple case

I am writing a simple GUI library and am using quadtrees to determine which, if any, objects are interacted with during a mouse event. I was looking through a number of quadtree libraries on github and they all contained a method for adding a rectangular object to a quadtree.
The method, in all cases, simply checked to see if the rectangle intersected with the given quadtree:
return quadtree.x2 >= rect.x1
and quadtree.x1 <= rect.x2
and quadtree.y2 >= rect.y1
and quadtree.y1 <= rect.y2
However, this gives an unwanted result in one of the simplest cases: Imagine a 100x100 square area. I place four 50x50 square objects into the area with coordinates (0,0), (0,50), (50,0), and (50,50). If these objects had been placed into a 100x100 quadtree with a maximum capacity of one object, I would (visually) expect that the first layer of the quadtree would split and that the four resulting trees would each exactly contain one of the squares.
If I use the above method to determine which tree the squares are placed into, though, I find that each object intersects with all four trees. This would cause each of the trees to rapidly split until the maximum depth is reached.
The only way I see to avoid this is to use two checks:
return (quadtree.x2 > rect.x1
and quadtree.x1 < rect.x2
and quadtree.y2 > rect.y1
and quadtree.y1 < rect.y2)
or (quadtree.x2 == rect.x1
and quadtree.x1 == rect.x2
and quadtree.y2 == rect.y1
and quadtree.y1 == rect.y2)
(in the simplest case. Larger objects would have to be viewed within a bounding box since, for example, an object with coordinates (0,0), w=100, h=100 would belong in the upper-left quadtree as well.)
I could also calculate the overlap between the rectangles and the quadtrees to see if it's non-zero.
Am I missing something? It seems like this should be an ideal situation for a quadtree, yet, in most implementations, it's a huge mess.
I wouldn't call this an ideal situation, because the four rectangles overlap by a fractional amount. For example, if we assume a (fictional) floating precision of 10^(-10), every 'point' is actually a small rectangle with 10^(-10) length, and thus the rectangles overlap by 10^(-10). This is why you get the deep tree.
But I also think the tree could be improved with a slightly modified overlap checking. With your code, the sub-nodes all overlap by a tiny amount. It would work better with excluding the minimum (or maximum values), for example:
return quadtree.x2 >= rect.x1
and quadtree.x1 < rect.x2
and quadtree.y2 >= rect.y1
and quadtree.y1 < rect.y2
So the lower left coordinate of a node is actually outside of that node. This would at least avoid points turning up in several nodes (such as the point (50,50)), and the lower left rectangle would be stored in only one node.

Rapidminer Classification

I am trying to solve a simple classification problem where the label has 12 different levels and need to classify each example into one of these 12. However, I want my output to look like refer the image:
http://i.stack.imgur.com/49USG.png
Here; assuming that I set a confidence threshold of 20%; I want my output to contain all the labels for each id which are above 20% and ordered (highest confidence first). If none of the labels are above 20%; then a default label.
More specifically, are there any existing operators in Rapidminer which could give such an output?
Whenever the Apply Model operator runs, it produces new special attributes corresponding to confidences for the individual values of the label attribute. So if the label has values one, two, three, three new attributes will be created confidence(one), confidence(two), confidence(three). It would be possible to use the Generate Attributes operator to work out some logic to decide how to really classify each example. It would also be possible to use the Apply Threshold operator (with Create Threshold) to do something similar. It's impossible to give any more guidance unless you post a representative example with data.

Question marks appear in Distance (similarity measure object)

I am trying to use KLDivergence for measuring similarity between text data. But, although other similarity measures work fine, KLDivergence returns question marks in the result. What can cause this problem?
If any of the attributes has a value of zero, the KLdivergence will produce the missing result (the ?). This is probably because of a division by zero.

center of a cluster of points and track shape

I have plots of points which look like this.
The tracks which these points form can be a circle or an ellipse. Clearly the center of the circular tracks in the two images above are different.
How can I find the center point of these tracks (circular/elliptical)? I want to find the (x,y) coordinates which is the center, not necessary that it has to be a point that's in the plotted data set. i.e., I don't want a medoid.
EDIT: Also, is there anyway that I can find an equation for circle/ellipse that envelopes a majority of these points? In the elliptical track, I've added an ellipse that envelopes the points on the track. The values were calculated by trial and error. The center was also calculated by eye balling the plot. How can I do this programmatically?
Smallest circle problem and the here is a paper (PDF download available) on the smallest ellipse problem. Both have O(N) algorithms and should be able to provide the formula for the circle and area from which you can get the center. However, they focus on enclosing all of the points. To solve that issue you'll need to remove some a number of the bounding points, which you should get from the algorithms as well. Unfortunately, it's pretty much up to you as to what qualifies as a good enough solution.
A fast and simple randomized solution is:
Randomly divide the set of points into k sets of N/k points each.
Run the smallest circle/ellipse algorithm for each set
For each of the k sets, pick at least 1 but no more m bounding points to remove from main point set.
Return to step 1, t times.
Return the result of the circle/ellipse algorithm on remaining points.
The algorithm removes between k and mk bounding points every pass at a cost of O(N). For your purpose you'll probably want to remove some percentage of the bounding points, 1-25% seems like a good starting point. This solution assumes that k is very small compared to N, otherwise you'll be removing too many points.
A slower but likely better algorithm is useful in the case that you want to repeated remove one or all of the bounding point from the smallest elipse, recalculate the smallest ellipse, then remove the bounding points again.
You can do this by having the parent node be the bounding points (points stored as a set for easy for faster removal) of the smallest enclosing ellipse of it's children. The maximum number of bounding points should be no more than k (which I'm thinking is 9 for an ellipse, compared to 3 for a circle). So removing a point from the data structure at O(k log N) as it requires recalculating the smallest circle, which is O(k) for each parent that is affected which is O(log N). So removing m points from the data structure should be O(mk log N). You might also want to consider calculating the area of the ellipse every every removed point and removing every point for a cost of O(Nk log N) until you only have three points left. You could then analyze the area data to determine what ellipse should be used. A simple result would be to simply use the ellipse that has the area closest to the average area of all of the ellipses created, but may not be exactly what you seek. It also might be too slow, in which case I recommend a single pass of the faster algorithm.
This looks like an instance of Robust Ellipse Fitting. Check this paper: Outlier Elimination for
Robust Ellipse and Ellipsoid Fitting http://arxiv.org/pdf/0910.4610.pdf.
A first rough and easy solution is provided by the ellipse of inertia (2D version of the ellipsoid of inertia http://en.wikipedia.org/wiki/Moment_of_inertia#Inertia_ellipsoid). Its center is just the centroid and axes are given by Eigen vectors/values of the 2x2 matrix of inertia.