Locating all elements between starting and ending points, given by value (not index) - language-agnostic

The problem is as follows,
I would be given a set of x and y coordinates(an coordinate array of around 30 to 40 thousand) of a long rope. The rope is lying on the ground and can be in any shape.
Now I would be given a start point(essentially x and y coordinate) and an ending point.
What is the efficient way to determine the set of x and y coordinates from the above mentioned coordinate array lie between the start and end points.
Exhaustive searching ie looping 40k times is not an acceptable solution (mentioned on the question paper)
A little bit margin for error is acceptable

We need to find the start point in the array, then the end point. For each, we can think of the rope as describing a function of distance from that point, and we're looking for the lowest point on that distance graph. If one point is a long way away and another is pretty close, we can do some kind of interpolation guess of where to search next.
distance
| /---\
|-- \ /\ -
| -- ------- -- ------ ---------- -
| \ / \---/ \--/
+-----------------------X--------------------------- array index
In the representation above, we want to find "X"... we look at the distances at a few points, get an impression of the slope of the distance curve, possibly even the rate of change of that slope, to help guide our next bit of probing....
To refine the basic approach of doing binary- or interpolated- searches in areas where we know the distance values are low, we may be able to use the following:
if we happen to be given the rope length and know the coordinate samples are equidistant along the rope, then we can calculate a maximum change in distance from our target point per sample.
if we know the rope has a stiffness ensuring it can't loop in a trivially small diameter, then
there's a known limit to how fast the slope of the curve can change
distance curve converges to vertical on both sides of the 0 point
you could potentially cross-reference/combine distance with, or use instead, the direction of each point from the target: only at the target would the direction instantly change ~180 degrees (how well the data points capture this still depends on the distance between adjacent samples and any stiffness of the rope).
Otherwise, there's always risk the target point may weirdly be encased by two very distance points, frustrating our whole searching algorithm (that must be what they mean about some margin for error - every now and then this search would have to revert to a O(N) brute-force search because any trend analysis fails).

For a one-time search, sometimes linear traversal is the simplest, fastest solution. Maybe that's the case for this problem.
Iterate through the ordered list of points until finding the start or end, and then collect points until hitting the other endpoint.
Now, if we expected to repeat the search, we could build an index to the points.
Edit: This presumes no additional constraints beyond those mentioned by #koool. Constraining the distance between the points would allow the hill-climbing approach described in #Tony's answer.

I don't think you can solve it accurately using anything other than exhaustive search. Say for cases where the rope is folded into half and the resulting double rope forms a spiral with the two ends on the centre.
However if we assume that long portions of the rope are in straight line, then we can eliminate a lot of points based on the slope check:
if (abs(slope(x[i],y[i],x[i+1],y[i+1])
-slope(x[i+1],y[i+1],x[i+2],y[i+2]))<tolerance)
eliminate (x[i+1],y[i+1]);
This will reduce the search time significantly if large portions of the rope are in straight line. But will be linear WRT number of remaining points.

So basically, you've got a sorted list of the points that comprise the entire rope and you're given two arbitrary points from within that list, and tasked with returning the sublist that exists between those two points.
I'm going to make the assumption that the start and end points that are provided are guaranteed to coincide exactly with points within the sorted list (otherwise it introduces a host of issues, particularly if the rope may be arbitrarily thin and passes by the start/end points multiple times).
That means all you're really looking for are the indices of the two provided coordinates. Or the index of one, and the answer to "is the second coordinate to the right or to the left?".
A simple O(n) solution to that would be:
For each index in array
coord = array[index]
if (coord == point1)
startIndex = index
if (coord == point2)
endIndex = index
if (endIndex < startIndex)
swap(startIndex, endIndex)
return array.sublist(startIndex, endIndex)
Or, if you wanted to optimize for repeated queries, I'd suggest a hashing based approach where you map each cooordinate to its index in the array. Something like:
//build the map (do this once, at init)
map = {}
For each index in array
coord = array[index]
map[coord] = index
//find a sublist (do this for each set of start/end points)
startIndex = map[point1]
endIndex = map[point2]
if (endIndex < startIndex)
swap(startIndex, endIndex)
return array.sublist(startIndex, endIndex)
That's O(n) to build the map, but once it's built you can determine the sublist between any two points in O(1). Assuming an efficient hashmap, of course.
Note that if my assumption doesn't hold, then the same solutions are still usable, provided that as a first step you take the provided start and end points and locate the points in the array that best correspond to each one. As noted, unless you are given some constraints regarding the thickness of the rope then interpolating from an arbitrary coordinate to one that's actually part of the rope can only be guesswork at best.

Related

Optimal approach for filtering out outlier map coordinates

I've got a list of map coordinates [lat,lon]. I would like to filter out those that by some metric, are too far away from the rest of the main group, outliers.
A) A plain approach to it would be to get the median for lat,lon and then filter out whatever is further away from that median than said metric ( e.g distance ). This would only work for an absolute distance ( e.g 5km ).
B) An improvement to that approach could be to assume that no more than x% of the coordinate pairs would be outliers (essentially setting a threshold there ). Then I'd sort the coordinates array and remove the first x/2% and the final x/2%. Then find the max distance of that group of markers which would be the distance of the first marker to the last marker in that array. Finally, apply A) with the metric for the distance being the distance we just calculated ( so that the distance metric is not fixed )
This is simply an approach I very briefly came up with so if it has any obvious downsides please let me know. In a more open discussion spirit, how would you go about solving this problem? Thanks for your input
Working separately on the coordinates is not the best approach because it is not rotation invariant.
You can try by "onion peeling", i.e. building the convex hull of the point cloud and removing the hull vertices, repeatedly.
Read the paper "Onion-Peeling Outlier Detection in 2-D data Sets; Archit Harsh, John E. Ball & Pan Wei".

Kalman Filter corrected by known path

I am trying to get filtered velocity/spacial data from noisy position data from a tracked vehicle. I have a set of noisy position/time data = (x_i,y_i,t_i) and a known curve along which the vehicle is traveling, curve = (x(s),y(s)), where s is total distance along the curve. I can run a Kalman filter on the data, but I don't know how to constrain it to the 'road' without throwing out data that is too far from the road, which I don't want to do.
Alternately, I'm trying to estimate the value of s along the constrained path with position data that is noisy in x and y
Does anyone have an idea of how to merge the two types of data?
Thanks!
Do you understand what a Kalman filter does? Fundamentally, it assigns a probability to each possible state given just observables. In simple cases, this doesn't use a priori knowledge. But in your case, you can simply set the off-road estimates to zero and renormalizing the remaining probabilities.
Note: this isn't throwing out observables which are too far off the road, or even discarding outcomes which are too far off. It means that an apparent off-road position strongly increases the probabilities of an outcome on, but near the edge of the road.
If you want the model to allow small excursions away from the road, you can use a fast decaying function to model the low but non-zero probability of a car being off the road.
You could have as states the distance s along the path, and the rate of change of s. The position observations X and Y will then be non-linear functions of the state (assuming your track is not a line) so you'll need to use an extended or unscented filter.

Why are there no asin2() and acos2() functions similar to atan2()?

From my understanding, the atan2() function exists in programming languages because atan() itself cannot always determine the correct theta since the output is restricted to -pi/2 to pi/2.
If this is the case, then the same problem applies to both asin() and acos(), both of whom also have restricted ranges, so then why are there no asin2() and acos2() functions?
First off, note that the syntaxes of the two arctan functions are atan(y/x) and atan2(y, x). This distinction is important, because by not performing the division you provide additional information, most importantly the individual signs of x and y. If you know the individual x and y coordinates, the particular solution to the atan function can be found (i.e. the solution which takes into account the quadrant that (x,y) is in).
If you go from tan(θ) = y/x to sin(θ) = y/sqrt(x²+y²), then the inverse operation asin takes y and sqrt(x²+y²) and combines that to obtain some information about the angle. Here it doesn't matter whether we perform the division ourself or let some hypothetical asin2 function handle it. The denominator is always positive, so the divided argument contains just as much information as separate numerator and denominator contain. (At least in an IEEE environment where division by zero leads to a correctly-signed infinity.)
If you know the y coordinate and the hypothenuse sqrt(x²+y²) then you know the sine of the angle, but you cannot know the angle itself, since you cannot distinguish between negative and positive x values. Likewise, if you know the x coordinate and the hypothenuse, you know the cosine of the angle but you cannot know the sign of the y value.
So asin2 and acos2 are not mathematically feasible, at least not in an obvious way. If you had some kind of sign encoded into the hypothenuse, things might be different, but I can think of no situation where such a sign would arise naturally.
Because asin(y,x) acos(y,x) would each take the same parameters as atan(y,x) and each give the same answer. Each would be equally valid, but we only need one such function.
The unclarity arises from the name (of atan2). Its a function that given x and y, computes the angle (made by a line from the origin to this point) with the (positive) x-axis. A name like angle_from(x,y) would arguably have been more appropriate.
There are times when a function like "acos2" is needed, for example when performing rotations of vectors in 3D space. Under those circumstances, I hard-code my own acos2 function which simply performs the following checks:
x_perp=sqrt(x*x+y*y)
r=sqrt(x*x+y*y+z*z)
if(x_perp.gt.0.0d0) then
phi=acos(x/x_perp)
else
phi=0.0d0
endif
if(y.lt.0.0d0) phi=2.0d0*pi-phi
theta=acos(z/r)
where theta and phi are the usual spherical coordinates and x,y,z the Cartesian coordinates. The problem arises when y is negative, there needs to be a phase shift in phi. There is no such problem for theta.
I will explain in SIMPLE TERMS this way.
Refer to this image for the following explanation:
Task: Choose a function that will track the correct angle across a range -180 < θ < 180
Trial 1:
sin() is positive in the first and second quadrants, sin(30) = sin(150) = 0.5. It won't be easy to track quadrant change with sin().
Therefore, asin2() is not feasible.
Trial 2:
cos() is positive in the first and fourth quadrants, cos(60) = sin(300) = 0.5. Also, it won't be easy to track quadrant change with cos().
Therefore, acos2() is again not feasible.
Trial 3:
tan() is positive in the first and third quadrants, and in an interesting order.
It is positive in the 1st quadrant, negative in the 2nd, positive in the 3rd, negative in the 4th, and positive in the wrapped-around-1st quadrant.
such that tan(45) = 1 , tan(135) = -1, tan(225) = 1, tan(315) = -1, and tan(360+45) = 1. Hurray! we can track quadrant change.
Notice that the unambiguous range is -180 < θ < 180. Also, note in my 45-degree-increment example above, if the sequence is 1,-1,.. the angle goes counter-clockwise, and if the sequence is -1,1,.. it goes clockwise. This idea should resolve directionality.
Therefore, atan2() BECOMES OUR CHOICE.

How to detect local maxima and curve windows correctly in semi complex scenarios?

I have a series of data and need to detect peak values in the series within a certain number of readings (window size) and excluding a certain level of background "noise." I also need to capture the starting and stopping points of the appreciable curves (ie, when it starts ticking up and then when it stops ticking down).
The data are high precision floats.
Here's a quick sketch that captures the most common scenarios that I'm up against visually:
One method I attempted was to pass a window of size X along the curve going backwards to detect the peaks. It started off working well, but I missed a lot of conditions initially not anticipated. Another method I started to work out was a growing window that would discover the longer duration curves. Yet another approach used a more calculus based approach that watches for some velocity / gradient aspects. None seemed to hit the sweet spot, probably due to my lack of experience in statistical analysis.
Perhaps I need to use some kind of a statistical analysis package to cover my bases vs writing my own algorithm? Or would there be an efficient method for tackling this directly with SQL with some kind of local max techniques? I'm simply not sure how to approach this efficiently. Each method I try it seems that I keep missing various thresholds, detecting too many peak values or not capturing entire events (reporting a peak datapoint too early in the reading process).
Ultimately this is implemented in Ruby and so if you could advise as to the most efficient and correct way to approach this problem with Ruby that would be appreciated, however I'm open to a language agnostic algorithmic approach as well. Or is there a certain library that would address the various issues I'm up against in this scenario of detecting the maximum peaks?
my idea is simple, after get your windows of interest you will need find all the peaks in this window, you can just compare the last value with the next , after this you will have where the peaks occur and you can decide where are the best peak.
I wrote one simple source in matlab to show my idea!
My example are in wave from audio file :-)
waveFile='Chick_eco.wav';
[y, fs, nbits]=wavread(waveFile);
subplot(2,2,1); plot(y); legend('Original signal');
startIndex=15000;
WindowSize=100;
endIndex=startIndex+WindowSize-1;
frame = y(startIndex:endIndex);
nframe=length(frame)
%find the peaks
peaks = zeros(nframe,1);
k=3;
while(k <= nframe - 1)
y1 = frame(k - 1);
y2 = frame(k);
y3 = frame(k + 1);
if (y2 > 0)
if (y2 > y1 && y2 >= y3)
peaks(k)=frame(k);
end
end
k=k+1;
end
peaks2=peaks;
peaks2(peaks2<=0)=nan;
subplot(2,2,2); plot(frame); legend('Get Window Length = 100');
subplot(2,2,3); plot(peaks); legend('Where are the PEAKS');
subplot(2,2,4); plot(frame); legend('Peaks in the Window');
hold on; plot(peaks2, '*');
for j = 1 : nframe
if (peaks(j) > 0)
fprintf('Local=%i\n', j);
fprintf('Value=%i\n', peaks(j));
end
end
%Where the Local Maxima occur
[maxivalue, maxi]=max(peaks)
you can see all the peaks and where it occurs
Local=37
Value=3.266296e-001
Local=51
Value=4.333496e-002
Local=65
Value=5.049438e-001
Local=80
Value=4.286804e-001
Local=84
Value=3.110046e-001
I'll propose a couple of different ideas. One is to use discrete wavelets, the other is to use the geographer's concept of prominence.
Wavelets: Apply some sort of wavelet decomposition to your data. There are multiple choices, with Daubechies wavelets being the most widely used. You want the low frequency peaks. Zero out the high frequency wavelet elements, reconstruct your data, and look for local extrema.
Prominence: Those noisy peaks and valleys are of key interest to geographers. They want to know exactly which of a mountain's multiple little peaks is tallest, the exact location of the lowest point in the valley. Find the local minima and maxima in your data set. You should have a sequence of min/max/min/max/.../min. (You might want to add an arbitrary end points that are lower than your global minimum.) Consider a min/max/min sequence. Classify each of these triples per the difference between the max and the larger of the two minima. Make a reduced sequence that replaces the smallest of these triples with the smaller of the two minima. Iterate until you get down to a single min/max/min triple. In your example, you want the next layer down, the min/max/min/max/min sequence.
Note: I'm going to describe the algorithmic steps as if each pass were distinct. Obviously, in a specific implementation, you can combine steps where it makes sense for your application. For the purposes of my explanation, it makes the text a little more clear.
I'm going to make some assumptions about your problem:
The windows of interest (the signals that you are looking for) cover a fraction of the entire data space (i.e., it's not one long signal).
The windows have significant scope (i.e., they aren't one pixel wide on your picture).
The windows have a minimum peak of interest (i.e., even if the signal exceeds the background noise, the peak must have an additional signal excess of the background).
The windows will never overlap (i.e., each can be examined as a distinct sub-problem out of context of the rest of the signal).
Given those, you can first look through your data stream for a set of windows of interest. You can do this by making a first pass through the data: moving from left to right, look for noise threshold crossing points. If the signal was below the noise floor and exceeds it on the next sample, that's a candidate starting point for a window (vice versa for the candidate end point).
Now make a pass through your candidate windows: compare the scope and contents of each window with the values defined above. To use your picture as an example, the small peaks on the left of the image barely exceed the noise floor and do so for too short a time. However, the window in the center of the screen clearly has a wide time extent and a significant max value. Keep the windows that meet your minimum criteria, discard those that are trivial.
Now to examine your remaining windows in detail (remember, they can be treated individually). The peak is easy to find: pass through the window and keep the local max. With respect to the leading and trailing edges of the signal, you can see n the picture that you have a window that's slightly larger than the actual point at which the signal exceeds the noise floor. In this case, you can use a finite difference approximation to calculate the first derivative of the signal. You know that the leading edge will be somewhat to the left of the window on the chart: look for a point at which the first derivative exceeds a positive noise floor of its own (the slope turns upwards sharply). Do the same for the trailing edge (which will always be to the right of the window).
Result: a set of time windows, the leading and trailing edges of the signals and the peak that occured in that window.
It looks like the definition of a window is the range of x over which y is above the threshold. So use that to determine the size of the window. Within that, locate the largest value, thus finding the peak.
If that fails, then what additional criteria do you have for defining a region of interest? You may need to nail down your implicit assumptions to more than 'that looks like a peak to me'.

Computing which points (latitude, longitude) are within a certain distance in mysql?

There are two points A, B, and distances x (miles from A), and y (miles from B). Let the distance from A to B be N. So, A is N miles away from B. How do I solve the problem: What are the points available that are (N + x + y) miles away from A? I'm not sure how to explain this any better. I really have no clue on how to attack this problem, I read Fastest Way to Find Distance Between Two Lat/Long Points and I believe the solution given calculates the distance between two points and have no idea if this solution could be used to apply to my problem, or if so, how.
If you are looking for an approximation algorithm I suggest to look for a k-means algorithm or a hierarchical cluster, especially a monster curve or a space filling curve. First off you can compute a minimal spanning tree of the graph and then remove the longest and expensivest edges. Then the tree makes many little trees and you can use the k-means to compute group of points i.e. clusters.
"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges." See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering.
A good example for a monster curve is the hilbert curve. The basic form of this curve is an U-shape and by copy many of it together and rotating it the curve fills the euklidian space. Surprisingly a gray code can help to find out the orientation of this U-shape. You can look up Nick's spatial index quadtree hilbert curve blog article about more details. Instead to calculate the curve's index you can put together a quadkey like in bing maps. The quadkey is unique for each coordinate and it can be used with normal string operations. Each position in the key is part of the U-shape curve and thus you can select this region of points from select partially from left to right from the quadkey.
In this image you can see the green polygon is found using a hilbert curve:
You can find my php classes here: http://www.phpclasses.org/package/6202-PHP-Generate-points-of-an-Hilbert-curve.html