Computing which points (latitude, longitude) are within a certain distance in mysql? - mysql

There are two points A, B, and distances x (miles from A), and y (miles from B). Let the distance from A to B be N. So, A is N miles away from B. How do I solve the problem: What are the points available that are (N + x + y) miles away from A? I'm not sure how to explain this any better. I really have no clue on how to attack this problem, I read Fastest Way to Find Distance Between Two Lat/Long Points and I believe the solution given calculates the distance between two points and have no idea if this solution could be used to apply to my problem, or if so, how.

If you are looking for an approximation algorithm I suggest to look for a k-means algorithm or a hierarchical cluster, especially a monster curve or a space filling curve. First off you can compute a minimal spanning tree of the graph and then remove the longest and expensivest edges. Then the tree makes many little trees and you can use the k-means to compute group of points i.e. clusters.
"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges." See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering.
A good example for a monster curve is the hilbert curve. The basic form of this curve is an U-shape and by copy many of it together and rotating it the curve fills the euklidian space. Surprisingly a gray code can help to find out the orientation of this U-shape. You can look up Nick's spatial index quadtree hilbert curve blog article about more details. Instead to calculate the curve's index you can put together a quadkey like in bing maps. The quadkey is unique for each coordinate and it can be used with normal string operations. Each position in the key is part of the U-shape curve and thus you can select this region of points from select partially from left to right from the quadkey.
In this image you can see the green polygon is found using a hilbert curve:
You can find my php classes here: http://www.phpclasses.org/package/6202-PHP-Generate-points-of-an-Hilbert-curve.html

Related

Why do my t-SNE plots with euclidean and cosine distances look similar

I have a question about two t-SNE plots I made.
I have a set of 850 articles for which I wanted to check which articles are similar to each other.
This was done by pre-processing the articles first, then making a tf-idf vector of the whole set and making a t-SNE plot of this tf-idf, one with cosine distances and one with euclidean distances.
However, they both look very similar, it looks a bit like that only the axes are switched or something... Is there any logical reasoning for this?
The colors are the labels an article got from a simple sentiment analysis.
Above the Cosine Distances
Above the Euclidean distances
Thanks for any help in advance!
The test result indicates that Euclidean distance and cosine distance are likely the same distance function (up to certain scaling factor) for the specific type of data. You could verify this by heatmaps of the two distance matrixes.

Optimal approach for filtering out outlier map coordinates

I've got a list of map coordinates [lat,lon]. I would like to filter out those that by some metric, are too far away from the rest of the main group, outliers.
A) A plain approach to it would be to get the median for lat,lon and then filter out whatever is further away from that median than said metric ( e.g distance ). This would only work for an absolute distance ( e.g 5km ).
B) An improvement to that approach could be to assume that no more than x% of the coordinate pairs would be outliers (essentially setting a threshold there ). Then I'd sort the coordinates array and remove the first x/2% and the final x/2%. Then find the max distance of that group of markers which would be the distance of the first marker to the last marker in that array. Finally, apply A) with the metric for the distance being the distance we just calculated ( so that the distance metric is not fixed )
This is simply an approach I very briefly came up with so if it has any obvious downsides please let me know. In a more open discussion spirit, how would you go about solving this problem? Thanks for your input
Working separately on the coordinates is not the best approach because it is not rotation invariant.
You can try by "onion peeling", i.e. building the convex hull of the point cloud and removing the hull vertices, repeatedly.
Read the paper "Onion-Peeling Outlier Detection in 2-D data Sets; Archit Harsh, John E. Ball & Pan Wei".

store geolocation efficiently in mysql

right now I store long and lat as two decimal, indexed fields in the DB.
I am wondering (without installing any bizzare engine) if there is an efficient way to do this, so the index will also help me to calculate distance. A sample query would be
get me all the location in 10M radios from long X lat Y
Use the float datatype for latitude and longitude. Anything of higher precision is most likely over-engineering.
Unless your results need to be accurate to less than a meter or so, the float datatype has PLENTY of precision for what you're trying to do. If you are working at resolutions of less than a meter, you're going to need to find out about projections (sphere-to-plane) like Universal Transverse Mercator and Lambert.
When you start doing the computations, keep in mind that one minute (one-sixtieth of a degree) of change in latitude (north-to-south) is one nautical mile.
Here's a nice presentation from a mySql person on doing this search.
http://www.scribd.com/doc/2569355/Geo-Distance-Search-with-MySQL
The performance optimization is to make an index on the latitudes, and maybe also longitudes, then do a search like this (positive radius)
where loctable.lat >= (mylat-radius)
and loctable.lat <= (mylat+radius)
and loctable.long >= (mylong-radius)
and loctable.long <= (mylong+radius)
and haversine_distance(mylat, mylong, loctable.lat, loctable.long) <= radius
This searches for a bounding box. That bounding box is the right size in latitude, and probably too big in longitude (unless you're near the equator). But it's OK if the box is too big, because the last line gets rid of any extra matches.
You want to look for a spatial index or a space-filling-curve. A si reduces the 2d complexity to a 1d complexity. It's looks like a quadtree and a bit like a fractal. If you don't mind the shape and an exact search you want to delete the harvesine formule because you can just search for a quadtree tile. Of course you need the mercantor projection. This is by far the fastest method. I uses it a lot with a hilbert curve. You want to look for Nick's hilbert curve spatial index quadtree blog.

Locating all elements between starting and ending points, given by value (not index)

The problem is as follows,
I would be given a set of x and y coordinates(an coordinate array of around 30 to 40 thousand) of a long rope. The rope is lying on the ground and can be in any shape.
Now I would be given a start point(essentially x and y coordinate) and an ending point.
What is the efficient way to determine the set of x and y coordinates from the above mentioned coordinate array lie between the start and end points.
Exhaustive searching ie looping 40k times is not an acceptable solution (mentioned on the question paper)
A little bit margin for error is acceptable
We need to find the start point in the array, then the end point. For each, we can think of the rope as describing a function of distance from that point, and we're looking for the lowest point on that distance graph. If one point is a long way away and another is pretty close, we can do some kind of interpolation guess of where to search next.
distance
| /---\
|-- \ /\ -
| -- ------- -- ------ ---------- -
| \ / \---/ \--/
+-----------------------X--------------------------- array index
In the representation above, we want to find "X"... we look at the distances at a few points, get an impression of the slope of the distance curve, possibly even the rate of change of that slope, to help guide our next bit of probing....
To refine the basic approach of doing binary- or interpolated- searches in areas where we know the distance values are low, we may be able to use the following:
if we happen to be given the rope length and know the coordinate samples are equidistant along the rope, then we can calculate a maximum change in distance from our target point per sample.
if we know the rope has a stiffness ensuring it can't loop in a trivially small diameter, then
there's a known limit to how fast the slope of the curve can change
distance curve converges to vertical on both sides of the 0 point
you could potentially cross-reference/combine distance with, or use instead, the direction of each point from the target: only at the target would the direction instantly change ~180 degrees (how well the data points capture this still depends on the distance between adjacent samples and any stiffness of the rope).
Otherwise, there's always risk the target point may weirdly be encased by two very distance points, frustrating our whole searching algorithm (that must be what they mean about some margin for error - every now and then this search would have to revert to a O(N) brute-force search because any trend analysis fails).
For a one-time search, sometimes linear traversal is the simplest, fastest solution. Maybe that's the case for this problem.
Iterate through the ordered list of points until finding the start or end, and then collect points until hitting the other endpoint.
Now, if we expected to repeat the search, we could build an index to the points.
Edit: This presumes no additional constraints beyond those mentioned by #koool. Constraining the distance between the points would allow the hill-climbing approach described in #Tony's answer.
I don't think you can solve it accurately using anything other than exhaustive search. Say for cases where the rope is folded into half and the resulting double rope forms a spiral with the two ends on the centre.
However if we assume that long portions of the rope are in straight line, then we can eliminate a lot of points based on the slope check:
if (abs(slope(x[i],y[i],x[i+1],y[i+1])
-slope(x[i+1],y[i+1],x[i+2],y[i+2]))<tolerance)
eliminate (x[i+1],y[i+1]);
This will reduce the search time significantly if large portions of the rope are in straight line. But will be linear WRT number of remaining points.
So basically, you've got a sorted list of the points that comprise the entire rope and you're given two arbitrary points from within that list, and tasked with returning the sublist that exists between those two points.
I'm going to make the assumption that the start and end points that are provided are guaranteed to coincide exactly with points within the sorted list (otherwise it introduces a host of issues, particularly if the rope may be arbitrarily thin and passes by the start/end points multiple times).
That means all you're really looking for are the indices of the two provided coordinates. Or the index of one, and the answer to "is the second coordinate to the right or to the left?".
A simple O(n) solution to that would be:
For each index in array
coord = array[index]
if (coord == point1)
startIndex = index
if (coord == point2)
endIndex = index
if (endIndex < startIndex)
swap(startIndex, endIndex)
return array.sublist(startIndex, endIndex)
Or, if you wanted to optimize for repeated queries, I'd suggest a hashing based approach where you map each cooordinate to its index in the array. Something like:
//build the map (do this once, at init)
map = {}
For each index in array
coord = array[index]
map[coord] = index
//find a sublist (do this for each set of start/end points)
startIndex = map[point1]
endIndex = map[point2]
if (endIndex < startIndex)
swap(startIndex, endIndex)
return array.sublist(startIndex, endIndex)
That's O(n) to build the map, but once it's built you can determine the sublist between any two points in O(1). Assuming an efficient hashmap, of course.
Note that if my assumption doesn't hold, then the same solutions are still usable, provided that as a first step you take the provided start and end points and locate the points in the array that best correspond to each one. As noted, unless you are given some constraints regarding the thickness of the rope then interpolating from an arbitrary coordinate to one that's actually part of the rope can only be guesswork at best.

How to calculate where each sensor is when I have only few variables

Suppose I have 3 sensors: sensor1, sensor2 and sensor3.
The only variables I know are:
Distance from sensor1 to origin is 36.05
Distance from sensor2 to origin is 62.00
Distance from sensor3 to origin is 63.19
Distance from sensor1 to sensor2 is 61.03
Distance from sensor1 to sensor3 is 90.07
Distance from sensor2 to sensor3 is 59.50
This is how it would look like if you had the positions:
How can I calculate the position of every point using only those variables?
This is not homework, just curiosity.
You cannot find the position of the points exactly, as any rotation around the origin, as well as symmetry still give the same distances.
Do you want a way to find all the possible results?
Finding the points is pretty straightforward, but do you need the method to be robust on noise?
This process is called trilateration. As others have noted, finding absolute, unambiguous positions for the sensors is not possible without more information - you'll need the positions of three non-coincident, non-colinear sensors in 2D, 4 non-coincident, non-coplanar sensors in 3D, to resolve all rotation/reflection ambiguities.
There's been an enormous amount of research into this problem in the field of wireless sensor network localisation - dealing with incomplete, noisy range measurements, unreliable communication and highly constrained resources make it interesting.
This might be an apt approach - the basic idea is to build up a system of located nodes piecewise - start with a seed formation of 3 or 4 nodes with well-defined relative locations and add nodes one by one as their locations become unambiguously computable relative to already-located nodes.
The anchor nodes with known locations can be used as the seed for system growth if possible, or used to compute a corrective transform after all nodes have been located.
The problem as posed is impossible without more information. If you add more information and some noise, then it is doable. See Finding a point that best fits the intersection of n spheres discusses how to solve that type of problem.
Look at these images.
And
You will see that the triangle can rotate freely (so no "fixed" position exists), and also the third intersensor distance is not needed in the general case, as it is determined by the other two distances.