distance calculation between two tables of lat/lon - mysql

I have the following two tables
cities
id,lat,lon
mountains
id,latitude,longitude
SELECT cities.id,
(SELECT id FROM mountains
WHERE SQRT(POW(69.1 * ( latitude - cities.lat ) , 2 ) +
POW( 69.1 * (cities.lon - longitude ) *
COS( latitude / 57.3 ) , 2 ) )<20 LIMIT 1) as mountain_id
FROM cities
(Query took 0.5060 seconds.)
I've removed some parts of the query (e.g. order by, where) for the complexity's sake. However it doesn't affect the execution time really.
The EXPLAIN below.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY cities ALL NULL NULL NULL NULL 478379
2 DEPENDENT SUBQUERY mountains ALL NULL NULL NULL NULL 15645 Using where
Using the SELECT itself is not my problem but when I try to use the given result... e.g.
id mountain_id
588437 NULL
588993 4269
589014 4201
589021 4213
589036 4952
589052 7625
589113 9235
589125 NULL
589176 1184
589210 4317
...to UPDATE a table everything gets awfully slow. I tried pretty much everything that I know of. I do know that a dependent sub-query isn't optimal but I don't know how to get rid of it.
Is there any way to improve my query. Maybe changing it into a JOIN?
The 2 tables itself have nothing really in common except latitude and longitude which are different and are only brought into relation when using calculations.
Spatial distance search (km,miles) in MariaDB seems not to be available yet.

The trick to making this sort of operation fast is to avoid doing all that computation on every possible pair of lat/lon points. To do that you should incorporate a bounding-box operation.
Let's start by using a JOIN. In pseudocode, you want something like this, but it doesn't matter if you catch a few extra pairs, as long as they are further apart than the others.
SELECT c.city_id, m.mountain_id
FROM cities c
JOIN mountains m ON distance_in_miles(c, m) < 20
So we need to figure out how to make that ON clause fast -- make it use indexes rather than rambling around all the cities and mountains (with apologies to Woody Guthrie).
Let's try this for the ON clause. It searches within square bounding boxes of +/- 20 miles for nearby pairs.
SELECT c.city_id, m.mountain_id
FROM cities c
JOIN mountains m
ON m.lat BETWEEN c.lat - (20.0 / 69.0)
AND c.lat + (20.0 / 69.0)
AND m.lon BETWEEN c.lon - (20.0 / (69.0 * COS(RADIANS(c.lat))))
AND c.lon + (20.0 / (69.0 * COS(RADIANS(c.lat))))
In this query, 20.0 is the comparison limit radius, and 69.0 is the constant defining statute miles per degree of latitude.
Then, put compound indexes on (lat, lon, id) on both tables, and your JOIN operation will be able to use index range scans to make the query more efficient.
Finally, you can augment that query with these sorts of clauses, in pseudocode
ORDER BY dist_in_miles (c,m) ASC
LIMIT 1
Here you actually need to use a distance formula. The cartesian-distance formula in your question is an approximation that works tolerably well unless you're near the pole. You may want to use a great circle formula instead. Those are called spherical cosine law, haversine, or Vincenty formulas.

Related

MySQL Query To Select Closest City

I am trying to repeat the following query for all rows. Basically I am trying to map the closest city (based on the latitude and longitude) to the places latitude and longitude. I have a table places which contains the places that need to be mapped, and a table CityTable with the places to be matched to. I have the following query which works for a single row:
SELECT p.placeID, p.State, p.City, p.County, p.name,
SQRT(POW((69.1 * (p.lat - z.Latitude)), 2 )
+ POW((53 * (p.lng - z.Loungitude)), 2)) AS distance,
p.lat,p.lng,z.Latitude,z.Loungitude,z.City
FROM places p,CityTable z
WHERE p.placeID = 1
ORDER BY distance ASC
LIMIT 1;
This works for a single location. Obviously I would need to remove the WHERE constraints to apply it to the entire table.The problem that I am encountering is that it seems to want to make a copy to compare to every other element in the table. For example, if there are 100 rows in p and 100 rows in z, then the resulting table seems to be 10,000 rows. I need the table to be of size count(*) for p. Any ideas? Also, are there any more efficient ways to do this if my table p contains over a million rows? Thanks.
You can find the nearest city to a place using:
SELECT p.placeID, p.State, p.City, p.County, p.name,
(select z.City
from CityTable z
order by SQRT(POW((69.1 * (p.lat - z.Latitude)), 2 ) + POW((53 * (p.lng - z.Loungitude)), 2))
limit 1
) as City,
p.lat, p.lng
FROM places p
ORDER BY distance ASC;
(If you want additional city information, join the city table back in on City.)
This doesn't solve the problem of having to do the Cartesian product. It does, however, frame it in a different way. If you know that a city is within five degrees longitude/latitude of any place, then you can make the subquery more efficient:
(select z.City
from CityTable z
where z.lat >= p.lat + 5 and z.lat <= p.lat - 5 and
z.long <= p.long + 5 and z.long <= p.lat - 5
order by SQRT(POW((69.1 * (p.lat - z.Latitude)), 2 ) + POW((53 * (p.lng - z.Loungitude)), 2))
limit 1
) as City,
p.lat, p.lng;
This query will use an index on lat. It might even use an index on lat, long.
If this isn't sufficient, then you might consider another way of reducing the search space, by looking only at neighboring states (in the US) or countries.
Finally, you may want to consider the geospatial extensions to MySQL if you are often dealing with this type of data.

Slow SQL Query by Limit/Order dynamic field (coordinates from X point)

I'm trying to make a SQL query on a database of 7 million records, the database "geonames" have the "latitude" and "longitude" in decimal(10.7) indexed both, the problem is that the query is too slow:
SELECT SQL_NO_CACHE DISTINCT
geonameid,
name,
(6367.41 * SQRT(2 * (1-Cos(RADIANS(latitude)) * Cos(0.704231626533) * (Sin(RADIANS(longitude))*Sin(-0.0669560660943) + Cos(RADIANS(longitude)) * Cos(-0.0669560660943)) - Sin(RADIANS(latitude)) * Sin(0.704231626533)))) AS Distance
FROM geoNames
WHERE (6367.41 * SQRT(2 * (1 - Cos(RADIANS(latitude)) * Cos(0.704231626533) * (Sin(RADIANS(longitude)) * Sin(-0.0669560660943) + cos(RADIANS(longitude)) * Cos(-0.0669560660943)) - Sin(RADIANS(latitude)) * Sin(0.704231626533))) <= '10')
ORDER BY Distance
The problem is sort by the "Distance" field, which when created dynamically take long to seep into the condition "WHERE", if I remove the condition of the "WHERE ... <= 10" takes only 0.34 seconds, but the result is 7 million records and to transfer data from MySQL to PHP takes almost 120 seconds.
Can you think of any way to make the query to not lose performance by limiting the Distance field, given that the query will very often change the values?
This kind of query cannot use an index but must compute whether the lat/lon of each row falls within the specified distance. Therefore, it is typical that some form of preprocessing is used to limit the scan to a subset of rows. You could create tables corresponding to distance "bands" (2, 5, 8, 10, 20 miles/km -- whatever makes sense for your application requirements) and then populate these bands and keep them up to date. If you want only those medical providers, say, or hotels, or whatever, within 10 miles of a given location, there's no need to worry about the ones that are hundreds or thousands of miles away. With ad hoc queries you could inner join on the "within 10 miles" band, say, and thereby exclude from the comparison scan all rows where the computed distance > 10. When the location varies, the "elegant" way to handle this is to implement an RTREE, but you can define your encompassing region in any arbitrary way you like if you have access to additional data -- e.g. by using zipcodes or counties or states.
There are two things you can do:
Make sure the datatypes are the same on both sides of a comparison: ie compare with 10 (a number), not '10' (a char type) - it will make less work for the DB
In cases like this, I create a view, which means the calculation to be made just once, even if you refer to it more than once in the query
If these two points are incorporated into you code, you get:
CREATE VIEW geoNamesDistance AS
SELECT SQL_NO_CACHE DISTINCT
geonameid,
name,
(6367.41 * SQRT(2 * (1-Cos(RADIANS(latitude)) * Cos(0.704231626533) * (Sin(RADIANS(longitude))*Sin(-0.0669560660943) + Cos(RADIANS(longitude)) * Cos(-0.0669560660943)) - Sin(RADIANS(latitude)) * Sin(0.704231626533)))) AS Distance
FROM geoNames;
SELECT * FROM geoNamesDistance
WHERE Distance <= 10
ORDER BY Distance;
I came up with:
select * from retailer
where latitude is not null and longitude is not null
and pow(2*(latitude - ?), 2) + pow(longitude - ?, 2) < your_magic_distance_value
With this fast & easy flat-Earth code, Los Angeles is closer to Honolulu than San Fransisco, but i doubt customers will consider that when going that far to shop.

Need Help understanding the HAVING clause as it relates to COUNT

I am starting to learn sql and I am currently getting hung up on the 'having' clause when I use it in conjunction with COUNT. I have done a lot of research and my general understanding is that unlike where it waits to apply until any functions in the query have run.
This lead me to find the rather neat little function that allows me to find the closest locations to a given zip code by latitude and longitude. I started playing with this and used it to tie together three tables. One has a list of zip codes and their latitude longitudes while the other has a list of events and their zip codes and the last has a list of any special requirements for each event. The three event tables are tied to each other through an index. The latitude/longitude table is tied in via the zip code.
So for example:
SELECT `EVENT_CAT`,
`NAME`,
(3959 * acos(cos(radians(44.643418)) * cos(radians(`Latitude`)) * cos(radians(`Longitude`) - radians(-73.121685) ) + sin( radians(44.643418)) * sin(radians( `Latitude`)))) AS distance
FROM DATA_ZipCodes
JOIN `EVENT_POST_General` ON ZIP_CODE = ZipCode
JOIN `DATA_EVENTCategories` ON EVENT_CAT = DATA_EVENTCategories.ID
JOIN `EVENT_POST_Filtering` ON EVENT_POST_General.EVENT_ID = EVENT_POST_Filtering.EVENT_ID
WHERE `REQUIRE_TICKET` = '0'
HAVING distance < 10
ORDER BY distance
This works great and returns:
EVENT_CAT NAME DISTANCE
-------------------------------
1 CONCERT 1
1 CONCERT 1
1 CONCERT 1
2 GAMES 1
2 GAMES 2
3 DANCE 4
4 DINNER 4
5 MOVIES 4
The catch is I also want to be able to just query a count of how many of each category of events I have.
To this end I tried just encorporating a COUNT and GROUP BY
SELECT COUNT(`EVENT_CAT`),
`NAME`,
(3959 * acos(cos(radians(44.643418)) * cos(radians(`Latitude`)) * cos(radians(`Longitude`) - radians(-73.121685) ) + sin( radians(44.643418)) * sin(radians( `Latitude`)))) AS distance
FROM DATA_ZipCodes
JOIN `EVENT_POST_General` ON ZIP_CODE = ZipCode
JOIN `DATA_EVENTCategories` ON EVENT_CAT = DATA_EVENTCategories.ID
JOIN `EVENT_POST_Filtering` ON EVENT_POST_General.EVENT_ID = EVENT_POST_Filtering.EVENT_ID
WHERE `REQUIRE_TICKET` = '0'
GROUP BY `EVENT_CAT`
HAVING distance < 10
ORDER BY distance
When I do this however the response does not error but it also does not return anything. I am very baffled as to how to best do this. I tried moving the group by but that just caused errors.
*EDITED TO CORRECT THE GROUP_BY TO GROUP BY as that was a type when I was pasting it in here and is not in the original code :)
Your query a bit shorted:
SELECT COUNT(`EVENT_CAT`),`NAME`, ... as `distance`
FROM ...
WHERE `REQUIRE_TICKET` = '0'
GROUP_BY `EVENT_CAT`
HAVING distance < 10
ORDER BY distance
The group by clause (I think it should not have a _ there, too) groups the result by different event_cat results, unifying the rest of the results. But:
your query result does not even contain such a column.
in each group you try to count how much different values of event_cat there are. There can only be one, if you would group by this.
additionally, the other columns of the result are not depending only on the group-by column ... this does not work.
The having clause is used to filter the resulting groups after grouping. Since your distance is not unique in each group, this is not a good way to do it - put it instead in the WHERE clause, if you want to count how many events in 10 km distance are in each category. The problem is that you can't refer in the WHERE clause to fields only defined in the SELECT clause, so we need to put the formula into the WHERE clause.
Then count the different name values (or something else unique), not the event_cat.
SELECT `EVENT_CAT`, COUNT(`NAME`)
FROM ...
WHERE `REQUIRE_TICKET` = '0' AND ... < 10
GROUP_BY `EVENT_CAT`
Ordering by distance also is not possible, since the distance is not unique by category. Maybe ordering by count, or minimal distance, or such?
Use:
SELECT dzc.name,
COUNT(dzc.name),
(3959 * acos(cos(radians(44.643418)) * cos(radians(`Latitude`)) * cos(radians(`Longitude`) - radians(-73.121685) ) + sin( radians(44.643418)) * sin(radians( `Latitude`)))) AS distance
FROM DATA_ZipCodes dzc
JOIN EVENT_POST_General epg ON epg.zip_code = dzc.zipcode
JOIN DATA_EVENTCategories dec ON dec.id = dzc.event_cat
JOIN EVENT_POST_Filtering epf ON epf.event_id = epg.event_id
WHERE REQUIRE_TICKET = '0'
GROUP BY dzc.name
HAVING distance < 10
ORDER BY distance
you had a typo - there's no underscore in "GROUP BY"
MySQL supports hidden columns in the GROUP BY, but few other databases do
You only need backticks in MySQL queries if you are escaping reserved/key words
use table aliases -- makes the query more readable, and will give you more specific errors in the event a table gets changed

Is there a Way to Optimize a MySQL Query that Runs a Function on Every Row?

I've got a MySQL query that pulls lat longs from a database based on a criterion, tests whether these points are within a polygon, and returns the points that are within the polygon.
Everything works fine. The problem is that the query takes approx. 20 seconds to return a result. Is there a way to optimize this query so that query speed is faster?
SELECT latitude, longitude
FROM myTable
WHERE offense = 'green' AND myWithin(
POINTFROMTEXT( CONCAT( 'POINT(', latitude, ' ', longitude, ')' ) ) , POLYFROMTEXT( 'POLYGON(( ...bunch of lat longs...))' )
) = 1;
I ran an EXPLAIN SELECT... which produced
id | select_type | table | type |
possible_keys | key | key_len | ref |
rows | Extra
1 SIMPLE myTable ALL NULL NULL NULL NULL 137003 Using where
Is there a way to optimize a query that is run on every latitude and longitude in the db or is this as good as it gets?
I'm thinking about doing a select into another table and then querying the results table, but I was hoping that there would be a way to improve the performance of this query.
If anyone has any suggestions or ideas, I'd love to hear them.
Thanks,
Laxmidi
How big are the polygons? You could define a "bounding rectangle" around the whole polygon and then do:
SELECT latitude, longitude
FROM myTable
WHERE
offense = 'green' AND
latitude BETWEEN rect_left AND rect_right AND
longitude BETWEEN rect_top AND rect_bottom AND
myWithin(
POINTFROMTEXT( CONCAT( 'POINT(', latitude, ' ', longitude, ')' ) ),
POLYFROMTEXT( 'POLYGON(( ...bunch of lat longs...))' )) = 1;
That way, it could use an index on latitude and longitude to narrow down the number of points that it has to run the complex stuff on.
I see two obvious avenues for optimization:
Reduce the result set more before you run your function O(n) times. Right now you're running the function 137003 times - there's little way to avoid that if you can't filter the result set any further.
Make the function faster, such that you're still running it 137k times, but each invocation takes less time, thus reducing your total runtime.
Right now your function is taking 0.1459 milliseconds per row to run, which really isn't bad. You probably want to try to find some way to further reduce the number of rows you have to run it on. Reducing the result set through clever use of WHERE also has the side benefit of allowing your database to do some optimization for you, which is how you want to be using it.

SQL Query For Total Points Within Radius of a Location

I have a database table of all zipcodes in the US that includes city,state,latitude & longitude for each zipcode. I also have a database table of points that each have a latitude & longitude associated with them. I'd like to be able to use 1 MySQL query to provide me with a list of all unique city/state combinations from the zipcodes table with the total number of points within a given radius of that city/state. I can get the unique city/state list using the following query:
select city,state,latitude,longitude
from zipcodes
group by city,state order by state,city;
I can get the number of points within a 100 mile radius of a specific city with latitude '$lat' and longitude '$lon' using the following query:
select count(*)
from points
where (3959 * acos(cos(radians($lat)) * cos(radians(latitude)) * cos(radians(longitude) - radians($lon)) + sin(radians($lat)) * sin(radians(latitude)))) < 100;
What I haven't been able to do is figure out how to combine these queries in a way that doesn't kill my database. Here is one of my sad attempts:
select city,state,latitude,longitude,
(select count(*) from points
where status="A" AND
(3959 * acos(cos(radians(zipcodes.latitude)) * cos(radians(latitude)) * cos(radians(longitude) - radians(zipcodes.longitude)) + sin(radians(zipcodes.latitude)) * sin(radians(latitude)))) < 100) as 'points'
from zipcodes
group by city,state order by state,city;
The tables currently have the following indexes:
Zipcodes - `zip` (zip)
Zipcodes - `location` (state,city)
Points - `status_length_location` (status,length,longitude,latitude)
When I run explain before the previous MySQL query here is the output:
+----+--------------------+----------+------+------------------------+------------------------+---------+-------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+------+------------------------+------------------------+---------+-------+-------+---------------------------------+
| 1 | PRIMARY | zipcodes | ALL | NULL | NULL | NULL | NULL | 43187 | Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | points | ref | status_length_location | status_length_location | 2 | const | 16473 | Using where; Using index |
+----+--------------------+----------+------+------------------------+------------------------+---------+-------+-------+---------------------------------+
I know I could loop through all the zipcodes and calculate the number of matching points within a given radius but the points table will be growing all the time and I'd rather not have stale point totals in the zipcodes database. I'm hoping a MySQL guru out there can show me the error of my ways. Thanks in advance for your help!
MySQL Guru or not, the problem is that unless you find a way of filtering out various rows, the distance needs to be calculated between each point and each city...
There are two general approaches that may help the situation
make the distance formula simpler
filter out unlikely candidates to the 100k radius from a given city
Before going into these two avenue of improvement, you should decide on the level of precision desired with regard to this 100 miles distance, also you should indicate which geographic area is covered by the database (is this just continental USA etc.
The reason for this is that while more precise numerically, the Great Circle formula, is very computationally expensive. Another avenue of performance improvement would be to store "Grid coordinates" of sorts in addtion (or instead of) the Lat/Long coordinates.
Edit:
A few ideas about a simpler (but less precise) formula:
Since we're dealing with relatively small distances, (and I'm guessing between 30 and 48 deg Lat North), we can use the euclidean distance (or better yet the square of the euclidean distance) rather than the more complicated spherical trigonometry formulas.
depending on the level of precision expected, it may even be acceptable to have one single parameter for the linear distance for a full degree of longitude, taking something average over the area considered (say circa 46 statute miles). The formula would then become
LatDegInMi = 69.0
LongDegInMi = 46.0
DistSquared = ((Lat1 - Lat2) * LatDegInMi) ^2 + ((Long1 - Long2) * LongDegInMi) ^2
On the idea of a columns with grid info to filter to limit the number of rows considered for distance calculation.
Each "point" in the system, be it a city, or another point (?delivery locations, store locations... whatever) is assigned two integer coordinate which define the square of say 25 miles * 25 miles where the point lies. The coordinates of any point within 100 miles from the reference point (a given city), will be at most +/- 4 in the x direction and +/- 4 in the y direction. We can then write a query similar to the following
SELECT city, state, latitude, longitude, COUNT(*)
FROM zipcodes Z
JOIN points P
ON P.GridX IN (
SELECT GridX - 4, GridX - 3, GridX - 2, GridX - 1, GridX, GridX +1, GridX + 2 GridX + 3, GridX +4
FROM zipcode ZX WHERE Z.id = ZX.id)
AND
P.GridY IN (
SELECT GridY - 4, GridY - 3, GridY - 2, GridY - 1, GridY, GridY +1, GridY + 2 GridY + 3, GridY +4
FROM zipcode ZY WHERE Z.id = ZY.id)
WHERE P.Status = A
AND ((Z.latitude - P.latitude) * LatDegInMi) ^2
+ ((Z.longitude - P.longitude) * LongDegInMi) ^2 < (100^2)
GROUP BY city,state,latitude,longitude;
Note that the LongDegInMi could either be hardcoded (same for all locations within continental USA), or come from corresponding record in the zipcodes table. Similarly, LatDegInMi could be hardcoded (little need to make it vary, as unlike the other it is relatively constant).
The reason why this is faster is that for most records in the cartesian product between the zipcodes table and the points table, we do not calculate the distance at all. We eliminate them on the basis of a index value (the GridX and GridY).
This brings us to the question of which SQL indexes to produce. For sure, we may want:
- GridX + GridY + Status (on the points table)
- GridY + GridX + status (possibly)
- City + State + latitude + longitude + GridX + GridY on the zipcodes table
An alternative to the grids is to "bound" the limits of latitude and longitude which we'll consider, based on the the latitude and longitude of the a given city. i.e. the JOIN condition becomes a range rather than an IN :
JOIN points P
ON P.latitude > (Z.Latitude - (100 / LatDegInMi))
AND P.latitude < (Z.Latitude + (100 / LatDegInMi))
AND P.longitude > (Z.longitude - (100 / LongDegInMi))
AND P.longitude < (Z.longitude + (100 / LongDegInMi))
When I do these type of searches, my needs allow some approximation. So I use the formula you have in your second query to first calculate the "bounds" -- the four lat/long values at the extremes of the allowed radius, then take those bounds and do a simple query to find the matches within them (less than the max lat, long, more than the minimum lat, long). So what I end up with is everything within a square sitting inside the circle defined by the radius.
SELECT * FROM tblLocation
WHERE 2 > POWER(POWER(Latitude - 40, 2) + POWER(Longitude - -90, 2), .5)
where the 2 > part would be the number of parallels away and 40 and -90 are lat/lon of the test point
Sorry I didn't use your tablenames or structures, I just copied this out of one of my stored procedures I have in one of my databases.
If I wanted to see the number of points in a zip code I suppose I would do something like this:
SELECT
ParcelZip, COUNT(LocationID) AS LocCount
FROM
tblLocation
WHERE
2 > POWER(POWER(Latitude - 40, 2) + POWER(Longitude - -90, 2), .5)
GROUP BY
ParcelZip
Getting the total count of all locations in the range would look like this:
SELECT
COUNT(LocationID) AS LocCount
FROM
tblLocation
WHERE
2 > POWER(POWER(Latitude - 40, 2) + POWER(Longitude - -90, 2), .5)
A cross join may be inefficient here since we are talking about a large quantity of records but this should do the job in a single query:
SELECT
ZipCodes.ZipCode, COUNT(PointID) AS LocCount
FROM
Points
CROSS JOIN
ZipCodes
WHERE
2 > POWER(POWER(Points.Latitude - ZipCodes.Latitude, 2) + POWER(Points.Longitude - ZipCodes.Longitude, 2), .5)
GROUP BY
ZipCodeTable.ZipCode