I bought a geo-database a long time ago and I'm updating its precision to the lat/lng values. But I've found some weird stuff. There are some cities that have the same lat/lng coordinates. Thing that is geographically impossible.
id City State Lat Lng
1 A sA XX XX
2 B sA XX XX
3 C sA YY YY
4 D sA ZZ ZZ
So I tried Group By City, Lat, Lng but as I need the id to update the record the group by clause will ask me to add ´id´ column.
From the table ids 1 and 2 should be updated leaving 3 and 4 out. It shouldn't be 2 (or more) cities with the same Lat/Lng. The Table has 22K rows. I could send all to gmap API but I'm looking for use the time, bandwith and hits to the API as smart as possible but I'm running out of time considering I can make a request per second using the free API access.
I've tried
SELECT DISTINCT postcodes_id, Latitude, Longitude, Region1Name, Region2Name, Nation_D
FROM postcodes
where Latitude + Longitude IN
(
SELECT Latitude + Longitude
FROM
(
SELECT postcodes_id, Latitude, Longitude, count(distinct(Region2Name)) as cantidad
FROM postcodes
where Nation_D is not null
GROUP BY Latitude, Longitude
having count(distinct(Region2Name)) > 1
) A
)
AND Nation_D IS NOT NULL
ORDER BY Latitude, Longitude, Region1Name, Region2Name, Nation_D
But is not working as expected. I think its pretty obvious for a new pair of eyes.
I wrote a python script to use Google Map geocode to get the current Lat/Lng and update it if it's different. This script works ok.
Hope someone has an idea. Thanks!!
Running MySQL 5.5 and Python 2.7 on a CentOS 7.
Just some pointers for you, which may be helpful:
You should not use group by or distinct on lat/lon or any combination of them, since they are contiguous floating points numbers and not discrete integers or strings.
By the same token, you should not use WHERE clauses on lat/lon or their sum. If you mean to check for proximity of two locations, use st_distance() function instead.
Multiple city names can refer to the same location. For example, New York, NY and Manhattan, NY.
And a non-technical point: storing Google geocoding data in your database is against their licensing agreement.
Related
Is there a way to find all the orders shipped to London using an SQL query? Simply searching for London in the columns doesn't work as some customers have put the district name rather than "London".
So I thought the best way to go was via the postcode. Would this be the best way to go about finding the rows? And continue with using OR statements for each postcode?
select * from tt_order_data
where ship_postcode like "e1%"
According to wiki, this is the postcode range:
The E, EC, N, NW, SE, SW, W and WC postcode areas (the eight London
postal districts) comprised the inner area of the London postal region
and correspond to the London post town.
The BR, CR, DA, EN, HA, IG, SL, TN, KT, RM, SM, TW, UB, WD and CM (the
14 outer London postcode areas) comprised the outer area of the London
postal region.[20]
The inner and outer areas together comprised the London postal
region.[13]
One way to do this would be to leverage REGEXP and define a pattern that matches only ship_postcodes that begin with one of the aforementioned London postcode character sequences:
SELECT *
FROM tt_order_data
WHERE UPPER(TRIM(ship_postcode)) REGEXP '^(E|EC|N|NW|SE|SW|W|WC|BR|CR|DA|EN|HA|IG|SL|TN|KT|RM|SM|TW|UB|WD|CM)'
DB Fiddle | Regex101
It's important to keep in mind that you will still need to perform some amount of data cleansing if the inputs weren't properly controlled, as invalid postcodes would match this filter (e.g., E1 7AA is valid, but this filter would also consider a string like ERGO valid as well).
As an aside, I'm not exactly sure how this will perform with your specific dataset at scale, but if this is for a one-off exercise then it should fit your needs just fine.
I have a city with longitude and latitude, and a database of city names, also with longitude and latitude. Since there can be several cities with the same name, I want to match the one that is geographically closest.
To give an example, I have New York with a lat of 40.7262 and a long of -73.9796. I want to find the closest city of Bangor to NY, and there are several in the db:
Bangor PA 40.86555560 -75.20694440
Bangor NY 44.81222220 -74.39777780
Bangor ME 44.80111110 -68.77833330
I can get the closest latitude with this query:
Select * from cities
where city='bangor'
order by abs(Latitude - 40.7262) limit 1;
and I can get the closest longitude with this query:
Select * from cities
where city='bangor'
order by abs(Longitude - -73.9796) limit 1;
but that does NOT get me the definitive closest city because Bangor NY wins in one case and Bangor PA wins another. How can I write my query to find the closest city taking into account BOTH lat and long?
We can use Haversine formula to determine the distance between two points on a map, given their latitudes and longitudes. You can get more details at this link: http://www.plumislandmedia.net/mysql/haversine-mysql-nearest-loc/
We determine the distance in km from the Bangor city, using the formula described in the referred link, and then ORDER BY on the calculated distance. LIMIT 1 allows us to consider the closest city.
SELECT *,
111.045 *
DEGREES(ACOS(COS(RADIANS(40.7262))
* COS(RADIANS(Latitude))
* COS(RADIANS(-73.9796 - Longitude))
+ SIN(RADIANS(40.7262))
* SIN(RADIANS(Latitude)))) AS distance_in_km
FROM cities
WHERE city='bangor'
ORDER BY distance_in_km LIMIT 1;
Right now I have a table of 100 million inserts:
CREATE TABLE o (
id int UNIQUE,
latitude FLOAT(10, 8),
longitude FLOAT(11, 8)
);
On my back end I am receiving a user lat/long and trying to return everything within x distance of that.
Instead of doing the distance formula on every single result I was thinking I could possibly calculate the maximum lat/long for X distance.
So we are sort of creating a square by finding the max lat/min lat, max long/min long.
Once we have these max values we would do the query on this range of values thus making our subset significantly smaller to then do the actual distance formula on (i.e., finding the values within X distance).
So my question to you is:
What makes me run faster?
Option 1)
Distance formula on 100 million entries to get the set.
Option 2)
Instead of doing the distance formula on the set of 100 million entries we calculate the min/max lat/long.
Select the values in that range from the table of 100 million entries
Do the distance formula on our new smaller set.
Option 3)
Something exists already for this in SQL
If option 2 is faster the next issue is actually solving that math problem.
If you want to look at that continue reading:
Lat/Long distance formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = (sin(dlat/2))^2 + cos(lat1) * cos(lat2) * (sin(dlon/2))^2
c = 2 * atan2(sqrt(a), sqrt(1-a))
d = R * c
Obviously we can rearrange this because D (assume 1 mile), and R (is the radius of the earth) is a set value so we get D/R = C.
The problem then comes in to how do we calculate C/2 = atan2(sqrt(a), sqrt(1-a))?
1 -- 100M rows is a lot to scan and test. It's OK do do once in a while, but it is too slow to do a lot.
2 -- Using a pseudo-square bounding box and doing
WHERE latitude BETWEEN ...
AND longitude BETWEEN ...
is a good first step. The latitude range is a simple constant times X; the longitude range also divides by cos(latitude).
But the problem comes when you try to find just those rows in the square. Any combination of index on latitude and/or longitude, either separately or together, will only partially filter. That is, it will ignore longitude and give you everything within the latitude range, or vice versa. That might get you down to 100,000 rows to check the distance against. That's a lot better than 100,000,000, but not as good as you would hope for.
3 -- http://mysql.rjweb.org/doc.php/latlng Does get down to the square, or very close. It is designed to scale. I have tested only 3M rows, not 100M, but it should work fine.
The main trick is to partition on latitude, then have longitude be the first column in the PRIMARY KEY so that InnoDB will cluster the nearby rows nearby in the partition(s). If you look for all rows within X miles (or km) it might look at (and compute the great-circle-distance) for about twice as many rows as necessary, not 100K. If you want to find the nearest 100 items, it might touch about 400 (4x).
As for SPATIAL index, you might want to upgrade to 5.7.6, which is when ST_Distance_Sphere() and ST_MakeEnvelope() were added. (MakeEnvelope is only marginally more convenient than building a Polygon yourself -- it has flat-earth syndrome.)
I am having some trouble with figuring out how to do this. What I have is a list of 160K locations on an Access table with lat and long coordinates for each. I am trying to find out how to create a column that compares 1 item on the list to the rest of the items to bring back the closest distance in miles.
I've figured out how to use the haversine formula to make a 1 to 1 comparison but I am lost in trying to automate the rest.
This is basically what I want to try to produce...
Loc_ID Loc_Lat Loc_Long Min_Miles_Away
1 33.537214 -81.687378 674.48
4 42.16584 -87.845117 11.83
5 41.99558 -87.869057 11.83
6 41.85325 -89.486883 83.75
Explanation to the table...
Location 1 is closest to location 5 (674.48 miles apart)
Location 4 is closest to location 5 (11.83 miles apart)
Location 5 is closest to location 4 (11.83 miles apart)
Location 6 is closest to location 5 (83.75 miles apart)
Any help would be appreciated.
You can do a cartesian join, i.e. a join without a where. It will join each row with every other row. You can do that by simply writing the SQL into the SQL view of the query.
SELECT *
FROM locations a, locations b
Next you can calculate the distance (I guess you have that code already, so just insert the function) on that table.
Finally you can group by MIN.
SELECT loc_id, loc_lat, loc_long, MIN(calulated_distance) as min_miles_away
FROM myCalculatedQuery
Lets say I have a table venues with following columns:
id
user_id
name
latitude
longitude
The latitude and longitude are kept as FLOAT(10,6) values. As different users add venues, there are venue duplicates. How can I select all the duplicates from the table in range up to lets say 50 metres (as it might be hard to achieve as the longitudial meter equivalents are different at different latitudes, so this is absolutely aproximate)? The query should select all venues: VenueA and VenueB (there might be VenueC, VenueD, etc) so that I can compare them. It should filter out venues that are actually one per location in the range (I care only for duplicates).
I was looking for an answer but had to settle with answering myself.
SELECT s1.id, s1.name, s2.id, s2.name FROM venues s1, venues s2
WHERE s2.id > s1.id AND
(POW(s1.latitude - s2.latitude, 2) + POW(s1.longitude - s2.longitude, 2) < 0.001)
The first condition is to select only half of matrix as order of similar venues is not important. The second one is simplified distance calculator. As user185631 suggested haversine formula should do the trick if you need more precision but I didn't need it as I was looking for duplicates with the same coordinates but couldn't settle with s1.latitude = s2.latitude AND s1.longitude = s2.longitude due to float/decimal corruption in my DB.
Of course checking this at insert would be better but if you get corrupt DB you need to clean it somehow. Please also note that this query is heavy on server if your tables are big.
Create a function which computes distances between lat/lons. For small/less accurate distance (which is the case here) you can use the Equirectangular approximation (see section here: http://www.movable-type.co.uk/scripts/latlong.html). If the distance is less than your chosen threshold (50m), then it is a duplicate.
Determine what 50 meters is in terms of lat and long. Then plus and minus that to your starting location to come up with a max and min for both lat and long. Then...
SELECT id FROM venues WHERE latitude < (your max latitude) AND latitude > (your min latitude) AND longitude < (your max longitude) AND longitude > (your min longitude);
Converting meters to lat/long is very tricky as it depends on where the starting point is on the globe. See the middle section of the page here: http://www.uwgb.edu/dutchs/usefuldata/utmformulas.htm