How is geocoding from an address done? - google-maps

I was wondering how Google geocodes and address? Does it work like a DNS lookup, where they have a big table of addresses which is a hash to a geocode, or is there any fun geometery that goes into it? If it is a big hash table how did they go about gathering all that data?

Busbina, I work for SmartyStreets where we verify and geocode street addresses -- so I'll tell you what I know, and link you to further sources for your own research.
To answer your question: It is both.
There are suppliers of massive databases (for example, those like TIGER Data) which contain relational, geo-political information including coordinates, streets, boundaries, and names. For US data, it is likely to obtain at least ZIP-level accuracy through tables like these simply by doing a lookup. For more accuracy, though, append the +4 code and you may narrow it down to a city block or floor of a high building.
To attempt further accuracy (ie. knowing where precisely on the street a building is located), Google and others perform what is called interpolation, where they take the known boundaries from their datasets and and the known range of primary numbers from the start of that block or street to the end of it, and they solve a ratio. If the correct primary number is known, and for straight streets in an ideal setting, a simple ratio like this works:
(primary number - starting primary number) / (ending primary number) =
(x - starting boundary coordinate) / (ending boundary coordinate)
Where x is a close guess to the actual location on the street - but only a guess. Accurate building-level data can be very expensive and I think is only available for some urban areas.
The key is to get the right primary number and accurate, up-to-date data. Maintaining this can be time-consuming and expensive because of all the overhead involved with so much information.
Note that Google and similar map services only perform address approximation, not address verification, and thus are liable to make mistakes (even if the geocoding algorithm is very precise) because the primary number may be wrong or may not even exist. So when that matters to you (or you aren't showing a Google Map and must honor the Terms of Service), something like LiveAddress, as a starting point, is certified by the USPS and won't return bad addresses.
So there are some things to consider.
More information:
http://en.wikipedia.org/wiki/Geocoding#Address_interpolation
http://www.ncgia.ucsb.edu/giscc/units/u016/u016.html
** I'll add a note, since I have had this question a lot: rooftop- or building-level accuracy is very expensive information. I know of very few providers who offer this, and they have mined and collected that data themselves. For example, Google has the Street View project, from which they've obtained accurate coordinates for approximate addresses, and they can provide such precision. But most geocoders use the same data from official sources, they just interpolate differently. If you want extremely precise coordinates like building-level, you can expect to pay mightily for it, or go collect the data yourself. (Yes, Google's is free to a point -- unless you intend to use the information for more than just showing a map, basically.)

Another service that is very similar is GeoNames which is a US Government run database of location names. This service is better tailored towards points of interest, like an airport or landmark. This is just a database of names, locations, and some meta data.
http://www.geonames.org/

Related

Should I use Google Maps API/Geocoding to power a store finder

I'm new to geocoding so I'm not certain this is even the question I should be asking, but all of the other discussions I've seen on this topic (here and on the Google API forum) are so application specific that I feel like I might be missing a very elementary step - I don't need to know how to implement a store finder - I need to know if I should.
Here is my specific situation - I have been contracted to design an application wherein we will build a database of shops (say, independently owned bars and pubs). This list will continually grow and change as shops close and new ones open. The user can enter his/her point of origin (zip code or address) and be shown a list or map containing all the various shops within a given radius in order of proximity.
I know how to deliver these results from a static database:
One would store the longitude and latitude as columns for each row and then just use that information to check distances.
But I have inherited an (already fairly large) database of shops which have addresses but not coordinates - so I'm not sure what the best way to get those addresses is. I could write a script to query them one at a time against google geocoding, I could have a data entry person manually look up the coordinates for each one and populate the data that way, or maybe there is a third option I'm not aware of.
Is this the right place to be asking this question? Google Maps Geocoding doesn't host a forum of their own, but refers people to Stack Overflow. Other forums on the net dealing with this topic are all relating to a specific technical question but no one seems to be talking about it from a top-down perspective (ie the big picture).
Google imposes a 2,500 queries per day limit on free users and a 100,000 queries a day limit on paid ones - neither of these seem to be up to the task of a site with even moderate traffic if, every time a user makes a request, the entire database (perhaps thousands of shops) are being checked against Google's data. It seems certain we must store the coords locally but even storing them locally, there will have to be checks against Google in order to plot them on a map. If I had a finite number of locations (if, for example, I had six hardware shops) and I wanted to make a store locator, there would be a wealth of discussions, tutorials, and stack overflow questions available to point the way for me, but I'm dealing with a potentially vast number of records and not sure how to proceed or where to begin.
Any advice would be welcome - Additionally, if this is not the best place to be asking this question, a helpful response would be to indicate a better place to post it. I've searched for three days but haven't found what looks like a good resource for asking such subjective questions.
The best way of course would be when you use a geocoding-service to get coordinates and store the coordinates in your DB. But it's not possible with google's geocoding-service, because it's not permitted to store geocoded data permanent.
There are free services without this restriction, some keywords to search for: mapquest, nominatim, geonames(but these services are less accurate than google)
Another option would be to use a FusionTable. The geocoding would run automatically(but the daily limits are the same as for the geocoding-service). The benefit: the geocoding is permanent(you can't access the locations directly by e.g. downloading the DB-dump), but you may use the coordinates for plotting markers(via a FusionTablesLayer) or filtering(e.g. by distance)
The number of entries shouldn't be an issue, 100k is no problem for a database

How do I discern a country from X/Y Coordinates (or Long/Lat)?

I have a number of XY coordinates and am looking to discern which country each of these are in, who knows a good service/way of doing this?
I am working with MySQL & PHP, not that its really relevant, I am o fey with consuming web services/pages and assume there must be a web-service/page somewhere which will do this, if someone can point me in the right direction that would be awesome.
How do I take: 306458,383136 and turn it into: United Kingdom (for example.)
Appreciate your responses in advance.
What you're looking for is called reverse geocoding, and e.g. Google Maps has this functionality: http://code.google.com/apis/maps/documentation/geocoding/#ReverseGeocoding
It works from a Lat/Lon coordinate, and could return even a more precise information than only a country; note that this is only an estimate - in some places, country boundaries are somewhat tangled.
If you're looking to do this offline, or if Google's/Bing's/whoever's licensing is too strict for you (e.g. you need to do a gazillion of requests per day, or need to present the result in an unorthodox format), it's possible to run your own instance of Nominatim, feed it a data extract from OpenStreetMap (under ODbL, a much more permissive license), and query that.
For example, there's a set of boundaries available at https://wambachers-osm.website/boundaries/ - just the national boundaries, so you wouldn't need to download the entire planet map.

Calculating trip travel times using available geo APIs for 5k+ addresses

I'm working on a transportation model, and am about to do a travel time matrix between 5,000 points. Is there a free, semi-reliable way to calculate the travel times between all my nodes?
I think google maps has a limit on the number of queries / hits I can achieve.
EDIT
I'd like to use an api such as google maps or similar ones as they include data such as road directions, number of lanes, posted speed, type of road, etc ...
EDIT 2
Please be advised that openstreet map data is incomplete and not available for all jurisdictions outside the US
Google Directions API restricts you to 2500 calls per day. Additionally, terms of service stipulate that you must only use the service "in conjunction with displaying the results on a Google map".
You may be interested in OpenTripPlanner, an in-development project which can do multi-modal routing, and Graphserver on which OpenTripPlanner is built.
One approach would be to use OpenStreetMap data with Graphserver to generate Shortest Path Trees from each node.
As that's 12,502,500 total connections, I'm pretty sure you'll hit some sort of limit if you attempt to use Google maps for all of them. How accurate of results do you need/how far are you travelling?
I might try to generate a crude map with travel speeds on it (e.g. mark off interstates as fast, yadda yadda) then use some software to calculate how long it would take from point to point. One could visualize it as an electromagnetic fields problem, where you're trying to calculate the resistance from point to point over a plane with varying resistance (interstates are wires, lakes are open circuits...).
If you really need all these routes accurately calculated and stored in your database, it sounds like (and I would believe) that you are going to have to spend the money to obtain this. As you can imagine, this is expensive to develop and there should be renumeration.
I would, however, probe a bit about your problem:
Do you really need all 5000! distances in a database? What if you asked google for them as you needed them, and then cached them (if allowed). I've had web applications like this that because of the slow traffic ramp-up pattern, I was able to leverage free services early on to vet the idea.
Do you really need all 5000 points? Or could you pick the top 100 and have a more tractable problem?
Perhaps there is some hybrid where you store distances between big cities and do more estimates for shorter distances.
Again, I really don't know what your problem is, but maybe thinking a bit outside the box will help you find an easier solution.
You might have to go for some heuristics here. Maybe you can estimate travel time based on a few factors like geometric distance and some features about the start and end points (urban vs rural areas, country, ...). You could get a few distances, try to fit your parameters on a subset of them and see how well you're able to predict the other ones. My prediction would be, for example, that travel times approach linear dependence from distance as distance grows larger, in many cases.
I know it's messy, but hey you're trying to estimate 12.5mio datapoints (or whatever the amount :)
You might also be able to incrementally add knowledge from already-retrieved "real" travel times by finding close points to the ones you're looking for:
get closest points StartApprox, EndApprox to starting and end position such that you have a travel time between StartApprox and EndApprox
compute distances StartError, EndError between start and StartApprox, end and EndApprox
if StartError+EndError>Distance(StartApprox, EndApprox) * 0.10 (or whatever your threshold) -> compute distance via API (and store it), else use known travel time plus overhead time based on StartError+EndError
(if you have 100 addresses in NY and 100 in SF, all the values are going to be more or less the same (ie the difference between them is probably lower than the uncertainty involved in these predictions) and such an approach would keep you from issuing 10000 queries where 1 would do)
Many GIS software packages have routing algorithms, if you have the data... Transportation data can be fairly spendy.
There are some other choices of sources for planning routes. Is this something to be done repeatedly, or a one-time process? Can this be broken up into smaller sub-sets of points? Perhaps you can use multiple routing sources and break up the data points into segments small enough for each routing engine.
Here are some other choices from quick Google search:
Wikipedia
Route66
Truck Miles

How to get the equivalent of the accuracy in Google Map Geocoder V3

I want to get geocode from google, and I used to do it with the V2 of the API.
Google send in the json a pretty good information, the accuracy, reference here : http://code.google.com/intl/fr-FR/apis/maps/documentation/javascript/v2/reference.html#GGeoAddressAccuracy
In V3, Google doesn't seem to send me exactly the same information. There is the array "adresse_component", which seem bigger if the accuracy is better, but not exactly.
For example, I have a request accuracy to the street number, the array is of size 8.
Another query is accuracy to the route, so less accuracy, but the array is still of size 8, as there is a row 'sublocality', which not appear in the first case.
Ok, for a result, Google send a data 'types', which have the 'best' accuracy. This types are here : http://code.google.com/intl/fr-FR/apis/maps/documentation/geocoding/#Types
But, there is no real order, and if I wan't the result better than postal_code, I have no clue to how to do that.
So, how can I get this equivalent of the V2 accuracy, whithout some dumb and horrible code ?
Well, there is the location type, which is not so bad :
location_type stores additional data about the specified location. The
following values are currently supported:
"ROOFTOP" indicates that the returned result is a precise geocode for which we have location information accurate down to street address precision.
"RANGE_INTERPOLATED" indicates that the returned result reflects an
approximation (usually on a road) interpolated between two precise
points (such as intersections). Interpolated results are generally
returned when rooftop geocodes are unavailable for a street address.
"GEOMETRIC_CENTER" indicates that the returned result is the geometric
center of a result such as a polyline (for example, a street) or
polygon (region)."
"APPROXIMATE" indicates that the returned result is
approximate.
I test if the location_type is different of approximate, and it gives some good results.
With Google deprecating their Geocoding v2 API later this year, there's going to be a ton of people migrating their geocoding logic to v3 and this very question is going to crop up: How to map the 'location_type' string to an equivalent 'accuracy'?
Here's a decent mapping:
"ROOFTOP" -> 9
[Everything else] -> 4 to 8 (aka the text string might as well read "GARBAGE")
If something other than ROOFTOP is specified, use the area of 'northeast' and 'southwest' to decide if it is accurate enough for you.
Now what should happen if you don't get something "accurate"? Run a Google Places text search query for the same address. Google Places does geocoding as well and, with Billing enabled, you can get 10,000 Places text search queries per day (no rate limit) and Google claims they won't charge the card (they supposedly just use it for verifying the account). With Billing, you get 100,000 queries, but Places text search queries have a "cost" of 10 times the amount of a regular Places query, hence the aforementioned 10,000 limit. Places can be finicky though and you should only consider responses with one result.
Sometimes Places queries will not return a zipcode - especially if one isn't sent. If you need the zipcode, take the lat/lng results of the Places query and feed it back into the geocoder, which will usually spit out an address with a zipcode (and very frequently a ROOFTOP match).
It should be noted that the official Geocoding API courtesy limit is 2,500 requests per day with a rate limit of one per second per IP address. Therefore, following the above formula will likely decimate and may even halve the number of geocodings available to you.
If you need more than the Google Geocoding limit (who doesn't?), invent your own mini geocoding service with something like the OpenStreetMap database. Clone the parts of OpenStreetMap that you need and write your own geocoder (or use a library). Then you can geocode to your heart's content with no quantity limit or rate limit. If you still use Google Maps, you can use Google's geocoder as a fallback should the OSM geocoder not be accurate enough for all cases.
Alternatively, if you trust your users to not submit bogus data (really?) and have to use the Google geocoding service, you could also abuse a user's web browser by having the browser geocode information for you and then feed the results to your server. You might burn through the user's daily limit and you risk someone pushing bogus data, but if you are going to all that trouble, do you actually care?
At any rate, the tips above should suffice for interim usage for most users to get a working v3 API set up. Ran into this issue myself, so figured I'd share with the community a halfway decent solution. I still think v2 was the better API - integer accuracy ratings instead of ugly text strings always wins out.
Paula's answer is good but you do need to also consider John's comment that ROOFTOP can return garbage.
I use a post-geocode-query sanity checker to get rid of those cases where location_type is 'ROOFTOP' but the address has nothing to do with the address you sent to google - this sanity checker compares the new address with the old address and considers what changed and by how much. The google geocoder is good at fixing typos, sometimes, but it can also make some non-sensical decisions - for instance choose a different city, a different state, or a different country. You need to decide if the result is a fix in a typo or if the geocode logic went astray.
So, don't just assume ROOFTOP == 9. It can also be garbage if the new address is way off from the original that you sent.
For things like Apartments or building with multiple units, the location_type = 'RANGE_INTERPOLATED' may also be accurate when the result type is 'subpremise'.
Remember, geocoding is not the same thing as address validation. They have some overlap but google geocode logic tries too hard to get you an answer, even when your input is garbage.

How can I sort/group Salesforce leads by geography?

If I had lat/long data for all our leads in Salesforce, is there a way to write a query to group them, or say list all the leads within 10 miles of San Francisco, CA ?
[EDIT: Clarification]
I have thousands of leads with both a full address, and long/lats.
I want to build a query on these leads that will give me all of the leads near San Francisco, CA. This means doing GIS type work within salesforce.
I could of course filter specifically on city, or zipcodes or area code, but this presents some problems when trying to rollup a whole metro area.
Yes. You need to Reverse GeoCode them with a tool/service. In the past I have used Maporamas service but it was quite expensive and that was before Google maps and virtual earth existed so I am sure there is something cheaper(free) out there now.... Googling around I have found this and this
EDIT:
OK from What I understand you are trying to calculate the distance between 2 lat/long points. I would start by discounting the ones that where outside you sphere of (lets say) 10 miles. So from your central point you will want to get the the coordinates 10 miles, East, West, South and North. To do this you need to use the Great-circle distance formula.
From that point you have you Sales Force Data if you wish to break this data up further then you need to order the points by distance from the central point. To do this you need to use the Haversine formula
I am not sure what you language preference is so I just included some examples from SQL(mainly) and C#
Haversine Formula in C# and in SQL
Determine the distance between ZIP codes using C#
Great Circle SQL
Great Circle 2
Use GeoHash.org (either as a web service or implement the algorithm). It hashes your lat-long coords into a form that appears similar for nearby places. For example A may have a hash like "akusDf3af" and B might have a hash like "akusDf3b2" if they are nearby. Then do a SOQL query that looks for places starting with the same n characters as a known location. Your n will determine the radius of the lookup.
These are some great technical solutions that can provide very exact answers, but two things to consider:
geospatial proximity does not map neatly to responsibility
Ownership calculation seems to be done best through postal code lookups or other rules that don't allow for gaps or overlaps. Otherwise, you'll have two (or more) salespeople fighting over leads that are close to both of them, and ignore those leads that are far away from both of them.
So, if you're using geo-calculations like those above to assign ownership, just acknowledge the system will leak and create business rules to accomodate for that. But a simple postal lookup to define territories (as salesforce's own territory management feature does) might be better.
I'd suggest the problem we're trying to solve geospatially is not who owns which lead. Rather, given all the leads you own, which are nearby?
maps often offer more data per pixel than columnar reports
Again, geospatial data in a report may not be the best answer. A lead 50km away, but along a major road, is more interesting than another lead 50km away on the other side of a mountain or lake. Or a lead close to other leads is more interesting than a lead by itself.
A report can't show this, but a map can.
Salesforce has some great examples of Google Maps integrations. Instead of a columnar report called "My Nearby Leads", why not a visualforce page, with a google map inside? You're giving the user far more information than a columnar report could. They might like it better, and it's easier to implement than trying to calculate some of the equations above.
Just another perspective that may (or may not) be appropriate to the problem at hand.
This post is really old, but is showing up at the top of Google results, so I figured I would post some info to it anyways.
2 nice mapping tools are batchgeo.com and geocod.io. Geocod.io can even give you lat and long coordinates from an address.
If you just need a one time calculation, you can use Excel. Export all your leads with the lat and long. Then go to Google Maps and get the lat and long in decimal degrees for the city center of wherever you want to measure to.
Then use this formula in excel to calculate the distance between the coordinates in miles. Lat1dd and Long1dd are the coordinates for one point, and the lat2dd and long2dd are coordinate points for the other point.
=3963*ACOS(COS(RADIANS(90-lat1dd))*COS(RADIANS(90-lat2dd))+SIN(RADIANS(90-lat1dd))*SIN(RADIANS(90-lat2dd))*COS(RADIANS(long1dd-long2dd)))
After you run it, just sort the results from smallest to largest to get those results that are the closest.
I haven't done this next part yet, but conceptually it should work. We have a field that lists the major market each account is in. Example, Chicago IL. I am going to build a trigger or formula field that essentially says IF(Market="Chicago IL") then use X and Y for the lat and long. These will be hardcoded as the city center for that specific market. The query will then run each individual account's lat and long against the one from the city center to calculate a distance.
If you wanted to break the market into different zones, you could adjust your formula so it uses < and > on the lat and long fields. Everything less than X but greater than Y goes in Zone A, etc.
Hope this helps someone.