efficient way to geocode huge amount of twitter user location - google-maps

I'd like to geocode around 2M strings representing cities or states or countries of the whole world every day.
Theses strings are not very clean, (you have to make distinction between "Paris,France" and "Paris,TX" for instance, but you have you also have to geocode "Paris" with the french city !) that's why I'm looking to use GMaps API or OSM Nominatim API.
The brute force solution would be to dump OSM data on my computer and process it locally, but I hope I can find an easier way to do it.
Obviously, I can reduce the amount of strings to be geocoded, store the results not to have to query twice a string, but I would still have hundred thousands strings to geocode...
Thanks !

To comply with the usage policy when using OSM's Nominatim instance you have to wait 1 second between each request. Geocoding 100.000 requests would take around 28 hours which could be feasible for you. Alternatively you can use one of the other Nominatim instances (after checking their usage policy!) or try to install it locally.

Related

Google Map API v3 Geocoding: should I use full address or just part of it?

When geocoding human readable address into lat lng e.g. 3 mystreetname, myarea, mycity, mypostcode.
Should I use full address or just some part of it, for instance post code?
I know that it works ether way, but I'd like to know what would be the best practice to avoid errors geocoding locations.
Some cities span multiple postal codes, and some postal codes span multiple cities. You might not have problems with smaller data sets but as you start dealing with a lot of addresses things like that can pop up. It's entirely possible that there are two 201 Main Streets in the same zip code, located in two different cities.
So yep, give as much detail as you can.
I'd really recommend storing your information in a spatial datatype if your database supports that as well.
This link is the google's documentation of geocoding
https://developers.google.com/maps/documentation/geocoding/?csw=1
And refer below link for the real time sample..!
http://gmaps-samples.googlecode.com/svn/trunk/geocoder/singlegeocode.html
And for better results use more information so that you will get accurate answers
Hope it helps
I would try running the queries with as much detail as possible, and if a query fails rerun without the zip, as sometimes the zip causes problems with the google api (especially zips greater then 5 digits).

Compiling a list of all colonies/neighborhoods for a particular city

I want a list of locations (coordinates) for all possible colonies/neighborhoods of some Indian cities. Take for example Delhi. Can this data be obtained with the Places API?
The only thing that comes to my mind is to use a query like -
https://maps.googleapis.com/maps/api/place/search/xml?location=28.540346,77.210026&radius=500&types=administrative_area_level_1|administrative_area_level_2|administrative_area_level_3|locality|neighborhood|street_address|sublocality|sublocality_level_4|sublocality_level_5|sublocality_level_3|sublocality_level_2|sublocality_level_1|subpremise&sensor=false&key=MYKEY
and then keep changing the radius by 500 till the whole city is covered.
Is there a better way of doing this?
Given how often you would need to do this for your map, since caching that data goes against the terms of service, this is not a great approach. If you map gets any decent usage, you'll rapidly hit your quota. Plus you're only get center points of the colonies/neighborhoods. I'd recommend trying to find another source of that data you can download. The Places API was not designed with this in mind.

Batch reverse geocoding

Recently I found myself in need to perform large batch reverse-geocoding operations.
Large means something about 20k points per request.
I was looking at Nominatim as standalone server. But there is no clue in docs about batch requests (or I just couldn't find it).
Thus the questions is:
1: Could I perform something like this with Nominatim
2: If not - is there other standalone solutions [Not service. May be proprietary]. Main zone of interest is Europe, if it's relevant.
3: What will be approximate time consumption of such request.
Or I am facing building own geocoder above pgsql + postGIS?
thx in advance.
Not a current question, but in the absence of any answers.. There seem to be 2 main solutions. You can either:
use the Google geocoding API, but this will limit you to a check every few seconds, which will make large bulk geocoding very slow, or
download a comprehensive database of place names (such as http://www.geonames.org/), and use a GIS to calculate the nearest points to your data at whatever hierarchical level is most appropriate.

alternative to Google maps

My client wants some of the functionality of Google maps namely:
- geocoding
- generating maps with points based on postal code or long.lat
- optimal trip mapping
Their issues with Google maps
- cannot control outages
- postal codes are sometimes inaccurate or not updated frequently for Canada/UK
- they have no way to correct inaccurate information
They would prefer to host the mapping application themselves, but will require postal code updates.
Can anyone suggest such a product?
thanks
"cannot control outages - postal codes are sometimes inaccurate or not updated frequently for Canada/UK - they have no way to correct inaccurate information"
Outages
hosting your own mapping is the only way to control this, but you would be very very hard pushed to beat Google Maps / Bing Maps uptime over the last 5 years. Take a look at the following:
OpenStreetMap for the road imagery data, this is open source data very good in the UK (Im not sure about canada) and you can make your own changes and submit them (or just change the data you have downloaded)
Geoserver, Mapnik or MapServer will read openstreetmapdata and create the image tiles needed to create your own maps in whatever style you wish. Depending on if you dont want all countries and all zoom levels these products can create all the tiles you will need in advance, but usually they have to be created in real time and cached. You need a BIG fast server to manage tile crunching
Openlayers or Leaflet are open source javascript mapping platforms that will display your tiles for you
Obviously this is just for road maps, aerial imagery would cost you an absolute fortune.
Post Code Data
Many people do not realize that UK postcode data for latitude and longitude is now completely free and available to download every quarter from the official source (ordinance survey) http://www.ordnancesurvey.co.uk/oswebsite/products/code-point-open/index.html.
This is the same data source Google will use and there is none better but it will always contain inaccuracies and always be a few months out of date.
Finally
Hopefully that answer the question you asked and gives you information to inform your client. Now for the question you didn't ask "Is this approach good value to my client?".
I won't presume to know your business or client, however what I described above is possible but with one to many months of work involved to get it all working together and even then it wont have any where near the performance or uptime of something like google /bing maps and only offers a small subset of their features.
I think you're looking for something like Caliper-It's a very custom, and I would expect expensive, solution. Not suggested.
http://www.caliper.com/GISMappingSoftwareDevelopment.htm
One solution could be to use two different mapping services and compare their results, this way there's a much better chance the data is accurate. You can also fix inaccurate data by creating a system which acts as a barrier between the API and your user, where data you know is inaccurate is corrected before it's displayed. Not sure exactly what you're doing though, so this might not work for you.
Is trip mapping/routing the basic functionality you want to do?
Before rushing into rolling your own, I'd suggest a good think about the consequences of doing so. The first that springs to mind is whilst the pros are that you can now control your data, the cons are that you now control your data.
So you are going to have to consider where and when you get updates and the processes you are going to have to employ to keep your maps in sync with the rest of the world. There are a lot of headaches involved in these things which is why so many people use externally hosted solutions such as Googles.

Calculating trip travel times using available geo APIs for 5k+ addresses

I'm working on a transportation model, and am about to do a travel time matrix between 5,000 points. Is there a free, semi-reliable way to calculate the travel times between all my nodes?
I think google maps has a limit on the number of queries / hits I can achieve.
EDIT
I'd like to use an api such as google maps or similar ones as they include data such as road directions, number of lanes, posted speed, type of road, etc ...
EDIT 2
Please be advised that openstreet map data is incomplete and not available for all jurisdictions outside the US
Google Directions API restricts you to 2500 calls per day. Additionally, terms of service stipulate that you must only use the service "in conjunction with displaying the results on a Google map".
You may be interested in OpenTripPlanner, an in-development project which can do multi-modal routing, and Graphserver on which OpenTripPlanner is built.
One approach would be to use OpenStreetMap data with Graphserver to generate Shortest Path Trees from each node.
As that's 12,502,500 total connections, I'm pretty sure you'll hit some sort of limit if you attempt to use Google maps for all of them. How accurate of results do you need/how far are you travelling?
I might try to generate a crude map with travel speeds on it (e.g. mark off interstates as fast, yadda yadda) then use some software to calculate how long it would take from point to point. One could visualize it as an electromagnetic fields problem, where you're trying to calculate the resistance from point to point over a plane with varying resistance (interstates are wires, lakes are open circuits...).
If you really need all these routes accurately calculated and stored in your database, it sounds like (and I would believe) that you are going to have to spend the money to obtain this. As you can imagine, this is expensive to develop and there should be renumeration.
I would, however, probe a bit about your problem:
Do you really need all 5000! distances in a database? What if you asked google for them as you needed them, and then cached them (if allowed). I've had web applications like this that because of the slow traffic ramp-up pattern, I was able to leverage free services early on to vet the idea.
Do you really need all 5000 points? Or could you pick the top 100 and have a more tractable problem?
Perhaps there is some hybrid where you store distances between big cities and do more estimates for shorter distances.
Again, I really don't know what your problem is, but maybe thinking a bit outside the box will help you find an easier solution.
You might have to go for some heuristics here. Maybe you can estimate travel time based on a few factors like geometric distance and some features about the start and end points (urban vs rural areas, country, ...). You could get a few distances, try to fit your parameters on a subset of them and see how well you're able to predict the other ones. My prediction would be, for example, that travel times approach linear dependence from distance as distance grows larger, in many cases.
I know it's messy, but hey you're trying to estimate 12.5mio datapoints (or whatever the amount :)
You might also be able to incrementally add knowledge from already-retrieved "real" travel times by finding close points to the ones you're looking for:
get closest points StartApprox, EndApprox to starting and end position such that you have a travel time between StartApprox and EndApprox
compute distances StartError, EndError between start and StartApprox, end and EndApprox
if StartError+EndError>Distance(StartApprox, EndApprox) * 0.10 (or whatever your threshold) -> compute distance via API (and store it), else use known travel time plus overhead time based on StartError+EndError
(if you have 100 addresses in NY and 100 in SF, all the values are going to be more or less the same (ie the difference between them is probably lower than the uncertainty involved in these predictions) and such an approach would keep you from issuing 10000 queries where 1 would do)
Many GIS software packages have routing algorithms, if you have the data... Transportation data can be fairly spendy.
There are some other choices of sources for planning routes. Is this something to be done repeatedly, or a one-time process? Can this be broken up into smaller sub-sets of points? Perhaps you can use multiple routing sources and break up the data points into segments small enough for each routing engine.
Here are some other choices from quick Google search:
Wikipedia
Route66
Truck Miles