I have lat/lon coordinates in a 400 million rows partitioned mysql table.
The table grows # 2000 records a minute and old data is flushed every few weeks.
I am exploring ways to do spatial analysis of this data as it comes in.
Most of the analysis requires finding whether a point is in a particular lat/lon polygon or which polygons contain that point.
I see the following ways of tackling the point in polygon (PIP) problem:
Create a mysql function that takes a point and a Geometry and returns a boolean.
Simple but not sure how Geometry can be used to perform operations on lat/lon co-ordinates since Geometry assumes flat surfaces and not spheres.
Create a mysql function that takes a point and identifier of a custom data structure and returns a boolean.
The polygon vertices can be stored in a table and a function can compute PIP using spherical math. Large number of polygon points may lead to a huge table and slow queries.
Leave point data in mysql and store polygon data in PostGIS and use the app server to run PIP query in PostGIS by probviding point as a parameter.
Port the application from MySQL to Postgresql/PostGIS.
This will require a lot of effort in rewriting queries and procedures.
I can still do it but how good is Postgresql at handling 400 million rows.
A quick search on google for "mysql 1 billion rows" returns many results. same query for Postgres returns no relevant results.
Would like to hear some thoughts & suggestions.
A few thoughts.
First PostgreSQL and MySQL are completely different beasts when it comes to performance tuning. So if you go the porting route be prepared to rethink your indexing strategies. Not only does PostgreSQL have a far more flexible indexing than MySQL, but the table approaches are very different also, meaning the appropriate indexing strategies are as different as the tactics are. Unfortunately this means you can expect to struggle a bit. If i could give advice I would suggest dropping all non-key indexes at first and then adding them back sparingly as needed.
The second point is that nobody here can likely give you a huge amount of practical advice at this point because we don't know the internals of your program. In PostgreSQL, you are best off indexing only what you need, but you can index functions' outputs (which is really helpful in cases like this) and you can index only part of a table.
I am more a PostgreSQL guy than a MySQL guy so of course I think you should go with PostgreSQL. However rather than tell you why etc. and have you struggle at this scale, I will tell you a few things that I would look at using if I were trying to do this.
Functional indexes
Write my own functions for indexes for related analysis
PostGIS is pretty amazing and very flexible
In the end, switching db's at this volume is going to be a learning curve, and you need to be prepared for that. However, PostgreSQL can handle the volume just fine.
The number of rows is quite irrelevant here.
The question is how much of the point in polygon work that can be done by the index.
The answer to that depends on how big the polygons are.
PostGIS is very fast to find all points in the bounding box of a polygon. Then it takes more effort to find out if the point actually is inside the polygon.
If your polygons is small (small bounding boxes) the query will be efficient. If your polygons are big or have a shape that mekes the bounding box big then it will be less efficient.
If your polygons is more or less static there is work arounds. You can divide your polygons in smaller polygons and recreate the idnex. Then the index will be more efficient.
If your polygons is actually multipolygons the firs step is to split the multipolygons to polygons with ST_Dump and recreate and build an index on the result.
HTH
Nicklas
Related
i am working on a GEO-enabled application where i have a obvious use case of searching users within some distance of given user location .Currently i am having MySQL DB used. as the User table is expected to be very large by time the time for getting results will get longer (too long in case it need to traverse entire table).
i am using InnoDB as my table do need many things which MYISAM cant do. i have tried mongo and had a test drive with adding 5 million users and doing some test over them . now i am curious to know what MYSQL can offer in same situation as i will prefer MYSQL if it gives slightly near results to mongo .
My user table is having other fields plus a lat field and a lng (both indexed). still it takes much time. can anyone suggest a better design approach for faster results.
Mongo has a bunch of very useful built in geospatial commands and aggregations that will be ideal for your given case of finding users near to a given user point. Others include within that finds points within a bounding box or polygon. In your case the geoNear aggregation is perfect and can provide the calculated distance away from the given point.
You will have to code a lot of that functionality with mysql. Then you also have Postgis an add on for Postgres. Postgres is the classic open source Mysql competitor and Postgis has been around longer than Mongo and the database presumably behind open street maps, government gis and similar.
But to the problem, you need to use geojson format and 2dsphere index that you might not be using. Post a single record of your data.
The site currently does mainly range searches (latitude & longitude) with some filtering like WHERE color = "red" type of clauses. However using MySQL with geospatial index is still quite slow and I need to speed it up.
Problem: Will using Solr to do the search be a good idea?
If so, should I only duplicate the range columns from MySQL into Solr, and do the WHERE clauses in MySQL, or do both type of queries in Solr?
I've read that Solr is not for storing data like a database (ie. MySQL). Does this mean that if my search can take place over 10 different columns (or field in Solr terms), and the MySQL table that I replicated Solr's from only has 11 tables, I would still keep the MySQL table even though that will use up almost twice as much storage space half of which is redundant?
It appears that I'm using structured data (because each row has many columns defined?) and storing the entire table in Solr instead having redundant data on MySQL and Solr will save storage space and number of database access operations when writing. Is Solr a good choice here?
In terms of speed, would it be better to use PostGIS or Solr?
Solr has very fast numerical/date range queries. Solr 3 geospatial takes advantage of that, and I wrote a plugin that does even better. I doubt MySQL is faster.
That said, if the sole problem you are trying to solve is slow geospatial queries then bringing in Solr may solve it but will add a lot of overall complexity to your system since it isn't designed to replace relational databases--it works alongside them. Don't get me wrong; Solr is awesome, particularly for faceted navigation and text search. But you didn't state you wanted to take advantage of Solr's primary features.
PostGIS is by far the most mature open-source GIS storage system. I suggest you try it as an experiment to see if it's better. I would try a lat + lon pair of columns approach like what you are doing now with MySQL, and I would also try using the PostGIS native geospatial way to do it, whatever that is exactly.
One thing you could try in either MySQL or PostGIS is to round your latitude and longitude value to the number of decimals to get an appropriate level of precision you need, which is surely far less than the full precision of a double. And if you store them in floats rather than doubles, right there the precision is capped to 2.37 meters. The system you use will probably have a much easier time doing range queries if there are fewer distinct values to scan over.
Well I want your opinions about this case:
I need a database that will have... two or three tables at most, one of them will have points (latitude, longitude) and some other info.
It's really simple what I need: Get the points within a given radius.
I'm not asking how to do it (but any advice is more than welcome, specially if it's about good practices), I want to know if making use of the MySQL's spatial support would help. Since what I need is fairly easy to get with just one query, what I expect by using Spatial support is to increase performance.
So, are the spatial indexes going to help noticeably? I don't think the table will store that many points. I'd say no more than 200.
If it's really only 200 points, I recommend you do without: This makes it much easier to write portable SQL (which I consider an important thing).
Write your SQL so, that first longitued and latitude are checked against the precalculated mins and maxes (giving you a rectangle), then check for the radius. This way, you will only need to calculate the radius without finally selecting the point for 1/pi of the result set.
I personally consider this an acceptable tradeof against writing SQL, that could if must be executed against SQlite or whatever.
I have a large database full of customers, implemented in sql server 2005. Customers each have a latitude and longitude, represented as Decimal(18,15). The most important search query in the database tries to find all customers close to a certain location like this:
(Addresses.Latitude - #SearchInLat) BETWEEN -1 * #LatitudeBound AND #LatitudeBound)
AND ( (Addresses.Longitude - #SearchInLng) BETWEEN -1 * #LongitudeBound AND #LongitudeBound)
So, this is a very simple method. #LatitudeBound and #LongitudeBound are just numbers, used to pull back all the customers within a rough bounding rectangle of the point #SearchInLat, #SearchInLng. Once the results get to a client PC, some results are filtered out so that there is a bounding circle rather than a rectangle. (This is done on the client PC to avoid calculating square roots on the server.)
This method has worked well enough in the past. However, we now want to make the search do more interesting things - for instance, having the number of results pulled back be more predictable, or for the user to dynamically increase the size of the search radius. To do this, I have been looking at the possibility of ugprading to sql server 2008, with its Geography datatype, spatial indexes, and distance functions. My question is this: how fast are these?
The advantage of the simple query we have at the moment is that it is very fast and not performance intensive, which is important as it is called very often. How fast would a query based around something like this:
SearchInPoint.STDistance(Addresses.GeographicPoint) < #DistanceBound
be by comparison? Do the spatial indexes work well, and is STDistance fast?
If your handling just a standard Lat/Lng pair as you describe, and all your doing is a simple lookup, then arguably your not going to gain much in the way of a speed increase by using the Geometry Type.
However, if you do want to get more adventurous as you state, then swapping to using the Geometry types will open up a whole world of new possibilities for you, and not just for searches.
For example (Based on a project I'm working on) you could (If it's uk data) download the polygon definitions for all the towns / villages / city's for a given area, then do cross references to search in a particular town, or if you had a road map, you could find which customers lived next to major delivery routes, motorways, primary roads all sorts of things.
You could also do some very fancy reporting, imagine a map of towns, where each outline was plotted on a map, then shaded in with a colour to show density of customers in an area, some simple geometry SQL will easily return you a count straight from the database, to graph this kind of information.
Then there's tracking, I don't know what data you handle, or why you have customers, but if your delivering anything, feeding the co-ordinates of a delivery van in, tells you how close it is to a given customer.
As for the Question is STDistance fast? well that's difficult to say really, I think a better question is "Is it fast in comparison to.....", it's difficult to say yes or no, unless you have something to compare it to.
Spatial Indexes are one of the primary reasons for moving your data to geographically aware database they are optimised to produce the best results for a given task, but like any database, if you create bad indexes, then you will get bad performance.
In general you should definitely see a speed increase of some sort, because the maths in the sorting and indexing are more aware of the data's purpose as opposed to just being fairly linear in operation like a normal index is.
Bear in mind as well, that the more beefy the SQL server machine is, the better results you'll get.
One last point to mention is management of the data, if your using a GIS aware database, then that opens the avenue for you to use a GIS package such as ArcMap or MapInfo to manage, correct and visualise your data, meaning corrections are very easy to do by pointing, clicking and dragging.
My advice would be to create a side by side table to your existing one, that is formatted for spatial operations, then write a few stored procs and do some timing tests, see which comes out the best. If you have a significant increase just on the basic operations your doing, then that's justification alone, if it's about equal then your decision really hinges on, what new functionality you actually want to achieve.
I have a MySQL database that contains geo-tagged objects. The objects are tagged by using a bounding polygon that the user draws and my program exports into the database. The bounding polygon is stored in the database as a Polygon (the MySQL spatial extensions kind).
I can think of a couple ways to do this, but I'm not very pleased with any of them, as this needs to be an efficient process that will execute fairly often, although on probably only < 50,000 records in the pertinent table.
I need a way to, given any point on the earth, find the record that corresponds to the closest geo-tagged/bounded object. It doesn't need to be correct in all cases but, let's say (just to invent a number), 95% of the time. Manual correction is acceptable if it doesn't need to be done very frequently.
It appears as though this question is very similar
Get polygons close to a lat,long in MySQL.
I am going to write some application-level code to do an interatively-widening search on the distance in the linked question.