Do I really need to use MySQL Spatial Functions? - mysql

Well I want your opinions about this case:
I need a database that will have... two or three tables at most, one of them will have points (latitude, longitude) and some other info.
It's really simple what I need: Get the points within a given radius.
I'm not asking how to do it (but any advice is more than welcome, specially if it's about good practices), I want to know if making use of the MySQL's spatial support would help. Since what I need is fairly easy to get with just one query, what I expect by using Spatial support is to increase performance.
So, are the spatial indexes going to help noticeably? I don't think the table will store that many points. I'd say no more than 200.

If it's really only 200 points, I recommend you do without: This makes it much easier to write portable SQL (which I consider an important thing).
Write your SQL so, that first longitued and latitude are checked against the precalculated mins and maxes (giving you a rectangle), then check for the radius. This way, you will only need to calculate the radius without finally selecting the point for 1/pi of the result set.
I personally consider this an acceptable tradeof against writing SQL, that could if must be executed against SQlite or whatever.

Related

Improve performance using geolocation to sort by distance

I have to build the structure of a posts table to handle a big number of data (let's say, 1 million of rows) with with notably those two fields:
latitude
longitude
What I'd like to do is optimise the time consumed by read queries, when sorting by distance.
I have chosen this type: decimal (precision: 10, scale: 6), thinking it is more precise than float, and relevant.
Would it be appropriate to add an index on latitude and an index on longitude?
I'm always scared watching all the operation, such as SIN(), that ORM are performing to build such queries. I'd like to follow the best practices, to be sure it will scale, event with a lot of rows.
Note: If a general solution is not possible, let's say the database is MySQL.
Thanks.
INDEX(latitude) will help some. But to make it significantly faster you need complicated data structure and code. See my blog .
In there, I point out that 6 decimal places is probably overkill in resolution, unless you are trying to distinguish two persons standing next to each other.
There is also reference code that includes the trigonometry to handle great circle distances.

SQL and fuzzy comparison

Let's assume we have a table of People (name, surname, address, SSN, etc).
We want to find all rows that are "very similar" to specified person A.
I would like to implement some kind of fuzzy logic comparation of A and all rows from table People. There will be several fuzzy inference rules working separately on several columns (e.g. 3 fuzzy rules for name, 2 rules on surname, 5 rules on address)
The question is Which of the following 2 approaches would be better and why?
Implement all fuzzy rules as stored procedures and use one heavy SELECT statement to return all rows that are "very similar" to A. This approach may include using soundex, sim metric etc.
Implement one or more simplier SELECT statements, that returns less accurate results, "rather similar" to A, and then fuzzy-compare A with all returned rows (outside database) to get "very similar" rows. So fuzzy comparation would be implemented in my favorit programming language.
Table People should have up to 500k rows, and I would like to make about 500-1000 queries like this a day. I use MySQL (but this is yet to be considered).
I don't really think there is a definitive answer because it depends on information not available in the question. Anyway, too long for a comment.
DBMSes are good at retrieving information according to indexes. It does not make sense to have a db server wasting time in heavy computations unless it is dedicated for this specific purpose (as answered by #Adrian).
Therefore, your client application should delegate to the DBMS the retrieval of information required by the rules.
If the computations are minor, all could be done on the server. Else, pull it off into the client system.
The disadvantage of the second approach lies in the amount of data traveling from the server to the client and the number of connections to establish. So, typically it is a compromise between computation and data transfer in the server. A balance to be achieved depending on the specificities of the fuzzy rules.
Edit: I've seen in a comment that you are almost sure to have to implement the code in the client. In that case, you should consider an additional criterion, code locality, for maintenance purposes, i.e., try to have all code that is related together, not spreading it between systems (and languages).
I would say you're best off using simple selects to get the closest matches you can without hammering the database, then do the heavy lifting in your application layer. The reason I would suggest this solution is scalability: if you do your heavy lifting in the application layer, your problem is a perfect use case for a map-reduce-style solution wherein you can distribute the processing of similarities across nodes and get your results back much faster than if you put it through the database; plus, this way, you're not locking up your database and slowing down any other operations that may be going on at the same time.
Since you're still considering what DB to use PostgreSQL has fuzzystrmatch module which provides Levenshtein and Soundex functions. Also, you might want to look on the pg_trm module as described here. Maybe you could also put the index on the column using soundex() so you won't have to calculate that every time.
But you seem to optimize prematurely so my advice would be to test using pg and then wonder if you need to optimize or not, the numbers you provided really don't seem like a lot considered you almost have two minutes to run one query.
An option i'd consider is to add a column in the "People Talbe" that is the SoundEx value of the person.
I've done joins using
Select [Column}
From People P
Inner join TableA A on Soundex(A.ComarisonColumn) = P.SoundexColumn
That'll return anything in TableA that has the same SoundEx value from the People Tables SoundEx Column.
I haven't used that kind of query on tables that size, but i see no issues with trying it. You can also index that SoundExColumn to help with performance.

MySQL Postgresql / PostGIS

I have lat/lon coordinates in a 400 million rows partitioned mysql table.
The table grows # 2000 records a minute and old data is flushed every few weeks.
I am exploring ways to do spatial analysis of this data as it comes in.
Most of the analysis requires finding whether a point is in a particular lat/lon polygon or which polygons contain that point.
I see the following ways of tackling the point in polygon (PIP) problem:
Create a mysql function that takes a point and a Geometry and returns a boolean.
Simple but not sure how Geometry can be used to perform operations on lat/lon co-ordinates since Geometry assumes flat surfaces and not spheres.
Create a mysql function that takes a point and identifier of a custom data structure and returns a boolean.
The polygon vertices can be stored in a table and a function can compute PIP using spherical math. Large number of polygon points may lead to a huge table and slow queries.
Leave point data in mysql and store polygon data in PostGIS and use the app server to run PIP query in PostGIS by probviding point as a parameter.
Port the application from MySQL to Postgresql/PostGIS.
This will require a lot of effort in rewriting queries and procedures.
I can still do it but how good is Postgresql at handling 400 million rows.
A quick search on google for "mysql 1 billion rows" returns many results. same query for Postgres returns no relevant results.
Would like to hear some thoughts & suggestions.
A few thoughts.
First PostgreSQL and MySQL are completely different beasts when it comes to performance tuning. So if you go the porting route be prepared to rethink your indexing strategies. Not only does PostgreSQL have a far more flexible indexing than MySQL, but the table approaches are very different also, meaning the appropriate indexing strategies are as different as the tactics are. Unfortunately this means you can expect to struggle a bit. If i could give advice I would suggest dropping all non-key indexes at first and then adding them back sparingly as needed.
The second point is that nobody here can likely give you a huge amount of practical advice at this point because we don't know the internals of your program. In PostgreSQL, you are best off indexing only what you need, but you can index functions' outputs (which is really helpful in cases like this) and you can index only part of a table.
I am more a PostgreSQL guy than a MySQL guy so of course I think you should go with PostgreSQL. However rather than tell you why etc. and have you struggle at this scale, I will tell you a few things that I would look at using if I were trying to do this.
Functional indexes
Write my own functions for indexes for related analysis
PostGIS is pretty amazing and very flexible
In the end, switching db's at this volume is going to be a learning curve, and you need to be prepared for that. However, PostgreSQL can handle the volume just fine.
The number of rows is quite irrelevant here.
The question is how much of the point in polygon work that can be done by the index.
The answer to that depends on how big the polygons are.
PostGIS is very fast to find all points in the bounding box of a polygon. Then it takes more effort to find out if the point actually is inside the polygon.
If your polygons is small (small bounding boxes) the query will be efficient. If your polygons are big or have a shape that mekes the bounding box big then it will be less efficient.
If your polygons is more or less static there is work arounds. You can divide your polygons in smaller polygons and recreate the idnex. Then the index will be more efficient.
If your polygons is actually multipolygons the firs step is to split the multipolygons to polygons with ST_Dump and recreate and build an index on the result.
HTH
Nicklas

How good is the geography datatype in sql server 2008?

I have a large database full of customers, implemented in sql server 2005. Customers each have a latitude and longitude, represented as Decimal(18,15). The most important search query in the database tries to find all customers close to a certain location like this:
(Addresses.Latitude - #SearchInLat) BETWEEN -1 * #LatitudeBound AND #LatitudeBound)
AND ( (Addresses.Longitude - #SearchInLng) BETWEEN -1 * #LongitudeBound AND #LongitudeBound)
So, this is a very simple method. #LatitudeBound and #LongitudeBound are just numbers, used to pull back all the customers within a rough bounding rectangle of the point #SearchInLat, #SearchInLng. Once the results get to a client PC, some results are filtered out so that there is a bounding circle rather than a rectangle. (This is done on the client PC to avoid calculating square roots on the server.)
This method has worked well enough in the past. However, we now want to make the search do more interesting things - for instance, having the number of results pulled back be more predictable, or for the user to dynamically increase the size of the search radius. To do this, I have been looking at the possibility of ugprading to sql server 2008, with its Geography datatype, spatial indexes, and distance functions. My question is this: how fast are these?
The advantage of the simple query we have at the moment is that it is very fast and not performance intensive, which is important as it is called very often. How fast would a query based around something like this:
SearchInPoint.STDistance(Addresses.GeographicPoint) < #DistanceBound
be by comparison? Do the spatial indexes work well, and is STDistance fast?
If your handling just a standard Lat/Lng pair as you describe, and all your doing is a simple lookup, then arguably your not going to gain much in the way of a speed increase by using the Geometry Type.
However, if you do want to get more adventurous as you state, then swapping to using the Geometry types will open up a whole world of new possibilities for you, and not just for searches.
For example (Based on a project I'm working on) you could (If it's uk data) download the polygon definitions for all the towns / villages / city's for a given area, then do cross references to search in a particular town, or if you had a road map, you could find which customers lived next to major delivery routes, motorways, primary roads all sorts of things.
You could also do some very fancy reporting, imagine a map of towns, where each outline was plotted on a map, then shaded in with a colour to show density of customers in an area, some simple geometry SQL will easily return you a count straight from the database, to graph this kind of information.
Then there's tracking, I don't know what data you handle, or why you have customers, but if your delivering anything, feeding the co-ordinates of a delivery van in, tells you how close it is to a given customer.
As for the Question is STDistance fast? well that's difficult to say really, I think a better question is "Is it fast in comparison to.....", it's difficult to say yes or no, unless you have something to compare it to.
Spatial Indexes are one of the primary reasons for moving your data to geographically aware database they are optimised to produce the best results for a given task, but like any database, if you create bad indexes, then you will get bad performance.
In general you should definitely see a speed increase of some sort, because the maths in the sorting and indexing are more aware of the data's purpose as opposed to just being fairly linear in operation like a normal index is.
Bear in mind as well, that the more beefy the SQL server machine is, the better results you'll get.
One last point to mention is management of the data, if your using a GIS aware database, then that opens the avenue for you to use a GIS package such as ArcMap or MapInfo to manage, correct and visualise your data, meaning corrections are very easy to do by pointing, clicking and dragging.
My advice would be to create a side by side table to your existing one, that is formatted for spatial operations, then write a few stored procs and do some timing tests, see which comes out the best. If you have a significant increase just on the basic operations your doing, then that's justification alone, if it's about equal then your decision really hinges on, what new functionality you actually want to achieve.

Most efficient way to get points within radius of a point with sql server spatial

I am trying to work out the most efficient query to get points within a radius of a given point. The results do not have to be very accurate so I would favor speed over accuracy.
We have tried using a where clause comparing distance of points using STDistance like this (where #point and v.GeoPoint are geography types):
WHERE v.GeoPoint.STDistance(#point) <= #radius
Also one using STIntersects similar to this:
WHERE #point.STBuffer(#radius).STIntersects(v.GeoPoint) = 1
Are either of these queries preferred or is there another function that I have missed?
If accuracy is not paramount then using the Filter function might be a good idea:
http://msdn.microsoft.com/en-us/library/cc627367.aspx
This can i many cases be orders of magnitude faster because it does not do the check to see if your match was exact.
In the index the data is stored in a grid pattern, so how viable this approach is probably depends on your spatial index options.
Also, if you don't have to many matches then doing a filter first, and then doing a full intersect might be viable.