i am working on a GEO-enabled application where i have a obvious use case of searching users within some distance of given user location .Currently i am having MySQL DB used. as the User table is expected to be very large by time the time for getting results will get longer (too long in case it need to traverse entire table).
i am using InnoDB as my table do need many things which MYISAM cant do. i have tried mongo and had a test drive with adding 5 million users and doing some test over them . now i am curious to know what MYSQL can offer in same situation as i will prefer MYSQL if it gives slightly near results to mongo .
My user table is having other fields plus a lat field and a lng (both indexed). still it takes much time. can anyone suggest a better design approach for faster results.
Mongo has a bunch of very useful built in geospatial commands and aggregations that will be ideal for your given case of finding users near to a given user point. Others include within that finds points within a bounding box or polygon. In your case the geoNear aggregation is perfect and can provide the calculated distance away from the given point.
You will have to code a lot of that functionality with mysql. Then you also have Postgis an add on for Postgres. Postgres is the classic open source Mysql competitor and Postgis has been around longer than Mongo and the database presumably behind open street maps, government gis and similar.
But to the problem, you need to use geojson format and 2dsphere index that you might not be using. Post a single record of your data.
Related
The site currently does mainly range searches (latitude & longitude) with some filtering like WHERE color = "red" type of clauses. However using MySQL with geospatial index is still quite slow and I need to speed it up.
Problem: Will using Solr to do the search be a good idea?
If so, should I only duplicate the range columns from MySQL into Solr, and do the WHERE clauses in MySQL, or do both type of queries in Solr?
I've read that Solr is not for storing data like a database (ie. MySQL). Does this mean that if my search can take place over 10 different columns (or field in Solr terms), and the MySQL table that I replicated Solr's from only has 11 tables, I would still keep the MySQL table even though that will use up almost twice as much storage space half of which is redundant?
It appears that I'm using structured data (because each row has many columns defined?) and storing the entire table in Solr instead having redundant data on MySQL and Solr will save storage space and number of database access operations when writing. Is Solr a good choice here?
In terms of speed, would it be better to use PostGIS or Solr?
Solr has very fast numerical/date range queries. Solr 3 geospatial takes advantage of that, and I wrote a plugin that does even better. I doubt MySQL is faster.
That said, if the sole problem you are trying to solve is slow geospatial queries then bringing in Solr may solve it but will add a lot of overall complexity to your system since it isn't designed to replace relational databases--it works alongside them. Don't get me wrong; Solr is awesome, particularly for faceted navigation and text search. But you didn't state you wanted to take advantage of Solr's primary features.
PostGIS is by far the most mature open-source GIS storage system. I suggest you try it as an experiment to see if it's better. I would try a lat + lon pair of columns approach like what you are doing now with MySQL, and I would also try using the PostGIS native geospatial way to do it, whatever that is exactly.
One thing you could try in either MySQL or PostGIS is to round your latitude and longitude value to the number of decimals to get an appropriate level of precision you need, which is surely far less than the full precision of a double. And if you store them in floats rather than doubles, right there the precision is capped to 2.37 meters. The system you use will probably have a much easier time doing range queries if there are fewer distinct values to scan over.
I have lat/lon coordinates in a 400 million rows partitioned mysql table.
The table grows # 2000 records a minute and old data is flushed every few weeks.
I am exploring ways to do spatial analysis of this data as it comes in.
Most of the analysis requires finding whether a point is in a particular lat/lon polygon or which polygons contain that point.
I see the following ways of tackling the point in polygon (PIP) problem:
Create a mysql function that takes a point and a Geometry and returns a boolean.
Simple but not sure how Geometry can be used to perform operations on lat/lon co-ordinates since Geometry assumes flat surfaces and not spheres.
Create a mysql function that takes a point and identifier of a custom data structure and returns a boolean.
The polygon vertices can be stored in a table and a function can compute PIP using spherical math. Large number of polygon points may lead to a huge table and slow queries.
Leave point data in mysql and store polygon data in PostGIS and use the app server to run PIP query in PostGIS by probviding point as a parameter.
Port the application from MySQL to Postgresql/PostGIS.
This will require a lot of effort in rewriting queries and procedures.
I can still do it but how good is Postgresql at handling 400 million rows.
A quick search on google for "mysql 1 billion rows" returns many results. same query for Postgres returns no relevant results.
Would like to hear some thoughts & suggestions.
A few thoughts.
First PostgreSQL and MySQL are completely different beasts when it comes to performance tuning. So if you go the porting route be prepared to rethink your indexing strategies. Not only does PostgreSQL have a far more flexible indexing than MySQL, but the table approaches are very different also, meaning the appropriate indexing strategies are as different as the tactics are. Unfortunately this means you can expect to struggle a bit. If i could give advice I would suggest dropping all non-key indexes at first and then adding them back sparingly as needed.
The second point is that nobody here can likely give you a huge amount of practical advice at this point because we don't know the internals of your program. In PostgreSQL, you are best off indexing only what you need, but you can index functions' outputs (which is really helpful in cases like this) and you can index only part of a table.
I am more a PostgreSQL guy than a MySQL guy so of course I think you should go with PostgreSQL. However rather than tell you why etc. and have you struggle at this scale, I will tell you a few things that I would look at using if I were trying to do this.
Functional indexes
Write my own functions for indexes for related analysis
PostGIS is pretty amazing and very flexible
In the end, switching db's at this volume is going to be a learning curve, and you need to be prepared for that. However, PostgreSQL can handle the volume just fine.
The number of rows is quite irrelevant here.
The question is how much of the point in polygon work that can be done by the index.
The answer to that depends on how big the polygons are.
PostGIS is very fast to find all points in the bounding box of a polygon. Then it takes more effort to find out if the point actually is inside the polygon.
If your polygons is small (small bounding boxes) the query will be efficient. If your polygons are big or have a shape that mekes the bounding box big then it will be less efficient.
If your polygons is more or less static there is work arounds. You can divide your polygons in smaller polygons and recreate the idnex. Then the index will be more efficient.
If your polygons is actually multipolygons the firs step is to split the multipolygons to polygons with ST_Dump and recreate and build an index on the result.
HTH
Nicklas
I'm starting a new project and although I'm used to MySQL, I'm worried about efficiency. I'm open to other options, and graph databases sound intriguing.
I will need to find similar users based on location and rating like values. In mysql I probably would have to join across 2 many to many relationships and order based on distance of both location and those values (euclidean distance probably). MySQL seems slow with things like that.
I will also need to do things like find 10 nodes with text that starts with a sub string, and has the largest number of connections (which is an autocomplete I guess).
Would Neo4j or another graph database do this easily and efficiently?
Yes, Neo4J is certainly more appropriate than MySQL. I've used it myself for similarity searches and continue to do so. Check out Cypher, or Gremlin depending on how complex your criteria are -- together with the inbuilt Lucene index, it's terrific.
Examples of what you may be trying to achieve: http://docs.neo4j.org/chunked/stable/data-modeling-examples.html
So, this question is similar to what I need, but the answers there don't quite match. I'm looking for a way to take a set of SURF descriptors and store them in a MySQL database so that I can take an image from a user, and run a reverse image search quickly.
What I'm doing now
At the moment, I am taking the list of descriptors given to me by jOpenSurf, running through them, and converting them to two 64 character strings. With this, I can query and find exact matches very easily, but I don't just want exact matches, I would like to do comparison of features.
What (I think) I need to do
After doing a bit of research online and looking at the comparison code provided by jOpenSurf, I think what I need to do is store the vector value of each interest point in the database so that I can compare that. But that is where I'm stuck.
What I need help with
How in the world can I store a vector value into a MySQL database so I can do a comparison for similarity matching on images?
I don't know of anything in mysql that would allow vector comparisons natively - some people have questioned whether the geospatial module would allow this but the consensus was no (e.g. cosine function on vectors).
You would need to evaluate the similarity scores outside of the database and store them in the database after evaluation (e.g. store the top 5 hits per image in the database). Assuming you are using a symmetric scoring algorithm, you would only need to update the database for at most N images for managing top N similar images for each new image evaluated.
There is some interesting work using a reverse index database (e.g. search engine) in providing image relevancy searches based on both image features and additional metadata or text if you so choose: http://www.mendeley.com/research/lire-lucene-image-retrieval-an-extensible-java-cbir-library/
I have a MySQL database that contains geo-tagged objects. The objects are tagged by using a bounding polygon that the user draws and my program exports into the database. The bounding polygon is stored in the database as a Polygon (the MySQL spatial extensions kind).
I can think of a couple ways to do this, but I'm not very pleased with any of them, as this needs to be an efficient process that will execute fairly often, although on probably only < 50,000 records in the pertinent table.
I need a way to, given any point on the earth, find the record that corresponds to the closest geo-tagged/bounded object. It doesn't need to be correct in all cases but, let's say (just to invent a number), 95% of the time. Manual correction is acceptable if it doesn't need to be done very frequently.
It appears as though this question is very similar
Get polygons close to a lat,long in MySQL.
I am going to write some application-level code to do an interatively-widening search on the distance in the linked question.