Has any of you had any experience with using NoSQL (non-relational) databases to store spatial data? Are there any potential benefits (speed, space, ...) of using such databases to hold data for, say, a desktop application (compared to using SpatiaLite or PostGIS)?
I've seen posts about using MongoDB for spatial data, but I'm interested in some performance comparison.
graphs databases like Neo4j are a very good fit, especially as you can add different indexing schemes dynamically as you go. Typical stuff you can do on your base data is of course 1D indexing (e.g. Timline or B-Trees) or funkier stuff like Hilbert Curves etc, see Nick's blog. Also, for some live demonstration, look at the AWE open source GIS desktop tool here, the underlying indexed graph being visible around time 07:00 .
Currently, MongoDB uses geohashing with B-trees which will be slower than the R-trees of PostGIS (I can't give exact numbers, I'm afraid, but there is plenty of theoretical literature on the differences). However, in these slides, http://www.slideshare.net/nknize/rtree-spatial-indexing-with-mongodb-mongodc the author talks about adding R-trees to MongoDB and sharding on a geo key. You talk about desktop use, so geosharding may not be of interest, as sharding's benefits will be felt more on massive datasets.
Ultimately, it probably comes down more to what you want to do with your spatial data. Postgis has vastly more functions and support for topology, rasters, 3D, conversions between coordinate systems, so if this is what you are looking for, PostGIS would still be the best option. If you are interested in storing billions/trillions of spatial objects and just running basic find all points near/inside this point based on some criteria, then MongoDB is likely a very good choice.
Couchdb also has a simple spatial extension
http://vmx.cx/cgi-bin/blog/index.cgi/category/CouchDB
I've been storing spatial data with ZODB. There's some inherent performance advantage in accessing local file data (spatialite) or unix socket (PostGIS) compared to TCP or HTTP requests (CouchDB etc), surely, but having an spatial index makes the biggest difference. I'm using the same R-trees mentioned in the MongoDB article, but there are plenty of good options. The JTS topology suite has various spatial indexes for Java.
Cassandra is also an option for spatial data:
http://www.readwriteweb.com/cloud/2011/02/video-simplegeo-cassandra.php
Tarantool supports spatial two-dimensional index (RTREE) with nearest neighbor search, overlaps, contains, and other spatial operators. Tarantool maintains the entire data set in RAM, making it the only OSS in-memory database with spatial index support.
https://github.com/tarantool/tarantool/wiki/R-tree-index-quick-start-and-usage
MarkLogic(Enterprise NoSQL) provides spatial functionality. This NoSQL product provides GIS applications the ability to conflate multiple objects into one entity. This provides support for managing relationships across structured and unstructured content, provenance and pedigree information about the data, historic and timeline information, etc. in a single entity.
Related
I am building a data ware house that is the range of 15+ TBs. While storage is cheap, but due to limited budget we have to squeeze as much data as possible in to that space while maintaining performance and flexibility since the data format changes quiet frequently.
I tried Infobright(community edition) as a SQL solution and it works wonderful in term of storage and performance, but the limitation on data/table alteration is making it almost a no go. and infobright's pricing on enterprise version is quiet steep.
After checking out MongoDB, it seems promising except one thing. I was in a chat with a 10gen guy, and he stated that they don't really give much of a thought in term of storage space since they flatten out the data to achieve the performance and flexibility, and in their opinion storage is too cheap nowadays to be bother with.
So any experienced mongo user out there can comment on its storage space vs mysql (as it is the standard for what we comparing against to right now). if it's larger or smaller, can you give rough ratio? I know it's very situation dependent on what sort of data you put in SQL and how you define the fields, indexing and such... but I am just trying to get a general idea.
Thanks for the help in advance!
MongoDB is not optimized for small disk space - as you've said, "disk is cheap".
From what I've seen and read, it's pretty difficult to estimate the required disk space due to:
Padding of documents to allow in-place updates
Attribute names are stored in each collection, so you might save quite a bit by using abbreviations
No built in compression (at the moment)
...
IMHO the general approach is to build a prototype, insert data and see how much disk space your specific use case requires. The more realistic you can model your queries (inserts and updates) the better your result will be.
For more details see http://www.mongodb.org/display/DOCS/Excessive+Disk+Space as well.
Pros and Cons of MongoDB
For the most part, users seem to like MongoDB. Reviews on TrustRadius give the document-oriented database 8.3 out of 10 stars.
Some of the things that authenticated MongoDB users say they like about the database include its:
Scalability.
Readable queries.
NoSQL.
Change streams and graph queries.
A flexible schema for altering data elements.
Quick query times.
Schema-less data models.
Easy installation.
Users also have negative things to say about MongoDB. Some cons reported by authenticated users include:
User interface, which has a fairly steep learning curve.
Lack of joins, which can make some data retrieval projects difficult.
Occasional slowness in the cloud environment.
High memory consumption
Poorly structured documentation.
Lack of built-in analytics.
Pros and Cons of MySQL
MySQL gets a slightly higher rating (8.6 out of 10 stars) on TrustRadius than MongoDB. Despite the higher rating, authenticated users still mention plenty of pros and cons of choosing MySQL.
Some of the positive features that users mention frequently include MySQL’s:
Portability that lets it connect to secondary databases easily.
Ability to store relational data.
Fast speed.
Excellent reliability.
Exceptional data security standards.
User-friendly interface that helps beginners complete projects.
Easy configuration and management.
Quick processing.
Of course, even people who enjoy using MySQL find features that they don’t like. Some of their complaints include:
Reliance on SQL, which creates a steeper learning curve for users who
do not know the language.
Lack of support for full-text searches in InnoDB tables.
Occasional stability issues.
Dependence on add-on features.
Limitations on fine-tuning and common table expressions.
Difficulties with some complex data types.
MongoDB vs MySQL Performance
When comparing the performance of MongoDB and MySQL, you must consider how each database will affect your projects on a case-by-case basis. While some performance features may appear to be objectively promising, your team members may never use the features that drew you to a database in the first place.
MongoDB Performance
Many people claim that MongoDB outperforms MySQL because it allows them to create queries in multiple ways. To put it another way, MongoDB can be used without knowing SQL. While the flexibility improves MongoDB's performance for some organizations, SQL queries will suffice for others.
MongoDB is also praised for its ability to handle large amounts of unstructured data. Depending on the types of data you collect, this feature could be extremely useful.
MongoDB does not bind you to a single vendor, giving you the freedom to improve its performance. If a vendor fails to provide you with excellent customer service, look for another vendor.
MySQL Performance
MySQL performs extremely well for teams that want an open-source relational database that can store information in multiple tables. The performance that you get, however, depends on how well you configure the MySQL database. Configurations should differ depending on the intended use. An e-commerce site, for example, might need a different MySQL configuration than a team of research scientists.
No matter how you plan to use MySQL, the database’s performance gets a boost from full-text indexes, a high-speed transactional system, and memory caches that prevent you from losing crucial information or work.
If you don’t get the performance that you expect from MySQL data warehouses and databases, you can improve performance by integrating them with an excellent ETL tool that makes data storage and manipulation easier than ever.
MySQL vs MongoDB Speed
In most speed comparisons between MySQL and MongoDB, MongoDB is the clear winner. MongoDB is much faster than MySQL at accepting large amounts of unstructured data. When dealing with large projects, it's difficult to say how much faster MongoDB is than MySQL. The speed you get depends on a number of factors, including the bandwidth of your internet connection, the distance between your location and the database server, and how well you organise your data.
If all else is equal, MongoDB should be able to handle large data projects much faster than MySQL.
Choosing Between MySQL and MongoDB
Whether you choose MySQL or MongoDB probably depends on how you plan to use your database.
Choosing MySQL
For projects that require a strong relational database management system, such as storing data in a table format, MySQL is likely to be the better choice. MySQL is also a great choice for cases requiring data security and fault tolerance. MySQL is a good choice if you have high-quality data that you've been collecting for a long time.
Keep in mind that to use MySQL, your team members will need to know SQL. You'll need to provide training to get them up to speed if they don't already know the language.
Choosing MongoDB
When you want to use data clusters and search languages other than SQL, MongoDB may be a better option. Anyone who knows how to code in a modern language will be able to get started with MongoDB. MongoDB is also good at scaling quickly, allowing multiple teams to collaborate, and storing data in a variety of formats.
Because MongoDB does not use data tables to make browsing easy, some people may struggle to understand the information stored there. Users can grow accustomed to MongoDB's document-oriented storage system over time.
I'm currently using MySQL and Solr, but considering making some changes. Should I keep on using Solr for location-based search, or move the entire thing over to MongoDB?
I did some experiments with both MongoDB & Solr. Don't have many data points to share, but from what I remember Solr had a much better performance than mongoDB when combining queries including both location and text.
Overall, since MongoDB does not have full-text search capability, and you are already using Solr, I would recommed to continue using it for search.
You can compute a geohash on your own. I've wrote a php library to lookup point of sales based on lat-lng pairs and spatial index. I also used it for a hierarchical cluster. It makes you indepedent from the database vendor and you have much more choice of the spatial index. You can grab my library at phpclasses.org (hilbert-curve).
Simple question, could I conceivably use redis instead of mysql for all sorts of web applications: social networks, geo-location services etc?
Nothing is impossible in IT. But some things might get extremely complicated.
Using key-value storage for things like full-text search might be extremely painfull.
Also, as far as I see, it lack support for large, clustered databases: so on MySQL you have no problems if you grow over 100s of Gb in Database, and on Redis... Well, it will require more effort :-)
So use it for what it was developed for, storing simple things which just need to be retreived by id.
ACID compliance is a must, if data integrity is important. Medical records and financial transactions would be an example. Most of the NoSQL solutions, including Redis, are fast because they trade ACID properties for speed.
Sometimes data is simply more convenient to represent using a relational database and the queries are simpler.
Also, thanks to foreign relationships and constraints in relational databases, your data is more likely to be correct. Keeping data in sync in NoSQL solutions is more difficult.
So, no I don't think we can talk about full replacement. They are different tools for different jobs. I wouldn't trade my hammer for a screwdriver.
I have researching NoSQL for a while, but I am still struggling to wrap my head around searching and filtering results/documents.
In a NoSQL world, how would I find all data within two timestamps for example? If everything is stored on a key/value basis? Or find all documents within a radius of a latitude and longitude point?
Thanks
Different data structures support different use cases. For instance, a distributed hash-table would be a good choice if you can make do with the limited API of the dictionary/map interface.
If you need to ask more complex queries of your database, then pick a database that supports this use case efficiently. The database landscape is very varied, and there is probably a database out there for you.
For range queries, the BigTable clones (and upwards in expressive querying power) will probably be worth considering.
Even if the database with the right data structure isn't fast enough or can't scale enough, you can still pull tricks like sharding, replication and clever use of caching or search indexes.
How you want to weigh the constraints of consistency, availability, resilience, through-put and latency all depend on the specific problem, and you may find that you need more than one type of database to implement the optimal solution.
Don't artificially make things harder for yourself; premature-optimization, you know.
I work at a MySQL shop and it works fast enough for our hundreds of thousands of transactions a day (millions of queries), and reliably enough for when our software is used in live TV events.
Sorry that I can't give you a more concrete answer than this.
For example MongoDB supports range queries and you can create geo indexes on latitude/longitude info. There are large differences between the NoSQL databases.
I have huge database (kinda wordnet) and want to know if it's easier to use Cassandra instead of MySQL|PostrgreSQL
All my life I was using MySQL and PostrgreSQL and I could easily think in terms of relational algebra, but several weeks ago I learned about Cassandra and that it's used in Facebook and Twitter.
Is it more convenient?
What DBMS are usually used nowadays to store social net's data, relationships between objects, wordnet?
There is nothing like a Silver bullet solution, everything is built to solve specific problem and has its own pros and cons. It is up to you to decide - what problem statement you have and what is best solution that fits your problem. Whether you use Cassandra (NoSQL) or MySQL(RDBMS), it is all driven from your system's requirements. Below are the inputs that will help you in taking better decision while deciding on database.
Why to Use NoSQL
In the case of RDBMS database, making choice is quite easy because almost all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost same kind of solutions oriented to the ACID property. When it comes to NoSQL, decision becomes difficult because every NoSQL database offers different solution and you have to understand which one is best suited for your app/system requirement. For example, MongoDB fits for use cases where your system demands schema-less document store. HBase might fit for Search engines, analysing log data, any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like tree, queue, link list etc and can be good fit for making real time leader board, pub-sub kind of system. Similarly there are other database in this category (including Cassandra) which fits for different problems. Now lets move to original question, and answer them one by one.
When to use Cassandra
Being a part of NoSQL family, Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. You can refer to blog post (http://blogs.shephertz.com/2015/04/22/why-cassandra-excellent-choice-for-realtime-analytics-workload/) to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra/NoSQL
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
There are many different flavours of "NoSQL" databases. If your application is really like Wordnet perhaps you should look at a graph database such as Neo4j.
I would suggest to analyse your request.
If you are going with more clusters, machines take NoSQL
If your data model is complicated - require efficient structures take NoSQL (no limits with type of columns)
If you fit in a few machines without scales, and you don't need super performance for multi request (as for example in social network - where lot of users send http request), and you don't think you involve saleability take RDBMS (Postgres have some good functions and structures which you can use, like array column type).
Cassandra should work better with large scales of data, multi purpose.
neo4j - would be better for special structures, graphs.
Cassandra and other NoSQL stores are being used for social based sites because of their need for massive write based operations. Not that MySQL and Postgres can't achieve this but NoSQL requires far less time and money, generally speaking.
Sounds like you may want to look at Neo4J though, just in terms of your object model needs.
All different products and they all have their pro's and conn's. What kind of problem do you have to solve?
Huge, as in TB's?