Combine MySQL, Sphinx and MongDB. Good idea? - mysql

For a new project I'm looking to combine MySQL, Sphinx and MongoDB. MySQL for the relational data and searching on numeric values, Sphinx for free text search and MongoDB for geodata. As far as my (quick) benchmarks shows MongoDB is the fastest for geo queries, sphinx for free text search and MySQL for relational data searches. So to get the best performance I might have to combine them in my project.
There are however three drawbacks to this.
Three points of failure, i.e. Sphinx, MySQL, and MongoDB can crash
which will stop my site
I need data in three databases and need to keep them up to date
(all data only changes ones per day so its not the worst problem).
Hardware requirements and mainly RAM is going through the roof
since all databases wants to have a large portion of the RAM to be
able to perform.
So the questions is should I combine the three, leave one out (probably MongoDB and use Sphinx for geodata as well) or even go with only one (MongoDB or MySQL)?
To give an idea of the data, the relational data is aprox 6GB, the geodata about 4GB and the freetext data about 16GB.

Didn't quite understood if the records/collections/documents contained in the 3 dbs have inter-db references. EG if user names, jobs, telephone numbers are in Mysql and user addresses are in Mongo. I'll assume that the answer is Yes.
IMHO having 3 different storage solutions is not recommended, because:
1) (most important) You can not aggregate data from 2 DBs (in a scalable way).
Example:
Let's say that you keep user data (user names) in Mysql and user geo coordinates in Mongo. You can't query having filters/sorts on fields located on both dbs. For example, you can't:
SELECT all users
WHERE name starts with 'A'
SORT BY distance_from_center
Same applies for Sphinx.
Solution: you either limit to data available on a single DB, or you duplicate/mirror data from one db to another.
2) Maintenance costs: 3 servers to maintain, different backup/redundance strategies, different scaling strategies; Development costs: developer must use 3 querying libraries, with 3 different ways to query, etc etc.
3) Inconsistence/Synchronization issues that must be manually dealt with(EG you want to insert data both in mongo and in mysql; let's say that mongo wrote the data, but mysql raised a referential integrity exception, so now you have an inconsistency between dbs)
4) About HW costs, the only RAM-eater is MongoDB (the recommendation is that it has to have all indexes in ram). For MySQL and Solr servers, you can control memory consumption.
What I would do:
If I don't need all the SQL features (like transactions, referential integrity, joins, etc) I would go with Mongo
If I need those features, and I can live with a lower performance on geo operations, I would go with MySQL
now, If I need (I mean, I really really need) full-text search, and Mongo/Mysql FTS capabilities are not enough, I would attach also a FTS server like Sphinx, Solr, Elasticsearch, etc

Related

When is it time to switch to NoSQL?

I am dealing with a large database that is collecting historical pricing data. The schema is relatively simple and does not change.
Something like:
SKU (char), type(enum), price(double), datetime(datetime)
The issue is that this table now has over 500,000,000 rows and is around 20gb and growing. It is already getting a bit difficult to run queries. One common query is to get all skus from a specific date range consisting of maybe 500,000 records. Add any complexity like group by, and you can forget it.
This db is mostly writes. But we obviously need to crunch the data and run queries occasionally. I understand that better index planning can help speed up the queries, but I am wondering if this is the type of data that would benefit from a noSQL solution like MongoDB? Can I expect mysql (probably moving to MariaDB) to continue to work for us, even after it grows beyond 100-200 gb in size? Or should I explore alternatives before things get unweildly?
NoSQL is not a solution to a "large database" problem; NoSQL--specifically document databases--are designed for scenarios where the nature of the data you're storing varies, so you don't want to define rigid schemas and relationships up front.
What you have is simple, well-defined data. This is ideally suited for a relational database, but for something of that scale I would recommend looking something either commercial (i.e. SQL Server or Oracle, depending on your platform). The databases I work with in SQL Server are around four terabytes in size with several tables in the hundreds-of-millions records like you have. A relational database can easily accommodate the simple data you've outlined.
You actually have an ideal use-case for SQL, and a rather bad fit for NoSQL. MySQL devs report people using databases of 5,000,000,000 records. Some other SQL servers will be even more scalable than that. However, if you don't have a proper index support, it should be impossible to manage even a fraction of that.
BTW, what is your table schema, including indices?
You could switch to mariadb and then use the spider engine. The spider engine makes it possible to split your data across multiple mariadb instances without loosing the abillity to run queries against your existing instance.
So you can define your own rules for partitioning and then create one instance per partition. So in the end you have multiple instances of mariadb but all your records are virtual sumed up in one table with the spider engine.
Your performace gain would be because you split your data across multiple instances and therefore reduce the amount of records per table or instance and of course by using more hardware ressources.

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to

When to consider Solr

I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.

Gem that allows for data access using sharded mysql databases while maintaining the usage of Activerecord

This is a relatively complex problem that I am thinking of, so please suggest edits or comment on parts where you are not clear about. I will update and iterate based on your comments
I am thinking of a developing a rails gem that simplifies the usage of sharded tables, even when most of your data is stored in relational databases. I believe this is similar to the concept being used in Quora or Friendfeed when they hit a wall scaling w traditional mysql, with most of the potential solutions requiring massive migration (nosql), or just being really painful (sticking w relational completely)
http://bret.appspot.com/entry/how-friendfeed-uses-mysql
http://www.quora.com/When-Adam-DAngelo-says-partition-your-data-at-the-application-level-what-exactly-does-he-mean?q=application+layer+quora+adam+
Essentially, how can we continue using MySQL for a lot of things it is really good at, yet allowing parts of the system to scale? This will allow someone got started using mysql/activerecord, but hit a roadblock scaling to easily scale the parts of the database that makes sense.
For us, we are using Ruby on Rails on a sharded database, and storing JSON blobs in them. Since we cannot do joins, we are creating tables for relationships between entities.
For example, we have 10 different type of entities. Each entity can be linked to each other using a big (sharded) relationship tables.
The tables are extremely simple. The indexes is (Id1, Id2..., type), and data is stored in the JSON blob.
Id, type, {json data}
Id1, Id2, type {json data}
Id1, Id2, Id3, type {json data}
We have put a lot of work into creating higher level interfaces for storing a range of data sets for relational data
For any given type, you can define a type of storage - (value, unweighted list, weighted lists, weighted lists with guids)
We have higher level interfaces for each of them - querying, sorting, timestamp comparison, intersections etc.
That way, if someone realizes that they need to scale a specific part of the database, they can keep most of their infrastructure, and move only the tables they need into this sharded database
What are your thoughts? As mentioned above, I would love to know what you folks think
Scalability is a tough nut to crack. My background includes two years as a sales engineer for BEA systems, back when all they sold was the TUXEDO middleware (TUXEDO == Transactions for UNix Extended for Distributed Operations). TUXEDO is still the king of the TPC-C benchmark on Unix platforms.
Scaling WRT a database is not so much about the database itself, it's about how you access that database. For example, if you establish a connection to a database, and you want that single connection to scale, make that connection access the same table in the database always. The problem with today's infrastructures (RoR included) is that when they open generic connections, those connections accesses many tables in the database.
So if you want to make a database CONNECTION scale, make that connection focus the database engine on as few database resources as possible. If you can manage to create a 'focused' connection, that ONLY accesses one table, and one table index, for example, it will scale much better than a connection that accesses EVERY table in the database and every index defined for all those tables.

How can I implement a large relational database schema on a basic key value store?

How can I implement a large relational database schema on a key value store?
The following are the requirements:
1) uses no stored procedures or special database vendor specific features
2) Uses indexes
3) Uses joins
4) Many complex types in tables (VARCHAR, INT, BLOB, etc)
5) Billions of records
6) Full text search
7) Timestamped backup possible
8) Transactions not needed (atomic single row/field updates only)
You will have to, in essence, build your own relational database system. Without it, joins will be horribly slow (no query optimizer). You can get full text search by grafting on Lucene.
Have you considered an open source RDBMS (e.g. Derby)?
Often, the whole point of a key-value store is to get away from the constraints of relational databases, but it sounds like you want them back here. Everybody wants to have their cake and eat it too, but I'm confused about what you're trying to accomplish here. if you want the power of a relational database, you should use a relational database.
That said, you may want to take a look at MongoDB, which represents a very good compromise between the rigid, structured nature of relational databases and the more free-form approach of a key-value store.