Switching from MySQL to Cassandra - Pros/Cons? - mysql

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy
lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in
MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
Order by [SliceRange in Cassandra's API seems to satisy this]
Group by
Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense
to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
Thanks.

Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.

If you're a relational database developer (as I am), I'd suggest/point out:
Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Dominic Williams' Cassandra blog posts
Secondary Indexes in Cassandra
More from Ed Anuff on indexing
Cassandra book (not fantastic, but a good start)
"WTF is a SuperColumn" pdf

The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?

Related

Backend technology for high volume data for web application

I am developing an application to provide daily dynamic information like prices, availability, etc for around 50,000 objects. I need to store data for about the next 200 days. That would mean a total of 10 million rows. The prices will be batch updated and new data will be added once daily. Let me say about 10,000 existing rows get updated and 50,000 rows are inserted daily. What is the best backend framework that I can use.
Can MySQL be scalable with limited hardware capability. Or is NoSQL database the way to go? If yes, then which NoSQL database will be best suited for fast fetching and updating the data.
I would recommend you to use Cassandra, as you need to write more than read, and Cassandra is optimized for high throughput while write.
It provide scalability, no single point failure and high throughput. And you can update records as well.
Cassandra also supports batch operation for DML (data manipulation language) i.e. write, update and delete. And batch operation of Cassandra provides atomicity as well.
This type of volume is well within the capabilities/capacities of traditional RDBMS. I would say that if you are familiar with MySQL you will be safe to stick with it. A lot depends also, on what kind of queries you want to run. With a properly structured, denormalized setup, you can run ad hoc queries in an RDBMS, whereas with document stores, you need to think quite carefully about structure up front -- embedding versus referencing, see: MongoDB relationships: embed or reference?. MongoDB has added a very nice aggregation framework, which goes a long way towards being able to query data as you would in an RDBMS, but in many other NoSQL systems, queries are essentially map-reduce jobs and joins are either painful or impossible.
It sounds like your data is structured around dates/days. One thing you can do that will yield dramatic speed improvements on queries is partitioning by date ranges. I have worked on dbs over 100m rows in MySQL where historical data had to be kept for auditing purposes but where most of the read/write was on current data, and partitioning led to truly dramatic read query improvements.
You might be interested by this link which shows what some very high volume sites are using: What databases do the World Wide Web's biggest sites run on? Anecdotally, I know that Facebook had trillions of rows in MySQL across various clusters before they started hitting real bottlenecks, but it is no suprise that Cassandra ultimately came out of Facebook engineering, given the truly colossal data volumes they now handle.
Cassandra, Riak, CouchDB, MongoDB, etc all arose to solve very real problems, but these come with tradeoffs, both in terms of the CAP theorem, and in terms of ad hoc queries being more difficult than in RDBMS. Having said that, MongoDB and Cassandra (which I have most experience with) are easy to set up and fun to work with, so if you want to give them a go, I'm sure you will have no problems, but I would say your usage requirements are well within the capabilities of MySQL. Just my 2c.

redis usage for semantic web application RDF quads alongside SparQL

I am the sysadmin supporting a Semantic Web application which runs on Tomcat and is using MySQL for the datastore. The dynamic nature of the queries cause the larger pages to be exponentially slower load times than pages with less dynamic content. The database is queried with scan type Select statements millions of times a day, and it is my thinking that the SparQL generated queries are not the the most efficient, but changing them seems to be out of our hands as the queries are somewhat of a black box to the developers. What I would like to know is, can Redis be used in a situation like this to handle the semantic relational data sets RDF (quads, triples, etc)? I am not very knowledgeable on this, so an explanation/links would be appreciated. Thanks!
In my experience, MySQL is not an acceptable backend for a triplestore. You should consider using any one of the many dedicated triple stores, any of them will probably outperform MySQL because they do not have the impedance mismatch of triple to store triples/quads into a relational store; they're using native storage.
There is no production ready solution based on Redis that I'm aware of. A quick google turned up a couple projects that seem to be related, but they may not meet your requirements.

SQL Server vs. NoSQL

So I have a website that could eventually get some pretty high traffic. My DB implementation is in SQL Server 2008 at the moment. I really only have 2 tables and a few stored procs. Most of the DB could be re-designed to work without joining (although it wouldn't make sense when I can join so easily within SQL Server).
I heard that sites like Digg and Facebook use NoSQL databases for a lot of their basic data access. Is this something worth looking into, or will SQL Server not really slow me down that bad?
I use paging on my site (although this might change in the future), and I also use AJAX'd data access for most of the "live" stuff, so it doesn't really seem to be a performance hindrance at the moment, but I'm afraid it will be as the data starts expanding exponentially.
Am I going to gain a lot of performance my moving to NoSQL? Honestly, right now I don't even completely understand NoSQL, so any tips on how this will help me improve the better.
Thanks guys.
Actually Facebook use a relational database at its core, see SOCC Keynote Address: Building Facebook: Performance at Massive Scale. And so do many other web-scale sites, see Why does Quora use MySQL as the data store instead of NoSQLs such as Cassandra, MongoDB, CouchDB etc?. There is also a discussion of how to scale SQL Server to web-scale size, see How do large-scale sites and applications remain SQL-based? which is based on MySpace's architecture (more details at Scale out SQL Server by using Reliable Messaging). I'm not saying that NoSQL doesn't have its use cases, I just want to point out that there are many shades of gray between white and black.
If you're afraid that your current solution will not scale then perhaps you should look at what are the factors that prevent scalability with your current solution. Test data is cheap to produce, load the 'exponentially increased' data volume and run your test harness, see where it cracks. None of the NoSQL solutions will bring magic off-the-shelf scalability, they all require you to understand how to use them effectively and deploy them correctly. And they also require you to test with large volumes if you want to ensure success at scale. Same for traditional relational solutions.
Sql Server scales pretty well. For example, Stack Overflow used it to serve you this very page. Facebook and Google might use a form of nosql, but even if you make it really big you're unlikely to rise to that level.
With a simple table structure and data that fits on one server, it doesn't matter much what platform you use. There are a several possible reasons to need to move to NoSQL:
Data scaling - SQL works best when all the data fits on one server (up to a few TB). The reason a lot of NoSQL stores don't have join is that they were designed not to require all the objects to be on one server.
Performance scaling - NoSQL stores do tend to be faster at handling high traffic, but not necessarily by enough to matter. You can improve SQL performance quite a lot with replication and caching as long as you aren't running into data size issues. Writes generally do have to run on the one server, but in most cases you will need to improve read performance long before write performance becomes an issue.
Complex data access - some types of queries simply don't fit well into a relational model. Graph and set stores work quite differently from relational databases so are a better fit for some applications.
Easier development - If you don't already have a SQL database and all the code to support it, using a schemaless datastore can save quite a bit of development time.
I don't think so you have to move your database from SQL to NoSQL unless and untill you are serving thousands of TB data. If you properly normalize your tables and serve the data and also need to set proper archive mechanism it should work.
If you still have question what to choose and how, than check this. Let's assume that you have decided to move on to NoSQL database than there are lot of market player. Just have a look at the list which is again depending upon your need and type of data you have.
Am I going to gain a lot of performance my moving to NoSQL?
It depends.
Check out this article for 7 reasons when you DON'T want to use NoSQL. If none is your case, then read further.
The main advantage of Document-based NoSQL for the traditional enterprise needs is cheaper hosting at high scale due to lower CPU usage on querying denormalised data (the most often request). Key points:
The CPU is going nuts on JOINs and GROUP BYs in the SQL queries, when a denormilised data structure implies no/less JOINs, hence less stress on CPU.
CPU is the most expensive resource in the cloud, then storage is the cheapest. And denormalised data trades higher storage for lower CPU.
How to get there?
Master the DDD (Domain-Driven Design).
Gain good understanding of CQRS (Command Query Responsibility Segregation) and Eventual consistency.
Understand your domain and business processes.
Design model, which is tuned to the access patterns.
Review.
Repeat steps 3 - 5.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

Alternatives to traditional relational databases for activity streams

I'm wondering if some other non-relational database would be a good fit for activity streams - sort of like what you see on Facebook, Flickr (http://www.flickr.com/activity), etc. Right now, I'm using MySQL but it's pretty taxing (I have tens of millions of activity records) and since they are basically read-only once written and always viewed chronologically, I was thinking that an alternative DB might work well.
The activities are things like:
6 PM: John favorited Bacon
5:30 PM: Jane commented on Snow Crash
5:15 PM: Jane added a photo of Bacon to her album
The catch is that unlike Twitter and some other systems, I can't just simply append activities to lists for each user who is interested in the activity - if I could it looks like Redis would be a good fit (with its list operations).
I need to be able to do the following:
Pull activities for a set or subset of people who you are following ("John" and "Jane"), in reverse date order
Pull activities for a thing (like "Bacon") in reverse date order
Filter by activity type ("favorite", "comment")
Store at least 30 million activities
Ideally, if you added or removed a person who you are following, your activity stream would reflect the change.
I have been doing this with MySQL. My "activities" table is as compact as I could make it, the keys are as small as possible, and the it is indexed appropriately. It works, but it just feels like the wrong tool for this job.
Is anybody doing anything like this outside of a traditional RDBMS?
Update November 2009: It's too early to answer my own question, but my current solution is to stick with MySQL but augment with Redis for fast access to the fresh activity stream data. More information in my answer here: How to implement the activity stream in a social network...
Update August 2014: Years later, I'm still using MySQL as the system of record and using Redis for very fast access to the most recent activities for each user. Dealing with schema changes on a massive MySQL table has become a non-issue thanks to pt-online-schema-change
I'd really, really, suggest stay with MySQL (or a RDBMS) until you fully understand the situation.
I have no idea how much performance or much data you plan on using, but 30M rows is not very many.
If you need to optimise certain range scans, you can do this with (for example) InnoDB by choosing a (implicitly clustered) primary key judiciously, and/or denormalising where necessary.
But like most things, make it work first, then fix performance problems you detect in your performance test lab on production-grade hardware.
EDIT:Some other points:
key/value database such as Cassandra, Voldermort etc, do not generally support secondary indexes
Therefore, you cannot do a CREATE INDEX
Most of them also don't do range scans (even on the main index) because they're using hashing to implement partitioning (which they mostly do).
Therefore they also don't do range expiry (DELETE FROM tbl WHERE ts < NOW() - INTERVAL 30 DAYS)
Your application must do ALL of this itself or manage without it; secondary indexes are really the killer
ALTER TABLE ... ADD INDEX takes quite a long time in e.g. MySQL with a large table, but at least you don't have to write much code to do it. In a "nosql" database, it will also take a long time BUT also you have to write heaps and heaps of code to maintain the new secondary index, expire it correctly, AND modify your queries to use it.
In short... you can't use a key/value database as a shortcut to avoid ALTER TABLE.
I am also planning on moving away from SQL. I have been looking at CouchDB, which looks promising. Looking at your requirements, I think all can be done with CouchDB views, and the list api.
It seems to me that what you want to do -- Query a large set of data in several different ways and order the results -- is exactly and precisely what RDBMeS were designed for.
I doubt you would find any other datastore that would do this as well as a modern commercial DBMS (Oracle, SQLServer, DB2 etc.) or any opn source tool that would accomplish
this any better than MySql.
You could have a look at Googles BigTable, which is really a relational database but
it can present an 'object'y personality to your program. Its exceptionaly good for free format text
searches, and complex predicates. As the whole thing (at least the version you can download) is implemented in Python I doubt it would beat MySql in a query marathon.
For a project I once needed a simple database that was fast at doing lookups and which would do lots of lookups and just an occasional write. I just ended up writing my own file format.
While you could do this too, it is pretty complex, especially if you need to support it from a web server. With a web server, you would at least need to protect every write to the file and make sure it can be read from multiple threads. The design of this file format is something you should work out as good as possible with plenty of testing and experiments. One minor bug could prove fatal for a web project in this style, but if you get it working, it can work real well and extremely fast.
But for 99.999% of all situations, you don't want such a custom solution. It's easier to just upgrade the hardware, move to Oracle, SQL Server or InterBase, use a dedicated database server, use faster hard disks, install more memory, upgrade to a 64-bit system. Those are the more generic tricks to improve performance with the least effort.
I'd recommend learning about message queue technology. There are several open-source options available, and also robust commercial products that would serve up the volume you describe as a tiny snack.
CouchDB is schema-free, and it's fairly simple to retrieve a huge amount of data quickly, because you are working only with indexes. You are not "querying" the database each time, you are retrieving only matching keys (which are pre-sorted making it even faster).
"Views" are re-indexed everytime new data is entered into the database, but this takes place transparently to the user, so while there might be potential delay in generating an updated view, there will virtually never be any delay in retrieving results.
I've just started to explore building an "activity stream" solution using CouchDB, and because the paradigm is different, my thinking about the process had to change from the SQL thinking.
Rather than figure out how to query the data I want and then process it on the page, I instead generate a view that keys all documents by date, so I can easily create multiple groups of data, just by using the appropriate date key, essentially running several queries simultaneously, but with no degradation in performance.
This is ideal for activity streams, and I can isolate everything by date, or along with date isolation I can further filter results of a particular subtype, etc - by creating a view as needed, and because the view itself is just using javascript and all data in CouchDB is JSON, virtually everything can be done client-side to render your page.