How to create pseudo document oriented model? - mysql

Currently, I am using Rails with Mysql as the backend. Unfortunately, my application has scaled in data which was not expected or foreseen when it started. Now, I am facing a lot of performance issues with increasing entries in the database and ActiveRecord is taking hit due to in-numerous queries that are fired as a result of enjoying the Relational logic.
I have come to a point where I feel like paying penalty for enjoying the advantages of a proper Relational model. Since speed has come under the hammer, I had to do research on Document-Oriented Models like Mongo DB and found that they offer speed compensating the Relational features.
My question here is, how to slowly migrate from Relational model to document model. Perhaps, I will store my temporary schemas or the tables returned and dump them as a bulk document on the fly instead of setting up a proper document-oriented DB (at least during the initial phase). Space is not an issue for me. All I care now is time. But then, I cannot do that in one single sweep. I would like to know how to approach this problem, any links/references where this kind of problem has been solved before would be much appreciated.

I would highly recommend against migrating to a document db unless your data is better suited to such a database.
Migrating for speed reasons would generally be a bad idea, and you should instead look for slow queries in your existing AR based system and optimise them.

Related

Is there a high performance difference in a Key-Value db on a single server with MySQL vs. NoSQL

In my PHP application I have a 470M rows table weighing 200GB in a MySQL MyISAM partitioned table on one server. Usage includes 70% Writes/30% Reads.
I'm trying to improve performance. Main problem currently is read/write contentions due to table-level locks. I'm trying to decide between two options:
Changing MySQL to Innodb. Pros: avoiding the table level locks. Cons: Much more disk space, need bigger HDs which might not be as fast as these (currently using RAID10 6*300GB SAS 15k).
Moving data to a NoSQL db. Main Con: Learning curve. Have never used NoSQL before.
Question is, while trying to still avoid sharding the data, and considering the fact I'm using the RDMS MySQL as a simple key-value storage, are there high differences between performances between the two approaches or is the NoSQL main advantage here comes when moving to a distributed system?
I can only answer your question partially but hopefully more than a comment.
MongoDB is not typically a key-value store and has been known to have certain performance hits when used as one.
MongoDb also has a locking problem here that could come back to haunt you. It has a DB level lock atm which means it could (would need testing) cause write lock saturation.
It is also heavily designed for a 80% read app (which is said to be the most common setup for websites now-a-days) so the more writes you do the more you will notice a performance drop over time. That being said you can tweak MongoDB to be more write friendly and the distributed nature does help to stop write lock saturation a little.
However that being said my personal opinion the learning curve of MongoDB from SQL:
Was next to null
More natural and simpler to implement into my app than SQL
Query language is simple making it dead easy to get to grips with
Query language has a lot of similarities to SQL
The drivers are standardised so that the syntax you see in the Docs for the JS driver in the console is consistent across the board.
My personal opinion on the general matter is the distributed notion of it. If you get a NoSQL solution designed for key-value stores then it could be really good. A quick search on Google pulled out a small list of NoSQL key-value stores on Wikipedia: http://en.wikipedia.org/wiki/NoSQL#Key-value_stores_on_solid_state_or_rotating_disk

How to handle ever changing database structure

I am working on my masters thesis. For my implementation I have some MySQL tables.
With every iteration my table structure will differ (adding, removing columns etc). I was wondering what the best way is to handle the ever changing structure, without changing old code too much.
I read that Facebook has a version control system where the can specify exactly what kind of code/feature is available and for what user. As far as I know that must mean that they manage many different database structures at once. How does their old code work along side their new code with respect to their database? Do they do a lot of testing? Did they abandon MySQL all together?
Personally I like FriendFeeds Solution a lot. However I am wondering if it is too much for me.
Why anyone would try to use a relational database for non-relational data.
Forget about FriendFied and take a look at NoSQL solutions. They are schemaless, they support horizontal scalability much better than any RDBS and most of them are free/open source.
I can recommend MongoDB. It's very fast, written in C++, but no ACID complaint.
Also you could try RavenDB. It's not as fast as MongoDB and inserts are very slow compared to Mongo, but it's ACID complaint. Written in .NET.

Are there any advantages to using mongodb over mysql if said mongo db were used without embedded documents?

I'm using a php framework with a mongodb adapter that doesn't currently comprehend embedded documents as a Model/association relationship. After reading about mongodb for a few days it seems that you should use embedded documents for objects that are most often displayed together. This makes a lot of sense to me. It was said during one mongo schema talk that a collection of many small documents can negate some of the advantages of mongo over an RDBMS.
In searching stackoverflow and beyond, I can't seem to see what advantages exist, if any, when deploying mongodb into an environment where it is implemented with a reasonably normalized schema like you'd find in a traditional RDBMS.
Are there still advantages to using MongoDB when used in this way? Scaling? Performance?
If by "reasonably normalized" you mean that you need information from one table to filter the information from another table (i.e. a join), then mongo is going to work against you. In a SQL database you can easily get the info from multiple tables with a single query. In mongo you'll need multiple queries to get data from multiple collections. Any speed advantage mongo gives you in pulling from a single collection will quickly be negated by making multiple round trips to the database.
Here are some advantages that MongoDb might give you (depending on your usecase):
Schemaless: More flexible if document structure is modified later.
Performance: MongoDB utilizes the RAM available very well making it very performant
Easy replication: Replication is easy to setup
Sharding/Clustering: MongoDB is designed with sharding in mind. It is easy to setup and doesn't require experts.
Map/Reduce: If you happen to need this, there is built-in support.
Javascript: Intuitive to use if you already know Javascript (and who doesn't nowadays :) )
MongoDB website has a good list of casestudies of production deployments.
MongoDB has replication and sharding built in.
These are things that can be done with MySQL.
The downside is the learning curve and lack of programmers that know it.
If it's just for you, it would be fun as a learning project.
If this is for a larger project, you'll need to weigh the lack of MongoDB programmers and learning curve against popularity of MySQL.
I have been developing my University dissertation project with MySQL first then thought to give a shot to MongoDB to improve performance. Rewriting code was really easy and straightforward with Jongo. Production has been really smooth.
Unfortunately performance were terrible. I am not particularly skilled with MongoDB queries, but I believe I did quite a lot of research: I have used map reduce, I have used the aggregation framework, $limit and all that stuff... when at same stage I got the message: "request heap use exceeded 10% of physical RAM" I really gave up and delivered the MySQL version.
For me it's really a shame because I was working so hard to make it work the best way possible with MongoDB (as a University project stands out if you do something different). However I think I will continue study MongoDB in future, but for the moment I stick to performance (or better what I can make perform).
I hope my comment will not offend MongoDB fans, but this is my experience.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

What database works well with 200+GB of data?

I've been using mysql (with innodb; on Amazon rds) because it's sort of universal default, but it's been ridiculously under-performing, and tweaking it only delays the inevitable.
The data is mostly relatively short (<1kB of bytes each) blobs information about 100Ms of urls. There is (or should be, mysql cannot seem to handle it) very high amount of insert / update / retrieve but few complex queries - not that complex queries wouldn't be useful, but because mysql is so slow that it's far faster to get the data out, process it locally, and cache the results somewhere.
I can keep tweaking mysql and throwing more hardware at it, but it seems increasingly futile.
So what are the options? SQL/relational model/etc. optional - anything will do as long as it's fast, networked, and language-independent.
Have you done any sort of end-to-end profiling of your application and MySQL database? To provide better advice it would also be good to understand what improvements you have tried to implement, and your database structure. You haven't given a lot of information on how your MySQL database is configured either. It provides a lot of options for tuning.
You should pick up a copy of High Performance MySQL if you haven't already to learn more about the product.
There is no point in doing anything until you know what your problem is. NoSQL solutions can offer performance benefits but you have provided little evidence that MySQL is incapable of servicing your needs.
Well "Fast, networked and language-independent" + "few complex queries" brings to mind the various NoSQL solutions. To name a few:
MongoDB
CouchDB
Cassandra
And if that's not fast enough, there are always the wicked fast Redis which is my personal favorite atm. :) It is not a database per se, but it's good enough for most scenarios.
I am sure other people can list more NoSQL databases...
and there is always http://nosql-database.org/ .
Generally speaking, databases in this category is better and faster in your scenario because they have relaxed constraints and thus is easier and faster to insert/update/retrieve frequently. But that requires that you think harder about your data model and it is generally not possible to do SQL-style complex queries directly -- you'll instead write more pre-computed data or use a more denormalized design to account for the lack of complex queries.
But since complex queries is a minor problem in your case, I think NoSQL solutions are ideal for you.
With the data you've given about your application's data and workload, it is almost impossible to determine whether the problem really is MySQL itself or something else. You seem to assume that you can throw any workload to a relational engine and it should handle it. Therefore the suggestions made by other commenters about analyzing the performance more carefully are valid in my opinion. Without more data (transactions / second etc.) any further analysis regarding other suitable engines is also futile.
I'm not sure I agree with the advice to jump ship on traditional databases. It might not be the most efficient tool, but it is the one that is FAR more widely understood and used, and a strongly doubt you have a problem that can't be handled by an efficiently set up relational database.
Obvious answers are Oracle, SQLServer, etc, but it might just be your database structure isn't right. I don't know much about MySQL but I do know it's used in some pretty big projects (eBay being noteworthy).