mysql v mongodb - best solution for a complex user focussed site? - mysql

I've spent a few days researching the pros and cons of mysql against nosql solutions (specifically mongodb) for my project.
The project needs to be able to eventually scale to handle tens of thousands of simultaneous users - millions of users in total. The site is heavily user focussed and will interact with the database as much if not more than a site like facebook - it is very relational, all functionality is dependant on the relation to the user and their relationship with other users. It's also data heavy - lots of files, images, audio, messaging, personal news feed etc.
I like the look on mongodb a lot, I like the way it works, and I like how it scales - but can't get my head around how this would work for a site such as I describe. Would all interactions for a specific user have to be stored in a single document?
I am however very comfortable using mysql and like the relational aspect of it. I am just worried without a lot of work there will be scalability issues with this project - although perhaps with memcached and sharding this won't be an issue?
I'd like to know from those with experience with the two databases on large projects, out of mysql and mongodb which is the right tool for this particular job?

If the data is highly relational, use a relational database. If it's not, don't. NoSQL is great, don't get me wrong, but it's not suited to all tasks. It may be suited to your task, but the only way to find out is for you to build some tests for your specific usecase. Add a bunch of dummy data (millions if not hundreds of millions of rows). And then load test it.
As far as scaling, that's more of a component of how you build your application than the backend you choose. Do you have a solid schema? Do you have a strong cache layer with write-through caching? Do you access the backend as efficiently as possible (queries and such)? Can you shard based upon your application?
Those are the kind of questions which are appropriate here. Not "which will scale for me better". And not "which is the right tool". Both can do the job fine. Which is best is up to you...

Obviously, there's no silver bullet here. However, I would like to challenge this one assumption you've made:
... it is very relational, all functionality is dependant on the relation to the user and their relationship with other users...
OK, I'd like you to picture having 100M users in a relational database and start building this model. Let's try something simple, grab the names of a user's friends.
How do you get a user's friends? Well you go to the users_friends table. If each user has even just 10 friends, that table contains a billion rows. If users have a more reasonable 100 friends, you now have 10B rows.
So now you have a user and a list of their friends IDs. How do we get their friend's names? Well you go through the list of 100 IDs and pull down each of the friends. Perfect.
So now, if you want to show one user the names of all of their friends, all you have to do is join the 100M record table to the 10B record table. This is not a simple task. Scaling joins becomes exponentially harder and more expensive as the dataset grows.
So, to make this easier, you're probably going to run a for loop and manually collect the records for each friend. You have to do this because the friends are scattered across multiple servers, so each "lookup" has to be done individually.
Already you've broken your "relational model".
What about the friends list? Is keeping a table of 10B records really practical? Why not just keep a list of friend IDs with each user? Why do an extra query.
If you notice the pattern here, we've basically broken down the "very relational" model into something that's effectively key-value lookups. Of course, the key-value model will scale much better. And so, MongoDB seems like a good fit here.
Don't get me wrong, there are lots of good uses for relational databases. But when you're talking about handling millions of individual key-value style requests, you probably want to look at a NoSQL database.

There is no law that you have to build an application with exactly one database. It is often common practice having dedicated backends for particular tasks. E.g. in the context of a Facebook-like application it may make sense to work with a graph-database for storing relations between users - every database has its pros and cons and only would fools implement large backends with only a RDBMS or only a NoSQL db just because they don't know better.

Related

Performance Analysis of CouchDB

I am developing a Discussion Forum for my University. For this to manipulate the data i m using CouchDB as database.
I m finding difficulty in designing the structure of my db, in order to maximize the performance of my db.
I want to discuss what is the good practice of designing a document database.
Either we should make only one database as SQL and make 'n' no. of documents in the database.
Or we can make more no of database in order to flatten my db structure.This also reduce the more no. of documents to be developed.
The questions you need to ask are simply this: "How do you want to get data out of your database?"
Database design hinges around the queries to be made, not what is available to be stored.
This is especially important for Document DBs like Couch, since, while it does have a flexible schema, it does not have flexible indexing. By that I mean that because of the granularity of the data, it's quite like that later on, when you need to ask a question that it was not designed to answer, answering that question may well be very expensive. It's much, much cheaper to design your views and other constructs early, when there is little data in the data base rather than later after you have thousands or millions of rows.
RDBMS's, since they tend to have a finer granularity of data, tend to be more nimble to new queries and such later in life. Document DBs, not so much.
So think through your use cases up front, and design around those, and design those early on, it's much less painless now than later.
It's hard to tell the right way to approach modeling your data since you don't give much information. Generally though you want to keep as much data as possible in one database as this allows you to index it together (indexes cannot span more than one database).
Also, since there is no schema enforcement in the database, you can create different types of records in each database. For example, there is nothing wrong with have both user information and forum entries in the same database.
Last, you will most likely want to keep messages and their replies in different records. This is an old but still relevant discussion on this topic: http://www.cmlenz.net/archives/2007/10/couchdb-joins
Cheers.

Is mongo appropriate to use alongside MySQL?

I can't discuss things in great detail due to an NDA, but I'm hoping an overview of the system being built can help you in aiding me in making a decision concerning our databases.
I'm building an app that will help vendors compete to gain clientele by making strategic offers based on records of inventory/purchase from the storefronts.
One side of the app is for the store owners to see presented offers, network, etc. I've got that going with a standard php/MySQL setup.
My question is concerning the records of inventory. We are talking millions of records here nearly immediately. The sample data I'm using is roll up of four of their managers (they have dozens) over the course of a year or two and it had over 500k rows with about 30 or more columns. When we get scores of stores with all of their managers it will be massive, at least compared to anything I've worked with as of yet.
The vendors will have a side of the product in which they can search through these records and make competitive offers based off of it.
Is the sheer size a good reason to use something like mongo? Or is it more a matter of how the data is laid out / what it consists of? Or some other element that I'm not considering?
And, if not mongo/nosql, then is there some other methodology or technology that such large data stores would benefit from me using (sharding, amazon cloud database, etc).
Thanks
Answers ...
Q: Is the sheer size a good reason to use something like mongo?
A: I think so. Mongo was built from the ground up to scale in a massive way. You have replica sets and sharding that can help you scale. They also have features to make sure your data gets stored in the appropriately geographically distributed data centers.
Q: Or is it more a matter of how the data is laid out / what it consists of?
A: Mongo is a document database and you're right, the data models will be different. You have to think of data in a denormalized way instead of normalized. Just like any technology, there are pros and cons to storing things as documents.
Some pros: Schema management is a breeze. Data more naturally fits objects in your application. Don't have to pay the price of complicated/slow joins.
Some cons: Schemas can be inconsistent - you have to manage it. Data is repeated, which is not managed means it can become inconsistent.
In general I think Mongo would be a good choice to deal with that scale. Mongo has a new aggregation framework that brings a lot of SQL concepts to queries on documents. Easier to make complex queries. Also Mongo has map/reduce to run any kind of query you might have.
After using Mongo daily for about a year, I've really enjoyed the support around it as a product and the general ease of setting it up and working with it.

is this sort of SQL structuring (in a mysql database) efficient in a real world model?

i was hoping to get feedback on an example mysql structure for a web application in a real world environment from people who have used complex mysql in real world situations before.
example ~
education management app.
80,000 users.
each user gets his own database containing tables for
-messages
-uploads
-grades
-info
and more tables for other features
what I'm wondering is, and any info would be appreciated, in a situation like this
-is this databasing model efficient? (basically like 80,000 databases) or is there (and I have this itching feeling there is) a better way to do it?
-what kind of dedicated server would this require? 80,000 databases each with 10-15 tables containing TONS of tables, with all 80,000 people accessing the site 20-30 times a day for 10-20 sessions
Start running.
Now!
Jokes aside, don't do that. Don't create one database per user. That's a hell to administer, maintain and to query. What if you need to know which users logged in yesterday? Will you query each database??
The structure you need is the same, only the amount of data changes. Just have one database, see how it goes and then optimize/fine tune.
I hate to bring this quote up, but in your case it totally applies:
Premature optimization is the root of all evil (Donald Knuth)
Don't try to optimize your solution before you know where your bottlenecks will be.
Just model your database the best you can. Worry about your constraints, PKs, FKs, Indexes. Do your database-design homework. Then have your data and software going. Only then you will see where it works and where it hurts. At this moment, you optimize.
Only attack your enemy when you know who it is.
it can be efficient with the appropiate store engine to support that model. Most of the recent no-sql data stores (hadoop, bigtable, mongodb, etc.) work great for this sort of scenario.
If you think about it, a user data is an excellent way to partition data stores in isolated islands (most of interactions between different user databases does not need to be transactional, and there is very little if any exchange of write requests)
Basically i would think that no-sql doesn't give you any benefits for data relationships inside the user data island itself, so relational and no-relational stores performance differences should not matter that much
Given said that, traditional relational databases as mysql are not designed to be managed in-masse, so in this respect you might want to consider a different data store (or hiring a dba rock star)
Both Adrian and Lurscher have already given lot of details.
Without knowing how much data and the schema it is hard to get into design. From outset with 80K active users the load doesn't seem to be huge even if they access the data hundreds of times a day. I have a feeling you could be able to create normalized schema for an OLTP environment with users table and work from there with other tables. And again it is the requirement that will drive the design. For example, what should be the response time for user query - sub-second, a second?

Best way to structure a database for scaling

I am working on a project that has the potential to have a large number of users each of which will be managing their own unique data sets. I am thinking the data can be stored in one of two ways.
1) Create a completely different database for each user so that their data is fully separate from everyone elses
2) Share the data in the same database, and segregate it at the query level using a user_id field.
The schema will always be identical for each user.
The main thing is that the system will need to be able to scale, and I am not sure if having potentially several thousand different databases, or storing millions of records in the same tables would scale better.
I am interested in hearing from anyone who has dealt with this kind of situation in the past and what pitfalls might be out there with either option.
In addition to the scaling aspect that you have already identified, there are a few other concerns which may drive your decision - the 'large number of users' can also mean such a range of numbers that you would be best to clarify.
Other operational concerns:
Security - relying on a user_id field within your code relies on there being no error or flaw that allows a user to see / maniuplate other user's data.
Upgrades - goes both ways, but you either upgrade everyone at once (single DB) or by splitting - allow yourself to upgrade diffent sets of users at different times.
Backup / Restore - depending on the restore requirements and SLAs, you may find that having everyone in a single database creates too much of a problem when it comes to backup / restore. If a single client wants to restore their data, the operational overhead when it is combined with all the other client's data is not trivial. Equally, having lots of databases = lots of seperate backups.
Scalability - having the ability to place different user's databases on seperate servers can aid scale, instead of requiring a big iron DB server. But again, that is a management overhead.
Multi-tenancy of an application and it's data source is not an easy question / answer - understanding more about how many users is 'large' in this case might be, combined with the operational concerns provide you guidance.
Option 2 should be your best bet. Databases are usually designed to work with millions and millions of rows and a lots of data. So, as long as you design your schema correctly and have proper indexes, fill factors etc., option 2 will lead you to the scaling that you are looking for. As DarthVader said, learn more about database design.
Dont create seperate database for each user. That s not good.
What if you will have million users?
Create table for users and entities that belong to same context. you cant scale applications like that. and before learning about scalability. you need to learn about database design and how databases works.

5 separate database or 5 tables in 1 database?

Let's say I want to build a gaming website and I have many game sections. They ALL have a lot of data that needs to be stored. Is it better to make one database with a table representing each game or have a database represent each section of the game? I'm pretty much expecting a "depends" kind of answer.
Managing 5 different databases is going to be a headache. I would suggest using one database with 5 different tables. Aside from anything else, I wouldn't be surprised to find you've got some common info between the 5 - e.g. user identity.
Note that your idea of "a lot of data" may well not be the same as the database's... databases are generally written to cope with huge globs of data.
Depends.
Just kidding. If this is one project and the data are in any way related to each other I would always opt for one database absent a specific and convincing reason for doing otherwise. Why? Because I can't ever remember thinking to myself "Boy, I sure wish it were harder to see that information."
While there is not enough information in your question to give a good answer, I would say that unless you foresee needing data from two games at the same time for the same user (or query), there is no reason to combine databases.
You should probably have a single database for anything common, and then create independent databases for anything unique. Databases, like code, tend to end up evolving in different directions for different applications. Keeping them together may lead you to break things or to be more conservative in your changes.
In addition, some databases are optimized, managed, and backed-up at a database level rather than a table level. Since they may have different performance characteristics and usage profiles, a one-size-fit-all solution may not be scalable.
If you use an ORM framework, you get access to multiple databases (almost) for free while still avoiding code replication. So unless you have joint queries, I don't think it's worth it to pay the risk of shared databases.
Of course, if you pay someone to host your databases, it may be cheaper to use a single database, but that's really a business question, not software.
If you do choose to use a single database, do yourself a favour and make sure the code for each game only knows about specific tables. It would make it easier for you to maintain things later or separate into multiple databases.
One database.
Most of the stuff you are reasonably going to want to store is going to be text, or primitive data types such as integers. You might fancy throwing your binary content into blobs, but that's a crazy plan on a media-heavy website when the web server will serve files over HTTP for free.
I pulled lead programming duties on a web-site for a major games publisher. We managed to cover a vast portion of their current and previous content, in three European languages.
At no point did we ever consider having multiple databases to store all of this, despite the fact that each title was replete with video and image resources.
I cannot imagine why a multiple database configuration would suit your needs here, either in development or outside of it. The amount of synchronisation you'll have to pull and capacity for error is immense. Trying to pull data that pertains to all of them from all of them will be a nightmare.
Every site-wide update you migrate will be n times as hard and error prone, where n is the number of databases you eventually plump for.
Seriously, one database - and that's about as far from your anticipated depends answer as you're going to get.
If the different games don't share any data it would make sense to use separate databases. On the other hand it would make sense to use one database if the structure of the games' data is the same--you would have to make changes in every game database separately otherwise.
Update: In case of doubt you should always use one database because it's easier to manage in the most cases. Just if you're sure that the applications are completely separate and have completely different structures you should use more databases. The only real advantage is more clarity.
Generally speaking, "one database per application" tends to be a good rule of thumb.
If you're building one site that has many sections for talking about different games (or different types of games), then that's a single application, so one database is likely the way to go. I'm not positive, but I think this is probably the situation you're asking about.
If, on the other hand, your "one site" is a battle.net-type matching service for a collection of five distinct games, then the site itself is one application and each of the five games is a separate application, so you'd probably want six databases since you have a total of six largely-independent applications. Again, though, my impression is that this is not the situation you're asking about.
If you are going to be storing the same data for each game, it would make sense to use 1 database to store all the information. There would be no sense in replicating table structures across different databases, likewise there would be no sense in creating 5 tables for 5 games if they are all storing the same information.
I'm not sure this is correct, but I think you want to do one database with 5 tables because (along with other reasons) of the alternative's impact on connection pooling (if, for example, you're using ADO.Net). In the ADO.Net connection pool, connections are keyed by the connection string, so with five different databases you might end up with 20 connections to each database instead of 100 connections to one database, which would potentially affect the flexibility of the allocation of connections.
If anybody knows better or has additional info, please add it here, as I'm not sure if what I'm saying is accurate.
What's your idea of "a lot of data"? The only reason that you'd need to split this across multiple databases is if you are trying to save some money with shared hosting (i.e. getting cheap shared hosts and splitting it across servers), or if you feel each database will be in the 500GB+ range and do not have access to appropriate storage.
Note that both of these reasons have nothing to do with architecture, and entirely based on monetary concerns during scaling.
But since you haven't created the site yet, you're putting the cart before the horse. It is very unlikely that a brand new site would use anywhere near this level of storage, so just create 1 database.
Some companies have single databases in the 1,000+ TB range ... there is basically no upper bound on database size.
The number of databases you want to create depends not on the number of your games, but on the data stored in the databases, or, better say, how do you exchange these data between the databases.
If it is export and import, then do separate databases.
If it is normal relationships (with foreign keys and cross-queries), then leave it in one database.
If the databases are not related to each other, then they are separate databases, of course.
In one of my projects, I distinguished between the internal and external data (which were stored in separate databases).
The difference was quite simple:
External database stored only the facts you cannot change or undo. That was phone calls, SMS messages and incoming payments in our case.
Internal database stored the things that are usually stored: users, passwords etc.
The external database used only the natural PRIMARY KEY's, that were the phone numbers, bank transaction id's etc.
The databases were given with completely different rights and exchanging data between them was a matter of import and export, not relationships.
This made sure that nothing would happen with actual data: it is easy to relink a payment to a user, but it's very hard to restore a payment if it's lost.
I can pass on my experience with a similar situation.
We had 4 "Common" databases and about 30 "Specific" databases, separated for the same space concerns. The downside is that the space concerns were just projecting dBase shortcomings onto SQL Server. We ended up with all these databases on SQL Server Enterprise that were well under the maximum size allowed by the Desktop edition.
From a database perspective with respect to separation of concerns, the 4 Common databases could've been 2. The 30 Specific databases could've been 3 (or even 1 with enough manipulation / generalization). It was inefficient code (both stored procs and data access layer code) and table schema that dictated the multitude of databases; in the end it had nothing at all to do with space.
I would consolidate as much as possible early and keep your design & implementation flexible enough to extract components if necessary. In short, plan for several databases but implement as one.
Remember, especially on web sites. If you have multiple databases, you often lose the performance benefits of query caching and connection pooling. Stick to one.
Defenitively, one database
One place I worked had many databases, a common one for the stuff all clients used and client specifc ones for customizing by client. What ended up happening was that since the clients asked for the changes, they woudl end up inthe client database instead of common and thus there would be 27 ways of doing essentially the same thing becasue there was no refactoring from client-specific to "hey this is something other clients will need to do as well" so let's put it in common. So one database tends to lead to less reinventing the wheel.
Security Model
If each game will have a distinct set of permissions/roles specific to that game, split it out.
Query Performance /Complexity
I'd suggest keeping them in a single database if you need to frequently query across the data between the games.
Scalability
Another consideration is your scalability plans. If the games get extremely popular, you might want to buy separate database hardware for each game. Separating them into different databases from the start would make that easier.
Data Size
The size of the data should not be a factor in this decision.
Just to add a little. When you have millions and millions of players in one game and your game is realtime and you have tens of thousand simultaneous players online and you have to at least keep some essential data as up-to-date in DB as possible (say, player's virtual money). Then you will want to separate tables into independent DBs even though they are all "connected".
It really depends. And scaling will be painful whatever you may try to do to avoid it being painful. But if you really expect A LOT of players and updates and data I would advise on thinking twice, thrice and more before settling on a "one DB for several projects" solution.
Yes it will be difficult to manage several DBs probably. But you will have to do this anyway.
Really depends :)..
Ask yourself these questions:
Could there be a resuability (users table) that I may want to think about?
Is it worth seperating these entities or are they pretty much the same?
Do any of these entities share specific events / needs?
Is it worth my time and effort to build 5 different database systems (remember if you are writing the games that would mean different connection strings and also present more security, etc).
Or you could create one database OnlineGames and have a table that stores the game name and a category:
PacMan Arcade
Zelda Role playing
etc etc..
It really depends on what your intentions are...