SQL for large scale datasets

SQL for large scale datasets - mysql

Some job descriptions include sth such as "demonstrated skill on handling large scale(massive) datasets using SQL"
I would like to know which kinds of SQL-related skill sets are required for meeting the requirements of these jobs?

Designing a performant schema and knowing when to denormalize (and when you've got problems you can solve other ways.)
Efficient query design.
The intimate details of index design, to the point where you can make changes and get the results you expected.
How to build, support, and effectively make use of test data.
How to read all the breadcrumbs your server leaves in its trail (logs and query plan analyses in particular.)
How to tell how hardware, dbms software, and configuration work together and be able to adjust parameters and modify hardware without fear, and get the results you expected.

Everything related to SQL would be, IMO. Everything from query writing to DBA changes with large scale datasets.

Related

How do You Organize Big Data in your Database?

I have some database with big data inside it, now I am thinking how to organize them to be more scallable.
some point as my consideration is :
Security
Performance
Cost
Generally answer is welcome, because I am still didn't expected all of my problem or possibility risk will happen, it's will help me if you can give me some suggestion.

To give a full answer to your question we will need more information on how big the data is, how complex, what your use cases are (ie. do you do many joins on multiple tables or are they mostly on a single table?). In any case, here are some good pointers that would help you get on your way.
If you are expecting your data to grow rapidly, I would recommend that you look at a cloud based database solution rather than invest on physical hardware that would need replacing every so often. Cloud based solutions provide you more freedom to scale your database both vertically and horizontally. There are specialized cloud database technologies such as Amazon RedShift and recently introduced Aurora which can be configured easily as your requirements grow.
For performance improvement within the database you can always look at indexes and changes in structures. Use the explain syntax in MySQL to analyze your queries and see if the queries use temporary tables or data scans which will slow things down. Adding indexes to columns that you use for filtering or merging data increases performance drastically.
In data warehouses, you can also denormalize and pre-join tables to improve performance. Although this will drastically increase your storage use, due to the fact that you are only working with one data table increases the performance as the time taken to do the join over and over again is taken off the equation.
If you are looking at massive datasets that will grow in structure and complexity, there are other non relational database technologies such as noSQL based Hadoop, Cassandra, etc. Moving into these environments may need you to rewrite most of your application, but is something that you should consider before you find yourself in the need for such things when the data has grown too big.
EDIT
Privacy and data security as pointed out below by #Saïd Tahali in the comments. If you can't host your data outside due to legal or security reasons, you will need to invest on your own hardware that will address all of the above in-house.

SQL Server vs. NoSQL

So I have a website that could eventually get some pretty high traffic. My DB implementation is in SQL Server 2008 at the moment. I really only have 2 tables and a few stored procs. Most of the DB could be re-designed to work without joining (although it wouldn't make sense when I can join so easily within SQL Server).
I heard that sites like Digg and Facebook use NoSQL databases for a lot of their basic data access. Is this something worth looking into, or will SQL Server not really slow me down that bad?
I use paging on my site (although this might change in the future), and I also use AJAX'd data access for most of the "live" stuff, so it doesn't really seem to be a performance hindrance at the moment, but I'm afraid it will be as the data starts expanding exponentially.
Am I going to gain a lot of performance my moving to NoSQL? Honestly, right now I don't even completely understand NoSQL, so any tips on how this will help me improve the better.
Thanks guys.

Actually Facebook use a relational database at its core, see SOCC Keynote Address: Building Facebook: Performance at Massive Scale. And so do many other web-scale sites, see Why does Quora use MySQL as the data store instead of NoSQLs such as Cassandra, MongoDB, CouchDB etc?. There is also a discussion of how to scale SQL Server to web-scale size, see How do large-scale sites and applications remain SQL-based? which is based on MySpace's architecture (more details at Scale out SQL Server by using Reliable Messaging). I'm not saying that NoSQL doesn't have its use cases, I just want to point out that there are many shades of gray between white and black.
If you're afraid that your current solution will not scale then perhaps you should look at what are the factors that prevent scalability with your current solution. Test data is cheap to produce, load the 'exponentially increased' data volume and run your test harness, see where it cracks. None of the NoSQL solutions will bring magic off-the-shelf scalability, they all require you to understand how to use them effectively and deploy them correctly. And they also require you to test with large volumes if you want to ensure success at scale. Same for traditional relational solutions.

Sql Server scales pretty well. For example, Stack Overflow used it to serve you this very page. Facebook and Google might use a form of nosql, but even if you make it really big you're unlikely to rise to that level.

With a simple table structure and data that fits on one server, it doesn't matter much what platform you use. There are a several possible reasons to need to move to NoSQL:
Data scaling - SQL works best when all the data fits on one server (up to a few TB). The reason a lot of NoSQL stores don't have join is that they were designed not to require all the objects to be on one server.
Performance scaling - NoSQL stores do tend to be faster at handling high traffic, but not necessarily by enough to matter. You can improve SQL performance quite a lot with replication and caching as long as you aren't running into data size issues. Writes generally do have to run on the one server, but in most cases you will need to improve read performance long before write performance becomes an issue.
Complex data access - some types of queries simply don't fit well into a relational model. Graph and set stores work quite differently from relational databases so are a better fit for some applications.
Easier development - If you don't already have a SQL database and all the code to support it, using a schemaless datastore can save quite a bit of development time.

I don't think so you have to move your database from SQL to NoSQL unless and untill you are serving thousands of TB data. If you properly normalize your tables and serve the data and also need to set proper archive mechanism it should work.
If you still have question what to choose and how, than check this. Let's assume that you have decided to move on to NoSQL database than there are lot of market player. Just have a look at the list which is again depending upon your need and type of data you have.

Am I going to gain a lot of performance my moving to NoSQL?
It depends.
Check out this article for 7 reasons when you DON'T want to use NoSQL. If none is your case, then read further.
The main advantage of Document-based NoSQL for the traditional enterprise needs is cheaper hosting at high scale due to lower CPU usage on querying denormalised data (the most often request). Key points:
The CPU is going nuts on JOINs and GROUP BYs in the SQL queries, when a denormilised data structure implies no/less JOINs, hence less stress on CPU.
CPU is the most expensive resource in the cloud, then storage is the cheapest. And denormalised data trades higher storage for lower CPU.
How to get there?
Master the DDD (Domain-Driven Design).
Gain good understanding of CQRS (Command Query Responsibility Segregation) and Eventual consistency.
Understand your domain and business processes.
Design model, which is tuned to the access patterns.
Review.
Repeat steps 3 - 5.

SQL Server: Mitigating schema changes/upgrades

I haven't spent a ton of time researching this yet, mostly looking for best practices on upgrading/changing DB schemas.
We're actively developing a new product and as such we often have additions or changes to our DB schema. We also have many copies of the DB -- one for the test environment, one for the prod environment, dev environments, you name it. We don't really want to have to blow away test data every time we want to make a change to the DB.
Are there good ways of automating this or handling this? None of us have really ever had to deal with this so...

Normalize, Normalize, Normalize
Then do it again.
This means that you can just slip new tables / views and other tasty goodness in without disrupting other tables.
I have seen databases that claim to be normalized, but are not. Try and look ahead when thinking about separating things out.
You may pay a bit with joins, but query views not tables and adopt a good caching strategy and you will be good to go. Some NoSQL databases offer better flexibility, but are a bit like the schizophrenic nephew at the moment in terms of maturity.
What we have is an SQL-independent table description which gets translated into SQL and updates and an ORM/ActiveRecord/Mapper that uses nothing but data from the SQL database schema itself to work out what is going on ... this means you app adjusts to changes too.
We also use stored procedures heavily for inserts and mainly read from views.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?

The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.

So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL

What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.

Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.

I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

Database structure - is mySQL the right choice?

We are currently planning the database structure of a quite complex e-commerce web app that has flexibility as its main cornerstone.
Our app features a large amount of data (products) and we have run into a slight headache trying to keep performance high without compromizing normalization rules in the database, or leaving our highly beloved flexibility concept behind when integrating product options (also widely known as product attributes or parameters).
Based on various references and sources available, we have made up lists on pros and cons of all major and well known database patterns to solve this. After comparing these, we have come up with two final alternatives:
EAV (Entity-attribute-value model) :
Pros: Database is used for all sorting.
Cons: All related queries will include a number of joins between multiple tables in order to complete the collection of data.
SLOB (Serialized LOB, also known as Facade?) :
Pros: Very flexible. Keeping the number of necessary joins low compared to a EAV design pattern. Easy to update/add/remove data from each product but hard to keep data integrity without additional tables.
Cons: All sorting will be done by the application instead of the database. Will use lots of performance (memory?) when big datasets is processed by a large number of users.
Our main questions:
Which pattern/structure would you use, or maybe even a different solution?
Is there better databases besides mySQL available nowadays to accomplish what we want?
Thanks a lot!
Reference: How to design a product table for many kinds of product where each product has many parameters

Why limit yourself to one model? It's very possible that you'll be better off with two different models where each one meets a specific goal very well.
Assuming, as is often the case, that the two don't have to be absolutely and instantaneously in sync, you might easily end up with much better overall performance. What kind of hard requirements would you have on synchronization? Milliseconds up to a minute?
Udi Dahan has some good information on command query responsibility separation (CQRS) that's relevant. See also a couple of other articles. InfoQ also has very relevant video of Greg Young from QCon08.
EDIT: Here's another video (by Udi Dahan) that discusses, among other things, the benefits of multiple models.

MySQL performs very well even for very large datasets. I use it at a financial services SaaS company and it has always worked well. I have also use SQL Server and Oracle for very large applications and MySQL performs no better or worse on whole. My focus is more the business layer, though, and you may get more detailed opinions from people closer to the DB.
When selecting a pattern, keep in mind that it's much more straightforward to scale the application tier than the data tier (easy and cheap to add application servers). Performing many joins for common operations can cause a real performance bottleneck.
I would suggest you prototype both approaches so that you can both get more familiar with each of them, and benchmark their performance in your specific environment.
Additionally, you may want to look into alternatives to SQL that attempt to achieve a pattern similar to the ones you outline. A friend at a very large, well-known Internet company is starting to use Project Voldemort. He prefers it over similar efforts mostly due to the very active community.

from your solution, it seems you don't want to use a relational model, so perhaps it's better not to use a relational database, take a look at these alternatives: http://nosql-database.org/ btw SQLServer has nice SLOB features in the form of xml fields (can be indexed an queried through XQuery)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008