I am working on a project involving large amount of data from the delicious website. The data available is "Date, UserId, Url, Tags" (for each bookmark).
I normalized my database to a 3NF, and because of the nature of the queries that we wanted to use in combination, I came down to 6 tables... The design looks fine, however, now that a large amount of data is in the database, most of the queries need to join at least 2 tables together to get the answer, sometimes 3 or 4. At first, we didn't have any performance issues, because for testing matters we had not added too much data to the database. Now that we have a lot of data, simply joining extremely large tables takes a lot of time and for our project, which has to be real-time, this is a disaster.
I was wondering how big companies solve these issues. Looks like normalizing tables just adds complexity, but how does the big company handle large amounts of data in their databases, don't they use normalization?
Thanks.
Since you asked about how big companies (generally) approaches this:
They usually have a dba(database administrator) who lives and breathes the database the company uses.
This means they have people that know everything from how to design the tables optimally, profile and tune the queries/indexes/OS/server to knowing what firmware revision of the RAID controller that can cause problems for the database.
You don't talk much about what kind of tuning you've done, e.g.
Are you using MyISAM or InnoDB tables ? Their performance(and not the least their features) is radically different for different workloads.
Are the tables properly indexed according to the queries you run ?
run EXPLAIN on all your queries - which will help you identify keys that could be added/removed, wether the proper keys are selected, compare queries(SQL leaves you with lots of way to accomplish the same things)
Have you tuned the query-cache ? For some workloads the query cache(default on) can cause considerable slowdown.
How much memory do your box have , and is mysql tuned to take advantage of this ?
Do you use a file system and raid setup geared towards the database ?
Sometimes a little de-normalization is needed.
Different database products will have different charasteristics, MySQL might be blazingly fast for some worlkoads, and slow for others.
Related
I am dealing with a large database that is collecting historical pricing data. The schema is relatively simple and does not change.
Something like:
SKU (char), type(enum), price(double), datetime(datetime)
The issue is that this table now has over 500,000,000 rows and is around 20gb and growing. It is already getting a bit difficult to run queries. One common query is to get all skus from a specific date range consisting of maybe 500,000 records. Add any complexity like group by, and you can forget it.
This db is mostly writes. But we obviously need to crunch the data and run queries occasionally. I understand that better index planning can help speed up the queries, but I am wondering if this is the type of data that would benefit from a noSQL solution like MongoDB? Can I expect mysql (probably moving to MariaDB) to continue to work for us, even after it grows beyond 100-200 gb in size? Or should I explore alternatives before things get unweildly?
NoSQL is not a solution to a "large database" problem; NoSQL--specifically document databases--are designed for scenarios where the nature of the data you're storing varies, so you don't want to define rigid schemas and relationships up front.
What you have is simple, well-defined data. This is ideally suited for a relational database, but for something of that scale I would recommend looking something either commercial (i.e. SQL Server or Oracle, depending on your platform). The databases I work with in SQL Server are around four terabytes in size with several tables in the hundreds-of-millions records like you have. A relational database can easily accommodate the simple data you've outlined.
You actually have an ideal use-case for SQL, and a rather bad fit for NoSQL. MySQL devs report people using databases of 5,000,000,000 records. Some other SQL servers will be even more scalable than that. However, if you don't have a proper index support, it should be impossible to manage even a fraction of that.
BTW, what is your table schema, including indices?
You could switch to mariadb and then use the spider engine. The spider engine makes it possible to split your data across multiple mariadb instances without loosing the abillity to run queries against your existing instance.
So you can define your own rules for partitioning and then create one instance per partition. So in the end you have multiple instances of mariadb but all your records are virtual sumed up in one table with the spider engine.
Your performace gain would be because you split your data across multiple instances and therefore reduce the amount of records per table or instance and of course by using more hardware ressources.
I am trying to apply for a job, which asks for the experiences on handling large scale data sets using relational database, like mySQL.
I would like to know which specific skill sets are required for handling large scale data using MySQL.
Handling large scale data with MySQL isn't just a specific set of skills, as there are a bazillion ways to deal with a large data set. Some basic things to understand are:
Column Indexes, how, why, and when they're used, and the pros and cons of using them.
Good database structure to balance between fast writes and easy reads.
Caching, leveraging several layers of caching and different caching technologies (memcached, redis, etc)
Examining MySQL queries to identify bottlenecks and understanding the MySQL internals to see how queries get planned an executed by the database server in order to increase query performance
Configuring the MySQL server to be able to handle a lot of concurrent connections, and access it's data fast. Hardware bottlenecks, and the advantages to using different technologies to speed up your hardware (for example, storing your MySQL data on a RAID5 Array to increase IO performance))
Leveraging built-in MySQL technology (like Replication) to off-load read traffic
These are just a few things that get thought about in regards to big data in MySQL. There's a TON more, which is why the company is looking for experience in the area. Knowing what to do, or having experience with things that have worked or failed for you is an absolutely invaluable asset to bring to a company that deals with high traffic, high availability, and high volume services.
edit
I would be remis if I didn't mention a source for more information. Check out High Performance MySQL. This is an incredible book, and has a plethora of information on how to make MySQL perform in all scenarios. Definitely worth the money, and the time spent reading it.
edit -- good structure for balanced writes and reads
With this point, I was referring to the topic of normalization / de-normalization. If you're familiar with DB design, you know that normalization is the separation of data as to reduce (eliminate) the amount of duplicate data you have about any single record. This is generally a fantastic idea, as it makes tables smaller, faster to query, easier to index (individually) and reduces the number of writes you have to do in order to create/update a new record.
There are different levels of normalization (as #Adam Robinson pointed out in the comments below) which are referred to as normal forms. Almost every web application I've worked with hasn't had much benefit beyond the 3NF (3rd Normal Form). Which the definition of, if you were to read that wikipedia link above, will probably make your head hurt. So in lamens (at the risk of dumbing it down too far...) a 3NF structure satisfies the following rules:
No duplicate columns within the same table.
Create different tables for each set related data. (Example: a Companies table which has a list of companies, and an Employees table which has a list of each companies' employees)
No sub-sets of columns which apply to multiple rows in a table. (Example: zip_code, state, and city is a sub-set of data which can be identified uniquely by zip_code. These 3 columns could be put in their own table, and referenced by the Employees table (in the previous example) by the zip_code). This eliminates large sets of duplication within your tables, so any change that is required to the city/state for any zip code is a single write operation instead of 1 write for every employee who lives in that zip code.
Each sub-set of data is moved to it's own table and is identified by it's own primary key (this is touched/explained in the example for #3).
Remove columns which are not fully dependent on the primary key. (An example here might be if your Employees table has start_date, end_date, and years_employed columns. The start_date and end_date are both unique and dependent on any single employee row, but the years_employed can be derived by subtracting start_date from end_date. This is important because as end-date increases, so does years_employed so if you were to update end_date you'd also have to update years_employed (2 writes instead of 1)
A fully normalized (3NF) database table structure is great, if you've got a very heavy write-load. If your server is doing a lot of writes, it's very easy to write small bits of data, especially when you're running fewer of them. The drawback is, all your reads become much more expensive, because you have to (typically) run a lot of JOIN queries when you're pulling data out. JOINs are typically expensive and harder to create proper indexes for when you're utilizing WHERE clauses that span the relationship and when sorting the result-sets If you have to perform a lot of reads (SELECTs) on your data-set, using a 3NF structure can cause you some performance problems. This is because as your tables grow you're asking MySQL to cram more and more table data (and indexes) into memory. Ideally this is what you want, but with big data-sets you're just not going to have enough memory to fit all of this at once. This is when MySQL starts to create temporary tables, and has to use the disk to load data and manipulate it. Once MySQL becomes reliant on the hard disk to serve up query results you're going to see a significant performance drop. This is less-so the case with solid state disks, but they are super expensive, and (imo) are not mature enough to use on mission critical data sets yet (i mean, unless you're prepared for them to fail and have a very fast backup recovery system in place...then use them and gonuts!).
This is the balancing part. You have to decide what kind of traffic the data you're reading/writing is going to be serving more of, and design that to be fast. In some instances, people don't mind writes being slow because they happen less frequently. In other cases, writes have to be very fast, and the reads don't have to be fast because the data isn't accessed that often (or at all, or even in real time).
Workloads that require a lot of reads benefit the most from a middle-tier caching layer. The idea is that your writes are still fast (because you're 'normal') and your reads can be slow because you're going to cache it (in memcached or something competitive to it), so you don't hit the database very frequently. The drawback here is, if your cache gets invalidated quickly, then the cache is not reducing the read load by a meaningful amount and that results in no added performance (and possibly even more overhead to check/invalidate the caches).
With workloads that have the requirement for high throughput in writes, with data that is read frequently, and can't be cached (constantly changes), you have to come up with another strategy. This could mean that you start to de-normalize your tables, by removing some of the normalization requirements you choose to satisfy, or something else. Instead of making smaller tables with less repetitive data, you make larger tables with more repetitive / redundant data. The advantage here is that your data is all in the same table, so you don't have to perform as many (or, any) JOINs to pull the data out. The drawback...writes are more expensive because you have to write in multiple places.
So with any given situation the developer(s) have to identify what kind of use the data structure is going to have to serve, and balance between any number of technologies and paradigms to achieve an acceptable solution that meets their needs. No two systems or solutions are the same which is why the employer is looking for someone with experience on how to deal with these large datasets. Finding these solutions is not something that can really be learned out of a book, it typically takes some experience in the field and experience with how different solutions performed.
I hope that helps. I know I rambled a bit, but it's really a lot of information. This is why DBAs make the big dollars (:
You need to know how to process the data in "chunks". That means instead of simply trying to manipulate the entire data set, you need to break it into smaller more manageable pieces. For example, if you had a table with 1 Billion records, a single update statement against the entire table would likely take a long time to complete, and may possibly bring the server to it's knees.
You could, however, issue a series of update statements within a loop that would update 20,000 records at a time. Each iteration of the loop you would increment your range/counters/whatever to identify the next set of records.
Also, you commit your changes at the end of each loop, thereby allowing you to stop the process and continue where you left off.
This is just one aspect of managing large data sets. You still need to know:
how to perform backups
proper indexing
database maintenance
You can raed/learn how to handle large dataset with MySQL But it is not equivalent to having actual experiences.
Straight and simple answer: Study about partitioned database and find appropriate MySQL data structure types for large scale datasets similar with the partitioned database architecture.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How many rows in a database are TOO MANY?
I am building the database scheme for an application that will have users, and each user will have many rows in relation tables such as 'favorites'.
Each user could have thousands of favorites, and there could be thousands of registered users (over time).
Given that users are never deleted, because that would either leave other entities orphaned, or have them deleted too (which isn't desired), and therefore these tables will keep growing forever, I was wondering if the resulting tables could be too big (eg: 1kk rows), and I should worry about this and do something like mark old and inactive users as deleted and remove the relations that only affect them (such as the favorites and other preferences).
Is this the way to go? Or can mysql easily handle 1kk rows in a table? Is there a known limit? Or is it fully hardware-dependant?
I agree with klennepette and Brian - with a couple of caveats.
If your data is inherently relational, and subject to queries that work well with SQL, you should be able to scale to hundreds of millions of records without exotic hardware requirements.
You will need to invest in indexing, query tuning, and making the occasional sacrifice to the relational model in the interests of speed. You should at least nod to performance when designing tables – preferring integers to strings for keys, for instance.
If, however, you have document-centric requirements, need free text search, or have lots of hierarchical relationships, you may need to look again.
If you need ACID transactions, you may run into scalability issues earlier than if you don't care about transactions (though this is still unlikely to affect you in practice); if you have long-running or complex transactions, your scalability decreases quite rapidly.
I'd recommend building the project from the ground up with scalability requirements in mind. What I've done in the past is set up a test environment populated with millions of records (I used DBMonster, but not sure if that's still around), and regularly test work-in-progress code against this database using load testing tools like Jmeter.
Millions of rows is fine, tens of millions of rows is fine - provided you've got an even remotely decent server, i.e. a few Gbs of RAM, plenty disk space. You will need to learn about indexes for fast retrieval, but in terms of MySQL being able to handle it, no problem.
Here's an example that demonstrates what can be achived using a well designed/normalised innodb schema which takes advantage of innodb's clustered primary key indexes (not available with myisam). The example is based on a forum with threads and has 500 million rows and query runtimes of 0.02 seconds while under load.
MySQL and NoSQL: Help me to choose the right one
It is mostly hardware dependant, but having that said MySQL scales pretty well.
I wouldn't worry too much about table size, if it does become an issue later on you can always use partitioning to ease the stress.
I'm working on a project which is similar in nature to website visitor analysis.
It will be used by 100s of websites with average of 10,000s to 100,000s page views a day each so the data amount will be very large.
Should I use a single table with websiteid or a separate table for each website?
Making changes to a live service with 100s of websites with separate tables for each seems like a big problem. On the other hand performance and scalability are probably going to be a problem with such large data. Any suggestions, comments or advice is most welcome.
How about one table partitioned by website FK?
I would say use the design that most makes sense given your data - in this case one large table.
The records will all be the same type, with same columns, so from a database normalization standpoint they make sense to have them in the same table. An index makes selecting particular rows easy, especially when whole queries can be satisfied by data in a single index (which can often be the case).
Note that visitor analysis will necessarily involve a lot of operations where there is no easy way to optimise other than to operate on a large number of rows at once - for instance: counts, sums, and averages. It is typical for resource intensive statistics like this to be pre-calculated and stored, rather than fetched live. It's something you would want to think about.
If the data is uniform, go with one table. If you ever need to SELECT across all websites
having multiple tables is a pain. However if you write enough scripting you can do it with multiple tables.
You could use MySQL's MERGE storage engine to do SELECTs across the tables (but don't expect good performance, and watch out for the Windows hard limit on the number of open files - in Linux you may haveto use ulimit to raise the limit. There's no way to do it in Windows).
I have broken a huge table into many (hundreds) of tables and used MERGE to SELECT. I did this so the I could perform off-line creation and optimization of each of the small tables. (Eg OPTIMIZE or ALTER TABLE...ORDER BY). However the performance of SELECT with MERGE caused me to write my own custom storage engine. (Described http://blog.coldlogic.com/categories/coldstore/'>here)
Use the single data structure. Once you start encountering performance problems there are many solutions like you can partition your tables by website id also known as horizontal partitioning or you can also use replication. This all depends upon the the ratio of reads vs writes.
But for start keep things simple and use one table with proper indexing. You can also determine if you need transactions or not. You can also take advantage of various different mysql storage engines like MyIsam or NDB (in memory clustering) to boost up the performance. Also caching plays a very good role in offloading the load from the database. The data that is mostly read only and can be computed easily is usually put in the cache and the cache serves the request instead of going to the database and only the necessary queries go to the database.
Use one table unless you have performance problems with MySQL.
Nobody here cannot answer performance questions, you should just do performance tests yourself to understand, whether having one big table is sufficient.
I know that in SQL Server, the maximum number of "objects" in a database is a little over 2 billion. Objects contains tables, views, stored procedures, indexes, among other things . I'm not at all worried about going beyond 2 billion objects. However, what I would like to know, is, does SQL Server suffer a performance hit from having a large number of tables. Does each table you add have a performance hit, or is there basically no difference (assuming constant amount of data). Does anybody have any experience working with databases with thousands of tables? I'm also wondering the same about MySQL.
No difference, assuming constant amount of data.
Probably a gain in practical terms because of things like reduced maintenance windows (smaller index rebuilds), ability to have read-only file groups etc.
Performance is determined by queries and indexes (at the most basic level): not number of objects
In terms of the max number of tables I have had a database with 2 million tables. No performance hit at all. my tables where small around 15MB each.
I doubt SQL Server will have a performance problem working with thousands of tables, but I sure would.
I've worked on databases with hundreds of tables in SQL Server with no problems, though.
SQl Server can suffer a larger performance hit by using tables with many, many columns instead of breaking out a related table (even one with a one-to_one relationship). Plus a wide table likely can have problems when the data you want to input exceeds the number of bytes that you can store for a column. You can create a table that has the potential to store, for example, 10000 bytes but you will still only be able to store 8060 bytes.
In my experience I dont think that the number of tables will hit the performance. But then you should be able to justify why you are having so many tables in the database. That is because having so many tables at the database side will also effect the work of developer at the server side.
IMO if you divide the tables on the basis of functionality then you can not make the life of developer easy but also have performance gain in your application because you have fixed tables from where you suppose to get the required data.
Say like you need to store sales,purchase,receipt and payment details. All of the them have the same table structure then instead of storing them in single table you could store them all separately in separate tables. with these you can get all the details for the sales in single table, for purchase in its single table and likewise. thus it can help in improving the response time of database tier of the application which is one of the most slowest component in all web tiers...!!! ofcourse we imporve upon the performance of database by SQL quires but then such structuring can also indirectly help you improve upon the database performance.