I have following SQL query that taking too much time to fetch data.
Customer.joins("LEFT OUTER JOIN renewals ON customers.id = renewals.customer_id").where("renewals.customer_id IS NULL && customers.status_id = 4").order("created_at DESC").select('first_name, last_name, customer_state, customers.created_at, customers.customer_state, customers.id, customers.status_id')
Above query takes 230976.6ms to execute.
I added indexing on firstname, lastname, customer_state and status_id.
How can I execute query within less then 3 sec. ?
Try this...
Everyone wants faster database queries, and both SQL developers and DBAs can turn to many time-tested methods to achieve that goal. Unfortunately, no single method is foolproof or ironclad. But even if there is no right answer to tuning every query, there are plenty of proven do's and don'ts to help light the way. While some are RDBMS-specific, most of these tips apply to any relational database.
Do use temp tables to improve cursor performance
I hope we all know by now that it’s best to stay away from cursors if at all possible. Cursors not only suffer from speed problems, which in itself can be an issue with many operations, but they can also cause your operation to block other operations for a lot longer than is necessary. This greatly decreases concurrency in your system.
However, you can’t always avoid using cursors, and when those times arise, you may be able to get away from cursor-induced performance issues by doing the cursor operations against a temp table instead. Take, for example, a cursor that goes through a table and updates a couple of columns based on some comparison results. Instead of doing the comparison against the live table, you may be able to put that data into a temp table and do the comparison against that instead. Then you have a single UPDATE statement against the live table that’s much smaller and holds locks only for a short time.
Sniping your data modifications like this can greatly increase concurrency. I’ll finish by saying you almost never need to use a cursor. There’s almost always a set-based solution; you need to learn to see it.
Don’t nest views
Views can be convenient, but you need to be careful when using them. While views can help to obscure large queries from users and to standardize data access, you can easily find yourself in a situation where you have views that call views that call views that call views. This is called nesting views, and it can cause severe performance issues, particularly in two ways. First, you will very likely have much more data coming back than you need. Second, the query optimizer will give up and return a bad query plan.
I once had a client that loved nesting views. The client had one view it used for almost everything because it had two important joins. The problem was that the view returned a column with 2MB documents in it. Some of the documents were even larger. The client was pushing at least an extra 2MB across the network for every single row in almost every single query it ran. Naturally, query performance was abysmal.
And none of the queries actually used that column! Of course, the column was buried seven views deep, so even finding it was difficult. When I removed the document column from the view, the time for the biggest query went from 2.5 hours to 10 minutes. When I finally unraveled the nested views, which had several unnecessary joins and columns, and wrote a plain query, the time for that same query dropped to subseconds.
Do use table-valued functions
RESOURCES
VIDEO/WEBCAST
Sponsored
Discover your Data Dilemma
WHITE PAPER
Best Practices when Designing a Digital Workplace
SEE ALL
Search Resources
Go
This is one of my favorite tricks of all time because it is truly one of those hidden secrets that only the experts know. When you use a scalar function in the SELECT list of a query, the function gets called for every single row in the result set. This can reduce the performance of large queries by a significant amount. However, you can greatly improve the performance by converting the scalar function to a table-valued function and using a CROSS APPLY in the query. This is a wonderful trick that can yield great improvements.
Want to know more about the APPLY operator? You'll find a full discussion in an excellent course on Microsoft Virtual Academy by Itzik Ben-Gan.
Do use partitioning to avoid large data moves
Not everyone will be able to take advantage of this tip, which relies on partitioning in SQL Server Enterprise, but for those of you who can, it’s a great trick. Most people don’t realize that all tables in SQL Server are partitioned. You can separate a table into multiple partitions if you like, but even simple tables are partitioned from the time they’re created; however, they’re created as single partitions. If you're running SQL Server Enterprise, you already have the advantages of partitioned tables at your disposal.
This means you can use partitioning features like SWITCH to archive large amounts of data from a warehousing load. Let’s look at a real example from a client I had last year. The client had the requirement to copy the data from the current day’s table into an archive table; in case the load failed, the company could quickly recover with the current day’s table. For various reasons, it couldn’t rename the tables back and forth every time, so the company inserted the data into an archive table every day before the load, then deleted the current day’s data from the live table.
This process worked fine in the beginning, but a year later, it was taking 1.5 hours to copy each table -- and several tables had to be copied every day. The problem was only going to get worse. The solution was to scrap the INSERT and DELETE process and use the SWITCH command. The SWITCH command allowed the company to avoid all of the writes because it assigned the pages to the archive table. It’s only a metadata change. The SWITCH took on average between two and three seconds to run. If the current load ever fails, you SWITCH the data back into the original table.
YOU MIGHT ALSO LIKE
Microsoft Dynamics AX ERP
Microsoft Dynamics AX: A new ERP is born, this time in the cloud
Joseph Sirosh
Why Microsoft’s data chief thinks current machine learning tools are like...
Urs Holzle Structure
Google's infrastructure czar predicts cloud business will outpace ads in 5...
This is a case where understanding that all tables are partitions slashed hours from a data load.
If you must use ORMs, use stored procedures
This is one of my regular diatribes. In short, don’t use ORMs (object-relational mappers). ORMs produce some of the worst code on the planet, and they’re responsible for almost every performance issue I get involved in. ORM code generators can’t possibly write SQL as well as a person who knows what they're doing. However, if you use an ORM, write your own stored procedures and have the ORM call the stored procedure instead of writing its own queries. Look, I know all the arguments, and I know that developers and managers love ORMs because they speed you to market. But the cost is incredibly high when you see what the queries do to your database.
Stored procedures have a number of advantages. For starters, you’re pushing much less data across the network. If you have a long query, then it could take three or four round trips across the network to get the entire query to the database server. That's not including the time it takes the server to put the query back together and run it, or considering that the query may run several -- or several hundred -- times a second.
Using a stored procedure will greatly reduce that traffic because the stored procedure call will always be much shorter. Also, stored procedures are easier to trace in Profiler or any other tool. A stored procedure is an actual object in your database. That means it's much easier to get performance statistics on a stored procedure than on an ad-hoc query and, in turn, find performance issues and draw out anomalies.
In addition, stored procedures parameterize more consistently. This means you’re more likely to reuse your execution plans and even deal with caching issues, which can be difficult to pin down with ad-hoc queries. Stored procedures also make it much easier to deal with edge cases and even add auditing or change-locking behavior. A stored procedure can handle many tasks that trouble ad-hoc queries. My wife unraveled a two-page query from Entity Framework a couple of years ago. It took 25 minutes to run. When she boiled it down to its essence, she rewrote that huge query as SELECT COUNT(*) from T1. No kidding.
OK, I kept it as short as I could. Those are the high-level points. I know many .Net coders think that business logic doesn’t belong in the database, but what can I say other than you’re outright wrong. By putting the business logic on the front end of the application, you have to bring all of the data across the wire merely to compare it. That’s not good performance. I had a client earlier this year that kept all of the logic out of the database and did everything on the front end. The company was shipping hundreds of thousands of rows of data to the front end, so it could apply the business logic and present the data it needed. It took 40 minutes to do that. I put a stored procedure on the back end and had it call from the front end; the page loaded in three seconds.
Of course, the truth is that sometimes the logic belongs on the front end and sometimes it belongs in the database. But ORMs always get me ranting.
Don’t do large ops on many tables in the same batch
This one seems obvious, but apparently it's not. I’ll use another live example because it will drive home the point much better. I had a system that suffered tons of blocking. Dozens of operations were at a standstill. As it turned out, a delete routine that ran several times a day was deleting data out of 14 tables in an explicit transaction. Handling all 14 tables in one transaction meant that the locks were held on every single table until all of the deletes were finished. The solution was to break up each table's deletes into separate transactions so that each delete transaction held locks on only one table. This freed up the other tables and reduced the blocking and allowed other operations to continue working. You always want to split up large transactions like this into separate smaller ones to prevent blocking.
Don't use triggers
This one is largely the same as the previous one, but it bears mentioning. Don’t use triggers unless it’s unavoidable -- and it’s almost always avoidable.
The problem with triggers: Whatever it is you want them to do will be done in the same transaction as the original operation. If you write a trigger to insert data into another table when you update a row in the Orders table, the lock will be held on both tables until the trigger is done. If you need to insert data into another table after the update, then put the update and the insert into a stored procedure and do them in separate transactions. If you need to roll back, you can do so easily without having to hold locks on both tables. As always, keep transactions as short as possible and don’t hold locks on more than one resource at a time if you can help it.
Don’t cluster on GUID
After all these years, I can't believe we’re still fighting this issue. But I still run into clustered GUIDs at least twice a year.
A GUID (globally unique identifier) is a 16-byte randomly generated number. Ordering your table’s data on this column will cause your table to fragment much faster than using a steadily increasing value like DATE or IDENTITY. I did a benchmark a few years ago where I inserted a bunch of data into one table with a clustered GUID and into another table with an IDENTITY column. The GUID table fragmented so severely that the performance degraded by several thousand percent in a mere 15 minutes. The IDENTITY table lost only a few percent off performance after five hours. This applies to more than GUIDs -- it goes toward any volatile column.
Don’t count all rows if you only need to see if data exists
It's a common situation. You need to see if data exists in a table or for a customer, and based on the results of that check, you’re going to perform some action. I can't tell you how often I've seen someone do a SELECT COUNT(*) FROM dbo.T1 to check for the existence of that data:
SET #CT = (SELECT COUNT(*) FROM dbo.T1);
If #CT > 0
BEGIN
END
It’s completely unnecessary. If you want to check for existence, then do this:
If EXISTS (SELECT 1 FROM dbo.T1)
BEGIN
END
Don’t count everything in the table. Just get back the first row you find. SQL Server is smart enough to use EXISTS properly, and the second block of code returns superfast. The larger the table, the bigger difference this will make. Do the smart thing now before your data gets too big. It’s never too early to tune your database.
In fact, I just ran this example on one of my production databases against a table with 270 million rows. The first query took 15 seconds, and included 456,197 logical reads, while the second one returned in less than one second and included only five logical reads. However, if you really do need a row count on the table, and it's really big, another technique is to pull it from the system table. SELECT rows from sysindexes will get you the row counts for all of the indexes. And because the clustered index represents the data itself, you can get the table rows by adding WHERE indid = 1. Then simply include the table name and you're golden. So the final query is SELECT rows from sysindexes where object_name(id) = 'T1' and indexid = 1. In my 270 million row table, this returned sub-second and had only six logical reads. Now that's performance.
Don’t do negative searches
Take the simple query SELECT * FROM Customers WHERE RegionID <> 3. You can’t use an index with this query because it’s a negative search that has to be compared row by row with a table scan. If you need to do something like this, you may find it performs much better if you rewrite the query to use the index. This query can easily be rewritten like this:
SELECT * FROM Customers WHERE RegionID < 3 UNION ALL SELECT * FROM Customers WHERE RegionID
This query will use an index, so if your data set is large it could greatly outperform the table scan version. Of course, nothing is ever that easy, right? It could also perform worse, so test this before you implement it. There are too many factors involved for me to tell you that it will work 100 percent of the time. Finally, I realize this query breaks the “no double dipping” tip from the last article, but that goes to show there are no hard and fast rules. Though we're double dipping here, we're doing it to avoid a costly table scan.
Ref:http://www.infoworld.com/article/2604472/database/10-more-dos-and-donts-for-faster-sql-queries.html
http://www.infoworld.com/article/2628420/database/database-7-performance-tips-for-faster-sql-queries.html
Related
We currently have a table that contains 90 columns and as the table is growing and the business needs change, we're having to alter the table alot (add/remove cols & indexes).
|------ (Table name: quotes)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
....
|completed_at|datetime|Yes|NULL
|reviewed_at|datetime|Yes|NULL
|marked_dud_at|datetime|Yes|NULL
|closed_at|datetime|Yes|NULL
|subscribed_at|datetime|Yes|NULL
|admin_checked_at|datetime|Yes|NULL
|priced_at|datetime|Yes|NULL
|number_verified_at|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
|deleted_at|datetime|Yes|NULL
For the application, our staff are constantly querying all sorts of variations on the above data, example being where it has been completed (completed_at), checked (admin_checked_at) and not deleted, reviewed (deleted_at, reviewed_at)
We're thinking it may be easier to offload some of these columns into their own row, we'll call it quotes_actions, then when querying do some joining.
|------ (Table name: quotes_actions)
|Column|Type|Null|Default
|------
|//**id**//|int(11)|No|
|quote_id|int(11)|No|
|action|varchar(100)|No|
|user_id|int(11)|No|
|time|datetime|Yes|NULL
|created_at|datetime|Yes|NULL
An example would be action = 'completed' using the field, with an index covering quote_id and action.
We've split the data into this format on 150,000 rows and it's not any faster nor slower than querying the original database with correct indexes.
Has anyone got any experience with this and has any recommendations or pitfalls for each approach? It's taking a lot of time to add covering indexes and add columns to the original table as we needed them, whereas the second approach has the indexes set up ready to go but is introducing a lot more joins and more complicated queries.
0.09s
select * from `quotes`
where `completed_at` is not null
and `approved_at` is not null
and deleted_at is null
=>
0.0005s
select * from `quotes_new`
inner join quotes_actions as q1 on q1.action = 'completed' and q1.quote_id = quotes_new.id
inner join quotes_actions as q2 on q2.action = 'approved' and q2.quote_id = quotes_new.id
where quotes_new.deleted_at is null
In addition, if the 2nd approach is better, how do you query for negative results, where a quote hasn't been approved?
Database design will vary from application to application, and things that are great for one implementation will be terrible for another. You've identified a few things that are important to you:
speed of data access (at least no reduction in current performance)
ability to respond to application needs/changes
limiting complexity of queries
Without being able to see the entirity of your database and how you are using it, these are the principles I would follow:
Use Stored Procedures and Views for as much as possible
This is just good design. You create an adapter layer between your application and the data tables, which allows you to make whatever changes you need to in the database (and the views/stored procs) without having to change the application itself. Decoupling your systems makes maintenance significantly easier. Also this is good for security, as if the only way outsiders can access the data is through your stored procs, you've eliminated a few avenues of attack. (There's also debate about whether or not the DBMS will cache execution plans for stored procedures, making them execute faster than similar queries, but I'm not a DBA or DBDev, so I'm not touching that).
Attempt to limit width of tables
One thing I've seen time and time again is every time a need arises in a production systems, a column gets added to a table and they call it a day. Far easier than rewriting a bunch of queries or reviewing table structures. This is terrible design. If you've already limited the changes needed to the application layer by following my first piece of advice, you've limited the work needed to actually resolve table changes in the right way. You should always evaluate whether data belongs to the row in question, or if it should be offloaded into its own table. You shouldn't be afraid to radically alter your database, as sometimes it is necessary.
Looking at the data you've provided, I think your second option is okay. You've identified many columns that actually represent the same thing (the "status changes" or as you put it "quote actions" that occur) and offloaded that from the main table to a secondary table. This is perfectly fine, and likely will be effective. You can further "cheat" to make this table faster by offloading status onto its own table, and using an integer to represent it instead of a string (since the string doesn't matter to the database, and integers are far faster to index and search).
This is not to say a wide table is a bad thing, sometimes tables just need to be wide. You just need to evaluate whether the data really belongs to the entity the data row represents.
Approach queries in new ways
You will want to play with the execution plan tools of your DBMS and understand how each query really works. Changing the order of joins can drastically alter the query return speed, and you shouldn't be afraid to use table variables and temp tables in your queries. They are all tools at your disposal.
Querying for Negative Results
Since you asked this question specifically, I'll address it. This requires thinking about your query in a little different way (consequently, if you haven't, you should look into taking a course or working through a textbook of Relational Algebra, it makes understanding databases so much easier).
Your original query made finding something where the quote was not approved easy. It was all in the table: approved_at is null. Simple, easy peasy, no problems. Now, however, instead of being in a column on the main table, it is in its own table, that also represents all the other actions that could be taken. You need to break the problem down a little.
You want to find the set wherein of all orders, there is no action to signify it is approved. In SQL that looks like:
select quote_id from quotes_action where quote_id not in
(select quote_id from quotes_action where action = 'approved');
Final Thoughts
You need to sit down with your team and talk about how you want to move forward with this product. Spend a few days or a couple weeks really thinking deeply about it. Brainstorm....hackathon....do something to find a solution you like and makes your product better and more maintainable. We've all been in the situation where we have an unmaintainable product that could have been fixed at some point, but is beyond that point. Try not to get to that point, and fix it while you have the opportunity.
i'm learning mysql and was working on a database for work. Everything's fine so far but I had a question. I am organizing financial statements for firms(balance sheet table, income statement table, cashflow table,etc.) and most companies have quarterly statements(they are unaudited) and annual statements(which are audited). Right now for each statement I have a column that flags it for annual or quarterly.
Its not likely that someone will be running a report on an audited and unaudited statement at the same time, so I was thinking if it was worth it to create a table for audited and one for unaudited. The reason I was thinking this was eventually the data will get fairly large and I thought the smaller the tables the faster performance.
So when I design a database should I be designing based on the content(i.e. group everything thats the same regardless) or should I be grouping based on how people will access it?
Another question this raises is should I be grouping financial statements by countries..since all analysis down at our firm in 90% within the same country
This is impossible to answer definitively without knowing the whole problem.
However, usually you want a single table to represent each logical entity in your system. From the sound of it, quarterly and annual statements represent the same logical entity, but differ by a single category column/field. The same holds true for the country question--if the only difference is the country (a categorization), then they likely should all be stored in the same table.
If you were to split your data into separate tables by category, your data would be scattered across multiple tables, and would be very hard to query. For example, if you wanted a count of all statements in the system, you would have to query ALL country tables and add the results together.
Edit: Joe Celko calls this anti-pattern "Attribute Splitting".
First of all I have to point out, I'm not a professional DB designer.
But if I ware you, in this case I would create one table as the entities are the same basically.
If you fear of mysql's performace on lager datasets, maybe it would be better to start building your app on Postgres. You can boost mysql's performace with stored functions/procedures or maybe views if you have to run complicated queries and of course you can use memcache or any nosql stuff to let the SQL rest a bit.
If you are sure in that users will search mainly only for this or that type of records, you can build three tables. One for all of the records, one-one for the audited and unaudited ones. You can keep them syncronized with the InnoDB's triggers (ON UPDATE/DELETE/INSERT). They could work like views, but I think (not tested) they would be faster then views. In this case you have to manage only the first "large" table. If you insert an audited record, the trigger fires and puts a record in to the audited table an so on...
Best wishes!
I agree with Phil and Damien - one table is better. What you want is one table per type of real business thing. If you design your tables to resemble real things, even abstract or conceptual things, then your data design is more likely to stand the test of time. Once you've sketched out a schema based on the real things you have data about, then you can go back and apply the rules of normalization to formalize your design.
As a rule, it is a bad idea to design for a performance problem you are worried about, but haven't actually seen. Your intuition about big tables being slower might actually be wrong. Most DBMS systems like bigger tables, at least to a point. When tables are big the query optimizers choose to use indexes. When tables are small they often end up getting full table scans, which can really slow down concurrent access. If your tables get so big that they are beyond the capabilities of your DBMS then it is time to consider either archiving out old data that you aren't using anymore or buying a more scalable DBMS.
I am trying to apply for a job, which asks for the experiences on handling large scale data sets using relational database, like mySQL.
I would like to know which specific skill sets are required for handling large scale data using MySQL.
Handling large scale data with MySQL isn't just a specific set of skills, as there are a bazillion ways to deal with a large data set. Some basic things to understand are:
Column Indexes, how, why, and when they're used, and the pros and cons of using them.
Good database structure to balance between fast writes and easy reads.
Caching, leveraging several layers of caching and different caching technologies (memcached, redis, etc)
Examining MySQL queries to identify bottlenecks and understanding the MySQL internals to see how queries get planned an executed by the database server in order to increase query performance
Configuring the MySQL server to be able to handle a lot of concurrent connections, and access it's data fast. Hardware bottlenecks, and the advantages to using different technologies to speed up your hardware (for example, storing your MySQL data on a RAID5 Array to increase IO performance))
Leveraging built-in MySQL technology (like Replication) to off-load read traffic
These are just a few things that get thought about in regards to big data in MySQL. There's a TON more, which is why the company is looking for experience in the area. Knowing what to do, or having experience with things that have worked or failed for you is an absolutely invaluable asset to bring to a company that deals with high traffic, high availability, and high volume services.
edit
I would be remis if I didn't mention a source for more information. Check out High Performance MySQL. This is an incredible book, and has a plethora of information on how to make MySQL perform in all scenarios. Definitely worth the money, and the time spent reading it.
edit -- good structure for balanced writes and reads
With this point, I was referring to the topic of normalization / de-normalization. If you're familiar with DB design, you know that normalization is the separation of data as to reduce (eliminate) the amount of duplicate data you have about any single record. This is generally a fantastic idea, as it makes tables smaller, faster to query, easier to index (individually) and reduces the number of writes you have to do in order to create/update a new record.
There are different levels of normalization (as #Adam Robinson pointed out in the comments below) which are referred to as normal forms. Almost every web application I've worked with hasn't had much benefit beyond the 3NF (3rd Normal Form). Which the definition of, if you were to read that wikipedia link above, will probably make your head hurt. So in lamens (at the risk of dumbing it down too far...) a 3NF structure satisfies the following rules:
No duplicate columns within the same table.
Create different tables for each set related data. (Example: a Companies table which has a list of companies, and an Employees table which has a list of each companies' employees)
No sub-sets of columns which apply to multiple rows in a table. (Example: zip_code, state, and city is a sub-set of data which can be identified uniquely by zip_code. These 3 columns could be put in their own table, and referenced by the Employees table (in the previous example) by the zip_code). This eliminates large sets of duplication within your tables, so any change that is required to the city/state for any zip code is a single write operation instead of 1 write for every employee who lives in that zip code.
Each sub-set of data is moved to it's own table and is identified by it's own primary key (this is touched/explained in the example for #3).
Remove columns which are not fully dependent on the primary key. (An example here might be if your Employees table has start_date, end_date, and years_employed columns. The start_date and end_date are both unique and dependent on any single employee row, but the years_employed can be derived by subtracting start_date from end_date. This is important because as end-date increases, so does years_employed so if you were to update end_date you'd also have to update years_employed (2 writes instead of 1)
A fully normalized (3NF) database table structure is great, if you've got a very heavy write-load. If your server is doing a lot of writes, it's very easy to write small bits of data, especially when you're running fewer of them. The drawback is, all your reads become much more expensive, because you have to (typically) run a lot of JOIN queries when you're pulling data out. JOINs are typically expensive and harder to create proper indexes for when you're utilizing WHERE clauses that span the relationship and when sorting the result-sets If you have to perform a lot of reads (SELECTs) on your data-set, using a 3NF structure can cause you some performance problems. This is because as your tables grow you're asking MySQL to cram more and more table data (and indexes) into memory. Ideally this is what you want, but with big data-sets you're just not going to have enough memory to fit all of this at once. This is when MySQL starts to create temporary tables, and has to use the disk to load data and manipulate it. Once MySQL becomes reliant on the hard disk to serve up query results you're going to see a significant performance drop. This is less-so the case with solid state disks, but they are super expensive, and (imo) are not mature enough to use on mission critical data sets yet (i mean, unless you're prepared for them to fail and have a very fast backup recovery system in place...then use them and gonuts!).
This is the balancing part. You have to decide what kind of traffic the data you're reading/writing is going to be serving more of, and design that to be fast. In some instances, people don't mind writes being slow because they happen less frequently. In other cases, writes have to be very fast, and the reads don't have to be fast because the data isn't accessed that often (or at all, or even in real time).
Workloads that require a lot of reads benefit the most from a middle-tier caching layer. The idea is that your writes are still fast (because you're 'normal') and your reads can be slow because you're going to cache it (in memcached or something competitive to it), so you don't hit the database very frequently. The drawback here is, if your cache gets invalidated quickly, then the cache is not reducing the read load by a meaningful amount and that results in no added performance (and possibly even more overhead to check/invalidate the caches).
With workloads that have the requirement for high throughput in writes, with data that is read frequently, and can't be cached (constantly changes), you have to come up with another strategy. This could mean that you start to de-normalize your tables, by removing some of the normalization requirements you choose to satisfy, or something else. Instead of making smaller tables with less repetitive data, you make larger tables with more repetitive / redundant data. The advantage here is that your data is all in the same table, so you don't have to perform as many (or, any) JOINs to pull the data out. The drawback...writes are more expensive because you have to write in multiple places.
So with any given situation the developer(s) have to identify what kind of use the data structure is going to have to serve, and balance between any number of technologies and paradigms to achieve an acceptable solution that meets their needs. No two systems or solutions are the same which is why the employer is looking for someone with experience on how to deal with these large datasets. Finding these solutions is not something that can really be learned out of a book, it typically takes some experience in the field and experience with how different solutions performed.
I hope that helps. I know I rambled a bit, but it's really a lot of information. This is why DBAs make the big dollars (:
You need to know how to process the data in "chunks". That means instead of simply trying to manipulate the entire data set, you need to break it into smaller more manageable pieces. For example, if you had a table with 1 Billion records, a single update statement against the entire table would likely take a long time to complete, and may possibly bring the server to it's knees.
You could, however, issue a series of update statements within a loop that would update 20,000 records at a time. Each iteration of the loop you would increment your range/counters/whatever to identify the next set of records.
Also, you commit your changes at the end of each loop, thereby allowing you to stop the process and continue where you left off.
This is just one aspect of managing large data sets. You still need to know:
how to perform backups
proper indexing
database maintenance
You can raed/learn how to handle large dataset with MySQL But it is not equivalent to having actual experiences.
Straight and simple answer: Study about partitioned database and find appropriate MySQL data structure types for large scale datasets similar with the partitioned database architecture.
This is my first time building a database with a table containing 10 million records. The table is a members table that will contain all the details of a member.
What do I need to pay attention when I build the database?
Do I need a special version of MySQL? Should I use MyISAM or InnoDB?
For a start, you may need to step back and re-examine your schema. How did you end up with 10 million rows in the member table? Do you actually have 10 million members (it seems like a lot)?
I suspect (although I'm not sure) that you have less than 10 million members in which case your table will not be correctly structured. Please post the schema, that's the first step to us helping you out.
If you do have 10 million members, my advice is to make your application vendor-agnostic to begin with (i.e., standard SQL). Then, if you start running into problems, just toss out your current DBMS and replace it with a more powerful one.
Once you've established you have one that's suitable, then, and only then would I advise using vendor-specific stuff. Otherwise it will be a painful process to change.
BTW, 10 million rows is not really considered a big database table, at least not where I come from.
Beyond that, the following is important (not necessarily an exhaustive list but a good start).
Design your tables for 3NF always. Once you identify performance problems, you can violate that rule provided you understand the consequences.
Don't bother performance tuning during development, your queries are in a state of flux. Just accept the fact they may not run fast.
Once the majority of queries are locked down, then start tuning your tables. Add whatever indexes speed up the selects, de-normalize and so forth.
Tuning is not a set-and-forget operation (which is why we pay our DBAs so much). Continuously monitor performance and tune to suit.
I prefer to keep my SQL standard to retain the ability to switch vendors at any time. But I'm pragmatic. Use vendor-specific stuff if it really gives you a boost. Just be aware of what you're losing and try to isolate the vendor-specific stuff as much as possible.
People that use "select * from ..." when they don't need every column should be beaten into submission.
Likewise those that select every row to filter out on the client side. The people that write our DBMS' aren't sitting around all day playing Solitaire, they know how to make queries run fast. Let the database do what it's best at. Filtering and aggregation is best done on the server side - only send what is needed across the wire.
Generate your queries to be useful. Other than the DoD who require reports detailing every component of their aircraft carriers down to the nuts-and-bolts level, no-one's interested in reading your 1200-page report no matter how useful you think it may be. In fact, I don't think the DoD reads theirs either, but I wouldn't want some general chewing me out because I didn't deliver - those guys can be loud and they have a fair bit of sophisticated weaponry under their control.
At least use InnoDB. You will feel the pain when you realize MyISAM has just lost your data...
Apart from this, you should give more information about what you want to do.
You don't need to use InnoDB if you don't have data integrity and atomic action requirements. You want to use InnoDB if you have foreign keys between tables and you are required to keep the constraints, or if you need to update multiple tables in atomic operation. Otherwise, if you just need to use the table to do analysis, MyISAM is fine.
For queries, make sure you build smart indexes to suite your query. For example, if you want to sort by columns c and selecting based on columns a, and b, make sure you have an index that covers columns a, b, and c, in that order, and that index includes full length of each column, rather than a prefix. If you don't do your index right, sorting over a large amount of data will kill you. See http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html
Just a note about InnoDB and setting up and testing a large table with it. If you start injecting your data, it will take hours. Make sure you issue commits periodically, otherwise if you want to stop and redo for whatever reason, you end up have to 1) wait hours for transaction recovery, or 2) kill mysqld, set InnoDB recover flag to no recover and restart. Also if you want to re-inject data from scratch, DROP the table and recreate it is almost instantaneous, but it will take hours to actually "DELETE FROM table".
I'm developping a chat application. I want to keep everything logged into a table (i.e. "who said what and when").
I hope that in a near future I'll have thousands of rows.
I was wondering : what is the best way to optimize the table, knowing that I'll do often rows insertion and sometimes group reading (i.e. showing an entire conversation from a user (look when he/she logged in/started to chat then look when he/she quit then show the entire conversation)).
This table should be able to handle (I hope though !) many many rows. (15000 / day => 4,5 M each month => 54 M of rows at the end of the year).
The conversations older than 15 days could be historized (but I don't know how I should do to do it right).
Any idea ?
I have two advices for you:
If you are expecting lots of writes
with little low priority reads. Then you
are better off with as little
indexes as possible. Indexes will
make insert slower. Only add what you really need.
If the log table
is going to get bigger and bigger
overtime you should consider log
rotation. Otherwise you might end up
with one gigantic corrupted table.
54 million rows is not that many, especially over a year.
If you are going to be rotating out lots of data periodically, I would recommend using MyISAM and MERGE tables. Since you won't be deleting or editing records, you won't have any locking issues as long as concurrency is set to 1. Inserts will then always be added to the end of the table, so SELECTs and INSERTs can happen simultaneously. So you don't have to use InnoDB based tables (which can use MERGE tables).
You could have 1 table per month, named something like data200905, data200904, etc. Your merge table would them include all the underlying tables you need to search on. Inserts are done on the merge table, so you don't have to worry about changing names. When it's time to rotate out data and create a new table, just redeclare the MERGE table.
You could even create multiple MERGE tables, based on quarter, years, etc. One table can be used in multiple MERGE tables.
I've done this setup on databases that added 30 million records per month.
Mysql does surprisingly well handling very large data sets with little more than standard database tuning and indexes. I ran a site that had millions of rows in a database and was able to run it just fine on mysql.
Mysql does have an "archive" table engine option for handling many rows, but the lack of index support will make it not a great option for you, except perhaps for historical data.
Index creation will be required, but you do have to balance them and not just create them because you can. They will allow for faster queries (and will required for usable queries on a table that large), but the more indexes you have, the more cost there will be inserting.
If you are just querying on your "user" id column, an index on there will not be a problem, but if you are looking to do full text queries on the messages, you may want to consider only indexing the user column in mysql and using something like sphynx or lucene for the full text searches, as full text searches in mysql are not the fastest and significantly slow down insert time.
You could handle this with two tables - one for the current chat history and one archive table. At the end of a period ( week, month or day depending on your traffic) you can archive current chat messages, remove them from the small table and add them to the archive.
This way your application is going to handle well the most common case - query the current chat status and this is going to be really fast.
For queries like "what did x say last month" you will query the archive table and it is going to take a little longer, but this is OK since there won't be that much of this queries and if someone does search like this he would be willing to wait a couple of seconds more.
Depending on your use cases you could extend this principle - if there will be a lot of queries for chat messages during last 6 months - store them in separate table too.
Similar principle (for completely different area) is used by the .NET garbage collector which has different storage for short lived objects, long lived objects, large objects, etc.