Sorry for lots of useless text. Most important stuff is told on last 3 paragraphs :D
Recently we had some mysql problem in one of client servers. Something out of blue starts sky-rocking CPU of mysql process. This problem lead us to finding and optimizing bad queries and here is a problem.
I was thinking that optimization is speeding up queries (total time needed for a query to execute). But after optimizing several queries towards it my colleague starting colleague started complaining that some queries read too many rows, all rows from table (as shown with EXPLAIN).
After rewriting a query I noticed that, if I want a query to read less rows - query speed suffers, if I query is made for speed - more rows are read.
And that didn't make me a sense: less rows read, but execution time is longer
And that made me wonder what should be done. Of course it would be perfect to have fast query which reads least rows. But since it doesn't seem to be possible for me, I'm searching for some answers. Which approach should I take - speed or less rows read? What are pros&cons when query is fast but with more rows read and when less rows are read with speed suffer? What happens with server at different cases?
After googling all I could find was articles and discussions about how to improve speed, but neither covered those different cases I mentioned before.
I'm looking forward to seeing even personal choices of course with some reasoning.
Links which could direct me right way are welcome too.
I think your problem depends on how you are limiting the amount of rows read. If you read less rows by implementing more WHERE clauses that MySQL needs to run against, then yes, performance will take a hit.
I would look at perhaps indexing some of your columns that make your search more complex. Simple data types are faster to lookup than complex ones. See if you are searching toward indexed columns.
Without more data, I can give you some hints:
Be sure your tables are properly indexed. Create the appropriate indexes for each of your tables. Also drop the indexes that are not needed.
Decide the best approach for each query. For example, if you use group by only to deduplicate rows, you are wasting resources; it is better to use select distinct (on an indexed field).
"Divide and conquer". Can you split your process in two, three or more intermediate steps? If the answer is "yes", then: Can you create temporary tables for some of these steps? I've split proceses using temp tables, and they are very useful to speed things up.
The count of rows read reported by EXPLAIN is an estimate anyway -- don't take it as a literal value. Notice that if you run EXPLAIN on the same query multiple times, the number of rows read changes each time. This estimate can even be totally inaccurate, as there have been bugs in EXPLAIN from time to time.
Another way to measure query performance is SHOW SESSION STATUS LIKE 'Handler%' as you test the query. This will tell you accurate counts of how many times the SQL layer made requests for individual rows to the storage engine layer. For examples, see my presentation, SQL Query Patterns, Optimized.
There's also an issue of whether the rows requested were already in the buffer pool (I'm assuming you use InnoDB), or did the query have to read them from disk, incurring I/O operations. A small number of rows read from disk can be orders of magnitude slower than a large number of rows read from RAM. This doesn't necessarily account for your case, but it points out that such a scenario can occur, and "rows read" doesn't tell you if the query caused I/O or not. There might even be multiple I/O operations for a single row, because of InnoDB's multi-versioning.
Insight into the difference between logical row request vs. physical I/O reads is harder to get. In Percona Server, enhancements to the slow query log include the count of InnoDB I/O operations per query.
Related
I realize this is a sort of meta-programming question, but I'm assuming there are enough experienced people here to give a decent answer.
I was just building a query again, to retrieve some data from a table.
SELECT pl.field1, pl.field2
FROM table pl
LEFT JOIN table2 dp on pl.field1 = dp.field1
WHERE dp.field1 IS NULL
Executing this query took ages (1800+ seconds).
After I got sick of waiting, and made the effort to EXPLAIN the query, it turned out that a full table scan was done.
I created an index on dp.field1 and the query was almost instant thereafter, creating that index took less than a second.
Judging from the EXPLAIN, this wasn't too difficult to determine. Why can't, or won't, MySQL do this automatically? Spending just a second to create that index will make the query instant, so MySQL could theoretically create a temporary index, use it to do the query and then remove it again, which would still be orders of magnitude faster than the alternative.
I'm expecting the usual answers of 'to make sure you design a good schema' or 'mysql just does what you tell it to do', but I'm wondering if there might be a technical reason why this is a bad idea.
For columns with low cardinality it is not a good idea to use a B-Tree Index. B-Trees become degenerated for low cardinalities and do in fact increase query time in comparison to a full table scan.
So always creating a B-Tree index is not a good idea. At least it have to consider cardinality, too. And maybe several other things, too.
Quite simply - because the idea doesn't really scale using the current design of RDBMS engines.
It's okay for a single user, but databases are designed to support many concurrent users, and having each user's query also run a speculative optimization step ("can I speed up this query by creating an index?"), and creating that index, which in some circumstances is a very expensive operation, would become slow at any degree of scale. Having the index be "single use" would be wasteful of both computation time and disk space, but having lots of permanent indices in turn would slow down the query optimizer by having to investigate many indices for a given query. It would also slow down data modification operations.
Admittedly, on modern hardware, these concerns are a lot less significant - basic design of RDBMS engines dates back to the days when disk space was expensive, CPUs were several orders of magnitude slower, and memory was an unimaginable luxury.
I'm only speaking for MySQL because there may be a database system out there that automatically modifies your database design.
The simple answer is, MySQL simply does what you tell it to do.
MySQL cannot predict the future. Only you can. You know much more about your data than MySQL does. MySQL keeps some statistics, but it's guessing the best way to execute your query on very sparse information (that is sometimes outdated) before it actually tries to do so. Once it starts executing, it doesn't change its plan, no matter how wrong the guess was.
The methods that it uses to guess are all very well documented. It's our job to provide the indexes that will provide the most benefit, and even, at times, hint that it should use those indexes.
If you tell MySQL to perform a query that requires a table scan, it assumes you know that it's going to do a table scan, because it told you in its documentation that it would. It simply obeys.
Database systems that don't allow the DBA to make decisions don't scale well. There are always tradeoffs to be made, and you're the one to make them. MySQL is a hammer, not a carpenter.
My application is using JPA with Hibernate and I see that hibernate generates some interesting SQL queries with a lot of joins in my log files. The application does not have a lot of users right now and I am worried that some of the queries being generated by hibernate are going to cause problems when the database grows in size.
I have run some of the sql queries generated by hibernate through the EXPLAIN command to look at the query plans that are generated.
Is the output of EXPLAIN dependent on the size of the database? When my database grows in size will the query planner generate different plans for the same SQL queries?
At what point in the development / deployment cycle should I be looking at SQL query plans for sql queries generated by hibernate? When is the right time to use EXPLAIN.
How can the output of explain be used to determine if a query will become a problem, when the database is so small that every query no matter how complex looking runs in under 0.5 seconds?
I am using Postgres 9.1 as the database for my application but I am interested in the general answer to the above questions.
Actually, #ams you are right in your comment - it is generally pointless to use explain with tiny amounts of data.
If a table only has 10 rows then it's quite likely all in one page and it costs (roughly) the same to read one row as all 10. Going to an index first and then fetching the page will be more expensive than just reading the lot and ignoring what you don't want. PostgreSQL's planner has configured costs for things like index reads, table reads, disk accesses vs cache accesses, sorting etc. It sizes these according to the (approximate) size of the tables and distribution of values within them. What it doesn't do (as of the pending 9.2 release) is account for cross-column or cross-table correlations. It also doesn't offer manual hints that let you override the planner's choices (unlike MS-SQL or Oracle).
Each RDBMS' planner has different strengths and weaknesses but I think it's fair to say that MySQL's is the weakest (particularly in older releases).
So - if you want to know how your system will perform with 100 concurrent users and billions of rows you'll want to generate test data and load for a sizeable fraction of that. Worse, you'll want to have roughly the same distribution of values too. If most clients have about 10 invoices but a few have 1000 then that's something your test data will need to reflect. If you need to maintain performance across multiple RDBMS then repeat testing across all of them.
This is all separate from the overall performance of the system of course, which depends on the size and capabilities of your server vs its required load. A system can cope with a steady increase in load and then suddenly you will see performance drop rapidly as cache sizes are exceeded etc.
HTH
1 Is the output of EXPLAIN dependent on the size of the database? When my database grows in size will the query planner generate
different plans for the same SQL queries?
It all depends on your data and the statistics about the data. Many performance problems occur because lack of statistics, when somebody forgot to ANALYZE or turned auto_vacuum (incl. analyze) off.
2 At what point in the development / deployment cycle should I be looking at SQL query plans for sql queries generated by hibernate?
When is the right time to use EXPLAIN.
Hibernate has a habit of sending lots and lots of queries to the database, even for simple joins. Turn your querylog on, and keep an eye on that one. Later on, you could run an auto-explain on all queries from your log.
3 How can the output of explain be used to determine if a query will become a problem, when the database is so small that every query
no matter how complex looking runs in under 0.5 seconds?
No, because it all depends on the data. When 95% of your users are male, an index on gender won't be used when searching for a man. When you're looking for a woman, the index makes sense and will be used. A functional index on records where gender = female, is even better: It's useless to index something that will never benefit from an index and the index will be much smaller.
The only thing you can do to predict the usage of indexes, is testing with set enable_seqscan = off; that will show that it is possible to use some index, but that's all.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Can select * usage ever be justified?
Curious to hear this from folks with more DBA insight, but what performance implications does an application face from when you see a query like:
select * from some_large_table;
You have to do a full table scan since no index is being hit, and I believe if we're talking O notation, we're speaking O(N) here where N is the size of the table. Is this typically considered not optimal behavior? What if you really do need everything from the table at certain times? Yes we have tools such as pagination etc, but I'm talking strictly from a database perspective here. Is this type of behavior normally frowned upon?
What happens if you don't specify columns, is that the DB Engine has to query the master table data for the column list. This query is really fast, but causes a minor performance issue. As long as you're not doing a sloppy SELECT * with a JOIN statement or nested queries, you should be fine. However, note the small performance impact of letting the DB Engine doing a query to find the columns.
MySQL server opens a cursor on server-side to read that table. The client of the query may read none or all records and performance for the client will only depend on the number of records it actually fetched. Also the performance of the query on server-side can acutally be faster than query with some conditions as it involves also some index reading. Only if client fetched all records, it will be equivalent to full table scan.
Selecting more columns than you need (select *) is always bad. Don't do more than you have to
If you're selecting from the whole table, it doesn't matter if you have an index.
Some other issues you're going to run into is how you want to lock the table. If this is a busy application you might not want to prevent locking entirely because of the inconsistent data that might be returned. But if you lock too tightly it could slow the query further. O(n) is considered acceptable in any computer science application. However in databases we measure in time & number of reads/writes. This is a huge number of reads and will probably take a long time to execute. Therefore it's unacceptable.
I'm writing an app with a MySQL table that indexes 3 columns. I'm concerned that after the table reaches a significant amount of records, the time to save a new record will be slow. Please inform how best to approach the indexing of columns.
UPDATE
I am indexing a point_value, the
user_id, and an event_id, all required
for client-facing purposes. For an
instance such as scoring baseball runs
by player id and game id. What would
be the cost of inserting about 200 new
records a day, after the table holds
records for two seasons, say 72,000
runs, and after 5 seasons, maybe a
quarter million records? Only for
illustration, but I'm expecting to
insert between 25 and 200 records a
day.
Index what seems the most logical (that should hopefully be obvious, for example, a customer ID column in the CUSTOMERS table).
Then run your application and collect statistics periodically to see how the database is performing. RUNSTATS on DB2 is one example, I would hope MySQL has a similar tool.
When you find some oft-run queries doing full table scans (or taking too long for other reasons), then, and only then, should you add more indexes. It does little good to optimise a once-a-month-run-at-midnight query so it can finish at 12:05 instead of 12:07. However, it's a huge improvement to reduce a customer-facing query from 5 seconds down to 2 seconds (that's still too slow, customer-facing queries should be sub-second if possible).
More indexes tend to slow down inserts and speed up queries. So it's always a balancing act. That's why you only add indexes in specific response to a problem. Anything else is premature optimization and should be avoided.
In addition, revisit the indexes you already have periodically to see if they're still needed. It may be that the queries that caused you to add those indexes are no longer run often enough to warrant it.
To be honest, I don't believe indexing three columns on a table will cause you to suffer unless you plan on storing really huge numbers of rows :-) - indexing is pretty efficient.
After your edit which states:
I am indexing a point_value, the user_id, and an event_id, all required for client-facing purposes. For an instance such as scoring baseball runs by player id and game id. What would be the cost of inserting about 200 new records a day, after the table holds records for two seasons, say 72,000 runs, and after 5 seasons, maybe a quarter million records? Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
My response is that 200 records a day is an extremely small value for a database, you definitely won't have anything to worry about with those three indexes.
Just this week, I imported a days worth of transactions into one of our database tables at work and it contained 2.1 million records (we get at least one transaction per second across the entire day from 25 separate machines). And it has four separate composite keys which is somewhat more intensive than your three individual keys.
Now granted, that's on a DB2 database but I can't imagine IBM are so much better than the MySQL people that MySQL can only handle less than 0.01% of the DB2 load.
I made some simple tests using my real project and real MySql database.
My results are: adding average index (1-3 columns in an index) to a table - makes inserts slower by 2.1%. So, if you add 20 indexes, your inserts will be slower by 40-50%. But your selects will be 10-100 times faster.
So is it ok to add many indexes? - It depends :) I gave you my results - You decide!
Nothing for select queries, though updates and especially inserts will be order of magnitudes slower - which you won't really notice before you start inserting a LOT of rows at the same time...
In fact at a previous employer (single user, desktop system) we actually DROPPED indexes before starting our "import routine" - which would first delete all records before inserting a huge number of records into the same table...
Then when we were finished with the insertion job we would re-create the indexes...
We would save 90% of the time for this operation by dropping the indexes before starting the operation and re-creating the indexes afterwards...
This was a Sybase database, but the same numbers apply for any database...
So be careful with indexes, they're FAR from "free"...
Only for illustration, but I'm expecting to insert between 25 and 200 records a day.
With that kind of insertion rate, the cost of indexing an extra column will be negligible.
Without some more details about expected usage of the data in your table worrying about indexes slowing you down smells a lot like premature optimization that should be avoided.
If you are really concerned about it, then setup a test database and simulate performance in the worst case scenarios. A test proving that is or is not a problem will probably be much more useful then trying to guess and worry about what may happen. If there is a problem you will be able to use your test setup to try different methods to fix the issue.
The index is there to speed retrieval of data, so the question should be "What data do I need to access quickly?". Without the index, some queries will do a full table scan (go through every row in the table) in order to find the data that you want. With a significant amount of records this will be a slow and expensive operation. If it is for a report that you run once a month then maybe thats okay; if it is for frequently accessed data then you will need the index to give your users a better experience.
If you find the speed of the insert operations are slow because of the index then this is a problem you can solve at the hardware level by throwing more CPUs, RAM and better hard drive technology at the problem.
What Pax said.
For the dimensions you describe, the only significant concern I can imagine is "What is the cost of failing to index multiple db columns?"
I once had a MySQL database table containing 25 million records, which made even a simple COUNT(*) query takes minute to execute. I ended up making partitions, separating them into a couple tables. What i'm asking is, is there any pattern or design techniques to handle this kind of problem (huge number of records)? Is MSSQL or Oracle better in handling lots of records?
P.S
the COUNT(*) problem stated above is just an example case, in reality the app does crud functionality and some aggregate query (for reporting), but nothing really complicated. It's just that it takes quite a while (minutes) to execute some these queries because of the table volume
See Why MySQL could be slow with large tables and COUNT(*) vs COUNT(col)
Make sure you have an index on the column you're counting. If your server has plenty of RAM, consider increasing MySQL's buffer size. Make sure your disks are configured correctly -- DMA enabled, not sharing a drive or cable with the swap partition, etc.
What you're asking with "SELECT COUNT(*)" is not easy.
In MySQL, the MyISAM non-transactional engine optimises this by keeping a record count, so SELECT COUNT(*) will be very quick.
However, if you're using a transactional engine, SELECT COUNT(*) is basically saying:
Exactly how many records exist in this table in my transaction ?
To do this, the engine needs to scan the entire table; it probably knows roughly how many records exist in the table already, but to get an exact answer for a particular transaction, it needs a scan. This isn't going to be fast using MySQL innodb, it's not going to be fast in Oracle, or anything else. The whole table MUST be read (excluding things stored separately by the engine, such as BLOBs)
Having the whole table in ram will make it a bit faster, but it's still not going to be fast.
If your application relies on frequent, accurate counts, you may want to make a summary table which is updated by a trigger or some other means.
If your application relies on frequent, less accurate counts, you could maintain summary data with a scheduled task (which may impact performance of other operations less).
Many performance issues around large tables relate to indexing problems, or lack of indexing all together. I'd definitely make sure you are familiar with indexing techniques and the specifics of the database you plan to use.
With regards to your slow count(*) on the huge table, i would assume you were using the InnoDB table type in MySQL. I have some tables with over 100 million records using MyISAM under MySQL and the count(*) is very quick.
With regards to MySQL in particular, there are even slight indexing differences between InnoDB and MyISAM tables which are the two most commonly used table types. It's worth understanding the pros and cons of each and how to use them.
What kind of access to the data do you need? I've used HBase (based on Google's BigTable) loaded with a vast amount of data (~30 million rows) as the backend for an application which could return results within a matter of seconds. However, it's not really appropriate if you need "real time" access - i.e. to power a website. Its column-oriented nature is also a fairly radical change if you're used to row-oriented DBMS.
Is count(*) on the whole table actually something you do a lot?
InnoDB will have to do a full table scan to count the rows, which is obviously a major performance issue if counting all of them is something you actually want to do. But that doesn't mean that other operations on the table will be slow.
With the right indexes, MySQL will be very fast at retrieving data from tables much bigger than that. The problem with indexes is that they can hurt insert speeds, particularly for large tables as insert performance drops dramatically once the space required for the index reaches a certain threshold - presumably the size it will keep in memory. But if you only need modest insert speeds, MySQL should do everything you need.
Any other database will have similar tradeoffs between retrieve speed and insert speed; they may or may not be better for your application. But I would look first at getting the indexes right, and maybe rewriting your queries, before you try other databases. For what it's worth, we picked MySQL originally because we found it performed best.
Note that MyISAM tables in MySQL store the total size of the table. They maintain this because it's useful to the optimiser in some cases, but a side effect is that count(*) on the whole table is really fast. That doesn't necessarily mean they're faster than InnoDB at anything else.
I answered a similar question in This Stackoverflow Posting in some detail, describing the merits of the architectures of both systems. To some extent it was done from a data warehousing point of view but many of the differences also matter on transactional systems.
However, 25 million rows is not a VLDB and if you are having performance problems you should look to indexing and tuning. You don't need to go to Oracle to support a 25 million row database - you've got about 3 orders of magnitude to go before you're truly in VLDB territory.
You are asking for a books worth of answer and I therefore propose you get a good book on databases. There are many.
To get you started, here are some database basics:
First, you need a great data model based not just on what data you need to store but on usage patterns. Good database performance starts with good schema design.
Second, place indicies on columns based upon expected lookup AND update needs as update performance is often overlooked.
Third, don't put functions in where clauses if at all possible.
Fourth, use an -ahem- RDBMS engine that is of quality design. I would respectfully submit that while it has improved greatly in the recent past, mysql does not qualify. (Apologies to those who wish to argue it has finally made the grade in recent times.) There is no longer any need to choose between high-price and quality; Postgres (aka PostgreSql) is available open-source and is truly fantastic - and has all the plug-ins available to meet your needs.
Finally, learn what you are asking a database engine to do - gain some insight into internals - so you can better judge what kinds of things are expensive and why.
I'm going to second #Mark Baker, and say that you need to build indices on your tables.
For other queries than the one you selected, you should also be aware that using constructs such as IN() is faster than a series of OR statements in the query. There are lots of little steps you can take to speed-up individual queries.
Indexing is key to performance with this number of records, but how you write the queries can make a big difference as well. Specific performance tuning methods vary by database, but in general, avoid returning more records or fields than you actually need, make sure all join fields are indexed (as well as common where clause fields), avoid cursors (although I think this is less true in Oracle than SQL Server I don't know about mySQL).
Hardware can also be a bottleneck especially if you are running things besides the database server on the same machine.
Performance tuning is a very technical subject and can't really be answered well in a format like this. I suggest you get a performance tuning book and read it. Here is a link to one for mySQL
http://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/0596101716