I have a table whose compound clustered index (int, DateTime) was 99% fragmented.
After defragmenting and making sure that statistics were updated, I still get the same response time when I run this query:
SELECT *
FROM myTable
WHERE myIntField = 1000
AND myDateTimeField >= '2012-01-01'
and myDateTimeField <= '2012-12-31 23:59:59.999'
Well, I see a small response time improvement (like 5-10%) but I was really expected to burst my queries after that index rebuild and stats update.
The estimated execution plan is:
SELECT Cost: 0%
Clustered Index Seek (Clustered)[MyTable].[IX_MyCompoundIndex] Cost: 100%
Is this because the index is a clustered index? Am I missing something?
You should avoid SELECT * - probably even if you do need all of the columns in the table (which is rare).
Also, you are doing something very dangerous here. Did you know that your end range rounds up, so you may be including data from 2013-01-01 at midnight? Try:
AND myDateTimeColumn >= '20120101'
AND myDateTimeColumn < '20130101'
(This won't change performance, but it is easier to generate and is guaranteed to be accurate no matter what the underlying data type is.)
To eliminate network delays from your analysis of query time, you could consider SQL Sentry Plan Explorer - which allows you to generate an actual plan by running the query against the server, but discards the results, so that isn't an interfering factor.
Disclaimer: I work for SQL Sentry.
The execution time of the query is going to be spent reading enough pages of the index's btree to generate the result. Defragmenting the index will put adjacent rows together, reducing the number of pages that need to be read. It can also benefit from turning a largely random io pattern into a sequential one.
If your rows are wide and you don't get many rows per page you won't see much reduction in the number of rows.
If your index fill factor is low, you'll not get as many rows per page.
If your pages are in cache, you won't see any streaming v random IO benefit.
If you have spare CPU capacity on the machine, you may benefit from using page compression. This essentially trades more CPU for less IO.
Related
For example, if there is a table named paper, I execute sql with
[ select paper.user_id, paper.name, paper.score from paper where user_id in (201,205,209……) ]
I observed that when this statement is executed, index will only be used when the number of "in" is less than a certain number. and the certain number is dynamic.
For example,when the total number of rows in the table is 4000 and cardinality is 3939, the number of "in" must be less than 790,MySQL will execute index query.
(View MySQL explain. If <790, type=range; if >790, type=all)
when the total number of rows in the table is 1300000 and cardinality is 1199166, the number of "in" must be less than 8500,MySQL will execute index query.
The result of this experiment is very strange to me.
I imagined that if I implemented this "in" query, I would first find in (max) and in (min), and then find the page where in (max) and in (min) are located,Then exclude the pages before in (min) and the pages after in (max). This is definitely faster than performing a full table scan.
Then, my test data can be summarized as follows:
Data in the table 1 to 1300000
Data of "in" 900000 to 920000
My question is, in a table with 1300000 rows of data, why does MySQL think that when the number of "in" is more than 8500, it does not need to execute index queries?
mysql version 5.7.20
In fact, this magic number is 8452. When the total number of rows in my table is 600000, it is 8452. When the total number of rows is 1300000, it is still 8452. Following is my test screenshot
When the number of in is 8452, this query only takes 0.099s.
Then view the execution plan. range query.
If I increase the number of in from 8452 to 8453, this query will take 5.066s, even if I only add a duplicate element.
Then view the execution plan. type all.
This is really strange. It means that if I execute the query with "8452 in" first, and then execute the remaining query, the total time is much faster than that of directly executing the query with "8453 in".
who can debug MySQL source code to see what happens in this process?
thanks very much.
Great question and nice find!
The query planner/optimizer has to decide if it's going seek the pages it needs to read or it's going to start reading many more and scan for the ones it needs. The seek strategy is more memory and especially cpu intensive while the scan probably is significantly more expensive in terms of I/O.
The bigger a table the less attractive the seek strategy becomes. For a large table a bigger part of the nonclustered index used for the seek needs to come from disk, memory pressure rises and the potential for sequential reads shrinks the longer the seek takes. Therefore the threshold for the rows/results ratio to which a seek is considered lowers as the table size rises.
If this is a problem there're a few things you could try to tune. But when this is a problem for you in production it might be the right time to consider a server upgrade, optimizing the queries and software involved or simply adjust expectations.
'Harden' or (re)enforce the query plans you prefer
Tweak the engine (when this is a problem that affects most tables server/database settings maybe can be optimized)
Optimize nonclustered indexes
Provide query hints
Alter tables and datatypes
It is usually folly go do a query in 2 steps. That framework seems to be fetching ids in one step, then fetching the real stuff in a second step.
If the two queries are combined into a single on (with a JOIN), the Optimizer is mostly forced to do the random lookups.
"Range" is perhaps always the "type" for IN lookups. Don't read anything into it. Whether IN looks at min and max to try to minimize disk hits -- I would expect this to be a 'recent' optimization. (I have not it in the Changelogs.)
Are those UUIDs with the dashes removed? They do not scale well to huge tables.
"Cardinality" is just an estimate. ANALYZE TABLE forces the recomputation of such stats. See if that changes the boundary, etc.
I have a query. As follows
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases,
SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount,
SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp)
FROM caseDetails USE INDEX (updatedAt_caseDetails)
WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/06/30 23:59:00'
It uses indexing effectively. Screen shot of result of explain:
There are 154500 records in date range '2016/06/01 00:00:00' AND '2016/07/26 23:59:00'.
But when I increase data range as,
SELECT SUM(principalBalance) as pos, COUNT(id) as TotalCases, SUM(amountPaid) as paid, COUNT(amountPaid) as paidCount, SUM(amountPdc) as Pdc, SUM(amountPtp), COUNT(amountPtp) FROM caseDetails USE INDEX (updatedAt_caseDetails) WHERE updatedAt BETWEEN '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
Now this is not using indexing. Screen shot of result of explain:
There are 3089464 records in date range '2016/06/01 00:00:00' AND '2016/07/30 23:59:00'
After increasing date range query not using indexing anymore, so it gets too much slow. Even after I am forcing to use index. I am not able to figure out why this is happening as there is no change in query as well as indexing. Can you please help me to know about why this is happening.
Don't use USE INDEX or FORCE INDEX. This will slow down the query when most of the table is being accessed. In particular, the Optimizer will decide, rightly, to do a table scan if the index seems to point to more than about 20% of the rows. Using an index involves bouncing back and forth between the index and the data, whereas doing a table scan smoothly reads the data sequentially (albeit having to skip over many of the rows).
There is another solution to the real problem. I assume you are building "reports" summarizing data from a large Data Warehouse table?
Instead of always starting with raw data ('Fact' table), build and maintain a "Summary Table". For your data, it would probably have 1 row per day. Each night you would tally the SUMs and COUNTs for the various things for the day. Then the 'report' would sum the sums and sum the counts to get the desired tallies for the bigger date range.
More discussion: http://mysql.rjweb.org/doc.php/summarytables
Your 'reports' will run more than 10 times as fast, and you won't even be tempted to FORCE INDEX. After all, 60 rows should be a lot faster than 3089464.
less time (more likely)
Using an index might be inferior even when disk reads would be fewer (see below). Most disk drives support bulk read. That is, you request data from a certain block/page and from the n following pages. This is especially fast for almost all rotating disks, tapes and all other hard drives where accessing data in a sequential manner is more efficient than random access (like ... really more efficient).
Essentially you gain a time advantage by sequential read versus random access.
fewer disk reads (less likely)
Using an index is effective, when you actually gain speed/efficiency. An index is good, when you reduce the number of disk reads significantly and need less time. When reading the index and reading the resulting rows determined by using the index will result in almost the same disk reads as reading the whole table, usage of an index is probably unwise.
This will probably happen if your data is spread out enough (in respect to search criteria), so that you most likely have to read (almost) all pages/blocks anyway.
ideas for a fix
if you only access your table in this way (that is, the date is the most important search criteria) it might very much be worth the time to order the data on disk. I believe mysql might provide such a feature ... (optimize table appears to do some of this)
this would decrease query duration for index usage (and the index is more likely to be used)
alternatives
see post from Rick James (essentially: store aggregates instead of repeatedly calculating them)
Hey it has been long time I had ask this question, Now I have better solution for this which is working really smoothly for me. I hope my answer may help someone.
I used Partitioning method, and observed that performance of the query is really high now. I alter table by creating range partitioning on updatedAt column.
Range Partitioning
I have the following query:
SELECT `Product_has_ProductFeature`.`productId`, `Product_has_ProductFeature`.`productFeatureId`, `Product_has_ProductFeature`.`productFeatureValueId`
FROM `Product_has_ProductFeature`
WHERE `productId` IN (...'18815', '18816', '18817', '18818', '18819', '18820', '18821', '18822', '18823', '18824', '18825', '18826', '18827', '18828', '18829', '18830', '18831', '18832', '18833', '18834'..)
I have around 50000 productId's. The execution is 20 seconds long. How can I make it faster?
This is more of a comment.
Returning 50,000 rows can take time, depending on your application, the row size, the network, and how busy the server is.
When doing comparisons, you should be sure the values are of the same type. So, if productId is numeric, then drop the single quotes.
If the values are all consecutive, then eschew the list and just do:
where productid >= 18815 and productid <= 18834
Finally, an index on productid is usually recommended. However, in some cases, an index can make performance worse. This depends on the size of your data and the size of memory available for the page and data cache.
MySQL implements in efficiently (using a binary tree). It is possible that much of the overhead is in compiling the query rather than executing it. If you have the values in a table, a join is probably going to be more efficient.
I'm using MySql and having a situation when a given query calculates the revenue from a table of transactions. The selected transactions can span over 1 day, 1 week or 1 month.
SELECT
revenue formula
FROM
product inner join
account on key_condition1 inner join
transaction on key_condition2
WHERE
tx.ENTRYDATE >= '2013-06-17 00:00:00' AND tx.ENTRYDATE < '2013-07-24 00:00:00'
GROUP BY product
When I supply one week to the where statement the query runs in 3-4 seconds. When I want the entries from one month the query completes in 300 - 400 seconds if ever.
The database we're taking about is quite big. It has about 3.5 million transactions.
At first I thought that the sheer number of transactions leads to such an issue but it doesn't seem so. Per week there are 110363 entries and per month 576910. My other idea (that seems very likely) it's that because of the join the time can grow exponentially even though the join is not based on entry dates.
My question is: is the join "at fault" for the exponential growth? For the moment the join is unavoidable but this could get fixed with some database refactoring.
Thanks for your opinion.
The results from EXPLAIN:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,LOANPRODUCT,index,PRIMARY,PRIMARY,98,NULL,1,
1,SIMPLE,LOANACCOUNT,ref,"PRIMARY,LOANACCOUNT_PRODUCTTYPEKEY",LOANACCOUNT_PRODUCTTYPEKEY,99,LOANPRODUCT.ENCODEDKEY,16559,"Using where; Using index"
1,SIMPLE,LOANTRANSACTION,ref,"LOANTRANSACTION_PARENTACCOUNTKEY,LOANTRANSACTION_REVERSALTRANSACTIONKEY,LOANTRANSACTION_ENTRYDATE",LOANTRANSACTION_PARENTACCOUNTKEY,99,LOANACCOUNT.ENCODEDKEY,7,"Using where"
There could be a couple of big reasons here:
indexing
waiting for other transactions
memory constraints
caching issue
Below's what I think about each:
Indexing
I don't think it's a completely missing index since you are retrieving 5x more rows at 100x the time cost. If this were the issue, the scaling would be more or less linear with the number of rows. With no indexing, the scaling would possibly be even better than 1 if the query optimization is half way decent. However, if you have conflicting indices, then the optimizer would choose one or the other based on what it thinks is best. It's likely that the optimizer chose one for 3-4 seconds, and then the other for 300-400 seconds.
From your EXPLAIN result, it looks like you have conflicting indices. I'm going to guess that LOANTRANSACTION_PARENTACCOUNTKEY contains key_condition2, and LOANTRANSACTION_ENTRYDATE contains ENTRYDATE. Neither one has the other column. Thus, the optimizer has to choose one or the other. You should have an index that includes both. I would put ENTRYDATE first.
I am also going to guess that this EXPLAIN is from the slower query, since it's not using index on LOANTRANSACTION to filter by ENTRYDATE. Hence, MySQL needs to read all those rows just to see if they are in the range or not.
Waiting for Others
This is likely if other transactions are modifying the data. Try reading uncommited to see if it speeds up. If so, then this is your issue.
Memory
When you run out of memory, then all sorts of things slow down dramatically. See if 1 month scales to 2 months linearly, and if 1 week scales to .5 week linearly.
Caching
If your data is not in the cache, then that data will need to come from the disk, which is ridiculously slow compared to memory. This could very likely be your issue. If you rerun the query, the second run should be significantly faster. If your memory isn't big enough to contain the relevant rows, then your query will always be slow. See if your memory should be able to hold all the relevant tables or not.
I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.