I've a query that takes about 18 seconds to finish:
THE QUERY:
SELECT YEAR(c.date), MONTH(c.date), p.district_id, COUNT(p.owner_id)
FROM commission c
INNER JOIN partner p ON c.customer_id = p.id
WHERE (c.date BETWEEN '2018-01-01' AND '2018-12-31')
AND (c.company_id = 90)
AND (c.source = 'ACTUAL')
AND (p.id IN (3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071,
3072, 3073, 3074, 3075, 3076, 3077, 3078, 3079, 3081, 3082, 3083, 3084,
3085, 3086, 3087, 3088, 3089, 3090, 3091, 3092, 3093, 3094, 3095, 3096,
3097, 3098, 3099, 3448, 3449, 3450, 3451, 3452, 3453, 3454, 3455, 3456,
3457, 3458, 3459, 3460, 3461, 3471, 3490, 3491, 6307, 6368, 6421))
GROUP BY YEAR(c.date), MONTH(c.date), p.district_id
The commission table has around 2,8 millions of records, of which 860 000+ belong to the current year 2018. The partner table has at this moment 8600+ records.
RESULT
| `YEAR(c.date)` | `MONTH(c.date)` | district_id | `COUNT(c.id)` |
|----------------|-----------------|-------------|---------------|
| 2018 | 1 | 1 | 19154 |
| 2018 | 1 | 5 | 9184 |
| 2018 | 1 | 6 | 2706 |
| 2018 | 1 | 12 | 36296 |
| 2018 | 1 | 15 | 13085 |
| 2018 | 2 | 1 | 21231 |
| 2018 | 2 | 5 | 10242 |
| ... | ... | ... | ... |
55 rows retrieved starting from 1 in 18 s 374 ms
(execution: 18 s 368 ms, fetching: 6 ms)
EXPLAIN:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | extra |
|----|-------------|-------|------------|-------|------------------------------------------------------------------------------------------------------|----------------------|---------|-----------------|------|----------|----------------------------------------------|
| 1 | SIMPLE | p | null | range | PRIMARY | PRIMARY | 4 | | 57 | 100 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | c | null | ref | UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73,IDX_6F7146F09395C3F3,IDX_6F7146F0979B1AD6,IDX_6F7146F0AA9E377A | IDX_6F7146F09395C3F3 | 5 | p.id | 6716 | 8.33 | Using where |
DDL:
create table if not exists commission (
id int auto_increment
primary key,
date date not null,
source enum('ACTUAL', 'EXPECTED') not null,
customer_id int null,
transaction_id varchar(255) not null,
company_id int null,
constraint UNIQ_6F7146F0979B1AD62FC0CB0F5F8A7F73 unique (company_id, transaction_id, source),
constraint FK_6F7146F09395C3F3 foreign key (customer_id) references partner (id),
constraint FK_6F7146F0979B1AD6 foreign key (company_id) references companies (id)
) collate=utf8_unicode_ci;
create index IDX_6F7146F09395C3F3 on commission (customer_id);
create index IDX_6F7146F0979B1AD6 on commission (company_id);
create index IDX_6F7146F0AA9E377A on commission (date);
I noted that by removing the partner IN condition MySQL takes only 3s. I tried to replace it doing something crazy like this:
AND (',3062,3063,3064,3065,3066,3067,3068,3069,3070,3071,3072,3073,3074,3075,3076,3077,3078,3079,3081,3082,3083,3084,3085,3086,3087,3088,3089,3090,3091,3092,3093,3094,3095,3096,3097,3098,3099,3448,3449,3450,3451,3452,3453,3454,3455,3456,3457,3458,3459,3460,3461,3471,3490,3491,6307,6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
and the result was about 5s... great! but it's a hack.
WHY this query is taking a very long execution time when I uses IN statement? workaround, tips, links, etc. Thanks!
MySQL can use one index at a time. For this query you need a compound index covering the aspects of the search. Constant aspects of the WHERE clause should be used before range aspects like:
ALTER TABLE commission
DROP INDEX IDX_6F7146F0979B1AD6,
ADD INDEX IDX_6F7146F0979B1AD6 (company_id, source, date)
Here's what the Optimizer sees in your query.
Checking whether to use an index for the GROUP BY:
Functions (YEAR()) in the GROUP BY, so no.
Multiple tables (c and p) mentioned, so no.
For a JOIN, Optimizer will (almost always) start with one, then reach into the other. So, let's look at the two options:
If starting with p:
Assuming you have PRIMARY KEY(id), there is not much to think about. It will simply use that index.
For each row selected from p, it will then look into c, and any variation of this INDEX would be optimal.
c: INDEX(company_id, source, customer_id, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
If starting with c:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date) -- last, since it is tested as a range
-- slightly better:
c: INDEX(company_id, source, -- in any order (all are tested "=")
date, -- last, since it is tested as a range
customer_id) -- really last -- added only to make it "covering".
The Optimizer will look at "statistics" to crudely decide which table to start with. So, add all the indexes I suggested.
A "covering" index is one that contains all the columns needed anywhere in the query. It is sometimes wise to extend a 'good' index with more columns to make it "covering".
But there is a monkey wrench in here. c.customer_id = p.id means that customer_id IN (...) effectively exists. But now there are two "range-like" constraints -- one is an IN, the other is a 'range'. In some newer versions, the Optimizer will happily jump around due to the IN and still be able to do "range" scans. So, I recommend this ordering:
Test(s) of column = constant
Test(s) with IN
One 'range' test (BETWEEN, >=, LIKE with trailing wildcard, etc)
Perhaps add more columns to make it "covering" -- but don't do this step if you end up with more than, say, 5 columns in the index.
Hence, for c, the following is optimal for the WHERE, and happens to be "covering".
INDEX(company_id, source, -- first, but in any order (all "=")
customer_id, -- "IN"
date) -- last, since it is tested as a range
p: (same as above)
Since there was an IN or "range", there is no use seeing if the index can also handle the GROUP BY.
A note on COUNT(x) -- it checks that x is NOT NULL. It is usually just as correct to say COUNT(*), which counts the number of rows without any extra checking.
This is a non-starter since it hides the indexed column (id) in a function:
AND (',3062,3063,3064,3065,3066,...6368,6421,'
LIKE CONCAT('%,', p.id, ',%'))
With your LIKE-hack you are tricking optimizer so it uses different plan (most probably using IDX_6F7146F0AA9E377A index on the first place).
You should be able to see this in explain.
I think the real issue in your case is the second line of explain: server executing multiple functions (MONTH, YEAR) for 6716 rows and then trying to group all these rows. During this time all these 6716 rows should be stored (in memory or on disk that is based on your server configuration).
SELECT COUNT(*) FROM commission WHERE (date BETWEEN '2018-01-01' AND '2018-12-31') AND company_id = 90 AND source = 'ACTUAL';
=> How many rows are we talking about?
If the number in above query is much lower then 6716 I'd try to add covering index on columns customer_id, company_id, source and date. Not sure about the best order as it depends on data you have (check cardinality for these columns). I'd started with index (date, company_id, source, customer_id). Also, I'd add unique index (id, district_id, owner_id) on partner.
It is also possible to add additional generated stored columns _year and _month (if your server is a bit old you can add normal columns and fill them in with trigger) to rid off the multiple function executions.
I have a sql query
SELECT level, data, string, soln, uid
FROM user_data
WHERE level = 10 AND (timetaken >= 151 AND timetaken <= 217) AND uid != 1
LIMIT 8852, 1;
which fetches from a table with 1.5 million entries.
I have indexed using
alter table user_data add index a_idx (level, timetaken, uid);
So the issue i am facing is it takes more than 30sec to query in some cases and in somecases less than 0.01sec.
Is there any issue with the indexing here.
Edit:
Added the explain query details
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
| 1 | SIMPLE | user_data | range | a_idx | a_idx | 30 | NULL | 24091 | Using where; Using index |
+----+-------------+------------------+-------+---------------+------------+---------+------+-------+--------------------------+
The data field in the table is a text field. Its length is greater than 255 characters in most cases. Does this cause a Issue?
First of all you should try getting the execution plan of this query with EXPLAIN:
EXPLAIN SELECT level, data, string, soln, uid
FROM user_data
WHERE level = 10 AND (timetaken >= 151 AND timetaken <= 217) AND uid != 1
LIMIT 8852, 1;
This is a great slide to follow through on this topic:
http://www.slideshare.net/phpcodemonkey/mysql-explain-explained
Try adding different indexes:
one on uid and level
a separate one on timetaken
The problem is in the high offset. In order to select the 8853rd result, MySQL has to scan all 8852 rows before this.
Btw, using limit without order by may lead to unexpected results.
In order to speed up the queries with high offset, you should move to a since..until pagination strategy
We have a MyISAM table with approximately 75 milion rows that has 5 columns:
id (int),
user_id(int),
page_id (int),
type (enum with 6 strings)
date_created(datetime).
We have a primary index on the ID column, a unique index (user_id, page_id, date_created) AND a composite index (page_id, date_created)
The problem is that the query below takes up to 90 seconds to complete
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table`
WHERE `page_id`=301
and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND page_id<>user_id
group by `user_id`
This is the explain of this query
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | table | range | page_id | page_id | 12 | NULL | 520024 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
EDIT:
At the suggestion of ypercube I tried adding a new index (page_id, user_id, date_created). However mysql does not use it bu default so i had to suggest it to the query optimizer. Here is the new query and the explain:
SELECT SQL_NO_CACHE user_id, count(*) nr FROM `table` USE INDEX (usridexp) WHERE `page_id`=301 and `date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59' AND page_id<>user_id group by `user_id` ORDER BY NULL
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
| 1 | SIMPLE | table | ref | usridexp | usridexp | 4 | const | 3943444 | Using where; Using index |
+----+-------------+----------------------------+------+---------------+----------+---------+-------+---------+--------------------------+
Some changes that may improve the query:
Change COUNT(id) to COUNT(*). Since id is (I guess) the PRIMARY KEY and NOT NULL, the results will be identical.
Add an ORDER BY NULL after ther GROUP BY clause. In MySQL, a group by operation also sorts the results, unless you specify other wise.
The (page_id, date_created) is probably the best index that MySQL can use for this query but you could also try (page_id, user_id, date_created) (can you also post the EXPLAIN if you add this index?)
Another thing not related to the performance of this query:
If your (user_id, page_id, date_created) is UNIQUE and the id is auto generated (and not used for anything else but as a Primary Key), you can make it the PRIMARY KEY and drop the id column. One less index and 4 bytes less per row.
1) It depends on your data - but you should have multiple indexes available to allow MySQL to choose the best one. e.g. if the table had an index on page_id it wouldn't be scanning so many rows.
2) There is a way of optimising date searches. I haven't actually implemented this myself yet, but have a similar problem that I have thought about quite a bit.
Basically you are looking up data by day - but date compares are really slow. What you could do is create another table that stores earliest and latest ID from table for each day. That table would need to be populated at the end of each day.
After that you could break your query into two parts:
i) Find the IDs to search y running two queries:
select earliestID from idCacheTable where date = '2012-01-03';
select latestID from idCacheTable where date = '2012-02-03';
ii) You can then search directly on the primary key of the table, without doing a date compare on each row, which would be waaaaaay faster.
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM table
WHERE page_id=301
and (id >= earliestID and id <= latestID)
AND page_id<>user_id
group by user_id;
The exact solution to your problem will depend on what your data looks like though, rather than one of those two things always being correct.
Sounds odd, but try to add JOIN statement:
SELECT SQL_NO_CACHE user_id, count(id) nr
FROM `table` t
JOIN `table` t2 ON t.`user_id`= t2.`user_id`
WHERE t.`page_id`=301
and t.`date_created` BETWEEN '2012-01-03' AND '2012-02-03 23:59:59'
AND t.`page_id`<>t.`user_id`
group by t.`user_id`
For similar problem, I got that query execute 20 times faster (3-4s instead 60+). JOIN statement does not perform anything smart - seems that speedup is fully to internal MySql implementation (Tested on MySql 5.1., table have rare user_id duplicates).
I was trying to optimize NOT IN clause in mysql: Some how I ended up in the following query:
SELECT #i:=(SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc');
SELECT * FROM word WHERE #i IS NULL OR word_id NOT IN (#i);
There is no relationship between sent_question table and word table. And also I cannot place index on correct_option_word_id.
Can somebody please explain, will this method even optimize the query or not?
UPDATE: As mentioned here that both the methods: NOT IN and LEFT JOIN/IS NULL are almost equally efficient. That's why I don't want to use LEFT JOIN/IS NULL method.
UPDATE 2:
Explain results for original query:
EXPLAIN SELECT * FROM word WHERE word_id NOT IN (SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc');
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
| 1 | PRIMARY | word | ALL | NULL | NULL | NULL | NULL | 10 | Using where |
| 2 | DEPENDENT SUBQUERY | sent_question | ref | fk_question_subscriber1 | fk_question_subscriber1 | 48 | const | 1 | Using where |
+----+--------------------+---------------+------+-------------------------+-------------------------+---------+-------+------+-------------+
You're right in that both the NOT IN and LEFT JOIN/IS NULL method are equally efficient, however, unfortunately, there is no faster option, only slower ones (NOT EXISTS).
Here's your query, simplified:
SELECT *
FROM word
WHERE
word_id NOT IN (SELECT correct_option_word_id FROM sent_question WHERE msisdn='abc')
As you know, MySQL will do the subquery first and use the returned result set for the NOT IN clause. Then, it will scan through all of the rows in word to see if word_id is in the list for each row.
Unfortunately for this case, indexes are inclusive, not exclusive. They don't help with NOT queries. A covering index on word could potentially still be used to avoid accessing the actual table, and provide some IO benefits, but it won't be used in the traditional "lookup" sense. However, since you are returning all columns on the word table, it may not be viable to have such a large index.
The most important index that will be used here is an index on sent_question.msisdn for the subquery. Ensure that you have that index defined. A multi-column "covering" index on (msisdn, correct_option_word_id) would be best.
If you share your design, we can probably offer some design solutions for optimization.
I doubt it'll work at all.
Try
SELECT *
FROM word AS w
LEFT JOIN sent_question AS sq
ON w.word_id = sq.correct_option_word_id AND sq.msisdn='abc'
WHERE sq.correct_option_word_id IS NULL
Give this simple query a try
SELECT
sent_question.*,
word.word_id AS foundWord
FROM sent_question
LEFT JOIN word
ON word.word_id = sent_question.correct_option_word_id
WHERE sent_question.msisdn='abc'
// GROUP BY sent_question.correct_option_word_id // This shouldn't be needed but included for completion
HAVING foundWord IS NULL
I am writing a query for a very specific report that contains a variable number of columns, based on specific relationships of an item. I am open to suggetions on how to change the query if needs be, but I don't think it can be. I would prefer to keep this as a single query, as opposed to running it in a loop. The table that is being searched contains around 4 million records, and cannot be archived.
What I would like to know, is why the DATEADD index is not being used on the subquery, although it is being used in the outer query, which is on the same table. I am aware that functions on a field stop MySQL from being able to index, but this is only on the item, not what you are comparing it to.
The result of the report is a number for each specific item (subquery) for each date in the range, where something took place. The date range is generated dynamically. The subquerys should return the results for a single day
We are using MySQL version 5.0.77, which we cannot change as it is managed by our Hosting Provider.
Here is the query:
SELECT DATE_FORMAT(DATEADD, '%d/%m/%y') AS DATEADD,
(SELECT COUNT(ID)
FROM ATABLE AS
WHERE ELEMNAME = 'ANELEMENT' AND COMPID = 132
AND VT.DATEADD BETWEEN CONCAT(DATE(V.DATEADD)," 00:00:00") AND CONCAT(DATE(V.DATEADD)," 23:59:59")))
AS '132',
(SELECT COUNT(ID)
FROM ATABLE AS
WHERE ELEMNAME = 'ANELEMENT' AND COMPID = 149
AND VT.DATEADD BETWEEN CONCAT(DATE(V.DATEADD)," 00:00:00") AND CONCAT(DATE(V.DATEADD)," 23:59:59")))
AS '149'
FROM ATABLE AS V
WHERE 1 = 1 AND COMPID = 132
AND (V.DATEADD >= "2010-09-01 00:00:00"
AND V.DATEADD <= "2010-10-26 23:59:59")
AND 1 = 1
AND ELEMNAME = 'ANELEMENT'
GROUP BY DATE_FORMAT(DATEADD, '%Y-%m-%d')
The number of times the subquery is ran depends on the number of links this item has, and is determined when the query is built.
We have tried:
replacing the between with
"VT.DATEADD <= DATE(V.DATEADD) and VT.DATEADD <= DATE(V.DATEADD) +1"
however this doesnt work either, changing it to
"VT.DATEADD = DATE(V.DATEADD)"
does use the index, however doesnt return the correct number of rows, as DATEADD is a datetime. If we change it to:
"VT.DATEADD >= "2010-09-01" AND VT.DATEADD <= "2010-09-02"
The output from Explain is
+----+--------------------+-------+-------+-------------------------+----------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+-------------------------+----------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | V | range | DATEADD,COMPID,ELEMNAME | DATEADD | 8 | NULL | 1386 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | VT | ref | COMPID,ELEMNAME | ELEMNAME | 103 | const | 44277 | Using where |
+----+--------------------+-------+-------+-------------------------+----------+---------+-------+-------+----------------------------------------------+
Using USE INDEX, or FORCE INDEX (when it is available but not used) uses NULL key
without fixing this, the query runs incredibly slowly, even over a tiny date range and locks the database up.
I don't know if I'm over simplifying what you want overall, but will this one work for you. It appears you want to know how much activity for two "compid" values within a given date range.
SELECT
DATE_FORMAT(DATEADD, '%Y-%m-%d'),
SUM( if( compid = 132, 1, 0 ) ) as Count132,
SUM( if( compid = 149, 1, 0 ) ) as Count149
from
ATable
where
elemname = "ANELEMENT"
AND ( compid = 132 or compid = 149 )
AND DATEADD BETWEEN "2010-09-01 00:00:00" AND "2010-10-26 23:59:59"
group by
dateadd