We have just moved from mariadb 5.5 to MySQL 8 and some of the update queries have suddenly become slow. On more investigation, we found that MySQL 8 does not use index when the subquery has group column.
For example, below is a sample database. Table users maintain the current balance of the users per type and table 'accounts' maintain the total balance history per day.
CREATE DATABASE 'test';
CREATE TABLE `users` (
`uid` int(10) unsigned NOT NULL DEFAULT '0',
`balance` int(10) unsigned NOT NULL DEFAULT '0',
`type` int(10) unsigned NOT NULL DEFAULT '0',
KEY (`uid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `accounts` (
`uid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`balance` int(10) unsigned NOT NULL DEFAULT '0',
`day` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`uid`),
KEY `day` (`day`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Below is a explanation for the query to update accounts
mysql> explain update accounts a inner join (
select uid, sum(balance) balance, day(current_date()) day from users) r
on r.uid=a.uid and r.day=a.day set a.balance=r.balance;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| 1 | UPDATE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table |
| 2 | DERIVED | users | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
2 rows in set, 1 warning (0.00 sec)
As you can see, mysql is not using index.
On more investigation, I found that if I remove sum() from the subquery, it starts using index. However, that's not the case with mariadb 5.5 which was correctly using the index in all the case.
Below are two select queries with and without sum(). I've used select query to cross check with mariadb 5.5 since 5.5 does not have explanation for update queries.
mysql> explain select * from accounts a inner join (
select uid, balance, day(current_date()) day from users
) r on r.uid=a.uid and r.day=a.day ;
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------+
| 1 | SIMPLE | a | NULL | ref | PRIMARY,day | day | 4 | const | 1 | 100.00 | NULL |
| 1 | SIMPLE | users | NULL | eq_ref | PRIMARY | PRIMARY | 4 | test.a.uid | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------+------+----------+-------+
2 rows in set, 1 warning (0.00 sec)
and with sum()
mysql> explain select * from accounts a inner join (
select uid, sum(balance) balance, day(current_date()) day from users
) r on r.uid=a.uid and r.day=a.day ;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | no matching row in const table |
| 2 | DERIVED | users | NULL | ALL | NULL | NULL | NULL | NULL | 1 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+--------------------------------+
2 rows in set, 1 warning (0.00 sec)
Below is output from mariadb 5.5
MariaDB [test]> explain select * from accounts a inner join (
select uid, sum(balance) balance, day(current_date()) day from users
) r on r.uid=a.uid and r.day=a.day ;
+------+-------------+------------+------+---------------+------+---------+-----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+------------+------+---------------+------+---------+-----------------------+------+-------------+
| 1 | PRIMARY | a | ALL | PRIMARY,day | NULL | NULL | NULL | 1 | |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | 10 | test.a.uid,test.a.day | 2 | Using where |
| 2 | DERIVED | users | ALL | NULL | NULL | NULL | NULL | 1 | |
+------+-------------+------------+------+---------------+------+---------+-----------------------+------+-------------+
3 rows in set (0.00 sec)
Any idea what are we doing wrong?
As others have commented, break your update query apart...
update accounts join
then your query
on condition of the join.
Your inner select query of
select uid, sum(balance) balance, day(current_date()) day from users
is the only thing that is running, getting some ID and the sum of all balances and whatever the current day. You never know which user is getting updated, let alone the correct amount. Start by getting your query to see your expected results per user ID. Although the context does not make sense that your users table has a "uid", but no primary key thus IMPLYING there is multiple records for the same "uid". The accounts (to me) implies ex: I am a bank representative and sign up multiple user accounts. Thus my active portfolio of client balances on a given day is the sum from users table.
Having said that, lets look at getting that answer
select
u.uid,
sum( u.balance ) allUserBalance
from
users u
group by
u.uid
This will show you per user what their total balance is as of right now. The group by now gives you the "ID" key to tie back to the accounts table. In MySQL, the syntax of a correlated update for this scenario would be... (I am using above query and giving alias "PQ" for PreQuery for the join)
update accounts a
JOIN
( select
u.uid,
sum( u.balance ) allUserBalance
from
users u
group by
u.uid ) PQ
-- NOW, the JOIN ON clause ties the Accounts ID to the SUM TOTALS per UID balance
on a.uid = PQ.uid
-- NOW you can SET the values
set Balance = PQ.allUserBalance,
Day = day( current_date())
Now, the above will not give a proper answer if you have accounts that no longer have user entries associated... such as all users get out. So, whatever accounts have no users, their balance and day record will be as of some prior day. To fix this, you could to a LEFT-JOIN such as
update accounts a
LEFT JOIN
( select
u.uid,
sum( u.balance ) allUserBalance
from
users u
group by
u.uid ) PQ
-- NOW, the JOIN ON clause ties the Accounts ID to the SUM TOTALS per UID balance
on a.uid = PQ.uid
-- NOW you can SET the values
set Balance = coalesce( PQ.allUserBalance, 0 ),
Day = day( current_date())
With the left-join and COALESCE(), if there is no record summation in the user table, it will set the account balance to zero.
Related
My query:
SELECT fd.*
FROM `fin_document` as fd
LEFT JOIN `fin_income` as fi ON fd.id=fi.document_id
WHERE fd.dt_payment < NOW()
HAVING SUM(fi.amount) < fd.total_amount
which is obviously not correct, has to retrieve all records from fin_document where dt_payment is earlier than NOW(). This part is ok. But I have to filter them by the payments made on this documents. One document can have more than 1 payment ( 2,3,4,5 ...). In fin_income are those payments. There is column document_id in fin_income table which is foreign key fin_income.document_id=fin_document.id. The problem ( at least for me ) is that I don't have a specific id criterion and the amount is made from all records from fin_income table. I also have to find records that still don't have payments on them ( they don't have rows in fin_income ).
fin_document:
+-------------------+---------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| dt_payment | date | YES | | NULL |
| total_amount | decimal(10,2) | NO | | 0.00 |
+-------------------+---------------------------+------+-----+---------+----------------+
fin_income:
+------------------+---------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------------+---------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| document_id | int(11) | YES | MUL | NULL | |
| amount | decimal(10,2) | YES | | 0.00 |
+------------------+---------------+------+-----+-------------------+----------------+
I'm not sure if I understand you correctly, but you can try this:
SELECT fd.*, SUM(IFNULL(fi.amount, 0)) as sum_amount, COUNT(fi.amount) as count_amount
FROM `fin_document` as fd
LEFT JOIN `fin_income` as fi ON fd.id=fi.document_id
WHERE fd.dt_payment < NOW()
GROUP BY fd.id
HAVING sum_amount < fd.total_amount # condition for searching by sum of payments
AND count_amount = {needed_count} # condition for searching by count of payments;
# documents without payments will have
# sum and count equal to 0
All aggregations are made in SELECT part, then all documents are grouped by id to avoid duplicates in result and make possible to use aggregation results (SUM, COUNT). And finally you can apply needed conditions (about date, paid sum or count of payments).
Note: pay attention that GROUP BY significantly increases execution time for a lot of data.
You may just need a correlated sub query to test income
drop table if exists fin_document,fin_income;
create table fin_document
(id int(11),
dt_payment date ,
total_amount decimal(10,2)
) ;
create table fin_income
( id int(11) ,
document_id int(11) ,
amount decimal(10,2)
);
insert into fin_document values
(1,'2019-05-31',1000),
(2,'2019-06-10',1000),
(3,'2019-07-10',1000);
insert into fin_income values
(1,1,5),(1,1,5);
SELECT fd.*,(select coalesce(sum(fi.amount),0) from fin_income fi where fd.id=fi.document_id) income
FROM `fin_document` as fd
WHERE fd.dt_payment < NOW() and
fd.total_amount > (select coalesce(sum(fi.amount),0) from fin_income fi where fd.id=fi.document_id);
+------+------------+--------------+--------+
| id | dt_payment | total_amount | income |
+------+------------+--------------+--------+
| 1 | 2019-05-31 | 1000.00 | 10.00 |
| 2 | 2019-06-10 | 1000.00 | 0.00 |
+------+------------+--------------+--------+
2 rows in set (0.00 sec)
for a project i'm working on, i have a single table with two dates meaning a range of dates and i needed a way to "multiply" my rows for every day in between the two dates.
So for instance i have start 2017-07-10, end 2017-07-14
I needed to have 4 lines with 2017-07-10, 2017-07-11, 2017-07-12, 2017-07-13
In order to do this i found here someone mentioning using a "calendar table" with all the dates for years.
So i built it, now i have these two simple tables:
CREATE TABLE `time_sample` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start` varchar(16) DEFAULT NULL,
`end` varchar(16) DEFAULT NULL,
PRIMARY KEY (`societa_id`),
KEY `start_idx` (`start`),
KEY `end_idx` (`end`)
) ENGINE=MyISAM AUTO_INCREMENT=222 DEFAULT CHARSET=latin1;
This table contains my date ranges, start and end are indexed, the primary key is an incremental int.
Sample Row:
id start end
1 2015-05-13 2015-05-18
Second table:
CREATE TABLE `time_dimension` (
`id` int(11) NOT NULL,
`db_date` date NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `td_dbdate_idx` (`db_date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This has a date indexed for every day for many years to come.
Sample row:
id db_date
20120101 2012-01-01
Now, i made the join:
select * from time_sample s join time_dimension t on (t.db_date >= start and t.db_date < end);
This takes 3ms. Even if my first table is HUGE, this query will always be very quick (max i've seen was 50ms with a lot of records).
The issue i have is while grouping results (i need them grouped for my application):
select * from time_sample s join time_dimension t on (t.db_date >= start and t.db_date < end) group by db_date;
This takes more than one second with not so many rows in the first table, increasing dramatically. Why is this happening and how can i avoid this?
Changing the data types doesn't help, having the second table with just one column doesn't help.
Can i have suggestions, please :(
I cannot replicate this result...
I have a calendar table with lots of dates: calendar(dt) where dt is a PRIMARY KEY DATE data type.
DROP TABLE IF EXISTS time_sample;
CREATE TABLE time_sample (
id int(11) NOT NULL AUTO_INCREMENT,
start date not NULL,
end date null,
PRIMARY KEY (id),
KEY (start,end)
);
INSERT INTO time_sample (start,end) VALUES ('2010-03-13','2010-05-09);
SELECT *
FROM calendar x
JOIN time_sample y
ON x.dt BETWEEN y.start AND y.end;
+------------+----+------------+------------+
| dt | id | start | end |
+------------+----+------------+------------+
| 2010-03-13 | 1 | 2010-03-13 | 2010-05-09 |
| 2010-03-14 | 1 | 2010-03-13 | 2010-05-09 |
| 2010-03-15 | 1 | 2010-03-13 | 2010-05-09 |
| 2010-03-16 | 1 | 2010-03-13 | 2010-05-09 |
...
| 2010-05-09 | 1 | 2010-03-13 | 2010-05-09 |
+------------+----+------------+------------+
58 rows in set (0.10 sec)
EXPLAIN
SELECT * FROM calendar x JOIN time_sample y ON x.dt BETWEEN y.start AND y.end;
+----+-------------+-------+--------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | y | system | start | NULL | NULL | NULL | 1 | |
| 1 | SIMPLE | x | range | PRIMARY | PRIMARY | 3 | NULL | 57 | Using where; Using index |
+----+-------------+-------+--------+---------------+---------+---------+------+------+--------------------------+
2 rows in set (0.00 sec)
Even with a GROUP BY, I'm struggling to reproduce the problem. Here's a simple COUNT...
SELECT SQL_NO_CACHE dt, COUNT(1) FROM calendar x JOIN time_sample y WHERE x.dt BETWEEN y.start AND y.end GROUP BY dt ORDER BY COUNT(1) DESC LIMIT 3;
+------------+----------+
| dt | COUNT(1) |
+------------+----------+
| 2010-04-03 | 2 |
| 2010-05-05 | 2 |
| 2010-03-13 | 2 |
+------------+----------+
3 rows in set (0.36 sec)
EXPLAIN
SELECT SQL_NO_CACHE dt, COUNT(1) FROM calendar x JOIN time_sample y WHERE x.dt BETWEEN y.start AND y.end GROUP BY dt ORDER BY COUNT(1) DESC LIMIT 3;
+----+-------------+-------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | y | index | start | start | 7 | NULL | 2 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | x | index | PRIMARY | PRIMARY | 3 | NULL | 1000001 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+---------+----------------------------------------------+
I have two query for same output
Slow Query:
SELECT
*
FROM
account_range
WHERE
is_active = 1 AND '8033576667466317' BETWEEN range_start AND range_end;
Execution Time: ~800 ms.
Explain:
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
| 1 | SIMPLE | account_range | NULL | ALL | range_start,range_end,range_se_active_idx | NULL | NULL | NULL | 940712 | 2.24 | Using where |
+----+-------------+---------------+------------+------+-------------------------------------------+------+---------+------+--------+----------+-------------+
Very Fast Query: learnt from here
SELECT
*
FROM
account_range
WHERE
is_active = 1 AND
range_start = (SELECT
MAX(range_start)
FROM
account_range
WHERE
range_start <= '8033576667466317') AND
range_end = (SELECT
MIN(range_end)
FROM
account_range
WHERE
range_end >= '8033576667466317')
Execution Time: ~1ms
Explain:
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
| 1 | PRIMARY | account_range | NULL | ref | range_start,range_end,range_se_active_idx | range_se_active_idx | 125 | const,const,const | 1 | 100.00 | NULL |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
| 2 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+---------------+------------+------+-------------------------------------------+---------------------+---------+-------------------+------+----------+------------------------------+
Table Structure:
CREATE TABLE account_range (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
range_start varchar(20) NOT NULL,
range_end varchar(20) NOT NULL,
is_active tinyint(1) NOT NULL,
bank_name varchar(100) DEFAULT NULL,
addedon timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
updatedon timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
description text,
PRIMARY KEY (id),
KEY range_start (range_start),
KEY range_end (range_end),
KEY range_se_active_idx (range_start , range_end , is_active)
) ENGINE=InnoDB AUTO_INCREMENT=946132 DEFAULT CHARSET=utf8;
Please do explain Why doesn't MySql automatically optimizes BETWEEN query?
Update:
Realised my mistake from #kordirko answer. My table contains only non-overlapping ranges, so both queries are returning same results.
Such a comparision doesn't make sense, since you are comparing apples to oranges.
These two queries are not eqivalent, they give different resuts, thus MySql optimises them in a different way and their plans can differ.
See this simple example: http://sqlfiddle.com/#!9/98678/2
create table account_range(
is_active int,
range_start int,
range_end int
);
insert into account_range values
(1,-20,100), (1,10,30);
First query gives 2 rows:
select * from account_range
where is_active = 1 and 25 between range_start AND range_end;
| is_active | range_start | range_end |
|-----------|-------------|-----------|
| 1 | -20 | 100 |
| 1 | 10 | 30 |
Second query gives only 1 row:
SELECT * FROM account_range
WHERE
is_active = 1 AND
range_start = (SELECT MAX(range_start)
FROM account_range
WHERE range_start <= 25
) AND
range_end = (SELECT MIN(range_end)
FROM account_range
WHERE range_end >= 25
)
| is_active | range_start | range_end |
|-----------|-------------|-----------|
| 1 | 10 | 30 |
To speed this query up (the first one), two bitmap indexes can be used together with "bitmap and" operation - but MySql doesn't have such a feature.
Another option is a spatial index (for example GIN indexes in PostgreSql: http://www.postgresql.org/docs/current/static/textsearch-indexes.html).
And another option is a star transformation (or a star schema) - you need to "divide" this table into two "dimension" or "measures" tables and one "fact" table .... but this is too broad topic, if you want know more you can start from here: https://en.wikipedia.org/wiki/Star_schema
Second query is fast because MySQL is able to use available index.
SELECT * FROM account_range
WHERE
is_active = 1 AND
range_start = a_constant_value_1 AND
range_end = a_constant_value_2
Above query is fast because range_se_active_idx index can satisfy search criteria so it is used.
Both subqueries are also fast (see Select tables optimized away in EXPLAIN's output)
SELECT MAX(range_start) FROM account_range
WHERE range_start <= '8033576667466317'
SELECT MIN(range_end) FROM account_range
WHERE range_end >= '8033576667466317'
because range_start and range_end both are indexed, they are ordered.
With ordered data, for first sub query, MySQL basically just picks one record whose range_start equals 8033576667466317 or one below it (MAX(range_start)). For second sub query, MySQL picks one record whose range_end equals 8033576667466317 or one above it (MIN(range_end)).
For BETWEEN ... AND .. query, MySQL cannot find any indices because that is not a range search. It's basically same as
SELECT * FROM account_range
WHERE
is_active = 1 AND
range_start >= '8033576667466317' AND
range_end <= '8033576667466317';
It has to search records with range_start from 8033576667466317 to largest value and also from smallest range_end to 8033576667466317. All indices cannot satisfy this criteria so it has to scan table.
I believe it can be optimized if you can rewrite it into something like this:
SELECT * FROM account_range
WHERE
is_active = 1 AND
(range_start BETWEEN a_min_value AND a_max_value) AND
(range_end BETWEEN a_min_value AND a_max_value);
I have a table with currency exchange rates that I fill with data published by the ECB. That data contains gaps in the date dimension like e.g. holidays.
CREATE TABLE `imp_exchangerate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`rate_date` date NOT NULL,
`currency` char(3) NOT NULL,
`rate` decimal(14,6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `rate_date` (`rate_date`,`currency`),
KEY `imp_exchangerate_by_currency` (`currency`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
I also have a date dimension as youd expect in a data warehouse:
CREATE TABLE `d_date` (
`date_id` int(11) NOT NULL,
`full_date` date DEFAULT NULL,
---- etc.
PRIMARY KEY (`date_id`),
KEY `full_date` (`full_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Now I try to fill the gaps in the exchangerates like this:
SELECT
d.full_date,
currency,
(SELECT rate FROM imp_exchangerate
WHERE rate_date <= d.full_date AND currency = c.currency
ORDER BY rate_date DESC LIMIT 1) AS rate
FROM
d_date d,
(SELECT DISTINCT currency FROM imp_exchangerate) c
WHERE
d.full_date >=
(SELECT min(rate_date) FROM imp_exchangerate
WHERE currency = c.currency) AND
d.full_date <= curdate()
Explain says:
+------+--------------------+------------------+-------+----------------------------------------+------------------------------+---------+------------+------+--------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------------+------------------+-------+----------------------------------------+------------------------------+---------+------------+------+--------------------------------------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 201 | |
| 1 | PRIMARY | d | range | full_date | full_date | 4 | NULL | 6047 | Using where; Using index; Using join buffer (flat, BNL join) |
| 4 | DEPENDENT SUBQUERY | imp_exchangerate | ref | imp_exchangerate_by_currency | imp_exchangerate_by_currency | 3 | c.currency | 664 | |
| 3 | DERIVED | imp_exchangerate | range | NULL | imp_exchangerate_by_currency | 3 | NULL | 201 | Using index for group-by |
| 2 | DEPENDENT SUBQUERY | imp_exchangerate | index | rate_date,imp_exchangerate_by_currency | rate_date | 6 | NULL | 1 | Using where |
+------+--------------------+------------------+-------+----------------------------------------+------------------------------+---------+------------+------+--------------------------------------------------------------+
MySQL needs multiple hours to execute that query. Are there any Ideas how to improve that? I have tried with an index on rate without any noticable impact.
I have a solution for a while now: get rid of dependent subqueries. I had to think from different angles in mutliple places and here is the result:
SELECT
cd.date_id,
x.currency,
x.rate
FROM
imp_exchangerate x INNER JOIN
(SELECT
d.date_id,
max(rate_date) as rate_date,
currency
FROM
d_date d INNER JOIN
imp_exchangerate ON rate_date <= d.full_date
WHERE
d.full_date <= curdate()
GROUP BY
d.date_id,
currency) cd ON x.rate_date = cd.rate_date and x.currency = cd.currency
This query finishes in less then 10 minutes now compared to multiple hours for the original query.
Lesson learned: avoid dependent subqueries in MySQL like the plague!
In MySQL 5.0.75-0ubuntu10.2 I've got a fixed table layout like that:
Table parent with an id
Table parent2 with an id
Table children1 with a parentId
CREATE TABLE `Parent` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(200) default NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
CREATE TABLE `Parent2` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(200) default NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
CREATE TABLE `Children1` (
`id` int(11) NOT NULL auto_increment,
`parentId` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `parent` (`parentId`)
) ENGINE=InnoDB
A children has a parent in one of the tables Parent or Parent2. When I need to get a children I use a query like that:
select * from Children1 c
inner join (
select id as parentId from Parent
union
select id as parentId from Parent2
) p on p.parentId = c.parentId
Explaining this query yields:
+----+--------------+------------+-------+---------------+---------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+-------+---------------+---------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | Parent | index | NULL | PRIMARY | 4 | NULL | 1 | Using index |
| 3 | UNION | Parent2 | index | NULL | PRIMARY | 4 | NULL | 1 | Using index |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+------------+-------+---------------+---------+---------+------+------+-----------------------------------------------------+
4 rows in set (0.00 sec)
which is reasonable given the layout.
Now the problem: The previous query is somewhat useless, since it returns no columns from the parent elements. In the moment I add more columns to the inner query no index will be used anymore:
mysql> explain select * from Children1 c inner join ( select id as parentId,name from Parent union select id as parentId,name from Parent2 ) p on p.parentId = c.parentId;
+----+--------------+------------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+------------+------+---------------+------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | Parent | ALL | NULL | NULL | NULL | NULL | 1 | |
| 3 | UNION | Parent2 | ALL | NULL | NULL | NULL | NULL | 1 | |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+------------+------+---------------+------+---------+------+------+-----------------------------------------------------+
4 rows in set (0.00 sec)
Can anyone explain why the (PRIMARY) indices are not used any more? Is there a workaround for this problem if possible without having to change the DB layout?
Thanks!
I think that the optimizer falls down once you start pulling out multiple columns in the derived query because of the possibility that it would need to convert data types on the union (not in this case, but in general). It may also be due to the fact that your query essentially wants to be a correlated derived subquery, which isn't possible (from dev.mysql.com):
Subqueries in the FROM clause cannot be correlated subqueries, unless used within the ON clause of a JOIN operation.
What you are trying to do (but isn't valid) is:
select * from Children1 c
inner join (
select id as parentId from Parent where Parent.id = c.parentId
union
select id as parentId from Parent2 where Parent.id = c.parentId
) p
Result: "Unknown column 'c.parentId' in 'where clause'.
Is there a reason you don't prefer two left joins and IFNULLs:
select *, IFNULL(p1.name, p2.name) AS name from Children1 c
left join Parent p1 ON p1.id = c.parentId
left join Parent2 p2 ON p2.id = c.parentId
The only difference between the queries is that in yours you'll get two rows if there is a parent in each table. If that's what you want/need then this will work well also and joins will be fast and always make use of the indexes:
(select * from Children1 c join Parent p1 ON p1.id = c.parentId)
union
(select * from Children1 c join Parent2 p2 ON p2.id = c.parentId)
My first thought is to insert a "significant" number of records in the tables and use ANALYZE TABLE to update the statistics. A table with 4 records will always be faster to read using a full scan rather then going via the index!
Further, you can try USE INDEX to force the usage of the index and look how the plan changes.
I will also recomend reading this documentation and see which bits are relevant
MYSQL::Optimizing Queries with EXPLAIN
This article can also be useful
7 ways to convince MySQL to use the right index