How to check another row if value exists? - mysql

This is driving me nuts. I have dumped the imdb db using imdbpy. I'm trying to find US movies that have the actor data available by the first letter of the movie.
Below is an example of a query that fetches the movies without acto data. This runs pretty quick:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
WHERE title LIKE 'a%'
AND title.kind_id = 1
LIMIT 75
The cast data is stored in a separate table called cast_info and contains about 22 million records. The nr_order column contains the order of credits for actors in a movie. For example, Tom Hank would be 1 in Forrest Gump. There are typically dozens of rows for each movie_id.
So to check to see if the actor data is available, there should be at least one row that isn't null for that particular movie_id. If all the values in nr_order for a movie_id are null, it does NOT contain the data I need.
To attempt to grab this information is used the query below:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
INNER JOIN cast_info ON
(cast_info.movie_id = title.id
AND
cast_info.nr_order = 1)
WHERE title LIKE 'a%'
AND title.kind_id = 1
LIMIT 75
For some reason the query becomes very slow. It takes .3-.7 for the first query and about 6-10 seconds for the second. I added an index on cast_info.nr_order but it didn't help.
The EXPLAIN output:
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| 1 | SIMPLE | title | range | PRIMARY,title_idx_title,fk_kind_type_id_4 | title_idx_title | 257 | NULL | 132801| Using where; Using temporary|
| 1 | SIMPLE | movie_info| ref | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4 | imdb.title.id| 4 | Using where; Distinct |
| 1 | SIMPLE | table1 | ref | cast_info_idx_mid,nr_order | cast_info_idx_mid | 4 | imdb.title.id| 12 | Using where; Distinct |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
Any ideas would be very helpful!
EDIT: EXPLAIN from 1st query
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+
| 1 | SIMPLE | title | range | PRIMARY,title_idx_title,fk_kind_type_id_4 | title_idx_title | 257 | NULL | 132801| Using where; Using temporary|
| 1 | SIMPLE | movie_info| ref | ovie_info_idx_mid,info_type_id movie_info_idx_mid| movie_info_idx_mid| 4 | imdb.title.id| 4 | Using where; Distinct |
+----+-------------+-----------+-------+--------------------------------------------------+-------------------+---------+--------------+-------+-----------------------------+

Since you're only concerned with whether there is or is not cast information available, you could try using EXISTS instead:
SELECT DISTINCT title.id,title.title,title.production_year
FROM title
INNER JOIN movie_info ON
(movie_info.movie_id = title.id
AND
movie_info.info_type_id = 8
AND
movie_info.info = 'USA')
WHERE title LIKE 'a%'
AND title.kind_id = 1
AND EXISTS(SELECT 1 FROM cast_info WHERE cast_info.movie_id = title.id AND cast_info.nr_order IS NOT NULL)
LIMIT 75
I'm not sure exactly the explanation for your behavior, but the DISTINCT could be doing something funny with lots of rows on the join - or at least lots of rows on the joined product - (note the Distinct being applied to the cast_info table in the explain).

Related

Return null values in conditional join

I have a settings table and a settings_values table that cross matches the value for every user. It's important that I return a NULL value in cases where the setting has not been turned on for a user. It's also important that the table is highly efficient as it will be called often.
Ultimately, I have about 50 settings and thousands of users, so every time I run the query I should get 50 result similar to this:
setting_id | setting_name | value | user_id
1 blue true 35
2 red NULL NULL
3 yellow false 35
4 brown true 35
5 black NULL NULL
I have a working solution below that's running, but it's creating a bottleneck in my database requests which is strange given that settings only has 50 rows and settings_values only has about 20,000.
My tables are structured similar to this:
settings:
setting_id | setting_name
With setting_id as the only index (Primary Key)
settings_values:
id | setting_id | value | user_id
With id being the Primary Key, setting_id being a foregin key reference to the previous table, value is varchar and user_id also has an index on it.
My current working code is the following:
SELECT
s.setting_id,
s.setting_name,
sv.value,
sv.user_id
FROM
cmssps_settings s
LEFT JOIN cmssps_settings_values sv ON
s.setting_id = sg.setting_id
AND sv.user_id = '35';
That seems to work OK apart from the bottleneck (I get my 5 rows returned), and I suspect its the JOIN ON AND that's causing the slow down. I've been trying to achieve the same result with a JOIN ON WHERE like this, but I don't get my required NULL results (i.e. it only returns the 3 rows that satisfy the WHERE):
SELECT
s.setting_id,
s.setting_name,
sv.value,
sv.user_id
FROM
cmssps_settings s
LEFT JOIN cmssps_settings_values sv ON
s.setting_id = sg.setting_id
WHERE sv.user_id = '35';
When I EXPLAIN both I get this for the former:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | s | ALL | [NULL] | [NULL]| [NULL] | [NULL]| 49 | [NULL]
1 | SIMPLE | sv | ref | user_id | branch_id | 5 | const | 24 | Using where
When I do it for the latter I get a key value for my first table which I assume indicates that it will excecute faster:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | s | index | [NULL] | PRIMARY | 4 | [NULL]| 49 | [NULL]
1 | SIMPLE | sv | ref | user_id | branch_id | 5 | const | 24 | Using where
I've also attempted nesting the queries and using UNION but that seems like a regression.
Is there anything obvious as to why my initial approach is having such a negative impact on performance?
Is there a more efficient way to achieve the same outcome?

How to improve this MySQL Query using join?

I have got a simple query and it takes more than 14 seconds.
select
e.title, e.date, v.name, v.city, v.region, v.country
from seminar e force index for join (venueid)
left join venues v on e.venueid = v.id
where v.country = 'US'
and v.city = 'New York'
and v.region = 'NY'
and e.date > curdate()
and e.someid != 0
Note: count(e.id) stands for an abbreviation for debugging purposes. In fact we get information from both tables.
Explain gives this:
+----+-------------+-------+-------------+--------------------------------------------------------------------------------------+--------------------------+---------+-----------------+------+--------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------------+--------------------------------------------------------------------------------------+--------------------------+---------+-----------------+------+--------------------------------------------------------+
| 1 | SIMPLE | v | index_merge | PRIMARY,city,country,region | city,region | 378,378 | NULL | 2 | Using intersect(city,region); Using where |
| 1 | SIMPLE | e | ref | venueid | venueid | 5 | v.id | 11 | Using where |
+----+-------------+-------+-------------+--------------------------------------------------------------------------------------+--------------------------+---------+-----------------+------+--------------------------------------------------------+
I have indexes on e.id, e.date, e.someid, as well as v.id, v.country, v.city and v.region.
I know the db-setup is a mess but that's what I have to deal with right now.
Why does the SQL take so long as in the end there will be an approx. count 150? In events there are about 1M entries and in venues about 100K.
Both tables are MyISAM. Any ideas how to improve this?
Upon creating an index like this
create index location on venues (city, region, country)
it takes 20 seconds, the explain is this:
+----+-------------+-------+------+--------------------------------------+--------------+---------+-------------------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------------------------+--------------+---------+-------------------+------+------------------------------------+
| 1 | SIMPLE | v | ref | PRIMARY,city,country,region,location | location | 765 | const,const,const | 410 | Using index condition; Using where |
| 1 | SIMPLE | e | ref | EventVenueID | venueid | 5 | v.id | 11 | Using where |
+----+-------------+-------+------+--------------------------------------+--------------+---------+-------------------+------+------------------------------------+
You have left join venues, but you have conditions in the where clause on the joined venues row, so only joined rows will be returned. However, that's a side issue - read on for why you don't need a join at all.
Next, if the city is vancouver, there's no need to also test for country or state.
Finally, if you're trying to find "how many future events are in Vancouver", you don't need a join, as the venue id is a constant!
Try this:
select count(*) as event_count
from events
where venueid = (select id from venues where city = 'vancouver')
and startdate > curdate()
and te_id != 0
Mysql will use the index on venueid without you having to use a hint. If it doesn't, execute this:
analyze events
which will update the statistics of the data distribution in the indexed columns. Note that if a lot of your events are in Vancouver, it's more efficient to not use an index (as most of the rows will have to be accessed anyway).
This would make the first part of the query faster:
INDEX(city, region, country)
I went another way since it seems that MySQL can't handle joins effectively:
Created one big new table with all the columns I need from the join
So the seminars and events are in one table now
added indexes
Now the query is fast. Don't know why...
From 25 seconds, we are down to .08 seconds
That's how I wanted it.
If anybody still knows why, you are more than welcome to provide an answer.

Updating millions of records on inner joined subquery - optimization techniques

I'm looking for some advice on how I might better optimize this query.
For each _piece_detail record that:
Contains at least one matching _scan record on (zip, zip_4,
zip_delivery_point, serial_number)
Belongs to a company from mailing_groups (through a chain of relationships)
Has either:
first_scan_date_time that is greater than the MIN(scan_date_time) of the related _scan records
latest_scan_date_time that is less than the MAX(scan_date_time) of
the related _scan records
I will need to:
Set _piece_detail.first_scan_date_time to MIN(_scan.scan_date_time)
Set _piece_detail.latest_scan_date_time to MAX(_scan.scan_date_time)
Since I'm dealing with millions upon millions of records, I am trying to reduce the number of records that I actually have to search through. Here are some facts about the data:
The _piece_details table is partitioned by job_id, so it seems to
make the most sense to run through these checks in the order of
_piece_detail.job_id, _piece_detail.piece_id.
The scan records table contains over 100,000,000 records right now and is partitioned by (zip, zip_4, zip_delivery_point,
serial_number, scan_date_time), which is the same key that is used
to match a _scan with a _piece_detail (aside from scan_date_time).
Only about 40% of the _piece_detail records belong to a mailing_group, but we don't know which ones these are until we run
through the full relationship of joins.
Only about 30% of the _scan records belong to a _piece_detail with a mailing_group.
There are typically between 0 and 4 _scan records per _piece_detail.
Now, I am having a hell of a time finding a way to execute this in a decent way. I had originally started with something like this:
UPDATE _piece_detail
INNER JOIN (
SELECT _piece_detail.job_id, _piece_detail.piece_id, MIN(_scan.scan_date_time) as first_scan_date_time, MAX(_scan.scan_date_time) as latest_scan_date_time
FROM _piece_detail
INNER JOIN _container_quantity
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN _container_summary
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
INNER JOIN _mail_piece_unit
ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id
AND _container_quantity.job_id = _mail_piece_unit.job_id
INNER JOIN _header
ON _header.job_id = _piece_detail.job_id
INNER JOIN mailing_groups
ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
INNER JOIN _scan
ON _scan.zip = _piece_detail.zip
AND _scan.zip_4 = _piece_detail.zip_4
AND _scan.zip_delivery_point = _piece_detail.zip_delivery_point
AND _scan.serial_number = _piece_detail.serial_number
GROUP BY _piece_detail.job_id, _piece_detail.piece_id, _scan.zip, _scan.zip_4, _scan.zip_delivery_point, _scan.serial_number
) as t1 ON _piece_detail.job_id = t1.job_id AND _piece_detail.piece_id = t1.piece_id
SET _piece_detail.first_scan_date_time = t1.first_scan_date_time, _piece_detail.latest_scan_date_time = t1.latest_scan_date_time
WHERE _piece_detail.first_scan_date_time < t1.first_scan_date_time
OR _piece_detail.latest_scan_date_time > t1.latest_scan_date_time;
I thought that this may have been trying to load too much into memory at once and might not be using the indexes properly.
Then I thought that I might be able to avoid doing that huge joined subquery and add two leftjoin subqueries to get the min/max like so:
UPDATE _piece_detail
INNER JOIN _container_quantity
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN _container_summary
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
INNER JOIN _mail_piece_unit
ON _container_quantity.mpu_id = _mail_piece_unit.mpu_id
AND _container_quantity.job_id = _mail_piece_unit.job_id
INNER JOIN _header
ON _header.job_id = _piece_detail.job_id
INNER JOIN mailing_groups
ON _mail_piece_unit.mpu_company = mailing_groups.mpu_company
LEFT JOIN _scan fs ON (fs.zip, fs.zip_4, fs.zip_delivery_point, fs.serial_number) = (
SELECT zip, zip_4, zip_delivery_point, serial_number
FROM _scan
WHERE zip = _piece_detail.zip
AND zip_4 = _piece_detail.zip_4
AND zip_delivery_point = _piece_detail.zip_delivery_point
AND serial_number = _piece_detail.serial_number
ORDER BY scan_date_time ASC
LIMIT 1
)
LEFT JOIN _scan ls ON (ls.zip, ls.zip_4, ls.zip_delivery_point, ls.serial_number) = (
SELECT zip, zip_4, zip_delivery_point, serial_number
FROM _scan
WHERE zip = _piece_detail.zip
AND zip_4 = _piece_detail.zip_4
AND zip_delivery_point = _piece_detail.zip_delivery_point
AND serial_number = _piece_detail.serial_number
ORDER BY scan_date_time DESC
LIMIT 1
)
SET _piece_detail.first_scan_date_time = fs.scan_date_time, _piece_detail.latest_scan_date_time = ls.scan_date_time
WHERE _piece_detail.first_scan_date_time < fs.scan_date_time
OR _piece_detail.latest_scan_date_time > ls.scan_date_time
These are the explains when I convert them to SELECT statements:
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 844161 | NULL |
| 1 | PRIMARY | _piece_detail | eq_ref | PRIMARY,first_scan_date_time,latest_scan_date_time | PRIMARY | 18 | t1.job_id,t1.piece_id | 1 | Using where |
| 2 | DERIVED | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index; Using temporary; Using filesort |
| 2 | DERIVED | _piece_detail | ref | PRIMARY,cqt_database_id,zip | PRIMARY | 10 | odms._header.job_id | 9703 | NULL |
| 2 | DERIVED | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL |
| 2 | DERIVED | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where |
| 2 | DERIVED | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index |
| 2 | DERIVED | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index |
| 2 | DERIVED | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using index |
+----+-------------+---------------------+--------+----------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+--------+----------------------------------------------+
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
| 1 | PRIMARY | _header | index | PRIMARY | date_prepared | 3 | NULL | 87 | Using index |
| 1 | PRIMARY | _piece_detail | ref | PRIMARY,cqt_database_id,first_scan_date_time,latest_scan_date_time | PRIMARY | 10 | odms._header.job_id | 9703 | NULL |
| 1 | PRIMARY | _container_quantity | eq_ref | unique,mpu_id,job_id,job_id_container_quantity | unique | 14 | odms._header.job_id,odms._piece_detail.cqt_database_id | 1 | NULL |
| 1 | PRIMARY | _mail_piece_unit | eq_ref | PRIMARY,company,job_id_mail_piece_unit | PRIMARY | 14 | odms._container_quantity.mpu_id,odms._header.job_id | 1 | Using where |
| 1 | PRIMARY | mailing_groups | eq_ref | PRIMARY | PRIMARY | 27 | odms._mail_piece_unit.mpu_company | 1 | Using index |
| 1 | PRIMARY | _container_summary | eq_ref | unique,container_id,job_id_container_summary | unique | 14 | odms._header.job_id,odms._container_quantity.container_id | 1 | Using index |
| 1 | PRIMARY | fs | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 1 | PRIMARY | ls | index | NULL | updated | 1 | NULL | 102462928 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 3 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort |
| 2 | DEPENDENT SUBQUERY | _scan | ref | PRIMARY | PRIMARY | 28 | odms._piece_detail.zip,odms._piece_detail.zip_4,odms._piece_detail.zip_delivery_point,odms._piece_detail.serial_number | 1 | Using where; Using index; Using filesort |
+----+--------------------+---------------------+--------+--------------------------------------------------------------------+---------------+---------+------------------------------------------------------------------------------------------------------------------------+-----------+-----------------------------------------------------------------+
Now, looking at the explains generated by each, I really can't tell which is giving me the best bang for my buck. The first one shows fewer total rows when multiplying the rows column, but the second appears to execute a bit quicker.
Is there anything that I could do to achieve the same results while increasing performance through modifying the query structure?
Disable update of index while doing the bulk updates
ALTER TABLE _piece_detail DISABLE KEYS;
UPDATE ....;
ALTER TABLE _piece_detail ENABLE KEYS;
Refer to the mysql docs : http://dev.mysql.com/doc/refman/5.0/en/alter-table.html
EDIT:
After looking at the mysql docs I pointed to, I see the docs specify this for MyISAM table, and is nit clear for other table types. Further solutions here : How to disable index in innodb
There is something I was taught and I strictly follow till today - Create as many temporary table you want while avoiding the usage of derived tables. Especially it in case of UPDATE/DELETE/INSERTs as
you cant predict the index on derived tables
The derived tables might not be held in memory if the resultset is big
The table(MyIsam)/rows(Innodb) may be locked for longer time as each time the derived query is running. I prefer a temp table which has primary key join with parent table.
And most importantly it makes you code look neat and readable.
My approach will be
CREATE table temp xxx(...)
INSERT INTO xxx select q from y inner join z....;
UPDATE _piece_detail INNER JOIN xxx on (...) SET ...;
Always reduce you downtime!!
Why aren't you using sub-queries for each join? Including the inner joins?
INNER JOIN (SELECT field1, field2, field 3 from _container_quantity order by 1,2,3)
ON _piece_detail.cqt_database_id = _container_quantity.cqt_database_id
AND _piece_detail.job_id = _container_quantity.job_id
INNER JOIN (SELECT field1, field2, field3 from _container_summary order by 1,2,3)
ON _container_quantity.container_id = _container_summary.container_id
AND _container_summary.job_id = _container_quantity.job_id
You're definitely pulling a lot into memory by not limiting your selects on those inner joins. By using the order by 1,2,3 at the end of each sub-query you create an index on each sub-query. Your only index is on headers and you aren't joining on _headers....
A couple suggestions to optimize this query. Either create the indexes you need on each table, or use the Sub-query join clauses to create manually the indexes you need on the fly.
Also remember that when you do a left join on a "temporary" table full of aggregates you are just asking for performance trouble.
Contains at least one matching _scan record on (zip, zip_4,
zip_delivery_point, serial_number)
Umm...this is your first point in what you want to do, but none of these fields are indexed?
From your explain results it seems that the subquery is going through all the rows twice then, how about you keep the MIN/MAX from the first one and use just one left join instead of two?

MySQL Query Optimization; SELECT multiple fields vs. JOIN

We've got a relatively straightforward query that does LEFT JOINs across 4 tables. A is the "main" table or the top-most table in the hierarchy. B links to A, C links to B. Furthermore, X links to A. So the hierarchy is basically
A
C => B => A
X => A
The query is essentially:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
WHERE
b.flag = true
ORDER BY
x.date DESC
LIMIT 25
Via EXPLAIN, I've confirmed that the correct indexes are in place, and that the built-in MySQL query optimizer is using those indexes correctly and properly.
So here's the strange part...
When we run the query as is, it takes about 1.1 seconds to run.
However, after doing some checking, it seems that if I removed most of the SELECT fields, I get a significant speed boost.
So if instead we made this into a two-step query process:
First query same as above except change the SELECT clause to only SELECT a.id instead of SELECT *
Second query also same as above, except change the WHERE clause to only do an a.id IN agains the result of Query 1 instead of what we have before
The result is drastically different. It's .03 seconds for the first query and .02 for the second query.
Doing this two-step query in code essentially gives us a 20x boost in performance.
So here's my question:
Shouldn't this type of optimization already be done within the DB engine? Why does the difference in which fields that are actually SELECTed make a difference on the overall performance of the query?
At the end of the day, it's merely selecting the exact same 25 rows and returning the exact same full contents of those 25 rows. So, why the wide disparity in performance?
ADDED 2012-08-24 13:02 PM PDT
Thanks eggyal and invertedSpear for the feedback. First off, it's not a caching issue -- I've run tests running both queries multiple times (about 10 times) alternating between each approach. The result averages at 1.1 seconds for the first (single query) approach and .03+.02 seconds for the second (2 query) approach.
In terms of indexes, I thought I had done an EXPLAIN to ensure that we're going thru the keys, and for the most part we are. However, I just did a quick check again and one interesting thing to note:
The slower "single query" approach doesn't show the Extra note of "Using index" for the third line:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
While it does show "Using index" for when we query for just the IDs:
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | t1 | index | PRIMARY | shop_group_id_idx | 5 | NULL | 102 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | t2 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t1.organization_id | 1 | Using where |
| 1 | SIMPLE | t0 | ref | bundle_idx,shop_id_idx | shop_id_idx | 4 | dbmodl_v18.t1.organization_id | 309 | Using index |
| 1 | SIMPLE | t3 | eq_ref | PRIMARY | PRIMARY | 4 | dbmodl_v18.t0.id | 1 | |
+----+-------------+-------+--------+------------------------+-------------------+---------+-------------------------------+------+----------------------------------------------+
The strange thing is that both do list the correct index being used... but I guess it begs the questions:
Why are they different (considering all the other clauses are the exact same)? And is this an indication of why it's slower?
Unfortunately, the MySQL docs do not give much information for when the "Extra" column is blank/null in the EXPLAIN results.
More important than speed, you have a flaw in your query logic. When you test a LEFT JOINed column in the WHERE clause (other than testing for NULL), you force that join to behave as if it were an INNER JOIN. Instead, you'd want:
SELECT
a.*, b.*, c.*, x.*
FROM
a
LEFT JOIN b ON b.a_id = a.id
AND b.flag = true
LEFT JOIN c ON c.b_id = b.id
LEFT JOIN x ON x.a_id = a.id
ORDER BY
x.date DESC
LIMIT 25
My next suggestion would be to examine all of those .*'s in your SELECT. Do you really need all the columns from all the tables?

MySQL query optimization - distinct, order by and limit

I am trying to optimize the following query:
select distinct this_.id as y0_
from Rental this_
left outer join RentalRequest rentalrequ1_
on this_.id=rentalrequ1_.rental_id
left outer join RentalSegment rentalsegm2_
on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
this_.DTYPE='B'
and this_.id<=1848978
and this_.billingStatus=1
and rentalsegm2_.endDate between 1273631699529 and 1274927699529
order by rentalsegm2_.id asc
limit 0, 100;
This query is done multiple time in a row for paginated processing of records (with a different limit each time). It returns the ids I need in the processing. My problem is that this query take more than 3 seconds. I have about 2 million rows in each of the three tables.
Explain gives:
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | rentalsegm2_ | range | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9 | NULL | 449904 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest | PRIMARY | 8 | solscsm_main.rentalsegm2_.rentalRequest_id | 1 | Using where |
| 1 | SIMPLE | this_ | eq_ref | PRIMARY,index_billingStatus | PRIMARY | 8 | solscsm_main.rentalrequ1_.rental_id | 1 | Using where |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+----------------------------------------------+
I tried to remove the distinct and the query ran three times faster. explain without the query gives:
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
| 1 | SIMPLE | rentalsegm2_ | range | index_endDate,fk_rentalRequest_id_BikeRentalSegment | index_endDate | 9 | NULL | 451972 | Using where; Using filesort |
| 1 | SIMPLE | rentalrequ1_ | eq_ref | PRIMARY,fk_rental_id_BikeRentalRequest | PRIMARY | 8 | solscsm_main.rentalsegm2_.rentalRequest_id | 1 | Using where |
| 1 | SIMPLE | this_ | eq_ref | PRIMARY,index_billingStatus | PRIMARY | 8 | solscsm_main.rentalrequ1_.rental_id | 1 | Using where |
+----+-------------+--------------+--------+-----------------------------------------------------+---------------+---------+--------------------------------------------+--------+-----------------------------+
As you can see, the Using temporary is added when using distinct.
I already have an index on all fields used in the where clause.
Is there anything I can do to optimize this query?
Thank you very much!
Edit: I tried to order by on this_.id as suggested and the query was 5x slower. Here is the explain plan:
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | this_ | ref | PRIMARY,index_billingStatus | index_billingStatus | 5 | const | 782348 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rentalrequ1_ | ref | PRIMARY,fk_rental_id_BikeRentalRequest | fk_rental_id_BikeRentalRequest | 9 | solscsm_main.this_.id | 1 | Using where; Using index; Distinct |
| 1 | SIMPLE | rentalsegm2_ | ref | index_endDate,fk_rentalRequest_id_BikeRentalSegment | fk_rentalRequest_id_BikeRentalSegment | 8 | solscsm_main.rentalrequ1_.id | 1 | Using where; Distinct |
+----+-------------+--------------+------+-----------------------------------------------------+---------------------------------------+---------+------------------------------+--------+----------------------------------------------+
From the execution plan we see that the optimizer is smart enough to understand that you do not require OUTER JOINs here. Anyway, you should better specify that explicitly.
The DISTINCT modifier means that you want to GROUP BY all fields in SELECT part, that is ORDER BY all of the specified fields and then discard duplicates. In other words, order by rentalsegm2_.id asc clause does not make any sence here.
The query below should return the equivalent result:
select distinct this_.id as y0_
from Rental this_
join RentalRequest rentalrequ1_
on this_.id=rentalrequ1_.rental_id
join RentalSegment rentalsegm2_
on rentalrequ1_.id=rentalsegm2_.rentalRequest_id
where
this_.DTYPE='B'
and this_.id<=1848978
and this_.billingStatus=1
and rentalsegm2_.endDate between 1273631699529 and 1274927699529
limit 0, 100;
UPD
If you want the execution plan to start with RentalSegment, you will need to add the following indices to the database:
RentalSegment (endDate)
RentalRequest (id, rental_id)
Rental (id, DTYPE, billingStatus) or (id, billingStatus, DTYPE)
The query then could be rewritten as the following:
SELECT this_.id as y0_
FROM RentalSegment rs
JOIN RentalRequest rr
JOIN Rental this_
WHERE rs.endDate between 1273631699529 and 1274927699529
AND rs.rentalRequest_id = rr.id
AND rr.rental_id <= 1848978
AND rr.rental_id = this_.id
AND this_.DTYPE='D'
AND this_.billingStatus = 1
GROUP BY this_.id
LIMIT 0, 100;
If the execution plan will not start from RentalSegment you can force in with STRAIGHT_JOIN.
The reason that the query without the distinct runs faster is because you have a limit clause. Without the distinct, the server only needs to look at the first hundred matches. However however some of those rows may have duplicate fields, so if you introduce the distinct clause, the server has to look at many more rows in order to find ones that do not have duplicate values.
BTW, why are you using OUTER JOIN?
Here for "rentalsegm2_" table, optimizer has chosen "index_endDate" index and its no of rows expected from this table is about 4.5 lakhs. Since there are other where conditions exist, you can check for "this_" table indexes . I mean you can check in "this_ table" for how much records affected for each where conditions.
In summary, you can try for alternate solutions by changing indices used by optimizer.
This can be obtained by "USE INDEX", "FORCE INDEX" commands.
Thanks
Rinson KE
DBA
www.qburst.com