MySQL JOIN two tables and ORDER BY seem not use proper indexes

MySQL JOIN two tables and ORDER BY seem not use proper indexes - mysql

I am having trouble with the performance of a MySQL query. To summarize, I would like to join table A to table B, and order the results based on table two columns from table A. My approach is making a combined index on (i) the column to join table A and B on, and (ii) the two columns on which I would like to order the results. However, as soon as I join the two tables, the behavior seems unexpected, and the ORDER BY clause does not seem to use the index anymore.
I use the following query:
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A USE INDEX (idx_t_patent_documents_result_order) INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
For table t_patent_documents, I have the index combined index idx_t_patent_documents_result_order defined as on the columns (publication_id, language_id, result_order). Furthermore, publication_id is the primary key of t_patent_documents. The explain plan is as follows:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|------|-------------------------------------|-------------------------------------|---------|------------------|-----------|----------|-----------------------------------|
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 132162247 | 100.00 | "Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | ref | idx_t_patent_documents_result_order | idx_t_patent_documents_result_order | 4 | B.publication_id | 1 | 100.00 | NULL |
If I do the following (without forcing the index):
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
Then the optimizer chooses to use the primary key only:
|----|-------------|-------|------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------|------------------|-----------|----------|-----------------------------------|
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 132162247 | 100.00 | "Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | eq_ref | "PRIMARY,idx_t_patent_documents_pubid_country_priority_dt,idx_t_patent_documents_pubid_priority_dt,idx_t_patent_documents_pubid_country_ucid,idx_t_patent_documents_pubid_ucid_priority_dt,idx_t_patent_documents_pubid_country_ucid_priority_dt,idx_t_patent_documents_result_order" | PRIMARY | 4 | B.publication_id | 1 | 100.00 | NULL |
Now, when I do not join table B on table A, but I ORDER BY the three columns on which I defined the index, i.e. (publication_id, language_id, result_order), it seems to pick up the indexes properly. The key_len here is indeed 14:
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A USE INDEX(idx_t_patent_documents_result_order)
ORDER BY
A.publication_id ASC,
A.language_id ASC,
A.result_order ASC
LIMIT 100
This results in the following explain plan:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|---------------|-------------------------------------|---------|------|------|----------|-------|
| 1 | SIMPLE | A | NULL | index | NULL | idx_t_patent_documents_result_order | 14 | NULL | 100 | 100.00 | NULL |
Does someone understand this behaviour? Ideally, I would be able to join another table and still be able to quickly order the results.
Thanks in advance!
Update 1:
I also tried adding publication_id in the ORDER BY clause. Although from an output perspective this would not make sense at all, because the publication_id is unique. The query would look as follows:
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A USE INDEX (idx_t_patent_documents_result_order) INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.publication_id ASC,
A.language_id ASC,
A.result_order ASC
LIMIT 100
The resulting explain plan:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|------|-------------------------------------|-------------------------------------|---------|------------------|-----------|----------|-----------------------------------|
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 132162247 | 100.00 | "Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | ref | idx_t_patent_documents_result_order | idx_t_patent_documents_result_order | 4 | B.publication_id | 1 | 100.00 | NULL |
Update 2:
I also tried running the query without the multi index on all three columns, but only on the two columns I would like to sort: (language_id, result_order) which is called x_temp_idx_1. This indeed seemed to have effect in the sense that key_len is 14 again, however, now the join part takes forever. The query that I ran:
explain SELECT SQL_NO_CACHE A.language_id, A.result_order
FROM
t_patent_documents A USE INDEX(x_temp_idx_1) INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
The corresponding explain plan:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|---------------|--------------|---------|------|---------|----------|------------------------------------------------|
| 1 | SIMPLE | A | NULL | index | NULL | x_temp_idx_1 | 14 | NULL | 1795275 | 100.00 | "Using index; Using temporary; Using filesort" |
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 2469412 | 10.00 | "Using where; Using join buffer (hash join)" |
Note that this query ran on another db (development) with less information. That's why the number of rows does not correspond to that of previous explain plans.
Update 3:
All the above queries are simplifications of the actual query I would like to run to not overcomplicate the question. In the real use case, I need to filter on a full-text index in table B. The table includes a full-text index on the column invention_title:
explain SELECT SQL_NO_CACHE *
FROM
t_patent_documents A INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
WHERE
MATCH(B.invention_title) AGAINST("+hydraulic" IN BOOLEAN MODE)
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
The resulting explain, shows that again it uses only the primary key to do the join, but fails to use the multi-index. If I force the index again, the key_len is 4, and it does not actually seem to use the multi-index:
| | | | | | | | | | | | |
|----|-------------|-------|------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|---------|------------------|------|----------|----------------------------------------------------------------------|
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | B | NULL | fulltext | idx_ft_inv_title_int_content_combined_inv_title | idx_ft_inv_title_int_content_combined_inv_title | 0 | const | 1 | 100.00 | "Using where; Ft_hints: no_ranking; Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | eq_ref | "PRIMARY,idx_t_patent_documents_pubid_country_priority_dt,idx_t_patent_documents_pubid_priority_dt,idx_t_patent_documents_pubid_country_ucid,idx_t_patent_documents_pubid_ucid_priority_dt,idx_t_patent_documents_pubid_country_ucid_priority_dt,idx_t_patent_documents_result_order,x_temp_idx_2" | PRIMARY | 4 | B.publication_id | 1 | 100.00 | NULL |

For your first query, remove the USE INDEX and change the index to this order:
(language_id, result_order, publication_id)
The idea is to have the INDEX match the ORDER BY. The optimizer gives preference to WHERE and GROUP BY, but you don't have either. So, the intent for your cases is to be able to stop scanning after LIMIT rows by having the index match the ORDER BY. This would work nicely for just A.
It may help to change the SELECT * to specify only the necessary rows.`
If there is exactly 1 B row for each A row, then there is another optimization. Since this is not the case, the JOIN must be performed, then sort, and only finally deliver LIMIT rows.
re: Update 3
That is as expected.
Use FT index to find the (hopefully) few rows matching.
Reach into the other table in a very efficient way, namely via the PK.
Grab the other columns needed.
Sort to achieve the ORDER BY.
Peel off LIMIT rows.
There is no further optimization. A possible drag on things is if * includes some bulky TEXT or BLOB columns.

As Rick James mentioned, the query seems to be optimized as it is. The main problem here is that the resultant set of records (after filtering using the FULLTEXT index) is large. The sort will not happen using the indexes, because the PRIMARY key has been used to join table B on table A.
Because of this, I opted for another solution; limiting the number of records returned from B, such that the sort will only happen on this subset. This solution is reasonable for my use case, because my query serves a front-end application, where it is unlikely that the user will iterate through more than a couple of thousand records anyways. The LIMIT can be set loosely such that it is highly unlikely that a user will ever reach the end, unless the provided filter criteria will be so tight that the user will find all records meeting his/hers criteria.
The query will therefore be:
SELECT SQL_NO_CACHE A.*
FROM
t_patent_documents A INNER JOIN
(SELECT publication_id FROM t_inv_title_int_content_combined WHERE MATCH(invention_title) AGAINST("+hydraulic" IN BOOLEAN MODE) LIMIT 10000) B on A.publication_id = B.publication_id
ORDER BY
A.language_id ASC,
A.result_order DESC
LIMIT 100
This performs like a charm.

Related

NOT IN subquery versus ON != Operation

I have two tables called ny_clean (3454602 entries) and pickup_0_ids_temp_table (2739268 entries) who have both an id CHAR(11) column which is a primary key and has a BTREE index on top of it ( MySQL 5.7) .
The "id" column in pickup_0_ids_temp_table is a subset of ny_clean and I want to get a result which is ny_clean without the id values from pickup_0_ids_temp_table.
Option 1:
EXPLAIN
SELECT *
FROM pickup_0_ids_temp_table as t
JOIN ny_clean as n
ON n.id != t.id;
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
| 1 | SIMPLE | t | NULL | index | NULL | PRIMARY | 11 | NULL | 2734512 | 100.00 | Using index |
| 1 | SIMPLE | ny_clean | NULL | index | NULL | btree_pk_ny_clean | 11 | NULL | 3445904 | 90.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
+----+-------------+----------+------------+-------+---------------+-------------------+---------+------+---------+----------+-----------------------------------------------------------------+
Option 2:
EXPLAIN
SELECT *
FROM ny_clean as n
WHERE n.id NOT IN (
SELECT id
FROM pickup_0_ids_temp_table);
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
| 1 | PRIMARY | n | NULL | ALL | NULL | NULL | NULL | NULL | 3445904 | 100.00 | Using where |
| 2 | DEPENDENT SUBQUERY | pickup_0_ids_temp_table | NULL | unique_subquery | PRIMARY,btree_pickup_0 | PRIMARY | 11 | func | 1 | 100.00 | Using index |
+----+--------------------+-------------------------+------------+-----------------+------------------------+---------+---------+------+---------+----------+-------------+
I then use one of the options inside this larger query
EXPLAIN
INSERT INTO y
SELECT id, pickup_longitude, pickup_latitude
FROM x
JOIN
(OPTION 1 OR 2) as z
ON z.id = x.id;
When I used Option 1 inside the larger query it ran for two days and it was not finished. Option 2 on the other hand did the job in less than 30minutes
My Question: Why is that?
Following the MySQL documentation (https://dev.mysql.com/doc/refman/5.7/en/subquery-materialization.html) I would suspect that it is due to materialization of the subquery but how would I check this ?
And am I interpreting the EXPLAIN Output wrong? Because judging from it I would expect Option 1 to be faster since it uses an index on both tables
Or does it have to do ith the larger query?
Thanks in advance

Your option 1 doesn't do what you think will do.
If you have two tables
n.id t.id
1 1
2 2
3 3
ON n.id != t.id;
You get:
1,2
1,3
2,1
2,3
3,1
3,2
That is almost a cartesian product. So 3.4 mill x 2.7 mill ~ 9.18 mill rows
Then you try to do a JOIN and because that materialzed table doesnt have index will take very long time.

mysql; how to analyse a slow query that occurs in only one database

I'm investigating some long running queries in my PRODUCTION mysql 5.7 database. 1 particular query is taking over 60 seconds.
My usual approach is to take a dump of the data from PROD, import it into a DEV database, reproduce the issue, then analyse and try out some tweaks to the query.
However, the exact same query in DEV is taking less than a second.
Obviously, the mysql configuration, table structure, record numbers, etc are all the same as in PROD.
The query itself is a select with joins across 3 tables with a where clause on each table; 2 of the tables have approx 15m records in them. My initial suspicion was the lack of indexes on the queried columns, but the fact that in DEV it runs very fast would appear to disprove that.
What can I do to shed some light on this?
EXPLAIN results of my query:
PROD
EXPLAIN select this_.id as y0_ from event this_ inner join member m1_ on this_.member_id=m1_.id inner join event_type et2_ on this_.type_id=et2_.id where m1_.submission_id=40646 and this_.status in ('SUPPRESSED') and et2_.name in ('Salary') order by m1_.ni_number asc, m1_.ident1 asc, m1_.ident2 asc, m1_.ident3 asc, m1_.id asc, et2_.name asc limit 15;
+----+-------------+-------+------------+--------+-------------------------------------+-------------------+---------+--------------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+-------------------------------------+-------------------+---------+--------------------------+------+----------+----------------------------------------------+
| 1 | SIMPLE | et2_ | NULL | ALL | PRIMARY | NULL | NULL | NULL | 17 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | this_ | NULL | ref | FK5C6729A2434DA80,FK5C6729AE4E22C6E | FK5C6729AE4E22C6E | 8 | iconnect.et2_.id | 4166 | 10.00 | Using where |
| 1 | SIMPLE | m1_ | NULL | eq_ref | PRIMARY,IND_submission_id | PRIMARY | 8 | iconnect.this_.member_id | 1 | 5.00 | Using where |
+----+-------------+-------+------------+--------+-------------------------------------+-------------------+---------+--------------------------+------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
DEV
EXPLAIN select this_.id as y0_ from event this_ inner join member m1_ on this_.member_id=m1_.id inner join event_type et2_ on this_.type_id=et2_.id where m1_.submission_id=40646 and this_.status in ('SUPPRESSED') and et2_.name in ('Salary') order by m1_.ni_number asc, m1_.ident1 asc, m1_.ident2 asc, m1_.ident3 asc, m1_.id asc, et2_.name asc limit 15;
+----+-------------+-------+------------+------+-------------------------------------+-------------------+---------+-----------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+-------------------------------------+-------------------+---------+-----------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | et2_ | NULL | ALL | PRIMARY | NULL | NULL | NULL | 17 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m1_ | NULL | ref | PRIMARY,IND_submission_id | IND_submission_id | 8 | const | 26644 | 100.00 | NULL |
| 1 | SIMPLE | this_ | NULL | ref | FK5C6729A2434DA80,FK5C6729AE4E22C6E | FK5C6729A2434DA80 | 8 | iconnect.m1_.id | 2 | 1.86 | Using where |
+----+-------------+-------+------------+------+-------------------------------------+-------------------+---------+-----------------+-------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.03 sec)
Have also spotted that the Cardinality of some of indexes accessed by this query are massively different between DEV and PROD:
FK5C6729AE4E22C6E: DEV=9, PROD=3792
IND_submission_id: DEV=2490, PROD=74220
Could this be impacting performance in PROD?

Query inefficiencies down to the tables containing more data than the index pages can hold. Increasing
innodb_stats_persistent_sample_pages
from 20 to 100, then running ANALYZE TABLE changed the execution plan for the query to be as expected, then running the query took less than 1 second.

How does adding GROUP BY make this query more efficient?

I don't understand mysql's EXPLAIN output for the following two queries.
In the first query mysql has to select 1238264 records first:
explain select
count(distinct utc.id)
from
user_to_company utc
inner join
users u
on utc.user_id=u.id
where
u.is_removed=false
order by
utc.user_id asc limit 20;
+----+-------------+--------+------+----------------------------+---------+---------+---------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+----------------------------+---------+---------+---------------------------------+---------+-------------+
| 1 | SIMPLE | u | ALL | PRIMARY | NULL | NULL | NULL | 1238264 | Using where |
| 1 | SIMPLE | utc | ref | user_id,FKF513E0271C2D1677 | user_id | 8 | u.id | 1 | Using index
In the second query, a GROUP BY was added which makes mysql to select only 20 records:
explain select
count(distinct utc.id)
from
user_to_company utc
inner join
users u
on utc.user_id=u.id
where
u.is_removed=false
group by
utc.user_id
order by
utc.user_id asc limit 20;
+----+-------------+--------+--------+----------------------------+--------------------+---------+-------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+--------+----------------------------+--------------------+---------+-------------------------+------+-------------+
| 1 | SIMPLE | utc | index | user_id,FKF513E0271C2D1677 | FKF513E0271C2D1677 | 8 | NULL | 20 | Using index |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 8 | utc.user_id | 1 | Using where |
+----+-------------+--------+--------+----------------------------+--------------------+---------+-------------------------+------+-------------+
For more info, there are 1333194 records in the users table and 1327768 records in user_to_company table.
How does adding the GROUP BY make mysql select only 20 records in the first pass?

The first query has to read all the data to find all the values of utc.id. It returns only one row, which is a summary for the whole table. So, it has to generate all the data.
The second query is producing a separate total for each utc.user_id. You have a limit clause and an index on utc.user_id. MySQL is, apparently, smart enough to recognize that it can go to the index to get the first 20 values of utc.user_id. It uses these to generate the counts.
I am surprised that MySQL is smart enough to do this (although the logic is documented pretty well here). But it makes perfect sense that the second query can be optimized this way where the first one cannot be.

MySQL using Filesort and Query is very slow?

I have a query :
SELECT listings.*, listingagents.agentid
FROM listings
LEFT JOIN listingagents ON (listingagents.id = listings.listingagentid)
LEFT JOIN ignore ON (ignore.system_key = listings.listingid)
WHERE ignore.id IS NULL
ORDER BY listings.id ASC
I am trying to improve the performance of this query since it is very slow and it is putting a heavy load on the MySQL server.
When I do a mysql explain, output shows :
+--------+-------------+---------------+--------+---------------+------------+---------+----------------------------+--------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+--------+-------------+---------------+--------+---------------+------------+---------+----------------------------+--------+-------------------------+
| 1 | SIMPLE | listings | ALL | NULL | NULL | NULL | NULL | 383360 | Using filesort |
| 1 | SIMPLE | listingagents | eq_ref | PRIMARY | PRIMARY | 4 | db.listings.listingagen... | 1 | |
| 1 | SIMPLE | ignore | ref | system_key | system_key | 1 | const | 404 | Using where; Not exists |
+--------+-------------+---------------+--------+---------------+------------+---------+----------------------------+--------+-------------------------+
I tried to do a simple query:
SELECT listings.*
FROM listings
ORDER BY listings.id ASC
And that query also have "Using filesort;".
The fields "listings.id", "listingagents.id" and "ignore.id" are Primary Keys
The fields "listingagents.id" and "ignore.system_key" have indexes.
What can I do to improve the 1st query?

try to decrease listings range (currently 383360 rows) by adding some condition. e.g. id > x or limit.

Adding LIMIT to query increases query time by over 1000%

I am running the following query:
SELECT
MyField,
COUNT(*) AS MyCount
FROM
MyTable
NATURAL JOIN
AnotherTable
WHERE
Timestamp >= 1000 AND Timestamp <= 10000
GROUP BY
MyField
ORDER BY
MyCount DESC;
This runs fine and takes about 6 seconds to complete. If I want to limit the result to show only the 20 highest MyCounts, I add LIMIT 20 on to the end of the query. Suddenly it takes 6 minutes to complete!
The EXPLAIN output for the original query:
+----+-------------+-------------+--------+---------------------------+---------+---------+---------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+---------------------------+---------+---------+---------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | MyTable | ALL | mytable_fkey | NULL | NULL | NULL | 6858209 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | AnotherTable| eq_ref | PRIMARY | PRIMARY | 4 | test.MyTable.FKeyID | 1 | Using index |
+----+-------------+-------------+--------+---------------------------+---------+---------+---------------------------+---------+----------------------------------------------+
The EXPLAIN output for the query with LIMIT 20:
+----+-------------+-------------+--------+---------------------------+-------------------------+---------+---------------------------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+---------------------------+-------------------------+---------+---------------------------+---------+----------------------------------------------+
| 1 | SIMPLE | MyTable | index | mytable_fkey | myfield_timestamp_index | 771 | NULL | 6858209 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | AnotherTable| eq_ref | PRIMARY | PRIMARY | 4 | test.MyTable.FKeyID | 1 | Using index |
+----+-------------+-------------+--------+---------------------------+-------------------------+---------+---------------------------+---------+----------------------------------------------+
What is the explanation for this? Is there a better way I can limit the number of rows?

If you see Using temporary; Using filesort in your EXPLAIN output, you are probably missing a suitable index and you're getting killed because of it.
Make sure your JOIN and GROUP BY fields are both available in the same index.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL JOIN two tables and ORDER BY seem not use proper indexes - mysql

Related

NOT IN subquery versus ON != Operation

mysql; how to analyse a slow query that occurs in only one database

How does adding GROUP BY make this query more efficient?

MySQL using Filesort and Query is very slow?

Adding LIMIT to query increases query time by over 1000%

Categories

Resources