How does adding GROUP BY make this query more efficient? - mysql

I don't understand mysql's EXPLAIN output for the following two queries.
In the first query mysql has to select 1238264 records first:
explain select
count(distinct utc.id)
from
user_to_company utc
inner join
users u
on utc.user_id=u.id
where
u.is_removed=false
order by
utc.user_id asc limit 20;
+----+-------------+--------+------+----------------------------+---------+---------+---------------------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+----------------------------+---------+---------+---------------------------------+---------+-------------+
| 1 | SIMPLE | u | ALL | PRIMARY | NULL | NULL | NULL | 1238264 | Using where |
| 1 | SIMPLE | utc | ref | user_id,FKF513E0271C2D1677 | user_id | 8 | u.id | 1 | Using index
In the second query, a GROUP BY was added which makes mysql to select only 20 records:
explain select
count(distinct utc.id)
from
user_to_company utc
inner join
users u
on utc.user_id=u.id
where
u.is_removed=false
group by
utc.user_id
order by
utc.user_id asc limit 20;
+----+-------------+--------+--------+----------------------------+--------------------+---------+-------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+--------+----------------------------+--------------------+---------+-------------------------+------+-------------+
| 1 | SIMPLE | utc | index | user_id,FKF513E0271C2D1677 | FKF513E0271C2D1677 | 8 | NULL | 20 | Using index |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 8 | utc.user_id | 1 | Using where |
+----+-------------+--------+--------+----------------------------+--------------------+---------+-------------------------+------+-------------+
For more info, there are 1333194 records in the users table and 1327768 records in user_to_company table.
How does adding the GROUP BY make mysql select only 20 records in the first pass?

The first query has to read all the data to find all the values of utc.id. It returns only one row, which is a summary for the whole table. So, it has to generate all the data.
The second query is producing a separate total for each utc.user_id. You have a limit clause and an index on utc.user_id. MySQL is, apparently, smart enough to recognize that it can go to the index to get the first 20 values of utc.user_id. It uses these to generate the counts.
I am surprised that MySQL is smart enough to do this (although the logic is documented pretty well here). But it makes perfect sense that the second query can be optimized this way where the first one cannot be.

Related

MySQL JOIN two tables and ORDER BY seem not use proper indexes

I am having trouble with the performance of a MySQL query. To summarize, I would like to join table A to table B, and order the results based on table two columns from table A. My approach is making a combined index on (i) the column to join table A and B on, and (ii) the two columns on which I would like to order the results. However, as soon as I join the two tables, the behavior seems unexpected, and the ORDER BY clause does not seem to use the index anymore.
I use the following query:
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A USE INDEX (idx_t_patent_documents_result_order) INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
For table t_patent_documents, I have the index combined index idx_t_patent_documents_result_order defined as on the columns (publication_id, language_id, result_order). Furthermore, publication_id is the primary key of t_patent_documents. The explain plan is as follows:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|------|-------------------------------------|-------------------------------------|---------|------------------|-----------|----------|-----------------------------------|
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 132162247 | 100.00 | "Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | ref | idx_t_patent_documents_result_order | idx_t_patent_documents_result_order | 4 | B.publication_id | 1 | 100.00 | NULL |
If I do the following (without forcing the index):
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
Then the optimizer chooses to use the primary key only:
|----|-------------|-------|------------|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|---------|------------------|-----------|----------|-----------------------------------|
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 132162247 | 100.00 | "Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | eq_ref | "PRIMARY,idx_t_patent_documents_pubid_country_priority_dt,idx_t_patent_documents_pubid_priority_dt,idx_t_patent_documents_pubid_country_ucid,idx_t_patent_documents_pubid_ucid_priority_dt,idx_t_patent_documents_pubid_country_ucid_priority_dt,idx_t_patent_documents_result_order" | PRIMARY | 4 | B.publication_id | 1 | 100.00 | NULL |
Now, when I do not join table B on table A, but I ORDER BY the three columns on which I defined the index, i.e. (publication_id, language_id, result_order), it seems to pick up the indexes properly. The key_len here is indeed 14:
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A USE INDEX(idx_t_patent_documents_result_order)
ORDER BY
A.publication_id ASC,
A.language_id ASC,
A.result_order ASC
LIMIT 100
This results in the following explain plan:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|---------------|-------------------------------------|---------|------|------|----------|-------|
| 1 | SIMPLE | A | NULL | index | NULL | idx_t_patent_documents_result_order | 14 | NULL | 100 | 100.00 | NULL |
Does someone understand this behaviour? Ideally, I would be able to join another table and still be able to quickly order the results.
Thanks in advance!
Update 1:
I also tried adding publication_id in the ORDER BY clause. Although from an output perspective this would not make sense at all, because the publication_id is unique. The query would look as follows:
SELECT SQL_NO_CACHE *
FROM
t_patent_documents A USE INDEX (idx_t_patent_documents_result_order) INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.publication_id ASC,
A.language_id ASC,
A.result_order ASC
LIMIT 100
The resulting explain plan:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|------|-------------------------------------|-------------------------------------|---------|------------------|-----------|----------|-----------------------------------|
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 132162247 | 100.00 | "Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | ref | idx_t_patent_documents_result_order | idx_t_patent_documents_result_order | 4 | B.publication_id | 1 | 100.00 | NULL |
Update 2:
I also tried running the query without the multi index on all three columns, but only on the two columns I would like to sort: (language_id, result_order) which is called x_temp_idx_1. This indeed seemed to have effect in the sense that key_len is 14 again, however, now the join part takes forever. The query that I ran:
explain SELECT SQL_NO_CACHE A.language_id, A.result_order
FROM
t_patent_documents A USE INDEX(x_temp_idx_1) INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
The corresponding explain plan:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|---------------|--------------|---------|------|---------|----------|------------------------------------------------|
| 1 | SIMPLE | A | NULL | index | NULL | x_temp_idx_1 | 14 | NULL | 1795275 | 100.00 | "Using index; Using temporary; Using filesort" |
| 1 | SIMPLE | B | NULL | ALL | NULL | NULL | NULL | NULL | 2469412 | 10.00 | "Using where; Using join buffer (hash join)" |
Note that this query ran on another db (development) with less information. That's why the number of rows does not correspond to that of previous explain plans.
Update 3:
All the above queries are simplifications of the actual query I would like to run to not overcomplicate the question. In the real use case, I need to filter on a full-text index in table B. The table includes a full-text index on the column invention_title:
explain SELECT SQL_NO_CACHE *
FROM
t_patent_documents A INNER JOIN
t_inv_title_int_content_combined B ON A.publication_id=B.publication_id
WHERE
MATCH(B.invention_title) AGAINST("+hydraulic" IN BOOLEAN MODE)
ORDER BY
A.language_id ASC,
A.result_order ASC
LIMIT 100
The resulting explain, shows that again it uses only the primary key to do the join, but fails to use the multi-index. If I force the index again, the key_len is 4, and it does not actually seem to use the multi-index:
| | | | | | | | | | | | |
|----|-------------|-------|------------|----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|---------|------------------|------|----------|----------------------------------------------------------------------|
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | B | NULL | fulltext | idx_ft_inv_title_int_content_combined_inv_title | idx_ft_inv_title_int_content_combined_inv_title | 0 | const | 1 | 100.00 | "Using where; Ft_hints: no_ranking; Using temporary; Using filesort" |
| 1 | SIMPLE | A | NULL | eq_ref | "PRIMARY,idx_t_patent_documents_pubid_country_priority_dt,idx_t_patent_documents_pubid_priority_dt,idx_t_patent_documents_pubid_country_ucid,idx_t_patent_documents_pubid_ucid_priority_dt,idx_t_patent_documents_pubid_country_ucid_priority_dt,idx_t_patent_documents_result_order,x_temp_idx_2" | PRIMARY | 4 | B.publication_id | 1 | 100.00 | NULL |
For your first query, remove the USE INDEX and change the index to this order:
(language_id, result_order, publication_id)
The idea is to have the INDEX match the ORDER BY. The optimizer gives preference to WHERE and GROUP BY, but you don't have either. So, the intent for your cases is to be able to stop scanning after LIMIT rows by having the index match the ORDER BY. This would work nicely for just A.
It may help to change the SELECT * to specify only the necessary rows.`
If there is exactly 1 B row for each A row, then there is another optimization. Since this is not the case, the JOIN must be performed, then sort, and only finally deliver LIMIT rows.
re: Update 3
That is as expected.
Use FT index to find the (hopefully) few rows matching.
Reach into the other table in a very efficient way, namely via the PK.
Grab the other columns needed.
Sort to achieve the ORDER BY.
Peel off LIMIT rows.
There is no further optimization. A possible drag on things is if * includes some bulky TEXT or BLOB columns.
As Rick James mentioned, the query seems to be optimized as it is. The main problem here is that the resultant set of records (after filtering using the FULLTEXT index) is large. The sort will not happen using the indexes, because the PRIMARY key has been used to join table B on table A.
Because of this, I opted for another solution; limiting the number of records returned from B, such that the sort will only happen on this subset. This solution is reasonable for my use case, because my query serves a front-end application, where it is unlikely that the user will iterate through more than a couple of thousand records anyways. The LIMIT can be set loosely such that it is highly unlikely that a user will ever reach the end, unless the provided filter criteria will be so tight that the user will find all records meeting his/hers criteria.
The query will therefore be:
SELECT SQL_NO_CACHE A.*
FROM
t_patent_documents A INNER JOIN
(SELECT publication_id FROM t_inv_title_int_content_combined WHERE MATCH(invention_title) AGAINST("+hydraulic" IN BOOLEAN MODE) LIMIT 10000) B on A.publication_id = B.publication_id
ORDER BY
A.language_id ASC,
A.result_order DESC
LIMIT 100
This performs like a charm.

mysql; how to analyse a slow query that occurs in only one database

I'm investigating some long running queries in my PRODUCTION mysql 5.7 database. 1 particular query is taking over 60 seconds.
My usual approach is to take a dump of the data from PROD, import it into a DEV database, reproduce the issue, then analyse and try out some tweaks to the query.
However, the exact same query in DEV is taking less than a second.
Obviously, the mysql configuration, table structure, record numbers, etc are all the same as in PROD.
The query itself is a select with joins across 3 tables with a where clause on each table; 2 of the tables have approx 15m records in them. My initial suspicion was the lack of indexes on the queried columns, but the fact that in DEV it runs very fast would appear to disprove that.
What can I do to shed some light on this?
EXPLAIN results of my query:
PROD
EXPLAIN select this_.id as y0_ from event this_ inner join member m1_ on this_.member_id=m1_.id inner join event_type et2_ on this_.type_id=et2_.id where m1_.submission_id=40646 and this_.status in ('SUPPRESSED') and et2_.name in ('Salary') order by m1_.ni_number asc, m1_.ident1 asc, m1_.ident2 asc, m1_.ident3 asc, m1_.id asc, et2_.name asc limit 15;
+----+-------------+-------+------------+--------+-------------------------------------+-------------------+---------+--------------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+-------------------------------------+-------------------+---------+--------------------------+------+----------+----------------------------------------------+
| 1 | SIMPLE | et2_ | NULL | ALL | PRIMARY | NULL | NULL | NULL | 17 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | this_ | NULL | ref | FK5C6729A2434DA80,FK5C6729AE4E22C6E | FK5C6729AE4E22C6E | 8 | iconnect.et2_.id | 4166 | 10.00 | Using where |
| 1 | SIMPLE | m1_ | NULL | eq_ref | PRIMARY,IND_submission_id | PRIMARY | 8 | iconnect.this_.member_id | 1 | 5.00 | Using where |
+----+-------------+-------+------------+--------+-------------------------------------+-------------------+---------+--------------------------+------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
DEV
EXPLAIN select this_.id as y0_ from event this_ inner join member m1_ on this_.member_id=m1_.id inner join event_type et2_ on this_.type_id=et2_.id where m1_.submission_id=40646 and this_.status in ('SUPPRESSED') and et2_.name in ('Salary') order by m1_.ni_number asc, m1_.ident1 asc, m1_.ident2 asc, m1_.ident3 asc, m1_.id asc, et2_.name asc limit 15;
+----+-------------+-------+------------+------+-------------------------------------+-------------------+---------+-----------------+-------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+-------------------------------------+-------------------+---------+-----------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | et2_ | NULL | ALL | PRIMARY | NULL | NULL | NULL | 17 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m1_ | NULL | ref | PRIMARY,IND_submission_id | IND_submission_id | 8 | const | 26644 | 100.00 | NULL |
| 1 | SIMPLE | this_ | NULL | ref | FK5C6729A2434DA80,FK5C6729AE4E22C6E | FK5C6729A2434DA80 | 8 | iconnect.m1_.id | 2 | 1.86 | Using where |
+----+-------------+-------+------------+------+-------------------------------------+-------------------+---------+-----------------+-------+----------+----------------------------------------------+
3 rows in set, 1 warning (0.03 sec)
Have also spotted that the Cardinality of some of indexes accessed by this query are massively different between DEV and PROD:
FK5C6729AE4E22C6E: DEV=9, PROD=3792
IND_submission_id: DEV=2490, PROD=74220
Could this be impacting performance in PROD?
Query inefficiencies down to the tables containing more data than the index pages can hold. Increasing
innodb_stats_persistent_sample_pages
from 20 to 100, then running ANALYZE TABLE changed the execution plan for the query to be as expected, then running the query took less than 1 second.

MySQL JOIN Rows Issues

I'm having some issues with a query that joins two tables. It runs trough the tables far more times than I expect and I can't seem to find why it does this.
My Query is: SELECT * FROM indexAddress LEFT JOIN indexTx ON indexTx.address_id = indexAddress.id WHERE indexAddress.walletId = '2'
IndexTx contains rows with transactions and a field with the address ID (address_id)
IndexAddress contains the address data with the ID as primary key.
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
1 | SIMPLE | indexAddress | NULL | ref | Wallet ID | Wallet ID | 4 | const | 121 | 100.00
1 | SIMPLE | indexTx | NULL | ref | Address ID | Address ID | 4 | indexAddress.id | 23 | 100.00
My Question is: Why the table runs indexTx 23 times and not just 1. 121 rows is expected, since it's the number of rows for the expected Wallet ID, but 23 is confusing me.
If one of table has more then one record with walletId == 2 the record will be duplicated.
You need filter more the data or use DISTINCT.

MySQL view taking too much time to select data

In the web page that I'm working on I need to show some statistics based on a different user details which are in three tables. So I have the following query that I join to more different tables:
SELECT *
FROM `user` `u`
LEFT JOIN `subscriptions` `s` ON `u`.`user_id` = `s`.`user_id`
LEFT JOIN `devices` `ud` ON `u`.`user_id` = `ud`.`user_id`
GROUP BY `u`.`user_id`
When I execute the query with LIMIT 1000 it takes about 0.05 seconds and since I'm using the data from all the three tables in a lot of queries I've decided to put it inside a VIEW:
CREATE VIEW `user_details` AS ( the same query from above )
And now when I run:
SELECT * FROM user_details LIMIT 1000
it takes about 7-10 seconds.
So my question is can I do something to optimize the view because the query seems to be pretty quick or I should the whole query instead of the view ?
Edit: this is what EXPLAIN SELECT * FROM user_details returns
+----+-------------+------------+--------+----------------+----------------+---------+------------------------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+--------+----------------+----------------+---------+------------------------+--------+-------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 322666 | |
| 2 | DERIVED | u | index | NULL | PRIMARY | 4 | NULL | 372587 | |
| 2 | DERIVED | s | eq_ref | PRIMARY | PRIMARY | 4 | db_users.u.user_id | 1 | |
| 2 | DERIVED | ud | ref | device_id_name | device_id_name | 4 | db_users.u.user_id | 1 | |
+----+-------------+------------+--------+----------------+----------------+---------+------------------------+--------+-------+
4 rows in set (8.67 sec)
this is what explain retuns for the query:
+----+-------------+-------+--------+----------------+----------------+---------+------------------------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------+----------------+---------+------------------------+--------+-------+
| 1 | SIMPLE | u | index | NULL | PRIMARY | 4 | NULL | 372587 | |
| 1 | SIMPLE | s | eq_ref | PRIMARY | PRIMARY | 4 | db_users.u.user_id | 1 | |
| 1 | SIMPLE | ud | ref | device_id_name | device_id_name | 4 | db_users.u.user_id | 1 | |
+----+-------------+-------+--------+----------------+----------------+---------+------------------------+--------+-------+
3 rows in set (0.00 sec)
Views and joins are extremely bad if it comes to performance. This is more or less true for all relational database management systems. Sounds strange, since that is what those systems are designed for, but it is true nevertheless.
Try to avoid the joins if this is a query in heavy usage on your page: instead create a real table (not a view) that is filled from the three tables. you can automate that process using triggers. So each time an entry is inserted into one of the original tables the triggers takes care that the data is propagated to the physical user_details table.
This strategy certainly means a one time investment for the setup, but you definitely will get a much better performance.

MySQL using Filesort and Query is very slow?

I have a query :
SELECT listings.*, listingagents.agentid
FROM listings
LEFT JOIN listingagents ON (listingagents.id = listings.listingagentid)
LEFT JOIN ignore ON (ignore.system_key = listings.listingid)
WHERE ignore.id IS NULL
ORDER BY listings.id ASC
I am trying to improve the performance of this query since it is very slow and it is putting a heavy load on the MySQL server.
When I do a mysql explain, output shows :
+--------+-------------+---------------+--------+---------------+------------+---------+----------------------------+--------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+--------+-------------+---------------+--------+---------------+------------+---------+----------------------------+--------+-------------------------+
| 1 | SIMPLE | listings | ALL | NULL | NULL | NULL | NULL | 383360 | Using filesort |
| 1 | SIMPLE | listingagents | eq_ref | PRIMARY | PRIMARY | 4 | db.listings.listingagen... | 1 | |
| 1 | SIMPLE | ignore | ref | system_key | system_key | 1 | const | 404 | Using where; Not exists |
+--------+-------------+---------------+--------+---------------+------------+---------+----------------------------+--------+-------------------------+
I tried to do a simple query:
SELECT listings.*
FROM listings
ORDER BY listings.id ASC
And that query also have "Using filesort;".
The fields "listings.id", "listingagents.id" and "ignore.id" are Primary Keys
The fields "listingagents.id" and "ignore.system_key" have indexes.
What can I do to improve the 1st query?
try to decrease listings range (currently 383360 rows) by adding some condition. e.g. id > x or limit.