Fix Using where; Using temporary; Using filesort - mysql

I have two simple tables:
CREATE TABLE cat_urls (
Id int(11) NOT NULL AUTO_INCREMENT,
SIL_Id int(11) NOT NULL,
SiteId int(11) NOT NULL,
AsCatId int(11) DEFAULT NULL,
Href varchar(2048) NOT NULL,
ReferrerHref varchar(2048) NOT NULL DEFAULT '',
AddedOn datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
GroupId int(11) DEFAULT NULL,
PRIMARY KEY (Id),
INDEX SIL (SIL_Id, AsCatId)
)
CREATE TABLE products (
Id int(11) NOT NULL AUTO_INCREMENT,
CatUrlId int(11) NOT NULL,
Href varchar(2048) NOT NULL,
SiteIdentity varchar(2048) NOT NULL,
Price decimal(12, 2) NOT NULL,
IsAvailable bit(1) NOT NULL,
ClientCode varchar(256) NOT NULL,
PRIMARY KEY (Id),
INDEX CatUrl (CatUrlId)
)
And I have pretty simple query:
SELECT cu.Href, COUNT(p.CatUrlId) FROM cat_urls cu
JOIN products p ON p.CatUrlId=cu.Id
WHERE sil_id=4601038
GROUP by cu.Id
EXPLAIN says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE cu ref PRIMARY,SIL SIL 4 const 303 Using where; Using temporary; Using filesort
1 SIMPLE p ref CatUrl CatUrl 4 blue_collar_logs.cu.Id 6 Using index
Please tell me is there any way to fix "Using where; Using temporary; Using filesort" and improve perfomance of this query?

It looks that, for some reason, MySQL chooses to use the index SIL on the first table and it uses it both for lookup (WHERE sil_id = 4601038) and grouping (GROUP BY cu.Id).
You can tell it to use the PK of the table
SELECT cu.Href, COUNT(p.CatUrlId) FROM cat_urls cu
USE INDEX FOR JOIN (PRIMARY)
JOIN products p ON p.CatUrlId=cu.Id
WHERE sil_id=4601038
GROUP by cu.Id
and it will produce this execution plan:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---+-------------+-------+-------+---------------+---------+---------+------------------+------+-------------
1 | SIMPLE | cu | index | PRIMARY | PRIMARY | 4 | NULL | 1 | Using where
1 | SIMPLE | p | ref | CatUrl | CatUrl | 4 | cbs-test-1.cu.Id | 1 | Using index
Ignore the values reported in column rows; they are not correct because my tables are empty.
Notice the Extra column now contains only Using where but also notice that the join type column changed from ref (very good) to index (full index scan, not quite good).
A better solution is to add an index on column SIL_Id. I know, SIL_Id is a prefix of index SIL(SIL_Id, AsCatId) and in theory another index on column SIL_Id is completely useless. But it seems it solves the issue on this case.
ALTER TABLE cat_urls
ADD INDEX (SIL_Id)
;
Now use it in the query:
SELECT cu.Href, COUNT(p.CatUrlId) FROM cat_urls cu
USE INDEX FOR JOIN (SIL_Id)
JOIN products p ON p.CatUrlId=cu.Id
WHERE sil_id=4601038
GROUP by cu.Id
The query execution plan looks much better now:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---+-------------+-------+------+---------------+--------+---------+------------------+------+-------------
1 | SIMPLE | cu | ref | SIL_Id | SIL_Id | 4 | const | 1 | Using where
1 | SIMPLE | p | ref | CatUrl | CatUrl | 4 | cbs-test-1.cu.Id | 1 | Using index
The drawback is that we have an extra index that is (theoretically) useless. It occupies storage space and it consumes processor cycles every time a row is added, deleted or have its SIL_Id field modified.

Related

How to use correct indexes with a double inner join query?

I have a query with 2 INNER JOIN statements, and only fetching a few column, but it is very slow even though I have indexes on all required columns.
My query
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
INNER JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
WHERE
com.prestataireLAD REGEXP '.*'
AND pe_nom REGEXP 'bordeaux|chambéry-annecy|grenoble|lyon|marseille|metz|montpellier|nancy|nice|nimes|rouen|strasbourg|toulon|toulouse|vitry|vitry bis 1|vitry bis 2|vlg'
AND com.date_livraison BETWEEN '2022-06-11 00:00:00'
AND '2022-07-08 00:00:00';
It takes around 20 seconds to compute and fetch 4123 rows.
The problem
In order to find what's wrong and why is it so slow, I've used the EXPLAIN statement, here is the output:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|--------|----------------------------|-------------|---------|------------------------|--------|----------|-------------|
| 1 | SIMPLE | dys | | ALL | id_commande,id_commande_2 | | | | 878588 | 100.00 | Using where |
| 1 | SIMPLE | com | | eq_ref | id_commande,date_livraison | id_commande | 110 | db.dys.id_commande | 1 | 7.14 | Using where |
| 1 | SIMPLE | pe | | ref | pe_id | pe_id | 5 | db.com.code_pe | 1 | 100.00 | Using where |
I can see that the dysfonctionnements JOIN is rigged, and doesn't use a key even though it could...
Table definitions
commandes (included relevant columns only)
CREATE TABLE `commandes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) NOT NULL DEFAULT '',
`date_commande` datetime NOT NULL,
`date_livraison` datetime NOT NULL,
`code_pe` int(11) NOT NULL,
`traitement_dysfonctionnement` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`),
KEY `date_livraison` (`date_livraison`),
KEY `traitement_dysfonctionnement` (`traitement_dysfonctionnement`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
dysfonctionnements (again, relevant columns only)
CREATE TABLE `dysfonctionnements` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) DEFAULT NULL,
`dysfonctionnement` varchar(150) DEFAULT NULL,
`responsable` varchar(50) DEFAULT NULL,
`reimputation` varchar(50) DEFAULT NULL,
`montant` float DEFAULT NULL,
`listRembArticles` text,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`,`dysfonctionnement`),
KEY `id_commande_2` (`id_commande`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
pe (again, relevant columns only)
CREATE TABLE `pe` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`pe_id` int(11) DEFAULT NULL,
`pe_nom` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pe_nom` (`pe_nom`),
KEY `pe_id` (`pe_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Investigation
If I remove the db.pe table from the query and the WHERE clause on pe_nom, the query takes 1.7 seconds to fetch 7k rows, and with the EXPLAIN statement, I can see it is using keys as I expect it to do:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|----------------------------|----------------|---------|------------------------|--------|----------|-----------------------------------------------|
| 1 | SIMPLE | com | | range | id_commande,date_livraison | date_livraison | 5 | | 389558 | 100.00 | Using index condition; Using where; Using MRR |
| 1 | SIMPLE | dys | | ref | id_commande,id_commande_2 | id_commande_2 | 111 | ooshop.com.id_commande | 1 | 100.00 | |
I'm open to any suggestions, I see no reason not to use the key when it does on a very similar query and it definitely makes it faster...
I had a similar experience when MySQL optimiser selected a joined table sequence far from optimal. At that time I used MySQL specific STRAIGHT_JOIN operator to overcome default optimiser behaviour. In your case I would try this:
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
STRAIGHT_JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
Also, in your WHERE clause one of the REGEXP probably might be changed to IN operator, I assume it can use index.
Remove com.prestataireLAD REGEXP '.*'. The Optimizer probably won't realize that this has no impact on the resultset. If you are dynamically building the WHERE clause, then eliminate anything else you can.
id_commande_2 is redundant. In queries where it might be useful, the UNIQUE can take care of it.
These indexes might help:
com: INDEX(date_livraison, id_commande, code_pe)
pe: INDEX(pe_nom, pe_id)

Implementation of composit clustered index in MySQL

I need to create composit clustered index like: username, name, id. Is it real to implement such thing? I need to boost perfomance of query like where username = ? and name = ? by using clustered indexes in Innodb. But i think it wont work because id stay at 3rd place, and it wont be used.
It's fine to define a clustered index with multiple columns.
CREATE TABLE mytable (
username VARCHAR(64) NOT NULL,
name VARCHAR(64) NOT NULL,
id BIGINT
PRIMARY KEY (username, name, id)
);
If you query against the first two columns, it will use the clustered index, so it will avoid the overhead of lookups via secondary indexes.
But if you use EXPLAIN to report the optimizer's plan for the query, you'll see that the access is type: ref which means an index lookup, but not a unique index lookup. That is, it will potentially match multiple rows.
mysql> explain select * from mytable where username = 'user' and name = 'name';
+----+-------------+---------+------------+------+---------------+---------+---------+-------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+---------------+---------+---------+-------------+------+----------+-------+
| 1 | SIMPLE | mytable | NULL | ref | PRIMARY | PRIMARY | 516 | const,const | 1 | 100.00 | NULL |
+----+-------------+---------+------------+------+---------------+---------+---------+-------------+------+----------+-------+
When doing lookups against a PRIMARY KEY, we'd like to see type: eq_ref or type: const which means it is doing a unique lookup, and the query is guaranteed to match either 0 or 1 row.
mysql> explain select * from mytable where username = 'user' and name = 'name' and id = 1;
+----+-------------+---------+------------+-------+---------------+---------+---------+-------------------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------------------+------+----------+-------+
| 1 | SIMPLE | mytable | NULL | const | PRIMARY | PRIMARY | 524 | const,const,const | 1 | 100.00 | NULL |
+----+-------------+---------+------------+-------+---------------+---------+---------+-------------------+------+----------+-------+
Both queries are using the clustered index.
Re your comment:
InnoDB requires the auto-increment column be the first column of a key in the table. It doesn't have to be the primary key. So you can do this for example:
CREATE TABLE `mytable` (
`username` varchar(64) NOT NULL,
`name` varchar(64) NOT NULL,
`id` bigint NOT NULL AUTO_INCREMENT,
`x` int DEFAULT NULL,
PRIMARY KEY (`username`,`name`,`id`),
KEY (`id`)
) ENGINE=InnoDB;
Notice I added an extra KEY (id) to satisfy InnoDB's requirement. But in the primary key, id is still at the end.

SELECT statement optimization MySQL

I am looking for a way how to make my SELECT query even faster than it is now, because I have a feeling it should be possible to make it faster.
Here is the query
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
Here is EXPLAIN SELECT
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY | PRIMARY | 4 | dev.r.id_pair | 1 | |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
Now the tables are
CREATE TABLE `tag_product` (
`id_pair` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_product` int(10) unsigned NOT NULL,
`id_user_tag` int(10) unsigned NOT NULL,
`status` tinyint(3) NOT NULL,
`percentile` decimal(8,4) unsigned NOT NULL,
`percentile_weighted` decimal(8,4) unsigned NOT NULL,
`elo` int(10) unsigned NOT NULL,
`date_add` datetime NOT NULL,
`date_upd` datetime NOT NULL,
PRIMARY KEY (`id_pair`),
UNIQUE KEY `id_product_id_user_tag` (`id_product`,`id_user_tag`),
KEY `status` (`status`),
KEY `id_user_tag` (`id_user_tag`),
CONSTRAINT `tag_product_ibfk_5` FOREIGN KEY (`id_user_tag`) REFERENCES `user_tag` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `tag_rating` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_customer` int(10) unsigned NOT NULL,
`id_pair` int(10) unsigned NOT NULL,
`id_duel` int(10) unsigned NOT NULL,
`value` tinyint(4) NOT NULL,
`date_add` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_duel_id_pair` (`id_duel`,`id_pair`),
KEY `id_pair_id_customer` (`id_pair`,`id_customer`),
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
KEY `id_customer_value_date_add` (`id_customer`,`value`,`date_add`),
CONSTRAINT `tag_rating_ibfk_3` FOREIGN KEY (`id_pair`) REFERENCES `tag_product` (`id_pair`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `tag_rating_ibfk_6` FOREIGN KEY (`id_duel`) REFERENCES `tag_rating_duel` (`id_duel`) ON DELETE CASCADE ON UPDATE CASCADE,
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The table tag_product has about 250k rows and the tag_rating has about 1M rows.
My issue is that the SQL query takes about 0.8s on average on my machine. I would like to make it ideally under 0.5s while also assuming the tables can get like 10 times bigger. The amount of rows taken into play should be about the same because I have a date condition (I only want less than a month old rows).
Is this possible to make faster just by some trick (aka not restructuring my tables)? When I slightly modify (dont join the smaller table) the statement as
SELECT r.id_customer, COUNT(*)
FROM tag_rating AS r USE INDEX (value_date_add)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer;
here is EXPLAIN SELECT
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
it takes about 0.25s which is great. So the JOIN makes it 3x slower. Is that inevitable? I feel like since I am joining via primary key it shouldnt make a query 3x slower.
---UPDATE---
This is actually my query. The number of different id_customer values is about 1 thousand and is expected to rise, the number of rows with value=1 is exactly half. So far the query performance seems to be slowing down linearly based on the number of rows in rating table
Using adding id_pair at the end of the id_customer_value_date_add or value_id_customer_date_add index doesnt help.
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (id_customer_value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.id_customer IN (2593179,1461878,2318871,2654090,2840415,2852531,2987432,3473275,3960453,3961798,4129734,4191571,4202912,4204817,4211263,4248789,765650,1341317,1430380,2116196,3367674,3701901,3995273,4118307,4136114,4236589,783262,913493,1034296,2626574,3574634,3785772,2825128,4157953,3331279,4180367,4208685,4287879,1038898,1445750,1975108,3658055,4185296,4276189,428693,4248631,1892448,3773855,2901524,3830868,3934786) AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
This is EXPLAIN SELECT
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | r | range | id_customer_value_date_add | id_customer_value_date_add | 10 | | 558906 | Using where; Using index |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY,status | PRIMARY | 4 | dev.r.id_pair | 1 | Using where |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
Any tips are appreciated. Thank you
INDEX(value, date_add, id_customer, id_pair)
Would be "covering", giving an extra boost on performance for both queries. And also for Gordon's formulation.
At the same time, get rid of these:
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
because they might get in the way of the Optimizer picking the new index. Any other queries that were using those indexes will easily use the new index.
If you are not otherwise using tag_rating.id, get rid of it and promote the UNIQUE to PRIMARY KEY.
Try writing the query using a correlated subquery:
SELECT r.id_customer,
(SELECT ROUND(AVG(tp.percentile_weighted), 2)
FROM tag_product tp
WHERE tp.id_pair = r.id_pair
) AS percentile
FROM tag_rating AS r
WHERE r.value = 1 AND
r.date_add > '2020-08-08 11:56:00';
This eliminates the outer aggregation which should be faster.

mysql table join with max()

I have problem with query using JOIN and MAX/MIN. For Example:
SELECT Min(a.date), Max(a.date)
FROM a
INNER JOIN b ON b.ID = a.ID AND b.cID = 5
Its possible to add index or change this query result was better?
Below the result of explain
+----+-------------+----------+------+-----------------+-----+---------+-----------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+-----------------+-----+---------+-----------+--------+-----------------------+
| 1 | SIMPLE | b | ref | PRIMARY,cID | cID | 5 | const | 680648 | Using index |
| 1 | SIMPLE | a | ref | ID | ID | 5 | base.b.ID | 1 | Using index condition |
+----+-------------+----------+------+-----------------+-----+---------+-----------+--------+-----------------------+
Sorry, but I would not put here the whole table, and could make a lot of confusion.
CREATE TABLE `a` (
`ID` int(11) NOT NULL,
`date` datetime DEFAULT,
PRIMARY KEY (`ID`),
KEY `date` (`date`),
)
CREATE TABLE `b` (
`bID` int(11) NOT NULL,
`ID` int(11) NOT NULL,
`cID` int(11) DEFAULT,
PRIMARY KEY (`bID`),
KEY `cID` (`cID`),
)
b: INDEX(cID, ID)
will make that a "covering" index, so it will probably get through the 680648 rows faster. It should replace the current KEY(cID).
Key_len for b is 5. That disagrees with the table definition; something got simplified too much.

Mysql - Using temporary; Using filesort

I have two tables like this
CREATE TABLE `vendors` (
vid int(10) unsigned NOT NULL AUTO_INCREMENT,
updated timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (vid),
key(updated)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `products` (
vid int(10) unsigned NOT NULL default 0,
pid int unsigned default 0,
flag int(11) unsigned DEFAULT '0',
PRIMARY KEY (vid),
KEY (pid)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This is a simple query
> explain select vendors.vid, pid from products, vendors where pid=1 and vendors.vid=products.vid order by updated;
+------+-------------+----------+--------+---------------+---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+--------+---------------+---------+---------+---------------------+------+----------------------------------------------+
| 1 | SIMPLE | products | ref | PRIMARY,pid | pid | 5 | const | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | vendors | eq_ref | PRIMARY | PRIMARY | 4 | social.products.vid | 1 | |
+------+-------------+----------+--------+---------------+---------+---------+---------------------+------+----------------------------------------------+
I am wondering why mysql need to use temporary table and filesort for such a simple query. As you can see that ORDER BY field has index.
mysql fiddle here : http://sqlfiddle.com/#!9/3d9be/30
That will be the optimum query in that case, doesn't always have to go to the index for the fastest result. The optimiser may choose to use the index when the record count goes up. You can try inserting 10,000 dummy records and seeing if this is the case.
If I flip the conditions here, you will find it will use the index, since I have supplied the table where the where condition is joined on later in the query. We need to look at records in table products after the join is made, so in essence I've made it harder work, so the index is used. It'll still run in the same time. You can try pitting the 2 queries against each other to see what happens. Here it is:
EXPLAIN
SELECT vendors.vid, products.pid
FROM vendors
INNER JOIN products ON vendors.vid = products.vid
WHERE pid = 1
ORDER BY vendors.updated DESC
You can find a detailed explanation here: Fix Using where; Using temporary; Using filesort