Help: Optimize this query in MySQL - mysql

This is my tables, the AUTO_INCREMENT shows the size of each:
tbl_clientes:
CREATE TABLE `tbl_clientes` (
`int_clientes_id_pk` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`str_clientes_documento` varchar(255) DEFAULT NULL,
`str_clientes_nome_original` char(255) DEFAULT NULL,
PRIMARY KEY (`int_clientes_id_pk`),
UNIQUE KEY `str_clientes_documento` (`str_clientes_documento`),
KEY `str_clientes_nome_original` (`str_clientes_nome_original`),
KEY `nome_original_cliente_id` (`str_clientes_nome_original`,`int_clientes_id_pk`),
KEY `cliente_id_nome_original` (`int_clientes_id_pk`,`str_clientes_nome_original`)
) ENGINE=MyISAM AUTO_INCREMENT=2815520 DEFAULT CHARSET=utf8
tbl_clienteEnderecos:
CREATE TABLE `tbl_clienteEnderecos` (
`int_clienteEnderecos_id_pk` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`int_clienteEnderecos_cliente_id_fk` bigint(20) unsigned NOT NULL,
`str_clienteEnderecos_endereco` varchar(255) NOT NULL,
`str_clienteEnderecos_cep` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_numero` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_complemento` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_bairro` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_cidade` varchar(255) DEFAULT NULL,
`str_clienteEnderecos_uf` varchar(2) DEFAULT NULL,
`int_clienteEnderecos_correspondencia` tinyint(1) NOT NULL DEFAULT '0',
`int_clienteEnderecos_tipo` int(11) NOT NULL DEFAULT '1',
PRIMARY KEY (`int_clienteEnderecos_id_pk`),
KEY `int_clienteEnderecos_cliente_id_fk` (`int_clienteEnderecos_cliente_id_fk`),
KEY `str_clienteEnderecos_cidade` (`str_clienteEnderecos_cidade`),
KEY `str_clienteEnderecos_uf` (`str_clienteEnderecos_uf`),
KEY `uf_cidade` (`str_clienteEnderecos_uf`,`str_clienteEnderecos_cidade`)
) ENGINE=MyISAM AUTO_INCREMENT=1542038 DEFAULT CHARSET=utf8
Then I run this query to search, it will be fast, is using indexes:
EXPLAIN
SELECT * FROM tbl_clientes LEFT JOIN tbl_clienteEnderecos ON int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk
GROUP BY str_clientes_nome_original, int_clientes_id_pk
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0,20
The result of EXPAIN is:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+------------------------------------+------------------------------------+---------+---------------------------------------------------+------+-------+
| 1 | SIMPLE | tbl_clientes | index | NULL | nome_original_cliente_id | 774 | NULL | 20 | |
| 1 | SIMPLE | tbl_clienteEnderecos | ref | int_clienteEnderecos_cliente_id_fk | int_clienteEnderecos_cliente_id_fk | 8 | mydb.tbl_clientes.int_clientes_id_pk | 1 | |
+----+-------------+----------------------+-------+------------------------------------+------------------------------------+---------+---------------------------------------------------+------+-------+
All right, but I need to filter by tbl_clienteEnderecos.str_clienteEnderecos_uf. It breaks all indexes, use temporary table and filesort (no index). Here's the query:
EXPLAIN
SELECT * FROM tbl_clientes LEFT JOIN tbl_clienteEnderecos ON int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk
WHERE str_clienteEnderecos_uf = "SP"
GROUP BY str_clientes_nome_original, int_clientes_id_pk
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0,20
Look, this is the output of EXPLAIN:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+--------+----------------------------------------------------------------------+-----------+---------+---------------------------------------------------------------------------+--------+----------------------------------------------+
| 1 | SIMPLE | tbl_clienteEnderecos | ref | int_clienteEnderecos_cliente_id_fk,str_clienteEnderecos_uf,uf_cidade | uf_cidade | 9 | const | 670654 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | tbl_clientes | eq_ref | PRIMARY,cliente_id_nome_original | PRIMARY | 8 | mydb.tbl_clienteEnderecos.int_clienteEnderecos_cliente_id_fk | 1 | |
+----+-------------+----------------------+--------+----------------------------------------------------------------------+-----------+---------+---------------------------------------------------------------------------+--------+----------------------------------------------+
With this Using where; Using temporary; Using filesort it can't be fast. I've tried a lot of things, how optimize this query?
Is it time to switch to NoSQL/MongoDB?

MySQL will typically not use an index if it will not help narrow the results down enough. It appears that "SP" occurs in roughly 670654 rows. Since this is about 1/3 of your total rows, it is more efficient to read it in disk order.
You can try an index to tbl_clienteEnderecos:
KEY `test` (`str_clienteEnderecos_uf `, `int_clienteEnderecos_cliente_id_fk`)
This might be enough to get it to use the index.
What is the difference between these two columns? They look like they should be the same.
int_clienteEnderecos_id_pk
int_clienteEnderecos_cliente_id_fk
Edit
I understand what the names of the columns imply. I was just curious if the two values should be identical. If they are, it would simplify a few things and have them be joined on the primary key of the tables. I am not sure about the specific meaning of the tables involved, so I don't know if there is a 1-1 or 1-0 relationship between them or a one to many relationship.
I suggest trying to retrieve just the primary key of the tables that you want. For instance, instead of select * try:
EXPLAIN
SELECT int_clienteEnerecos_id_pk, int_clientes_id_pk
FROM tbl_clientes
LEFT JOIN tbl_clienteEnderecos ON int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk
WHERE str_clienteEnderecos_uf = "SP"
GROUP BY str_clientes_nome_original, int_clientes_id_pk
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0,20
If this works out the way I hope it will, you sell see "from index" in the Extra column. If you need additional fields returned, you can either make another round trip to fetch them, or add them to your index. Or use a nested query to fetch them based on the results of the query above.
Also, why are you grouping by and ordering by the same thing? Are you expecting multiple matches of the foreign key?

I'd suggest giving the following a try; the subquery might use the key better than the join in this context. Take care, though; I couldn't swear on a stack of K & R's that the query is the same as your original.
SELECT *,
(SELECT *
FROM tbl_clienteEnderecos
WHERE int_clienteEnderecos_cliente_id_fk = int_clientes_id_pk AND
str_clienteEnderecos_uf = "SP") AS T2
FROM tbl_clientes
GROUP BY str_clientes_nome_original, int_clientes_id_pk
HAVING T2.int_clienteEnderecos_id_pk IS NOT NULL
ORDER BY str_clientes_nome_original, int_clientes_id_pk
LIMIT 0, 20

Related

SELECT statement optimization MySQL

I am looking for a way how to make my SELECT query even faster than it is now, because I have a feeling it should be possible to make it faster.
Here is the query
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
Here is EXPLAIN SELECT
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY | PRIMARY | 4 | dev.r.id_pair | 1 | |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
Now the tables are
CREATE TABLE `tag_product` (
`id_pair` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_product` int(10) unsigned NOT NULL,
`id_user_tag` int(10) unsigned NOT NULL,
`status` tinyint(3) NOT NULL,
`percentile` decimal(8,4) unsigned NOT NULL,
`percentile_weighted` decimal(8,4) unsigned NOT NULL,
`elo` int(10) unsigned NOT NULL,
`date_add` datetime NOT NULL,
`date_upd` datetime NOT NULL,
PRIMARY KEY (`id_pair`),
UNIQUE KEY `id_product_id_user_tag` (`id_product`,`id_user_tag`),
KEY `status` (`status`),
KEY `id_user_tag` (`id_user_tag`),
CONSTRAINT `tag_product_ibfk_5` FOREIGN KEY (`id_user_tag`) REFERENCES `user_tag` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `tag_rating` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_customer` int(10) unsigned NOT NULL,
`id_pair` int(10) unsigned NOT NULL,
`id_duel` int(10) unsigned NOT NULL,
`value` tinyint(4) NOT NULL,
`date_add` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_duel_id_pair` (`id_duel`,`id_pair`),
KEY `id_pair_id_customer` (`id_pair`,`id_customer`),
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
KEY `id_customer_value_date_add` (`id_customer`,`value`,`date_add`),
CONSTRAINT `tag_rating_ibfk_3` FOREIGN KEY (`id_pair`) REFERENCES `tag_product` (`id_pair`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `tag_rating_ibfk_6` FOREIGN KEY (`id_duel`) REFERENCES `tag_rating_duel` (`id_duel`) ON DELETE CASCADE ON UPDATE CASCADE,
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The table tag_product has about 250k rows and the tag_rating has about 1M rows.
My issue is that the SQL query takes about 0.8s on average on my machine. I would like to make it ideally under 0.5s while also assuming the tables can get like 10 times bigger. The amount of rows taken into play should be about the same because I have a date condition (I only want less than a month old rows).
Is this possible to make faster just by some trick (aka not restructuring my tables)? When I slightly modify (dont join the smaller table) the statement as
SELECT r.id_customer, COUNT(*)
FROM tag_rating AS r USE INDEX (value_date_add)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer;
here is EXPLAIN SELECT
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
it takes about 0.25s which is great. So the JOIN makes it 3x slower. Is that inevitable? I feel like since I am joining via primary key it shouldnt make a query 3x slower.
---UPDATE---
This is actually my query. The number of different id_customer values is about 1 thousand and is expected to rise, the number of rows with value=1 is exactly half. So far the query performance seems to be slowing down linearly based on the number of rows in rating table
Using adding id_pair at the end of the id_customer_value_date_add or value_id_customer_date_add index doesnt help.
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (id_customer_value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.id_customer IN (2593179,1461878,2318871,2654090,2840415,2852531,2987432,3473275,3960453,3961798,4129734,4191571,4202912,4204817,4211263,4248789,765650,1341317,1430380,2116196,3367674,3701901,3995273,4118307,4136114,4236589,783262,913493,1034296,2626574,3574634,3785772,2825128,4157953,3331279,4180367,4208685,4287879,1038898,1445750,1975108,3658055,4185296,4276189,428693,4248631,1892448,3773855,2901524,3830868,3934786) AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
This is EXPLAIN SELECT
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | r | range | id_customer_value_date_add | id_customer_value_date_add | 10 | | 558906 | Using where; Using index |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY,status | PRIMARY | 4 | dev.r.id_pair | 1 | Using where |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
Any tips are appreciated. Thank you
INDEX(value, date_add, id_customer, id_pair)
Would be "covering", giving an extra boost on performance for both queries. And also for Gordon's formulation.
At the same time, get rid of these:
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
because they might get in the way of the Optimizer picking the new index. Any other queries that were using those indexes will easily use the new index.
If you are not otherwise using tag_rating.id, get rid of it and promote the UNIQUE to PRIMARY KEY.
Try writing the query using a correlated subquery:
SELECT r.id_customer,
(SELECT ROUND(AVG(tp.percentile_weighted), 2)
FROM tag_product tp
WHERE tp.id_pair = r.id_pair
) AS percentile
FROM tag_rating AS r
WHERE r.value = 1 AND
r.date_add > '2020-08-08 11:56:00';
This eliminates the outer aggregation which should be faster.

Any possibility to speed up WHERE IN or replace it with faster alternative?

I am trying to speed up select in query below where I have over 1000 items in WHERE IN
table:
CREATE TABLE `user_item` (
`user_id` int(11) unsigned NOT NULL,
`item_id` int(11) unsigned NOT NULL,
PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
query:
SELECT
item_id
FROM
user_item
WHERE
user_id = 2
AND item_id IN(3433456,67584634,587345,...)
With 1000 items in IN list, query takes about 3 seconds to execute. is there any optimization that can be done in this case? There can be billions of rows in this table. Is there an alternative to doing this faster be it with another DB or programming method?
UPDATE:
Here's results of explain:
If I have 999 items in the IN(...) statement:
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | user_item | range | PRIMARY | PRIMARY | 8 | NULL | 999 | Using where; Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
If I have 1000 items in IN(...) statement:
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 1000 | |
| 1 | PRIMARY | user_item | eq_ref | PRIMARY | PRIMARY | 8 | const,tvc_0._col_1 | 1 | Using where; Using index |
| 2 | MATERIALIZED | <derived3> | ALL | NULL | NULL | NULL | NULL | 1000 | |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
Update 2
I want to explain why I need to do above:
I want to give the user the ability to list items ordered by sort_criteria_1, sort_criteria_2 or sort_criteria_3 and exclude from the list those items that have been marked by given (n) users in the user_item table.
Here's sample schema:
CREATE TABLE `user` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `item` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`file` varchar(45) NOT NULL,
`sort_criteria_1` int(11) DEFAULT NULL,
`sort_criteria_2` int(11) DEFAULT NULL,
`sort_criteria_3` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_sc1` (`sort_criteria_1`),
KEY `idx_sc2` (`sort_criteria_2`),
KEY `idx_sc3` (`sort_criteria_3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `user_item` (
`user_id` int(11) NOT NULL,
`item_id` int(11) NOT NULL,
PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Here's how I would get items ordered by sort_criteria_2 excluding ones that have record by users (300, 6, 1344, 24) in user_item table:
SELECT
i.id,
FROM
item i
LEFT JOIN user_item ui1 ON (i.id = ui1.item_id AND ui1.user_id = 300)
LEFT JOIN user_item ui2 ON (i.id = ui2.item_id AND ui2.user_id = 6)
LEFT JOIN user_item ui3 ON (i.id = ui3.item_id AND ui3.user_id = 1344)
LEFT JOIN user_item ui4 ON (i.id = ui4.item_id AND ui4.user_id = 24)
WHERE
ui1.item_id IS NULL
AND ui2.item_id IS NULL
AND ui3.item_id IS NULL
AND ui4.item_id IS NULL
ORDER BY
v.sort_criteria_2
LIMIT
800
Main problem with above approach is that more users I'm filtering by, more expensive query gets. I want the toll for filtering to be paid by client browser. So I would send list of items and list of matching user_item records per user to the client to filter by. This would help with sharding as well, since I would not have to have user_item tables or set of records on the same machine.
It's hard to tell exactly, but there could be lag on parsing your huge query because of many constant item_id values.
Have you tried getting just all the values by user_id ? As this field is first (main) in the PRIMARY KEY, relevant index would still be used.
Have you tried replacing constant list with a subquery ? Maybe you're interested in items of specific type, for example.
Make sure that you use Prepared statement concept - at least if your database and language support it. This would protect your code from possible SQL injections and enable database built-in query caching (if your database supports it).
Instead of putting the 1000 item_id's into IN-clause, you could put them into temporary table with index and join it with the user_item-table.
If you also have an index with both user_id and item_id, that would make the query fastest that it gets. The rest depends on the data distribution.

Slow inner join order query

I have a problem with the speed of query. Question is similar to this one, but can't find solution. Explain says that MySQL is using: Using index condition; Using where; Using temporary; Using filesort on companies table.
Mysql slow query: INNER JOIN + ORDER BY causes filesort
Slow query:
SELECT * FROM companies
INNER JOIN post_indices
ON companies.post_index_id = post_indices.id
WHERE companies.deleted_at is NULL
ORDER BY post_indices.id
LIMIT 1;
# 1 row in set (5.62 sec)
But if I remove where statement from query it is really fast:
SELECT * FROM companies
INNER JOIN post_indices
ON companies.post_index_id = post_indices.id
ORDER BY post_indices.id
LIMIT 1;
# 1 row in set (0.00 sec)
I've tried using different indexes on companies table:
index_companies_on_deleted_at
index_companeis_on_post_index_id
index_companies_on_deleted_at_and_post_index_id
index_companies_on_post_index_id_and_deleted_at
index_companies_on_deleted_at index is automatically selected by MySQL. Stats for same query using above indexes:
5.6 sec
3.4 sec
8.5 sec
3.5 sec
Any ideas how to improve my query speed? Again said - without where deleted_at is null condition query is instant..
Companies table has 1.3 mil of rows.
PostIndices table has 3k rows.
UPDATE 1:
Order by post_indices.id is used for simplicity since it's indexed already. But it will be used on other columns of join table (post_indices). So sort on companies.post_index_id wont solve this issue
UPDATE 2: for Rick James
Your query takes only 0.04 sec to accomplish. And explain says that index_companies_on_deleted_at_and_post_index_id index is used. So yes, it works better, but this doesn't solve my problem (need to order on post_indices columns, will do this in future, so id post_indices.id used for simplicity of example. In future it will be for example post_indices.city).
My query with WHERE, but without ORDER BY is instant.
UPDATE 3:
EXPLAIN query. Also I noticed that order of indexes matters. index_companies_on_deleted_at index is used if it's higher (created earlier) then index_companies_on_deleted_at_and_post_index_id. Otherwise later index is used. I mean automatically selected by MySQL.
mysql> EXPLAIN SELECT * FROM companies INNER JOIN post_indices ON post_indices.id = companies.post_index_id WHERE companies.deleted_at IS NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+------------+--------+----------------------------------------------------------------------------------------------------------------+-------------------------------+---------+------------------------------------------------------+--------+----------+---------------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+--------+----------------------------------------------------------------------------------------------------------------+-------------------------------+---------+------------------------------------------------------+--------+----------+---------------------------------------------------------------------+
| 1 | SIMPLE | companies | NULL | ref | index_companies_on_post_index_id,index_companies_on_deleted_at,index_companies_on_deleted_at_and_post_index_id | index_companies_on_deleted_at | 6 | const | 638692 | 100.00 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | post_indices | NULL | eq_ref | PRIMARY | PRIMARY | 4 | enbro_purecrm_eu_development.companies.post_index_id | 1 | 100.00 | NULL |
+----+-------------+--------------+------------+--------+----------------------------------------------------------------------------------------------------------------+-------------------------------+---------+------------------------------------------------------+--------+----------+---------------------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT * FROM companies USE INDEX(index_companies_on_post_index_id) INNER JOIN post_indices ON post_indices.id = companies.post_index_id WHERE companies.deleted_at IS NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+------------+--------+----------------------------------+---------+---------+------------------------------------------------------+---------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+--------+----------------------------------+---------+---------+------------------------------------------------------+---------+----------+----------------------------------------------+
| 1 | SIMPLE | companies | NULL | ALL | index_companies_on_post_index_id | NULL | NULL | NULL | 1277385 | 10.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | post_indices | NULL | eq_ref | PRIMARY | PRIMARY | 4 | enbro_purecrm_eu_development.companies.post_index_id | 1 | 100.00 | NULL |
+----+-------------+--------------+------------+--------+----------------------------------+---------+---------+------------------------------------------------------+---------+----------+----------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT * FROM companies USE INDEX(index_companies_on_deleted_at_and_post_index_id) INNER JOIN post_indices ON post_indices.id = companies.post_index_id WHERE companies.deleted_at IS NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+------------+--------+-------------------------------------------------+-------------------------------------------------+---------+------------------------------------------------------+--------+----------+--------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+--------+-------------------------------------------------+-------------------------------------------------+---------+------------------------------------------------------+--------+----------+--------------------------------------------------------+
| 1 | SIMPLE | companies | NULL | ref | index_companies_on_deleted_at_and_post_index_id | index_companies_on_deleted_at_and_post_index_id | 6 | const | 638692 | 100.00 | Using index condition; Using temporary; Using filesort |
| 1 | SIMPLE | post_indices | NULL | eq_ref | PRIMARY | PRIMARY | 4 | enbro_purecrm_eu_development.companies.post_index_id | 1 | 100.00 | NULL |
+----+-------------+--------------+------------+--------+-------------------------------------------------+-------------------------------------------------+---------+------------------------------------------------------+--------+----------+--------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
UPDATE 4:
I've removed non related columns:
| companies | CREATE TABLE `companies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`address` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`post_index_id` int(11) DEFAULT NULL,
`vat` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`note` text COLLATE utf8_unicode_ci,
`state` varchar(255) COLLATE utf8_unicode_ci NOT NULL DEFAULT 'new',
`deleted_at` datetime DEFAULT NULL,
`lead_list_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_companies_on_vat` (`vat`),
KEY `index_companies_on_post_index_id` (`post_index_id`),
KEY `index_companies_on_state` (`state`),
KEY `index_companies_on_deleted_at` (`deleted_at`),
KEY `index_companies_on_deleted_at_and_post_index_id` (`deleted_at`,`post_index_id`),
KEY `index_companies_on_lead_list_id` (`lead_list_id`),
CONSTRAINT `fk_rails_5fc7f5c6b9` FOREIGN KEY (`lead_list_id`) REFERENCES `lead_lists` (`id`),
CONSTRAINT `fk_rails_79719355c6` FOREIGN KEY (`post_index_id`) REFERENCES `post_indices` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2523518 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
| post_indices | CREATE TABLE `post_indices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`county` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`postal_code` int(11) DEFAULT NULL,
`group_part` int(11) DEFAULT NULL,
`group_number` int(11) DEFAULT NULL,
`group_name` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`city` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3101 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
UPDATE 5:
Another developer tested same query on his local machine with exactly same data set (dump/restore). And he got totally different explain:
mysql> explain SELECT * FROM companies INNER JOIN post_indices ON companies.post_index_id = post_indices.id WHERE companies.deleted_at is NULL ORDER BY post_indices.id LIMIT 1;
+----+-------------+--------------+-------+----------------------------------------------------------------------------------------------------------------+-------------------------------------------------+---------+----------------------------------------------------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+----------------------------------------------------------------------------------------------------------------+-------------------------------------------------+---------+----------------------------------------------------+------+-----------------------+
| 1 | SIMPLE | post_indices | index | PRIMARY | PRIMARY | 4 | NULL | 1 | NULL |
| 1 | SIMPLE | companies | ref | index_companies_on_post_index_id,index_companies_on_deleted_at,index_companies_on_deleted_at_and_post_index_id | index_companies_on_deleted_at_and_post_index_id | 11 | const,enbro_purecrm_eu_development.post_indices.id | 283 | Using index condition |
+----+-------------+--------------+-------+----------------------------------------------------------------------------------------------------------------+-------------------------------------------------+---------+----------------------------------------------------+------+-----------------------+
2 rows in set (0,00 sec)
Same query on his PC is instant. Have no idea why it is happening.. I've also tried to use STRAIGHT_JOIN. When I force post_indices table to be read first by MySQL, it is blazing fast too. But still it is mistery for me, why same query on another machine is fast (mysql -v 5.6.27) and slow on my machine (mysql -v 5.7.10)
So it seems that problem is MySQL using wrong table as first table to read.
Does this work better?
SELECT * FROM companies AS c
INNER JOIN post_indices AS pi
ON c.post_index_id = pi.id
WHERE c.deleted_at is NULL
ORDER BY c.post_index_id -- Note
LIMIT 1;
INDEX(deleted_at, post_index_id) -- note
For that matter, how fast does it run with the WHERE, but without the ORDER BY?
Using the following optimizer hints, should force MySQL to use the plan that your colleague observed:
SELECT * FROM post_indices
STRAIGHT_JOIN companies FORCE INDEX(index_companies_on_deleted_at_and_post_index_id)
ON companies.post_index_id = post_indices.id
WHERE companies.deleted_at is NULL
ORDER BY post_indices.id
LIMIT 1;
If you will be sorting on other columns of post_indices, you will need an index on those columns to make this plan work well.
Note that what is the most optimal plan will depend on how frequent deleted_at is NULL. If deleted_at is frequently NULL, the above plan will be fast. If not, with the above plan one will have to run through many rows of post_indices before a match is found. Note also that for queries with OFFSET, the same plan may not be the most effective.
I think the issue here is that MySQL decides the join order without considering the effects of ORDER BY and LIMIT. In other words, it will choose the join order that it thinks is fastest to execute the full join.
Since there is a restriction on the companies table (deleted_at is NULL), I am not surprised that it will start with this table.

Optimizing an InnoDB table and a problematic query

I have a biggish InnoDB table which at this moment contains about 20 million rows with ~20000 new rows inserted every day. They contain messages for different topics.
CREATE TABLE IF NOT EXISTS `Messages` (
`ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`TopicID` bigint(20) unsigned NOT NULL,
`DATESTAMP` int(11) DEFAULT NULL,
`TIMESTAMP` int(10) unsigned NOT NULL,
`Message` mediumtext NOT NULL,
`Checksum` varchar(50) DEFAULT NULL,
`Nickname` varchar(80) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `TopicID` (`TopicID`,`Checksum`),
KEY `DATESTAMP` (`DATESTAMP`),
KEY `Nickname` (`Nickname`),
KEY `TIMESTAMP` (`TIMESTAMP`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=25195126 ;
NOTE: The Cheksum stores an MD5 checksum which prevents same messages inserted twice in the same topics. (nickname + timestamp + topicid + last 20 chars of message)
The site I'm building has a newsfeed in which users can select to view newest messages from different Nicknames from different forums. The query is as follows:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Topics.SubforumID = Subforums.ID
WHERE
Messages.Nickname = FollowedNicknames.Nickname AND
Messages.TopicID = Topics.ID AND Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
The TIMESTAMP contains an unix timestamp and DATESTAMP is simply a date generated from the unix timestamp for faster access via '=' operator instead of range scans with unix timestamps.
The problem is, this query takes about 13 seconds ( or more ) unbuffered. That is of course unacceptable for the intented usage. Adding the DATESTAMP seemed to speed things up, but not by much.
At this point, I don't really know what should I do. I've read about composite primary keys, but I am still unsure whether they would do any good and how to correctly implement one in this particular case.
I know that using BIGINTs may be a little overkill, but do they affect that much?
EXPLAIN:
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | FollowedNicknames | ALL | UserID,ForumID,Nickname | NULL | NULL | NULL | 8 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | Forums | eq_ref | PRIMARY | PRIMARY | 8 | database.FollowedNicknames.ForumiID | 1 | NULL |
| 1 | SIMPLE | Messages | ref | TopicID,DATETIME,Nickname | Nickname | 242 | database.FollowedNicknames.Nickname | 15 | Using where |
| 1 | SIMPLE | Topics | eq_ref | PRIMARY,SubforumID | PRIMARY | 8 | database.Messages.TopicID | 1 | NULL |
| 1 | SIMPLE | Subforums | eq_ref | PRIMARY,ForumID | PRIMARY | 8 | database.Topics.SubforumID | 1 | Using where |
+----+-------------+-----------------------+--------+---------------------------------------+------------+---------+-----------------------------------------------+------+----------------------------------------------+
You shouldn't be JOINing on a VARCHAR column (Nickname); you should use the user ID to join those tables. That is definitely slowing the query down and is probably the biggest issue. It would also be easier to follow if you wrote all of the JOINs explicitly instead of at the end in the WHERE clause like this:
SELECT
Messages.ID AS MessageID,
Messages.Message,
Messages.TIMESTAMP,
Messages.Nickname,
Topics.ID AS TopicID,
Topics.Title AS TopicTitle,
Forums.Title AS ForumTitle
FROM Messages
JOIN FollowedNicknames ON Messages.Nickname = FollowedNicknames.Nickname
AND FollowedNicknames.UserID = 'MYUSERID'
JOIN Forums ON Forums.ID = FollowedNicknames.ForumID
JOIN Subforums ON Subforums.ForumID = Forums.ID
JOIN Topics ON Messages.TopicID = Topics.ID
AND Topics.SubforumID = Subforums.ID
WHERE Messages.DATESTAMP = '2013619'
ORDER BY Messages.TIMESTAMP DESC
Instead of INT as the data type for the DATESTAMP column, I would use DATE. The Checksum column should probably use latin1_general_ci as the collation. I would use INT for the ID columns as long as their values are less than 2,000,000,000 since INT UNSIGNED can store values up to roughly 4,000,000,000. InnoDB is affected by the primary key much more than MyISAM and it could make a noticeable difference.

Excluding large sets of objects from a query on a table with fast changing order

I have a table of products with a score column, which has a B-Tree Index on it. I have a query which returns products that have not been shown to the user in the current session. I can't simply use simple pagination with LIMIT for it, because the result should be ordered by the score column, which can change between query calls.
My current solution works like this:
SELECT *
FROM products p
LEFT JOIN product_seen ps
ON (ps.session_id = ? AND p.product_id = ps.product_id )
WHERE ps.product_id is null
ORDER BY p.score DESC
LIMIT 30;
This works fine for the first few pages, but the response time grows linear to the number of products already shown in the session and hits the second mark by the time this number reaches ~300. Is there a way to fasten this up in MySQL? Or should I solve this problem in an entirely other way?
Edit:
These are the two tables:
CREATE TABLE `products` (
`product_id` int(15) NOT NULL AUTO_INCREMENT,
`shop` varchar(15) NOT NULL,
`shop_id` varchar(25) NOT NULL,
`shop_category_id` varchar(20) DEFAULT NULL,
`shop_subcategory_id` varchar(20) DEFAULT NULL,
`shop_designer_id` varchar(20) DEFAULT NULL,
`shop_designer_name` varchar(40) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`product_url` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`description` mediumtext NOT NULL,
`price_cents` int(10) NOT NULL,
`list_image_url` varchar(255) NOT NULL,
`list_image_height` int(4) NOT NULL,
`ending` timestamp NULL DEFAULT NULL,
`category_id` int(5) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`included_at` timestamp NULL DEFAULT NULL,
`hearts` int(5) NOT NULL,
`score` decimal(10,5) NOT NULL,
`rand_field` decimal(16,15) NOT NULL,
`last_score_update` timestamp NULL DEFAULT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`product_id`),
UNIQUE KEY `unique_shop_id` (`shop`,`shop_id`),
KEY `score_index` (`active`,`score`),
KEY `included_at_index` (`included_at`),
KEY `active_category_score` (`active`,`category_id`,`score`),
KEY `active_category` (`active`,`category_id`,`product_id`),
KEY `active_products` (`active`,`product_id`),
KEY `active_rand` (`active`,`rand_field`),
KEY `active_category_rand` (`active`,`category_id`,`rand_field`)
) ENGINE=InnoDB AUTO_INCREMENT=55985 DEFAULT CHARSET=utf8
CREATE TABLE `product_seen` (
`seenby_id` int(20) NOT NULL AUTO_INCREMENT,
`session_id` varchar(25) NOT NULL,
`product_id` int(15) NOT NULL,
`last_seen` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`sorting` varchar(10) NOT NULL,
`in_category` int(3) DEFAULT NULL,
PRIMARY KEY (`seenby_id`),
KEY `last_seen_index` (`last_seen`),
KEY `session_id` (`session_id`,`seenby_id`),
KEY `session_id_2` (`session_id`,`sorting`,`seenby_id`)
) ENGINE=InnoDB AUTO_INCREMENT=17431 DEFAULT CHARSET=utf8
Edit 2:
The query above is a simplification, this is the real query with EXPLAIN:
EXPLAIN SELECT
DISTINCT p.product_id AS id,
p.list_image_url AS image,
p.list_image_height AS list_height,
hearts,
active AS available,
(UNIX_TIMESTAMP( ) - ulp.last_action) AS last_loved
FROM `looksandgoods`.`products` p
LEFT JOIN `looksandgoods`.`user_likes_products` ulp
ON ( p.product_id = ulp.product_id AND ulp.user_id =1 )
LEFT JOIN `looksandgoods`.`product_seen` sb
ON (sb.session_id = 'y7lWunZKKABgMoDgzjwDjZw1'
AND sb.sorting = 'trend'
AND p.product_id = sb.product_id )
WHERE p.active =1
AND sb.product_id IS NULL
ORDER BY p.score DESC
LIMIT 30 ;
Explain output, there is still a temp table and filesort, although the keys for the join exist:
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | p | range | score_index,active_category_score,active_category,active_products,active_rand,active_category_rand | score_index | 1 | NULL | 2299 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | ulp | ref | love_count_index,user_to_product_index,product_id | love_count_index | 9 | looksandgoods.p.product_id,const | 1 | |
| 1 | SIMPLE | sb | ref | session_id,session_id_2 | session_id | 77 | const | 711 | Using where; Not exists; Distinct |
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
New answer
I think the problem with the real query is the DISTINCT clause. The implication is that either or both of the product_seen and user_likes_products tables can join multiple rows for each product_id which could potentially appear in the result set (given the somewhat disturbing lack of UNIQUE KEYs on the product_seen table), and this is the reason you've included the DISTINCT clause. Unfortunately, it also means MySQL will have to create a temp table to process the query.
Before I go any further, if it's possible to do...
ALTER TABLE product_seen ADD UNIQUE KEY (session_id, product_id, sorting);
...and...
ALTER TABLE user_likes_products ADD UNIQUE KEY (user_id, product_id);
...then the DISTINCT clause is redundant, and removing it should eliminate the problem. N.B. I'm not suggesting you necessarily need to add these keys, but rather just to confirm that these fields are always unique.
If it's not possible, then there may be another solution, but I'd need to know a lot more about the tables involved in the joins.
Old answer
An EXPLAIN for your query yields...
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
| 1 | SIMPLE | p | ALL | NULL | NULL | NULL | NULL | 10 | Using filesort |
| 1 | SIMPLE | ps | ref | session_id | session_id | 27 | const | 1 | Using where; Not exists |
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
...which shows it's not using an index on the products table, so it's having to do a table scan and a filesort, which is why it's slow.
I noticed there's an index on (active, score) which you could use by changing the query to only show active products...
SELECT *
FROM products p
LEFT JOIN product_seen ps
ON (ps.session_id = ? AND p.product_id = ps.product_id )
WHERE p.active=TRUE AND ps.product_id is null
ORDER BY p.score DESC
LIMIT 30;
...which changes the EXPLAIN to...
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
| 1 | SIMPLE | p | range | score_index,active_products | score_index | 1 | NULL | 10 | Using where |
| 1 | SIMPLE | ps | ref | session_id | session_id | 27 | const | 1 | Using where; Not exists |
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
...which is now doing a range scan and no filesort, which should be much faster.
Or if you want it to also return inactive products, then you'll need to add an index on score only, with...
ALTER TABLE products ADD KEY (score);