How to prevent a full table scan when doing a simple JOIN? - mysql

I have two tables, TableA and TableB:
CREATE TABLE `TableA` (
`shared_id` int(10) unsigned NOT NULL default '0',
`foo` int(10) unsigned NOT NULL,
PRIMARY KEY (`shared_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
CREATE TABLE `TableB` (
`shared_id` int(10) unsigned NOT NULL auto_increment,
`bar` int(10) unsigned NOT NULL,
KEY `shared_id` (`shared_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1001 DEFAULT CHARSET=latin1
Here's my query:
SELECT TableB.bar
FROM TableB, TableA
WHERE TableA.foo = 1000
AND TableA.shared_id = TableB.shared_id;
Here's the problem:
mysql> explain SELECT TableB.bar FROM TableB, TableA WHERE TableA.foo = 1000 AND TableA.shared_id = TableB.shared_id;
+----+-------------+--------------+--------+---------------+---------+---------+------------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+---------------+---------+---------+------------------------------------------+------+-------------+
| 1 | SIMPLE | TableB | ALL | shared_id | NULL | NULL | NULL | 1000 | |
| 1 | SIMPLE | TableA | eq_ref | PRIMARY | PRIMARY | 4 | MyDatabase.TableB.shared_id | 1 | Using where |
+----+-------------+--------------+--------+---------------+---------+---------+------------------------------------------+------+-------------+
Is there an index that I can add that will prevent the full table scan of TableB?

Runcible, your query could use some rewriting. You should always specify your JOIN conditions in an ON clause and not in a WHERE.
Your query would become:
SELECT TableB.bar
FROM TableB
JOIN TableA
ON TableB.shared_id = TableA.shared_id
AND TableA.foo = 1000;
Not only do you want to do this:
ALTER TABLE TableB ADD INDEX (shared_id,bar);
You'll want to add an index to A as follows:
ALTER TABLE TableA ADD INDEX (foo, shared_id);
Do this, and provide the EXPLAIN output please.
Also note that by adding an index on (shared_id, bar) you just made your (shared_id) index redundant. Drop it.

Related

SELECT statement optimization MySQL

I am looking for a way how to make my SELECT query even faster than it is now, because I have a feeling it should be possible to make it faster.
Here is the query
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
Here is EXPLAIN SELECT
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY | PRIMARY | 4 | dev.r.id_pair | 1 | |
+----+-------------+-------+--------+----------------+----------------+---------+---------------+--------+---------------------------------------------------------------------+
Now the tables are
CREATE TABLE `tag_product` (
`id_pair` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_product` int(10) unsigned NOT NULL,
`id_user_tag` int(10) unsigned NOT NULL,
`status` tinyint(3) NOT NULL,
`percentile` decimal(8,4) unsigned NOT NULL,
`percentile_weighted` decimal(8,4) unsigned NOT NULL,
`elo` int(10) unsigned NOT NULL,
`date_add` datetime NOT NULL,
`date_upd` datetime NOT NULL,
PRIMARY KEY (`id_pair`),
UNIQUE KEY `id_product_id_user_tag` (`id_product`,`id_user_tag`),
KEY `status` (`status`),
KEY `id_user_tag` (`id_user_tag`),
CONSTRAINT `tag_product_ibfk_5` FOREIGN KEY (`id_user_tag`) REFERENCES `user_tag` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `tag_rating` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_customer` int(10) unsigned NOT NULL,
`id_pair` int(10) unsigned NOT NULL,
`id_duel` int(10) unsigned NOT NULL,
`value` tinyint(4) NOT NULL,
`date_add` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_duel_id_pair` (`id_duel`,`id_pair`),
KEY `id_pair_id_customer` (`id_pair`,`id_customer`),
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
KEY `id_customer_value_date_add` (`id_customer`,`value`,`date_add`),
CONSTRAINT `tag_rating_ibfk_3` FOREIGN KEY (`id_pair`) REFERENCES `tag_product` (`id_pair`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `tag_rating_ibfk_6` FOREIGN KEY (`id_duel`) REFERENCES `tag_rating_duel` (`id_duel`) ON DELETE CASCADE ON UPDATE CASCADE,
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
The table tag_product has about 250k rows and the tag_rating has about 1M rows.
My issue is that the SQL query takes about 0.8s on average on my machine. I would like to make it ideally under 0.5s while also assuming the tables can get like 10 times bigger. The amount of rows taken into play should be about the same because I have a date condition (I only want less than a month old rows).
Is this possible to make faster just by some trick (aka not restructuring my tables)? When I slightly modify (dont join the smaller table) the statement as
SELECT r.id_customer, COUNT(*)
FROM tag_rating AS r USE INDEX (value_date_add)
WHERE
r.value = 1 AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer;
here is EXPLAIN SELECT
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
| 1 | SIMPLE | r | ref | value_date_add | value_date_add | 1 | const | 449502 | Using index condition; Using where; Using temporary; Using filesort |
+----+-------------+-------+------+----------------+----------------+---------+-------+--------+---------------------------------------------------------------------+
it takes about 0.25s which is great. So the JOIN makes it 3x slower. Is that inevitable? I feel like since I am joining via primary key it shouldnt make a query 3x slower.
---UPDATE---
This is actually my query. The number of different id_customer values is about 1 thousand and is expected to rise, the number of rows with value=1 is exactly half. So far the query performance seems to be slowing down linearly based on the number of rows in rating table
Using adding id_pair at the end of the id_customer_value_date_add or value_id_customer_date_add index doesnt help.
SELECT r.id_customer, ROUND(AVG(tp.percentile_weighted), 2) AS percentile
FROM tag_rating AS r USE INDEX (id_customer_value_date_add)
JOIN tag_product AS tp ON (tp.id_pair = r.id_pair)
WHERE
r.value = 1 AND
r.id_customer IN (2593179,1461878,2318871,2654090,2840415,2852531,2987432,3473275,3960453,3961798,4129734,4191571,4202912,4204817,4211263,4248789,765650,1341317,1430380,2116196,3367674,3701901,3995273,4118307,4136114,4236589,783262,913493,1034296,2626574,3574634,3785772,2825128,4157953,3331279,4180367,4208685,4287879,1038898,1445750,1975108,3658055,4185296,4276189,428693,4248631,1892448,3773855,2901524,3830868,3934786) AND
r.date_add > '2020-08-08 11:56:00'
GROUP BY r.id_customer
This is EXPLAIN SELECT
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | r | range | id_customer_value_date_add | id_customer_value_date_add | 10 | | 558906 | Using where; Using index |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
| 1 | SIMPLE | tp | eq_ref | PRIMARY,status | PRIMARY | 4 | dev.r.id_pair | 1 | Using where |
+----+-------------+-------+--------+----------------------------+----------------------------+---------+----------------------------------+--------+--------------------------+
Any tips are appreciated. Thank you
INDEX(value, date_add, id_customer, id_pair)
Would be "covering", giving an extra boost on performance for both queries. And also for Gordon's formulation.
At the same time, get rid of these:
KEY `value` (`value`),
KEY `value_date_add` (`value`,`date_add`),
because they might get in the way of the Optimizer picking the new index. Any other queries that were using those indexes will easily use the new index.
If you are not otherwise using tag_rating.id, get rid of it and promote the UNIQUE to PRIMARY KEY.
Try writing the query using a correlated subquery:
SELECT r.id_customer,
(SELECT ROUND(AVG(tp.percentile_weighted), 2)
FROM tag_product tp
WHERE tp.id_pair = r.id_pair
) AS percentile
FROM tag_rating AS r
WHERE r.value = 1 AND
r.date_add > '2020-08-08 11:56:00';
This eliminates the outer aggregation which should be faster.

Any possibility to speed up WHERE IN or replace it with faster alternative?

I am trying to speed up select in query below where I have over 1000 items in WHERE IN
table:
CREATE TABLE `user_item` (
`user_id` int(11) unsigned NOT NULL,
`item_id` int(11) unsigned NOT NULL,
PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
query:
SELECT
item_id
FROM
user_item
WHERE
user_id = 2
AND item_id IN(3433456,67584634,587345,...)
With 1000 items in IN list, query takes about 3 seconds to execute. is there any optimization that can be done in this case? There can be billions of rows in this table. Is there an alternative to doing this faster be it with another DB or programming method?
UPDATE:
Here's results of explain:
If I have 999 items in the IN(...) statement:
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | user_item | range | PRIMARY | PRIMARY | 8 | NULL | 999 | Using where; Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
If I have 1000 items in IN(...) statement:
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| 1 | PRIMARY | <subquery2> | ALL | distinct_key | NULL | NULL | NULL | 1000 | |
| 1 | PRIMARY | user_item | eq_ref | PRIMARY | PRIMARY | 8 | const,tvc_0._col_1 | 1 | Using where; Using index |
| 2 | MATERIALIZED | <derived3> | ALL | NULL | NULL | NULL | NULL | 1000 | |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
Update 2
I want to explain why I need to do above:
I want to give the user the ability to list items ordered by sort_criteria_1, sort_criteria_2 or sort_criteria_3 and exclude from the list those items that have been marked by given (n) users in the user_item table.
Here's sample schema:
CREATE TABLE `user` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(45) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `item` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`file` varchar(45) NOT NULL,
`sort_criteria_1` int(11) DEFAULT NULL,
`sort_criteria_2` int(11) DEFAULT NULL,
`sort_criteria_3` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_sc1` (`sort_criteria_1`),
KEY `idx_sc2` (`sort_criteria_2`),
KEY `idx_sc3` (`sort_criteria_3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `user_item` (
`user_id` int(11) NOT NULL,
`item_id` int(11) NOT NULL,
PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Here's how I would get items ordered by sort_criteria_2 excluding ones that have record by users (300, 6, 1344, 24) in user_item table:
SELECT
i.id,
FROM
item i
LEFT JOIN user_item ui1 ON (i.id = ui1.item_id AND ui1.user_id = 300)
LEFT JOIN user_item ui2 ON (i.id = ui2.item_id AND ui2.user_id = 6)
LEFT JOIN user_item ui3 ON (i.id = ui3.item_id AND ui3.user_id = 1344)
LEFT JOIN user_item ui4 ON (i.id = ui4.item_id AND ui4.user_id = 24)
WHERE
ui1.item_id IS NULL
AND ui2.item_id IS NULL
AND ui3.item_id IS NULL
AND ui4.item_id IS NULL
ORDER BY
v.sort_criteria_2
LIMIT
800
Main problem with above approach is that more users I'm filtering by, more expensive query gets. I want the toll for filtering to be paid by client browser. So I would send list of items and list of matching user_item records per user to the client to filter by. This would help with sharding as well, since I would not have to have user_item tables or set of records on the same machine.
It's hard to tell exactly, but there could be lag on parsing your huge query because of many constant item_id values.
Have you tried getting just all the values by user_id ? As this field is first (main) in the PRIMARY KEY, relevant index would still be used.
Have you tried replacing constant list with a subquery ? Maybe you're interested in items of specific type, for example.
Make sure that you use Prepared statement concept - at least if your database and language support it. This would protect your code from possible SQL injections and enable database built-in query caching (if your database supports it).
Instead of putting the 1000 item_id's into IN-clause, you could put them into temporary table with index and join it with the user_item-table.
If you also have an index with both user_id and item_id, that would make the query fastest that it gets. The rest depends on the data distribution.

mysql group by joined column too slow

I have two tables events and event_params
the first table stores the events with these columns
events | CREATE TABLE `events` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`project` varchar(24) NOT NULL,
`event` varchar(24) NOT NULL,
`date` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `project` (`project`,`event`)
) ENGINE=InnoDB AUTO_INCREMENT=2915335 DEFAULT CHARSET=latin1
and second stores parameters for each event with these columns
event_params | CREATE TABLE `event_params` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`event_id` int(10) unsigned NOT NULL,
`name` varchar(24) NOT NULL,
`value` varchar(524) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
KEY `name` (`name`),
KEY `event_id` (`event_id`),
KEY `value` (`value`),
) ENGINE=InnoDB AUTO_INCREMENT=20789391 DEFAULT CHARSET=latin1
now I want to get count of events those have various values on a specified parameter
I wrote this query for campaign parameter but this is too slow (15 secs to respond)
SELECT
event_params.value as campaign,
count(*) as count
FROM `events`
left join event_params on event_params.event_id = events.id
and event_params.name = 'campaign'
WHERE events.project = 'foo'
GROUP by event_params.value
and here is the EXPLAIN query result:
+----+-------------+--------------+------------+------+---------------------+----------+---------+------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------------+----------+---------+------------------+------+----------+----------------------------------------------+
| 1 | SIMPLE | events | NULL | ref | project | project | 26 | const | 1 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | event_params | NULL | ref | name,event_id,value | event_id | 4 | events.events.id | 4 | 100.00 | Using where |
+----+-------------+--------------+------------+------+---------------------+----------+---------+------------------+------+----------+----------------------------------------------+
can i speed up this query ?
You may try adding the following index on the event_params table, which might speed up the join:
CREATE INDEX idx1 ON event_params (event_id, name, value);
The aggregation step probably can't be optimized much because the COUNT operation involves counting each record.
Move the "campaign value" into the main table, with a suitable length for VARCHAR and then
SELECT
campaign,
count(*) as count
FROM `events`
WHERE project = 'foo'
GROUP by campaign
And have
INDEX(project, campaign)
A bit of advice when tempted to use EAV: Move the 'important' values into the main table; leave only the rarely used or rarely set 'values' in the other table. Also (assuming there are no dups), have
PRIMARY KEY(event_id, name)
More discussion: http://mysql.rjweb.org/doc.php/eav

mysql table join with max()

I have problem with query using JOIN and MAX/MIN. For Example:
SELECT Min(a.date), Max(a.date)
FROM a
INNER JOIN b ON b.ID = a.ID AND b.cID = 5
Its possible to add index or change this query result was better?
Below the result of explain
+----+-------------+----------+------+-----------------+-----+---------+-----------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+-----------------+-----+---------+-----------+--------+-----------------------+
| 1 | SIMPLE | b | ref | PRIMARY,cID | cID | 5 | const | 680648 | Using index |
| 1 | SIMPLE | a | ref | ID | ID | 5 | base.b.ID | 1 | Using index condition |
+----+-------------+----------+------+-----------------+-----+---------+-----------+--------+-----------------------+
Sorry, but I would not put here the whole table, and could make a lot of confusion.
CREATE TABLE `a` (
`ID` int(11) NOT NULL,
`date` datetime DEFAULT,
PRIMARY KEY (`ID`),
KEY `date` (`date`),
)
CREATE TABLE `b` (
`bID` int(11) NOT NULL,
`ID` int(11) NOT NULL,
`cID` int(11) DEFAULT,
PRIMARY KEY (`bID`),
KEY `cID` (`cID`),
)
b: INDEX(cID, ID)
will make that a "covering" index, so it will probably get through the 680648 rows faster. It should replace the current KEY(cID).
Key_len for b is 5. That disagrees with the table definition; something got simplified too much.

Estimate/speedup huge table self-join on mysql

I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table