I can't find a way to fasten simple queries in a huge table.
I don't think i'm asking something crazy to MySQL, even with the amount of datas… and i can't understand why these following queries have so much different execution time !
I tried my best to read all articles about big datas in mysql, fields optimization, and already achieved to reduce query time with field types… but really, i'm getting lost now with this kind of simple queries !
Here is an example on MySQL 5.1.69 :
SELECT rv.`id_prd`,SUM(`quantite`)
FROM `report_ventes` AS rv
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 3.76 sec
Let's add a LEFT JOIN and another selected field :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 12.10 sec
Explain :
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using index |
+----+-------------+-------+--------+---------------+---------+---------+--------------------------+----------+----------------------------------------------+
And let's another where clause :
SELECT rv.`id_prd`,SUM(`quantite`),`acl_cip_7`
FROM `report_ventes` AS rv
LEFT JOIN `report_produits` AS rp
ON (rv.`id_prd` = rp.`id_prd`)
WHERE rp.`id_clas_prd` LIKE '1%'
AND `periode` BETWEEN 201301 AND 201312
GROUP BY rv.`id_prd`
Execution time : 21.00 sec
Explain :
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
| 1 | SIMPLE | rv | ALL | periode | NULL | NULL | NULL | 16556188 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | rp | eq_ref | PRIMARY,id_clas_prd | PRIMARY | 4 | main_reporting.rv.id_prd | 1 | Using where |
+----+-------------+-------+--------+---------------------+---------+---------+--------------------------+----------+----------------------------------------------+
And here are the tables parameters :
report_produits : 80 000 rows
CREATE TABLE `report_produits` (
`id_prd` int(11) unsigned NOT NULL,
`acl_cip_7` int(7) NOT NULL,
`acl_cip_ean_13` varchar(255) DEFAULT NULL,
`lib_prd` varchar(255) DEFAULT NULL,
`id_clas_prd` char(7) NOT NULL DEFAULT '',
`id_lab_prd` int(11) unsigned NOT NULL,
`id_rbt_prd` int(11) unsigned NOT NULL,
`id_tva_prd` int(11) unsigned NOT NULL,
`t_gen` varchar(255) NOT NULL,
`id_grp_gen` varchar(16) NOT NULL DEFAULT '',
`id_liste_delivrance` int(11) unsigned NOT NULL,
PRIMARY KEY (`id_prd`),
KEY `index_lab` (`id_lab_prd`),
KEY `index_grp` (`id_grp_gen`),
KEY `id_clas_prd` (`id_clas_prd`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
report_ventes : 16 556 188 rows
CREATE TABLE `report_ventes` (
`id` int(13) NOT NULL AUTO_INCREMENT,
`periode` mediumint(6) DEFAULT NULL,
`id_phie` smallint(4) unsigned NOT NULL,
`id_prd` mediumint(8) unsigned NOT NULL,
`quantite` smallint(11) DEFAULT NULL,
`ca_ht` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `periode` (`periode`)
) ENGINE=MyISAM AUTO_INCREMENT=18491315 DEFAULT CHARSET=utf8;
There is no covering index and MySQL decides that scanning the whole table is more effective than to use an index and lookup for the requested values.
You are joining to the report_ventes on the id_prd, but that column is not the part of the clustering index (PK in MySQL). This means, the server should lookup for all the values. The server bypasses the periode index possibly because it is not enough selective to use it.
An index could help which includes the id_prd, periode and quantite columns. With this index, there is a chance that the MySQL server will use it since it is a covering index for this query.
Give it a try, but its hard to tell the real truth without testing it on the actual environment.
Basically your indexes is not being used, i can't spot the precise reason without trying it on a sql server, but a common cause is the data has different types.
AND periode BETWEEN 201301 AND 201312
"periode" has datatype mediumint(6) and the litteral "201301" possible has datatype int(10)
LEFT JOIN `report_produits` AS rp ON (rv.`id_prd` = rp.`id_prd`)
Here are the 2 datatypes also different.
Related
I have a query with 2 INNER JOIN statements, and only fetching a few column, but it is very slow even though I have indexes on all required columns.
My query
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
INNER JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
WHERE
com.prestataireLAD REGEXP '.*'
AND pe_nom REGEXP 'bordeaux|chambéry-annecy|grenoble|lyon|marseille|metz|montpellier|nancy|nice|nimes|rouen|strasbourg|toulon|toulouse|vitry|vitry bis 1|vitry bis 2|vlg'
AND com.date_livraison BETWEEN '2022-06-11 00:00:00'
AND '2022-07-08 00:00:00';
It takes around 20 seconds to compute and fetch 4123 rows.
The problem
In order to find what's wrong and why is it so slow, I've used the EXPLAIN statement, here is the output:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|--------|----------------------------|-------------|---------|------------------------|--------|----------|-------------|
| 1 | SIMPLE | dys | | ALL | id_commande,id_commande_2 | | | | 878588 | 100.00 | Using where |
| 1 | SIMPLE | com | | eq_ref | id_commande,date_livraison | id_commande | 110 | db.dys.id_commande | 1 | 7.14 | Using where |
| 1 | SIMPLE | pe | | ref | pe_id | pe_id | 5 | db.com.code_pe | 1 | 100.00 | Using where |
I can see that the dysfonctionnements JOIN is rigged, and doesn't use a key even though it could...
Table definitions
commandes (included relevant columns only)
CREATE TABLE `commandes` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) NOT NULL DEFAULT '',
`date_commande` datetime NOT NULL,
`date_livraison` datetime NOT NULL,
`code_pe` int(11) NOT NULL,
`traitement_dysfonctionnement` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`),
KEY `date_livraison` (`date_livraison`),
KEY `traitement_dysfonctionnement` (`traitement_dysfonctionnement`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
dysfonctionnements (again, relevant columns only)
CREATE TABLE `dysfonctionnements` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`id_commande` varchar(36) DEFAULT NULL,
`dysfonctionnement` varchar(150) DEFAULT NULL,
`responsable` varchar(50) DEFAULT NULL,
`reimputation` varchar(50) DEFAULT NULL,
`montant` float DEFAULT NULL,
`listRembArticles` text,
PRIMARY KEY (`id`),
UNIQUE KEY `id_commande` (`id_commande`,`dysfonctionnement`),
KEY `id_commande_2` (`id_commande`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
pe (again, relevant columns only)
CREATE TABLE `pe` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`pe_id` int(11) DEFAULT NULL,
`pe_nom` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `pe_nom` (`pe_nom`),
KEY `pe_id` (`pe_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Investigation
If I remove the db.pe table from the query and the WHERE clause on pe_nom, the query takes 1.7 seconds to fetch 7k rows, and with the EXPLAIN statement, I can see it is using keys as I expect it to do:
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|----|-------------|-------|------------|-------|----------------------------|----------------|---------|------------------------|--------|----------|-----------------------------------------------|
| 1 | SIMPLE | com | | range | id_commande,date_livraison | date_livraison | 5 | | 389558 | 100.00 | Using index condition; Using where; Using MRR |
| 1 | SIMPLE | dys | | ref | id_commande,id_commande_2 | id_commande_2 | 111 | ooshop.com.id_commande | 1 | 100.00 | |
I'm open to any suggestions, I see no reason not to use the key when it does on a very similar query and it definitely makes it faster...
I had a similar experience when MySQL optimiser selected a joined table sequence far from optimal. At that time I used MySQL specific STRAIGHT_JOIN operator to overcome default optimiser behaviour. In your case I would try this:
SELECT
dysfonctionnement,
montant,
listRembArticles,
case when dys.reimputation is not null then dys.reimputation else dys.responsable end as responsable_final
FROM
db.commandes AS com
STRAIGHT_JOIN db.dysfonctionnements AS dys ON com.id_commande = dys.id_commande
INNER JOIN db.pe AS pe ON com.code_pe = pe.pe_id
Also, in your WHERE clause one of the REGEXP probably might be changed to IN operator, I assume it can use index.
Remove com.prestataireLAD REGEXP '.*'. The Optimizer probably won't realize that this has no impact on the resultset. If you are dynamically building the WHERE clause, then eliminate anything else you can.
id_commande_2 is redundant. In queries where it might be useful, the UNIQUE can take care of it.
These indexes might help:
com: INDEX(date_livraison, id_commande, code_pe)
pe: INDEX(pe_nom, pe_id)
I'm trying to understand why these two queries are treated differently with regards to use of the primary keys in joins.
This query with a join on icd_codes (the SELECT query, without the EXPLAIN, of course) completes in 56 ms:
EXPLAIN
SELECT var.Var_ID,
var.Gene,
var.HGVSc,
pVCF_145K.PT_ID,
pVCF_145K.AD_ALT,
pVCF_145K.AD_REF,
icd_codes.ICD_NM,
icd_codes.PT_AGE
FROM public.variants_145K var
INNER JOIN public.pVCF_145K USING (Var_ID)
INNER JOIN public.icd_codes using (PT_ID)
# INNER JOIN public.demographics USING (PT_ID)
WHERE Gene IN ('SLC9A6', 'SLC9A7')
AND Canonical
AND impact = 'high'
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
| 1 | SIMPLE | var | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125 | NULL | 280 | Using index condition; Using where |
| 1 | SIMPLE | pVCF_145K | ref | PRIMARY,pVCF_145K_PT_ID_index | PRIMARY | 326 | public.var.Var_ID | 268 | |
| 1 | SIMPLE | icd_codes | ref | PRIMARY | PRIMARY | 38 | public.pVCF_145K.PT_ID | 29 | |
+------+-------------+-----------+-------+------------------------------------------------------------------+---------------------------------+---------+------------------------+------+------------------------------------+
This query with a join on demographics takes over 11 minutes, and I'm not sure how to interpret the difference in the explain results. Why is it resorting to using the join buffer? How can I optimize this further?
EXPLAIN
SELECT variants_145K.Var_ID,
variants_145K.Gene,
variants_145K.HGVSc,
pVCF_145K.PT_ID,
pVCF_145K.AD_ALT,
pVCF_145K.AD_REF,
demographics.Sex,
demographics.Age
FROM public.variants_145K
INNER JOIN public.pVCF_145K USING (Var_ID)
# inner join public.icd_codes using (PT_ID)
INNER JOIN public.demographics USING (PT_ID)
WHERE Gene IN ('SLC9A6', 'SLC9A7')
AND Canonical
AND impact = 'high'
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
| 1 | SIMPLE | variants_145K | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125 | NULL | 280 | Using index condition; Using where |
| 1 | SIMPLE | demographics | ALL | PRIMARY | NULL | NULL | NULL | 1916393 | Using join buffer (flat, BNL join) |
| 1 | SIMPLE | pVCF_145K | eq_ref | PRIMARY,pVCF_145K_PT_ID_index | PRIMARY | 364 | public.variants_145K.Var_ID,public.demographics.PT_ID | 1 | |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+---------+------------------------------------+
Adding a further filter in demographics (WHERE demographics.Platform IS NOT NULL) as shown below reduces to 38 seconds. However, there are queries where we do not use such filters so it would be ideal if it could use the primary PT_ID key in the joins.
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
| 1 | SIMPLE | variants_145K | range | PRIMARY,variants_145K_Gene_index,variants_145K_Impact_Gene_index | variants_145K_Impact_Gene_index | 125 | NULL | 280 | Using index condition; Using where |
| 1 | SIMPLE | demographics | range | PRIMARY,Demographics_PLATFORM_index | Demographics_PLATFORM_index | 17 | NULL | 258544 | Using index condition; Using where; Using join buffer (flat, BNL join) |
| 1 | SIMPLE | pVCF_145K | eq_ref | PRIMARY,pVCF_145K_PT_ID_index | PRIMARY | 364 | public.variants_145K.Var_ID,public.demographics.PT_ID | 1 | |
+------+-------------+---------------+--------+------------------------------------------------------------------+---------------------------------+---------+-------------------------------------------------------+--------+------------------------------------------------------------------------+
The tables:
create table public.demographics # 1,916,393 rows
(
PT_ID varchar(9) not null
primary key,
Age float(3,1) null,
Status varchar(8) not null,
Sex varchar(7) not null,
Race_1 varchar(41) not null,
Race_2 varchar(41) not null,
Ethnicity varchar(22) not null,
Smoker_flag tinyint(1) not null,
Platform char(4) null,
MyCode_Consent tinyint(1) not null,
MR_ENC_DT date null,
Birthday date null,
Deathday date null,
max_unrelated_145K tinyint unsigned null
);
create index Demographics_PLATFORM_index
on public.demographics (Platform);
create table public.icd_codes # 116,220,141 rows
(
PT_ID varchar(9) not null,
ICD_CD varchar(8) not null,
ICD_NM varchar(217) not null,
DX_DT date not null,
PT_AGE float(3,1) unsigned not null,
CODE_SYSTEM char(7) not null,
primary key (PT_ID, ICD_CD, DX_DT)
);
create table public.pVCF_145K # 10,113,244,082 rows
(
Var_ID varchar(81) not null,
PT_ID varchar(9) not null,
GT tinyint unsigned not null,
GQ smallint unsigned not null,
AD_REF smallint unsigned not null,
AD_ALT smallint unsigned not null,
DP smallint unsigned not null,
FT varchar(30) null,
primary key (Var_ID, PT_ID)
);
create index pVCF_145K_PT_ID_index
on public.pVCF_145K (PT_ID);
create table public.variants_145K # 151,314,917 rows
(
Var_ID varchar(81) not null,
Gene varchar(22) null,
Feature varchar(18) not null,
Feature_type varchar(10) null,
HIGH_INF_POS tinyint(1) null,
Consequence varchar(26) not null,
rsid varchar(34) null,
Impact varchar(8) not null,
Canonical tinyint(1) not null,
Exon smallint unsigned null,
Intron smallint unsigned null,
HGVSc varchar(323) null,
HGVSp varchar(196) null,
AA_position smallint unsigned null,
gnomAD_NFE_MAF float null,
SIFT varchar(14) null,
PolyPhen varchar(17) null,
GHS_Hom mediumint(5) unsigned null,
GHS_Het mediumint(5) unsigned null,
GHS_WT mediumint(5) unsigned null,
IDT_MAF float null,
VCR_MAF float null,
UKB_MAF float null,
Chr tinyint unsigned not null,
Pos int(9) unsigned not null,
Ref varchar(298) not null,
Alt varchar(306) not null,
primary key (Var_ID, Feature)
);
create index variants_145K_Chr_Pos_Ref_Alt_index
on public.variants_145K (Chr, Pos, Ref, Alt);
create index variants_145K_Gene_index
on public.variants_145K (Gene);
create index variants_145K_Impact_Gene_index
on public.variants_145K (Impact, Gene);
create index variants_145K_rsid_index
on public.variants_145K (rsid);
This is on MariaDB 10.5.8 (innodb)
Thank you!
INDEX(impact, canonical, gene) or INDEX(canonical, impact, gene) is better for the var.
If you don't need it, remove INNER JOIN public.icd_codes USING (PT_ID). It is costly to reach into that table, and all it does is filter out any rows that fail in the JOIN.
Ditto for demographics.
The "join buffer" is not always a "resort to"; however, it is often a fast way. Especially if most of the table is needed and the join_buffer is big enough.
More
Note that demographics has a single-column PRIMARY KEY(PT_ID), but the other table has a composite PK. This probably impacts whether the Optimizer will even consider using the "join buffer".
Depending on a lot of things (in the query and the data), the Optimizer may make the wrong choice between join_buffer and repeatedly doing lookups.
I have two databases one for dev and one for staging, and they're both on the same machine too. I'm having a problem with a query for two tables. here are the schema for the tables
Table 1 schema:
Table: import_schedule_t
Create Table: CREATE TABLE `import_schedule_t` (
`id` int(11) NOT NULL,
`theater_id` int(11) NOT NULL,
`movie_code` varchar(20) NOT NULL,
`start_time` datetime NOT NULL,
`end_time` datetime NOT NULL,
`pc_url` varchar(250) NOT NULL,
`mb_url` varchar(250) NOT NULL,
`url_type` int(11) DEFAULT '0',
`active` int(11) DEFAULT '1',
`intime` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
`utime` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`schedule_date` datetime NOT NULL,
`movie_name` text NOT NULL,
`screen_name` text NOT NULL,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
and Table 2 schema:
Table: wp_postmeta
Create Table: CREATE TABLE `wp_postmeta` (
`meta_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`post_id` bigint(20) unsigned NOT NULL DEFAULT '0',
`meta_key` varchar(255) DEFAULT NULL,
`meta_value` longtext,
PRIMARY KEY (`meta_id`),
KEY `post_id` (`post_id`),
KEY `meta_key` (`meta_key`(191))
) ENGINE=MyISAM AUTO_INCREMENT=1399270 DEFAULT CHARSET=utf8
both of the tables are present in both of the databases i've mentioned. When i try to run this query:
SELECT DISTINCT movie_code,post_id
FROM import_schedule_t
INNER JOIN wp_postmeta
ON wp_postmeta.meta_value = import_schedule_t.movie_code
AND wp_postmeta.meta_key='update_movie_id'
WHERE DATE_FORMAT(start_time, '%Y-%m-%d')>= DATE_FORMAT(NOW(),'%Y-%m-%d')
dev database would finish the query in 20 seconds but the staging database would only run it for 1.4 seconds.
here's a sample data:
wp_postmeta table
+---------+---------+-----------------+------------+
| meta_id | post_id | meta_key | meta_value |
+---------+---------+-----------------+------------+
| 45150 | 74572 | update_movie_id | 74572 |
+---------+---------+-----------------+------------+
import_schedule_t table (omitted some of the fields)
+--------+------------+---------------------+---------------------+
| id | movie_code | start_time | end_time |
+--------+------------+---------------------+---------------------+
| 120884 | 74572 | 2015-07-04 12:50:00 | 2015-07-04 15:05:00 |
+--------+------------+---------------------+---------------------+
i already tried looking at the indexes and optimizing the tables but with no success, the query time on the dev database is still 20 seconds.
EXPLAIN EXTENDED on dev
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 23597 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461731 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
EXPLAIN EXTENDED on staging
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
| 1 | SIMPLE | import_schedule_t | ALL | NULL | NULL | NULL | NULL | 9311 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | wp_postmeta | ALL | NULL | NULL | NULL | NULL | 1461384 | 100.00 | Using where; Using join buffer |
+----+-------------+-------------------+------+---------------+------+---------+------+---------+----------+--------------------------------+
If both DBs are running on the same machine, with the same MySQL version, in the same harddrive, with the very same structure and data then it might be a fragmentation issue on the OS level. Take the servers down and defrag your disk.
On a side note: don't compare dates as strings, since dates are numbers internally in the DB, and they are compared much more efficiently (WHERE start_time >= curdate() ).
Also you can save some storage space if you define smaller ints for some fields (like the 'active' field). An int is a 4 byte number while a tinyint is 1 byte.
Sorry, can't comment cos I don't have enough reputation, BUT, I would expect the dev system has a lot more data in its tables.
On another point you should not use DATE_FORMAT - I would guess that is turning your dates into strings which are really inefficient to compare. Dates are just integers (internal to MySQL) so they can be compared in one cycle.. the string comparison could easily be 1000 (or more) cycles. You should probably index the start_time field as well to save it having to scan the table.
Anytime you have a query taking 20 seconds you should be suspicious you are doing something wrong! MySQL can do A LOT in 20 seconds.
I have found that MySQL (Win 7 64, 5.6.14) does not use index properly if I specify table output for IN statement. USER table contains 900k records.
If I use IN (_SOME_TABLE_OUTPUT_) syntax - I get fullscan for all 900k users. Query runs forever.
If I use IN ('CONCRETE','VALUES') syntax - I get a correct index usage.
How can I make MySQL finally USE the index?
1st case:
explain SELECT gu.id FROM USER gu WHERE gu.uuid in
(select '11b6a540-0dc5-44e0-877d-b3b83f331231' union
select '11b6a540-0dc5-44e0-877d-b3b83f331232');
+----+--------------------+------------+-------+---------------+------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-------+---------------+------+---------+------+--------+--------------------------+
| 1 | PRIMARY | gu | index | NULL | uuid | 257 | NULL | 829930 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | DEPENDENT UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | NULL | NULL | Using temporary |
+----+--------------------+------------+-------+---------------+------+---------+------+--------+--------------------------+
2nd case:
explain SELECT gu.id FROM USER gu WHERE gu.uuid in
('11b6a540-0dc5-44e0-877d-b3b83f331231');
+----+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
| 1 | SIMPLE | gu | ref | uuid | uuid | 257 | const | 1 | Using where; Using index |
+----+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
Table structure:
CREATE TABLE `USER` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`version` bigint(20) NOT NULL,
`email` varchar(255) DEFAULT NULL,
`uuid` varchar(255) NOT NULL,
`partner_id` bigint(20) NOT NULL,
`password` varchar(255) DEFAULT NULL,
`date_created` datetime DEFAULT NULL,
`last_updated` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique-email` (`partner_id`,`email`),
KEY `uuid` (`uuid`),
CONSTRAINT `fk_USER_partner` FOREIGN KEY (`partner_id`) REFERENCES `partner` (`id`) ON DELETE CASCADE,
CONSTRAINT `FKB2D9FEBE725C505E` FOREIGN KEY (`partner_id`) REFERENCES `partner` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3315452 DEFAULT CHARSET=latin1
FORCE INDEX and USE INDEX statements don't change anything.
Demonstration SQLfiddle: http://sqlfiddle.com/#!2/c607e1/2
In fact I faced such problem before and it happened that I had one table that had a single column set as UTF-8 and the other tables where latin1. It did not matter what I did, MySQL insisted on using no indexes. The problem is quite well described on this blog post Slow queries in MySQL due to collation problems. Once you manage to fix the character set, I believe any of the queries will work.
An inner join on your virtual table might give you better performance. Try something along these lines.
SELECT gu.id
FROM USER gu
INNER JOIN (
select '11b6a540-0dc5-44e0-877d-b3b83f331231' uuid
union all
select '11b6a540-0dc5-44e0-877d-b3b83f331232') ids
on gu.uuid = ids.uuid;
I have a table of products with a score column, which has a B-Tree Index on it. I have a query which returns products that have not been shown to the user in the current session. I can't simply use simple pagination with LIMIT for it, because the result should be ordered by the score column, which can change between query calls.
My current solution works like this:
SELECT *
FROM products p
LEFT JOIN product_seen ps
ON (ps.session_id = ? AND p.product_id = ps.product_id )
WHERE ps.product_id is null
ORDER BY p.score DESC
LIMIT 30;
This works fine for the first few pages, but the response time grows linear to the number of products already shown in the session and hits the second mark by the time this number reaches ~300. Is there a way to fasten this up in MySQL? Or should I solve this problem in an entirely other way?
Edit:
These are the two tables:
CREATE TABLE `products` (
`product_id` int(15) NOT NULL AUTO_INCREMENT,
`shop` varchar(15) NOT NULL,
`shop_id` varchar(25) NOT NULL,
`shop_category_id` varchar(20) DEFAULT NULL,
`shop_subcategory_id` varchar(20) DEFAULT NULL,
`shop_designer_id` varchar(20) DEFAULT NULL,
`shop_designer_name` varchar(40) NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`product_url` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`description` mediumtext NOT NULL,
`price_cents` int(10) NOT NULL,
`list_image_url` varchar(255) NOT NULL,
`list_image_height` int(4) NOT NULL,
`ending` timestamp NULL DEFAULT NULL,
`category_id` int(5) NOT NULL,
`last_update` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`included_at` timestamp NULL DEFAULT NULL,
`hearts` int(5) NOT NULL,
`score` decimal(10,5) NOT NULL,
`rand_field` decimal(16,15) NOT NULL,
`last_score_update` timestamp NULL DEFAULT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`product_id`),
UNIQUE KEY `unique_shop_id` (`shop`,`shop_id`),
KEY `score_index` (`active`,`score`),
KEY `included_at_index` (`included_at`),
KEY `active_category_score` (`active`,`category_id`,`score`),
KEY `active_category` (`active`,`category_id`,`product_id`),
KEY `active_products` (`active`,`product_id`),
KEY `active_rand` (`active`,`rand_field`),
KEY `active_category_rand` (`active`,`category_id`,`rand_field`)
) ENGINE=InnoDB AUTO_INCREMENT=55985 DEFAULT CHARSET=utf8
CREATE TABLE `product_seen` (
`seenby_id` int(20) NOT NULL AUTO_INCREMENT,
`session_id` varchar(25) NOT NULL,
`product_id` int(15) NOT NULL,
`last_seen` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`sorting` varchar(10) NOT NULL,
`in_category` int(3) DEFAULT NULL,
PRIMARY KEY (`seenby_id`),
KEY `last_seen_index` (`last_seen`),
KEY `session_id` (`session_id`,`seenby_id`),
KEY `session_id_2` (`session_id`,`sorting`,`seenby_id`)
) ENGINE=InnoDB AUTO_INCREMENT=17431 DEFAULT CHARSET=utf8
Edit 2:
The query above is a simplification, this is the real query with EXPLAIN:
EXPLAIN SELECT
DISTINCT p.product_id AS id,
p.list_image_url AS image,
p.list_image_height AS list_height,
hearts,
active AS available,
(UNIX_TIMESTAMP( ) - ulp.last_action) AS last_loved
FROM `looksandgoods`.`products` p
LEFT JOIN `looksandgoods`.`user_likes_products` ulp
ON ( p.product_id = ulp.product_id AND ulp.user_id =1 )
LEFT JOIN `looksandgoods`.`product_seen` sb
ON (sb.session_id = 'y7lWunZKKABgMoDgzjwDjZw1'
AND sb.sorting = 'trend'
AND p.product_id = sb.product_id )
WHERE p.active =1
AND sb.product_id IS NULL
ORDER BY p.score DESC
LIMIT 30 ;
Explain output, there is still a temp table and filesort, although the keys for the join exist:
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | p | range | score_index,active_category_score,active_category,active_products,active_rand,active_category_rand | score_index | 1 | NULL | 2299 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | ulp | ref | love_count_index,user_to_product_index,product_id | love_count_index | 9 | looksandgoods.p.product_id,const | 1 | |
| 1 | SIMPLE | sb | ref | session_id,session_id_2 | session_id | 77 | const | 711 | Using where; Not exists; Distinct |
+----+-------------+-------+-------+----------------------------------------------------------------------------------------------------+------------------+---------+----------------------------------+------+----------------------------------------------+
New answer
I think the problem with the real query is the DISTINCT clause. The implication is that either or both of the product_seen and user_likes_products tables can join multiple rows for each product_id which could potentially appear in the result set (given the somewhat disturbing lack of UNIQUE KEYs on the product_seen table), and this is the reason you've included the DISTINCT clause. Unfortunately, it also means MySQL will have to create a temp table to process the query.
Before I go any further, if it's possible to do...
ALTER TABLE product_seen ADD UNIQUE KEY (session_id, product_id, sorting);
...and...
ALTER TABLE user_likes_products ADD UNIQUE KEY (user_id, product_id);
...then the DISTINCT clause is redundant, and removing it should eliminate the problem. N.B. I'm not suggesting you necessarily need to add these keys, but rather just to confirm that these fields are always unique.
If it's not possible, then there may be another solution, but I'd need to know a lot more about the tables involved in the joins.
Old answer
An EXPLAIN for your query yields...
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
| 1 | SIMPLE | p | ALL | NULL | NULL | NULL | NULL | 10 | Using filesort |
| 1 | SIMPLE | ps | ref | session_id | session_id | 27 | const | 1 | Using where; Not exists |
+----+-------------+-------+------+---------------+------------+---------+-------+------+-------------------------+
...which shows it's not using an index on the products table, so it's having to do a table scan and a filesort, which is why it's slow.
I noticed there's an index on (active, score) which you could use by changing the query to only show active products...
SELECT *
FROM products p
LEFT JOIN product_seen ps
ON (ps.session_id = ? AND p.product_id = ps.product_id )
WHERE p.active=TRUE AND ps.product_id is null
ORDER BY p.score DESC
LIMIT 30;
...which changes the EXPLAIN to...
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
| 1 | SIMPLE | p | range | score_index,active_products | score_index | 1 | NULL | 10 | Using where |
| 1 | SIMPLE | ps | ref | session_id | session_id | 27 | const | 1 | Using where; Not exists |
+----+-------------+-------+-------+-----------------------------+-------------+---------+-------+------+-------------------------+
...which is now doing a range scan and no filesort, which should be much faster.
Or if you want it to also return inactive products, then you'll need to add an index on score only, with...
ALTER TABLE products ADD KEY (score);