MySQL fixing index so possible keys is not null on left join - mysql
This question follows on from the problem posted here when i run explain I on my query
SELECT u_id, SUM(counts.s_count * tablename.weighted) AS total FROM tablename
LEFT JOIN (SELECT a_id, s_count FROM tablename WHERE u_id = 1) counts
ON tablename.a_id = counts.a_id
GROUP BY u_id ORDER BY total DESC LIMIT 0,100;
I get the response
+----+-------------+--------------------+-------+---------------+-----------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+-----------+---------+------+--------+----------------------------------------------+
| 1 | PRIMARY | tablename | index | NULL | a_id | 3 | NULL | 7222350| Using index; Using temporary; Using filesort |
| 1 | PRIMARY | [derived2] | ALL | NULL | NULL | NULL | NULL | 37 | |
| 2 | DERIVED | tablename | ref | PRIMARY | PRIMARY | 4 | | 37 | Using index |
+----+-------------+--------------------+-------+---------------+-----------+---------+------+-------+----------------------------------------------+
the table is created with
CREATE TABLE IF NOT EXISTS tablename (
u_id INT NOT NULL,
a_id MEDIUMINT NOT NULL,
s_count MEDIUMINT NOT NULL,
weighted FLOAT NOT NULL,
INDEX (a_id),
PRIMARY KEY (u_id,a_id)
)ENGINE=INNODB;
how can I change the index or query to get it to make use of the key more effectively? Once the database grows to a 7 million rows the query takes about 30 seconds
edit
which can be created with dummy data using
CREATE TABLE IF NOT EXISTS tablename ( u_id INT NOT NULL, a_id MEDIUMINT NOT NULL,s_count MEDIUMINT NOT NULL, weighted FLOAT NOT NULL,INDEX (a_id), PRIMARY KEY (u_id,a_id))ENGINE=INNODB;
INSERT INTO tablename (u_id,a_id,s_count,weighted ) VALUES (1,1,17,0.0521472392638),(1,2,80,0.245398773006),(1,3,2,0.00613496932515),(1,4,1,0.00306748466258),(1,5,1,0.00306748466258),(1,6,20,0.0613496932515),(1,7,3,0.00920245398773),(1,8,100,0.306748466258),(1,9,100,0.306748466258),(1,10,2,0.00613496932515),(2,1,1,0.00327868852459),(2,2,1,0.00327868852459),(2,3,100,0.327868852459),(2,4,200,0.655737704918),(2,5,1,0.00327868852459),(2,6,1,0.00327868852459),(2,7,0,0.0),(2,8,0,0.0),(2,9,0,0.0),(2,10,1,0.00327868852459),(3,1,15,0.172413793103),(3,2,40,0.459770114943),(3,3,0,0.0),(3,4,0,0.0),(3,5,0,0.0),(3,6,10,0.114942528736),(3,7,1,0.0114942528736),(3,8,20,0.229885057471),(3,9,0,0.0),(3,10,1,0.0114942528736);
You can hardly force MySQL to use an index for the join with results of a subquery, but you can try to speed up the grouping by using a coverage index (an index that has enough data not to fetch the row it references):
Try to add an composite index (u_id, a_id, weighted)
And you will probably need to give MySQL a hint to use the index:
SELECT u_id, SUM(counts.s_count * tablename.weighted) AS total
FROM tablename USE INDEX(Index_3)
LEFT JOIN (SELECT a_id, s_count FROM tablename WHERE u_id = 1) counts
ON tablename.a_id = counts.a_id
GROUP BY u_id ORDER BY total DESC LIMIT 0,100;
Related
can't improve query performance of query
I have the following 2 tables: CREATE TABLE table1 ( ID INT(11) NOT NULL AUTO_INCREMENT, AccountID INT NOT NULL, Type VARCHAR(50) NOT NULL, ValidForBilling BOOLEAN NULL DEFAULT false, MerchantCreationTime TIMESTAMP NOT NULL, PRIMARY KEY (ID), UNIQUE KEY (OrderID, Type) ); with the index: INDEX accID_type_merchCreatTime_vfb (AccountID, Type, MerchantCreationTime, ValidForBilling); CREATE TABLE table2 ( OrderID INT NOT NULL, AccountID INT NOT NULL, LineType VARCHAR(256) NOT NULL, CreationDate TIMESTAMP NOT NULL, CalculatedAmount NUMERIC(4,4) NULL, table1ID INT(11) NOT NULL ); I'm running the following query: SELECT COALESCE(SUM(CalculatedAmount), 0.0) AS CalculatedAmount FROM table2 INNER JOIN table1 ON table1.ID = table2.table1ID WHERE table1.ValidForBilling is TRUE AND table1.AccountID = 388 AND table1.Type = 'TPG_DISCOUNT' AND table1.MerchantCreationTime >= '2018-11-01T05:00:00' AND table1.MerchantCreationTime < '2018-12-01T05:00:00'; And it takes about 2 minutes to complete. I did EXPLAIN in order to try and improve the query performance and got the following output: +----+-------------+------------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------+----------------------+-------+----------+--------------------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------+----------------------+-------+----------+--------------------------+ | 1 | SIMPLE | table1 | NULL | range | PRIMARY,i_fo_merchant_time_account,FO_AccountID_MerchantCreationTime,FO_AccountID_ExecutionTime,FO_AccountID_Type_ExecutionTime,FO_AccountID_Type_MerchantCreationTime,accID_type_merchCreatTime_vfb | accID_type_merchCreatTime_vfb | 61 | NULL | 71276 | 100.00 | Using where; Using index | | 1 | SIMPLE | table2 | NULL | eq_ref | table1ID,i_oc_fo_id | table1ID | 4 | finance.table1.ID | 1 | 100.00 | NULL | +----+-------------+------------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------+----------------------+-------+----------+--------------------------+ I see that I scan 71276 rows in table1 and I can't seem to make this number lower. Is there an index I can create to improve this query performance?
Move ValidForBilling before MerchantCreationTime in accID_type_merchCreatTime_vfb. You need to do ref lookups =TRUE before range uses in an index. For table 2, seems to be a table1ID index already and appending CalculatedAmount will be able to be used in the result: CREATE INDEX tbl1IDCalcAmount (table1ID,CalculatedAmount) ON table2
Optimize join query with SUM, Group By and ORDER By Clauses
I have the following database schema keywords(id, keyword, lang) :( about 8M records) topics(id, topic, lang) : ( about 2.6M records) topic_keywords(topic_id, keyword_id, weight) : (200M records) In a script, I have about 50-100 keywords with an additional field keyword_score and I want to retrieve the top 20 topics that corresponds to those keywords based on the following formula : SUM(keyword_score * topic_weight) A solution I implemented currently in my script is : I create a temporary table as follow temporary_keywords(keyword_id, keyword_score ) Insert all 50-100 keywords to it with their keyword_score Then execute the following query to retrieve topics SELECT topic_id, SUM(weight * keyword_score) AS score FROM temporary_keywords JOIN topic_keywords USING keyword_id GROUP BY topic_id ORDER BY score DESC LIMIT 20 This solution works, but it takes in some cases up to 3 seconds to execute, which is too much for me. I'm asking if there is a way to optimize this query? or should I redesign the data structure into a NoSQL database? Any other solutions or ideas beyond what is listed above are most appreciated UPDATE (SHOW CREATE TABLE) CREATE TABLE `topic_keywords` ( `topic_id` int(11) NOT NULL, `keyword_id` int(11) NOT NULL, `weight` float DEFAULT '0', PRIMARY KEY (`topic_id`,`keyword_id`), KEY `keyword_id_idx` (`keyword_id`,`topic_id`,`weight`) ) CREATE TEMPORARY TABLE temporary_keywords ( keyword_id INT PRIMARY KEY NOT NULL, keyword_score DOUBLE ) EXPLAIN QUERY +----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+ | 1 | SIMPLE | temporary_keywords | ALL | PRIMARY | NULL | NULL | NULL | 100 | Using temporary; Using filesort | | 1 | SIMPLE | topic_keywords | ref | keyword_id_idx | keyword_id_idx | 4 | topics.temporary_keywords.keyword_id | 10778853 | Using index | +----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
Incorrect, but uncaught, syntax. JOIN topic_keywords USING keyword_id --> JOIN topic_keywords USING(keyword_id) If that does not fix it, please provide EXPLAIN FORMAT=JSON SELECT ...
Improving speed of SQL query with MAX, WHERE, and GROUP BY on three different columns
I am attempting to speed up a query that takes around 60 seconds to complete on a table of ~20 million rows. For this example, the table has three columns (id, dateAdded, name). id is the primary key. The indexes I have added to the table are: (dateAdded) (name) (id, name) (id, name, dateAdded) The query I am trying to run is: SELECT MAX(id) as id, name FROM exampletable WHERE dateAdded <= '2014-01-20 12:00:00' GROUP BY name ORDER BY NULL; The date is variable from query to query. The objective of this is to get the most recent entry for each name at or before the date added. When I use explain on the query it tells me that it is using the (id, name, dateAdded) index. +----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+ | 1 | SIMPLE | exampletable | index | date_added_index | id_element_name_date_added_index | 162 | NULL | 22016957 | Using where; Using index; Using temporary; Using filesort | +----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+ Edit: Added two new indexes from comments: (dateAdded, name, id) (name, id) +----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+ | 1 | SIMPLE | exampletable | index | date_added_index,date_added_name_id_index | id__name_date_added_index | 162 | NULL | 22040469 | Using where; Using index; Using temporary | +----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+ Edit: Added create table script. CREATE TABLE `exampletable` ( `id` int(10) NOT NULL auto_increment, `dateAdded` timestamp NULL default CURRENT_TIMESTAMP, `name` varchar(50) character set utf8 default '', PRIMARY KEY (`id`), KEY `date_added_index` (`dateAdded`), KEY `name_index` USING BTREE (`name`), KEY `id_name_index` USING BTREE (`id`,`name`), KEY `id_name_date_added_index` USING BTREE (`id`,`dateAdded`,`name`), KEY `date_added_name_id_index` USING BTREE (`dateAdded`,`name`,`id`), KEY `name_id_index` USING BTREE (`name`,`id`) ) ENGINE=MyISAM AUTO_INCREMENT=22046064 DEFAULT CHARSET=latin1 Edit: Here is the Explain from the answer provided by HeavyE. +----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+ | id | select_type | table | type | possible_k | key | key_len | ref | rows | Extra | +----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1732 | Using temporary; Using filesort | | 1 | PRIMARY | example1 | ref | date_added_index,name_index,date_added_name_id_index,name_id_index,name_date_added_index | date_added_name_id_index | 158 | maxDateByElement.dateAdded,maxDateByElement.name | 1 | Using where; Using index | | 2 | DERIVED | exampletable | range | date_added_index,date_added_name_id_index | name_date_added_index | 158 | NULL | 1743 | Using where; Using index for group-by | +----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
There is a great Stack Overflow post on optimization of Selecting rows with the max value in a column: https://stackoverflow.com/a/7745635/633063 This seems a little messy but works very well: SELECT example1.name, MAX(example1.id) FROM exampletable example1 INNER JOIN ( select name, max(dateAdded) dateAdded from exampletable where dateAdded <= '2014-01-20 12:00:00' group by name ) maxDateByElement on example1.name = maxDateByElement.name AND example1.dateAdded = maxDateByElement.dateAdded GROUP BY name;
why are you using index on many keys?? if your where clause contains only one column, then use that index only, put index on dateAdded and on name separately and then use in sql statement like this: SELECT MAX(id) as id, name FROM exampletable USE INDEX (dateAdded_index) USE INDEX FOR GROUP BY (name_index) WHERE dateAdded <= '2014-01-20 12:00:00' GROUP BY name ORDER BY NULL; here is the link if you want to know more. Please let me know, whether it is giving some positive results or not.
IF the where command makes no difference, then its either the max(id) or the name. I would test the indexes by eliminating Max(id) completely, and see if the group by name is fast. Then I would add Min(id) to see if its any faster than Max(id). (I have seen this make a difference). Also, you should test the order by NULL. Try Order by name desc, or Order by name asc. Clark Vera
MySQL query to find items without matching record in JOIN is very slow
I've read a lot of questions about query optimization but none have helped me with this. As setup, I have 3 tables that represent an "entry" that can have zero or more "categories". > show create table entries; CREATE TABLE `entries` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT ... `name` varchar(255), `updated_at` timestamp NOT NULL, ... PRIMARY KEY (`id`), KEY `name` (`name`) ) ENGINE=InnoDB > show create table entry_categories; CREATE TABLE `entry_categories` ( `ent_name` varchar(255), `cat_id` int(11), PRIMARY KEY (`ent_name`,`cat_id`), KEY `names` (`ent_name`) ) ENGINE=InnoDB (The actual "category" table doesn't come into the question.) Editing an "entry" in the application creates a new row in the entry table -- think like the history of a wiki page -- with the same name and a newer timestamp. I want to see how many uniquely-named Entries don't have a category, which seems really straightforward: SELECT COUNT(id) FROM entries e LEFT JOIN entry_categories c ON e.name=c.ent_name WHERE c.ent_name IS NUL GROUP BY e.name; On my small dataset (about 6000 total entries, with about 4000 names, averaging about one category per named entry) this query takes over 24 seconds (!). I've also tried SELECT COUNT(id) FROM entries e WHERE NOT EXISTS( SELECT ent_name FROM entry_categories c WHERE c.ent_name = e.name ) GROUP BY e.name; with similar results. This seems really, really slow to me, especially considering that finding entries in a single category with SELECT COUNT(*) FROM entries e JOIN ( SELECT ent_name as name FROM entry_categories WHERE cat_id = 123 )c USING (name) GROUP BY name; runs in about 120ms on the same data. Is there a better way to find records in a table that don't have at least one corresponding entry in another table? I'll try to transcribe the EXPLAIN results for each query: > EXPLAIN {no category query}; +----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+ | 1 | SIMPLE | e | index | NULL | name | 767 | NULL | 6222 | Using index; Using temporary; Using filesort | | 1 | SIMPLE | c | index | PRIMARY,names | names | 767 | NULL | 6906 | Using where; using index; Not exists | +----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+ > EXPLAIN {single category query} +----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2850 | Using temporary; Using filesort | | 1 | PRIMARY | e | ref | name | 767 | c.name | 1 | Using where; Using index | | | 2 | DERIVED | c | index | NULL | names | NULL | 6906 | Using where; Using index | | +----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
Try: select name, sum(e) count_entries from (select name, 1 e, 0 c from entries union all select ent_name name, 0 e, 1 c from entry_categories) s group by name having sum(c) = 0
First: remove the names key as it's the same as the primary key (as the ent_name column is the left-most in the primary key and the PK can be used to resolve the query). This should change the output of explain by using the PK in the join. The keys you are using to join are pretty large (255 varchar column) - it is better if you can use integers for this, even if this mean introducing one more table (with the room_id, room_name mapping) For some reason the query uses filesort, despite that you don't have an order by clause. Can you show the explain results next to each query, and the single category query, for further diagnosis?
Estimate/speedup huge table self-join on mysql
I have a huge table: CREATE TABLE `messageline` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `hash` bigint(20) DEFAULT NULL, `quoteLevel` int(11) DEFAULT NULL, `messageDetails_id` bigint(20) DEFAULT NULL, PRIMARY KEY (`id`), KEY `FK2F5B707BF7C835B8` (`messageDetails_id`), KEY `hash_idx` (`hash`), KEY `quote_level_idx` (`quoteLevel`), CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION ) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin I need to find duplicate lines this way: create table foundline AS select ml.messagedetails_id, ml.hash, ml.quotelevel from messageline ml, messageline ml1 where ml1.hash = ml.hash and ml1.messagedetails_id!=ml.messagedetails_id But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx. Explain: +----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+ | 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | | | 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where | +----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this SELECT messagedetails_id, COUNT(*) c FROM messageline ml GROUP BY messagedetails_id HAVING c > 1; If it is still too long, add a condition to split the request on an indexed field : WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps: First run the following query CREATE TABLE duplicate_hashes SELECT * FROM ( SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt, COUNT(DISTINCT messagedetails_id) AS cnt_message_details, GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1 ) tmp WHERE cnt > cnt_message_details This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id where ml1.hash = ml.hash and ml1.messagedetails_id!=ml.messagedetails_id Use a script to check each record of the duplicate_hashes table