MySQL fixing index so possible keys is not null on left join - mysql

This question follows on from the problem posted here when i run explain I on my query
SELECT u_id, SUM(counts.s_count * tablename.weighted) AS total FROM tablename
LEFT JOIN (SELECT a_id, s_count FROM tablename WHERE u_id = 1) counts
ON tablename.a_id = counts.a_id
GROUP BY u_id ORDER BY total DESC LIMIT 0,100;
I get the response
+----+-------------+--------------------+-------+---------------+-----------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+-----------+---------+------+--------+----------------------------------------------+
| 1 | PRIMARY | tablename | index | NULL | a_id | 3 | NULL | 7222350| Using index; Using temporary; Using filesort |
| 1 | PRIMARY | [derived2] | ALL | NULL | NULL | NULL | NULL | 37 | |
| 2 | DERIVED | tablename | ref | PRIMARY | PRIMARY | 4 | | 37 | Using index |
+----+-------------+--------------------+-------+---------------+-----------+---------+------+-------+----------------------------------------------+
the table is created with
CREATE TABLE IF NOT EXISTS tablename (
u_id INT NOT NULL,
a_id MEDIUMINT NOT NULL,
s_count MEDIUMINT NOT NULL,
weighted FLOAT NOT NULL,
INDEX (a_id),
PRIMARY KEY (u_id,a_id)
)ENGINE=INNODB;
how can I change the index or query to get it to make use of the key more effectively? Once the database grows to a 7 million rows the query takes about 30 seconds
edit
which can be created with dummy data using
CREATE TABLE IF NOT EXISTS tablename ( u_id INT NOT NULL, a_id MEDIUMINT NOT NULL,s_count MEDIUMINT NOT NULL, weighted FLOAT NOT NULL,INDEX (a_id), PRIMARY KEY (u_id,a_id))ENGINE=INNODB;
INSERT INTO tablename (u_id,a_id,s_count,weighted ) VALUES (1,1,17,0.0521472392638),(1,2,80,0.245398773006),(1,3,2,0.00613496932515),(1,4,1,0.00306748466258),(1,5,1,0.00306748466258),(1,6,20,0.0613496932515),(1,7,3,0.00920245398773),(1,8,100,0.306748466258),(1,9,100,0.306748466258),(1,10,2,0.00613496932515),(2,1,1,0.00327868852459),(2,2,1,0.00327868852459),(2,3,100,0.327868852459),(2,4,200,0.655737704918),(2,5,1,0.00327868852459),(2,6,1,0.00327868852459),(2,7,0,0.0),(2,8,0,0.0),(2,9,0,0.0),(2,10,1,0.00327868852459),(3,1,15,0.172413793103),(3,2,40,0.459770114943),(3,3,0,0.0),(3,4,0,0.0),(3,5,0,0.0),(3,6,10,0.114942528736),(3,7,1,0.0114942528736),(3,8,20,0.229885057471),(3,9,0,0.0),(3,10,1,0.0114942528736);

You can hardly force MySQL to use an index for the join with results of a subquery, but you can try to speed up the grouping by using a coverage index (an index that has enough data not to fetch the row it references):
Try to add an composite index (u_id, a_id, weighted)
And you will probably need to give MySQL a hint to use the index:
SELECT u_id, SUM(counts.s_count * tablename.weighted) AS total
FROM tablename USE INDEX(Index_3)
LEFT JOIN (SELECT a_id, s_count FROM tablename WHERE u_id = 1) counts
ON tablename.a_id = counts.a_id
GROUP BY u_id ORDER BY total DESC LIMIT 0,100;

Related

can't improve query performance of query

I have the following 2 tables:
CREATE TABLE table1 (
ID INT(11) NOT NULL AUTO_INCREMENT,
AccountID INT NOT NULL,
Type VARCHAR(50) NOT NULL,
ValidForBilling BOOLEAN NULL DEFAULT false,
MerchantCreationTime TIMESTAMP NOT NULL,
PRIMARY KEY (ID),
UNIQUE KEY (OrderID, Type)
);
with the index:
INDEX accID_type_merchCreatTime_vfb (AccountID, Type, MerchantCreationTime, ValidForBilling);
CREATE TABLE table2 (
OrderID INT NOT NULL,
AccountID INT NOT NULL,
LineType VARCHAR(256) NOT NULL,
CreationDate TIMESTAMP NOT NULL,
CalculatedAmount NUMERIC(4,4) NULL,
table1ID INT(11) NOT NULL
);
I'm running the following query:
SELECT COALESCE(SUM(CalculatedAmount), 0.0) AS CalculatedAmount
FROM table2
INNER JOIN table1 ON table1.ID = table2.table1ID
WHERE table1.ValidForBilling is TRUE
AND table1.AccountID = 388
AND table1.Type = 'TPG_DISCOUNT'
AND table1.MerchantCreationTime >= '2018-11-01T05:00:00'
AND table1.MerchantCreationTime < '2018-12-01T05:00:00';
And it takes about 2 minutes to complete.
I did EXPLAIN in order to try and improve the query performance and got the following output:
+----+-------------+------------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------+----------------------+-------+----------+--------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------+----------------------+-------+----------+--------------------------+
| 1 | SIMPLE | table1 | NULL | range | PRIMARY,i_fo_merchant_time_account,FO_AccountID_MerchantCreationTime,FO_AccountID_ExecutionTime,FO_AccountID_Type_ExecutionTime,FO_AccountID_Type_MerchantCreationTime,accID_type_merchCreatTime_vfb | accID_type_merchCreatTime_vfb | 61 | NULL | 71276 | 100.00 | Using where; Using index |
| 1 | SIMPLE | table2 | NULL | eq_ref | table1ID,i_oc_fo_id | table1ID | 4 | finance.table1.ID | 1 | 100.00 | NULL |
+----+-------------+------------------+------------+--------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------+---------+----------------------+-------+----------+--------------------------+
I see that I scan 71276 rows in table1 and I can't seem to make this number lower.
Is there an index I can create to improve this query performance?
Move ValidForBilling before MerchantCreationTime in accID_type_merchCreatTime_vfb. You need to do ref lookups =TRUE before range uses in an index.
For table 2, seems to be a table1ID index already and appending CalculatedAmount will be able to be used in the result:
CREATE INDEX tbl1IDCalcAmount (table1ID,CalculatedAmount) ON table2

Optimize join query with SUM, Group By and ORDER By Clauses

I have the following database schema
keywords(id, keyword, lang) :( about 8M records)
topics(id, topic, lang) : ( about 2.6M records)
topic_keywords(topic_id, keyword_id, weight) : (200M records)
In a script, I have about 50-100 keywords with an additional field keyword_score and I want to retrieve the top 20 topics that corresponds to those keywords based on the following formula : SUM(keyword_score * topic_weight)
A solution I implemented currently in my script is :
I create a temporary table as follow temporary_keywords(keyword_id, keyword_score )
Insert all 50-100 keywords to it with their keyword_score
Then execute the following query to retrieve topics
SELECT topic_id, SUM(weight * keyword_score) AS score
FROM temporary_keywords
JOIN topic_keywords USING keyword_id
GROUP BY topic_id
ORDER BY score DESC
LIMIT 20
This solution works, but it takes in some cases up to 3 seconds to execute, which is too much for me.
I'm asking if there is a way to optimize this query? or should I redesign the data structure into a NoSQL database?
Any other solutions or ideas beyond what is listed above are most appreciated
UPDATE (SHOW CREATE TABLE)
CREATE TABLE `topic_keywords` (
`topic_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
`weight` float DEFAULT '0',
PRIMARY KEY (`topic_id`,`keyword_id`),
KEY `keyword_id_idx` (`keyword_id`,`topic_id`,`weight`)
)
CREATE TEMPORARY TABLE temporary_keywords
( keyword_id INT PRIMARY KEY NOT NULL,
keyword_score DOUBLE
)
EXPLAIN QUERY
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | temporary_keywords | ALL | PRIMARY | NULL | NULL | NULL | 100 | Using temporary; Using filesort |
| 1 | SIMPLE | topic_keywords | ref | keyword_id_idx | keyword_id_idx | 4 | topics.temporary_keywords.keyword_id | 10778853 | Using index |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
Incorrect, but uncaught, syntax.
JOIN topic_keywords USING keyword_id
-->
JOIN topic_keywords USING(keyword_id)
If that does not fix it, please provide EXPLAIN FORMAT=JSON SELECT ...

Improving speed of SQL query with MAX, WHERE, and GROUP BY on three different columns

I am attempting to speed up a query that takes around 60 seconds to complete on a table of ~20 million rows.
For this example, the table has three columns (id, dateAdded, name).
id is the primary key.
The indexes I have added to the table are:
(dateAdded)
(name)
(id, name)
(id, name, dateAdded)
The query I am trying to run is:
SELECT MAX(id) as id, name
FROM exampletable
WHERE dateAdded <= '2014-01-20 12:00:00'
GROUP BY name
ORDER BY NULL;
The date is variable from query to query.
The objective of this is to get the most recent entry for each name at or before the date added.
When I use explain on the query it tells me that it is using the (id, name, dateAdded) index.
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
| 1 | SIMPLE | exampletable | index | date_added_index | id_element_name_date_added_index | 162 | NULL | 22016957 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+------------------+-------+------------------+----------------------------------------------+---------+------+----------+-----------------------------------------------------------+
Edit:
Added two new indexes from comments:
(dateAdded, name, id)
(name, id)
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
| 1 | SIMPLE | exampletable | index | date_added_index,date_added_name_id_index | id__name_date_added_index | 162 | NULL | 22040469 | Using where; Using index; Using temporary |
+----+-------------+------------------+-------+---------------------------------------------------------------+----------------------------------------------+---------+------+----------+-------------------------------------------+
Edit:
Added create table script.
CREATE TABLE `exampletable` (
`id` int(10) NOT NULL auto_increment,
`dateAdded` timestamp NULL default CURRENT_TIMESTAMP,
`name` varchar(50) character set utf8 default '',
PRIMARY KEY (`id`),
KEY `date_added_index` (`dateAdded`),
KEY `name_index` USING BTREE (`name`),
KEY `id_name_index` USING BTREE (`id`,`name`),
KEY `id_name_date_added_index` USING BTREE (`id`,`dateAdded`,`name`),
KEY `date_added_name_id_index` USING BTREE (`dateAdded`,`name`,`id`),
KEY `name_id_index` USING BTREE (`name`,`id`)
) ENGINE=MyISAM AUTO_INCREMENT=22046064 DEFAULT CHARSET=latin1
Edit:
Here is the Explain from the answer provided by HeavyE.
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
| id | select_type | table | type | possible_k | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1732 | Using temporary; Using filesort |
| 1 | PRIMARY | example1 | ref | date_added_index,name_index,date_added_name_id_index,name_id_index,name_date_added_index | date_added_name_id_index | 158 | maxDateByElement.dateAdded,maxDateByElement.name | 1 | Using where; Using index |
| 2 | DERIVED | exampletable | range | date_added_index,date_added_name_id_index | name_date_added_index | 158 | NULL | 1743 | Using where; Using index for group-by |
+----+-------------+--------------+-------+------------------------------------------------------------------------------------------+--------------------------+---------+--------------------------------------------------+------+---------------------------------------+
There is a great Stack Overflow post on optimization of Selecting rows with the max value in a column: https://stackoverflow.com/a/7745635/633063
This seems a little messy but works very well:
SELECT example1.name, MAX(example1.id)
FROM exampletable example1
INNER JOIN (
select name, max(dateAdded) dateAdded
from exampletable
where dateAdded <= '2014-01-20 12:00:00'
group by name
) maxDateByElement on example1.name = maxDateByElement.name AND example1.dateAdded = maxDateByElement.dateAdded
GROUP BY name;
why are you using index on many keys?? if your where clause contains only one column, then use that index only, put index on dateAdded and on name separately and then use in sql statement like this:
SELECT MAX(id) as id, name
FROM exampletable
USE INDEX (dateAdded_index) USE INDEX FOR GROUP BY (name_index)
WHERE dateAdded <= '2014-01-20 12:00:00'
GROUP BY name
ORDER BY NULL;
here is the link if you want to know more. Please let me know, whether it is giving some positive results or not.
IF the where command makes no difference, then its either the max(id) or the name.
I would test the indexes by eliminating Max(id) completely, and see if the group by name is fast. Then I would add Min(id) to see if its any faster than Max(id). (I have seen this make a difference).
Also, you should test the order by NULL. Try Order by name desc, or Order by name asc. Clark Vera

MySQL query to find items without matching record in JOIN is very slow

I've read a lot of questions about query optimization but none have helped me with this.
As setup, I have 3 tables that represent an "entry" that can have zero or more "categories".
> show create table entries;
CREATE TABLE `entries` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
...
`name` varchar(255),
`updated_at` timestamp NOT NULL,
...
PRIMARY KEY (`id`),
KEY `name` (`name`)
) ENGINE=InnoDB
> show create table entry_categories;
CREATE TABLE `entry_categories` (
`ent_name` varchar(255),
`cat_id` int(11),
PRIMARY KEY (`ent_name`,`cat_id`),
KEY `names` (`ent_name`)
) ENGINE=InnoDB
(The actual "category" table doesn't come into the question.)
Editing an "entry" in the application creates a new row in the entry table -- think like the history of a wiki page -- with the same name and a newer timestamp. I want to see how many uniquely-named Entries don't have a category, which seems really straightforward:
SELECT COUNT(id)
FROM entries e
LEFT JOIN entry_categories c
ON e.name=c.ent_name
WHERE c.ent_name IS NUL
GROUP BY e.name;
On my small dataset (about 6000 total entries, with about 4000 names, averaging about one category per named entry) this query takes over 24 seconds (!). I've also tried
SELECT COUNT(id)
FROM entries e
WHERE NOT EXISTS(
SELECT ent_name
FROM entry_categories c
WHERE c.ent_name = e.name
)
GROUP BY e.name;
with similar results. This seems really, really slow to me, especially considering that finding entries in a single category with
SELECT COUNT(*)
FROM entries e
JOIN (
SELECT ent_name as name
FROM entry_categories
WHERE cat_id = 123
)c
USING (name)
GROUP BY name;
runs in about 120ms on the same data. Is there a better way to find records in a table that don't have at least one corresponding entry in another table?
I'll try to transcribe the EXPLAIN results for each query:
> EXPLAIN {no category query};
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | e | index | NULL | name | 767 | NULL | 6222 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | c | index | PRIMARY,names | names | 767 | NULL | 6906 | Using where; using index; Not exists |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
> EXPLAIN {single category query}
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2850 | Using temporary; Using filesort |
| 1 | PRIMARY | e | ref | name | 767 | c.name | 1 | Using where; Using index | |
| 2 | DERIVED | c | index | NULL | names | NULL | 6906 | Using where; Using index | |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
Try:
select name, sum(e) count_entries from
(select name, 1 e, 0 c from entries
union all
select ent_name name, 0 e, 1 c from entry_categories) s
group by name
having sum(c) = 0
First: remove the names key as it's the same as the primary key (as the ent_name column is the left-most in the primary key and the PK can be used to resolve the query). This should change the output of explain by using the PK in the join.
The keys you are using to join are pretty large (255 varchar column) - it is better if you can use integers for this, even if this mean introducing one more table (with the room_id, room_name mapping)
For some reason the query uses filesort, despite that you don't have an order by clause.
Can you show the explain results next to each query, and the single category query, for further diagnosis?

Estimate/speedup huge table self-join on mysql

I have a huge table:
CREATE TABLE `messageline` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`hash` bigint(20) DEFAULT NULL,
`quoteLevel` int(11) DEFAULT NULL,
`messageDetails_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
KEY `hash_idx` (`hash`),
KEY `quote_level_idx` (`quoteLevel`),
CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
I need to find duplicate lines this way:
create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
messageline ml1
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.
Explain:
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| 1 | SIMPLE | ml | ALL | hash_idx | NULL | NULL | NULL | 401798409 | |
| 1 | SIMPLE | ml1 | ref | hash_idx | hash_idx | 9 | skryb.ml.hash | 1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
You can find your duplicates like this
SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;
If it is still too long, add a condition to split the request on an indexed field :
WHERE messagedetails_id < 100000
Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:
First run the following query
CREATE TABLE duplicate_hashes
SELECT * FROM (
SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
) tmp
WHERE cnt > cnt_message_details
This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id
where ml1.hash = ml.hash
and ml1.messagedetails_id!=ml.messagedetails_id
Use a script to check each record of the duplicate_hashes table