I have the following database schema
keywords(id, keyword, lang) :( about 8M records)
topics(id, topic, lang) : ( about 2.6M records)
topic_keywords(topic_id, keyword_id, weight) : (200M records)
In a script, I have about 50-100 keywords with an additional field keyword_score and I want to retrieve the top 20 topics that corresponds to those keywords based on the following formula : SUM(keyword_score * topic_weight)
A solution I implemented currently in my script is :
I create a temporary table as follow temporary_keywords(keyword_id, keyword_score )
Insert all 50-100 keywords to it with their keyword_score
Then execute the following query to retrieve topics
SELECT topic_id, SUM(weight * keyword_score) AS score
FROM temporary_keywords
JOIN topic_keywords USING keyword_id
GROUP BY topic_id
ORDER BY score DESC
LIMIT 20
This solution works, but it takes in some cases up to 3 seconds to execute, which is too much for me.
I'm asking if there is a way to optimize this query? or should I redesign the data structure into a NoSQL database?
Any other solutions or ideas beyond what is listed above are most appreciated
UPDATE (SHOW CREATE TABLE)
CREATE TABLE `topic_keywords` (
`topic_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
`weight` float DEFAULT '0',
PRIMARY KEY (`topic_id`,`keyword_id`),
KEY `keyword_id_idx` (`keyword_id`,`topic_id`,`weight`)
)
CREATE TEMPORARY TABLE temporary_keywords
( keyword_id INT PRIMARY KEY NOT NULL,
keyword_score DOUBLE
)
EXPLAIN QUERY
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | temporary_keywords | ALL | PRIMARY | NULL | NULL | NULL | 100 | Using temporary; Using filesort |
| 1 | SIMPLE | topic_keywords | ref | keyword_id_idx | keyword_id_idx | 4 | topics.temporary_keywords.keyword_id | 10778853 | Using index |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
Incorrect, but uncaught, syntax.
JOIN topic_keywords USING keyword_id
-->
JOIN topic_keywords USING(keyword_id)
If that does not fix it, please provide EXPLAIN FORMAT=JSON SELECT ...
Related
I have three tables, bug, bugrule and bugtrace, for which relationships are:
bug 1--------N bugrule
id = bugid
bugrule 0---------N bugtrace
id = ruleid
Because I'm almost always interested in relations between bug <---> bugtrace I have created an appropriate VIEW which is used as part of several queries. Interestingly, queries using this VIEW have significantly worse performance than equivalent queries using the underlying JOIN explicitly.
VIEW definition:
CREATE VIEW bugtracev AS
SELECT t.*, r.bugid
FROM bugtrace AS t
LEFT JOIN bugrule AS r ON t.ruleid=r.id
WHERE r.version IS NULL
Execution plan for a query using the VIEW (bad performance):
mysql> explain
SELECT c.id,state,
(SELECT COUNT(DISTINCT(t.id)) FROM bugtracev AS t
WHERE t.bugid=c.id)
FROM bug AS c
WHERE c.version IS NULL
AND c.id<10;
+----+--------------------+-------+-------+---------------+--------+---------+-----------------+---------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+--------+---------+-----------------+---------+-----------------------+
| 1 | PRIMARY | c | range | id_2,id | id_2 | 8 | NULL | 3 | Using index condition |
| 2 | DEPENDENT SUBQUERY | t | index | NULL | ruleid | 9 | NULL | 1426004 | Using index |
| 2 | DEPENDENT SUBQUERY | r | ref | id_2,id | id_2 | 8 | bugapp.t.ruleid | 1 | Using where |
+----+--------------------+-------+-------+---------------+--------+---------+-----------------+---------+-----------------------+
3 rows in set (0.00 sec)
Execution plan for a query using the underlying JOIN directly (good performance):
mysql> explain
SELECT c.id,state,
(SELECT COUNT(DISTINCT(t.id))
FROM bugtrace AS t
LEFT JOIN bugrule AS r ON t.ruleid=r.id
WHERE r.version IS NULL
AND r.bugid=c.id)
FROM bug AS c
WHERE c.version IS NULL
AND c.id<10;
+----+--------------------+-------+-------+---------------+--------+---------+-------------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+--------+---------+-------------+--------+-----------------------+
| 1 | PRIMARY | c | range | id_2,id | id_2 | 8 | NULL | 3 | Using index condition |
| 2 | DEPENDENT SUBQUERY | r | ref | id_2,id,bugid | bugid | 8 | bugapp.c.id | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | t | ref | ruleid | ruleid | 9 | bugapp.r.id | 713002 | Using index |
+----+--------------------+-------+-------+---------------+--------+---------+-------------+--------+-----------------------+
3 rows in set (0.00 sec)
CREATE TABLE statements (reduced by irrelevant columns) are:
mysql> show create table bug;
CREATE TABLE `bug` (
`id` bigint(20) NOT NULL,
`version` int(11) DEFAULT NULL,
`state` varchar(16) DEFAULT NULL,
UNIQUE KEY `id_2` (`id`,`version`),
KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> show create table bugrule;
CREATE TABLE `bugrule` (
`id` bigint(20) NOT NULL,
`version` int(11) DEFAULT NULL,
`bugid` bigint(20) NOT NULL,
UNIQUE KEY `id_2` (`id`,`version`),
KEY `id` (`id`),
KEY `bugid` (`bugid`),
CONSTRAINT `bugrule_ibfk_1` FOREIGN KEY (`bugid`) REFERENCES `bug` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
mysql> show create table bugtrace;
CREATE TABLE `bugtrace` (
`id` bigint(20) NOT NULL,
`ruleid` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `ruleid` (`ruleid`),
CONSTRAINT `bugtrace_ibfk_1` FOREIGN KEY (`ruleid`) REFERENCES `bugrule` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
You ask why about query optimization for a couple of complex queries with COUNT(DISTINCT val) and dependent subqueries. It's hard to know why for sure.
You will probably fix most of your performance problem by getting rid of your dependent subquery, though. Try something like this:
SELECT c.id,state, cnt.cnt
FROM bug AS c
LEFT JOIN (
SELECT bugid, COUNT(DISTINCT id) cnt
FROM bugtracev
GROUP BY bugid
) cnt ON c.id = cnt.bugid
WHERE c.version IS NULL
AND c.id<10;
Why does this help? To satisfy the query the optimizer can choose to run the GROUP BY subquery just once, rather than many times. And, you can use EXPLAIN on the GROUP BY subquery to understand its performance.
You may also get a performance boost by creating a compound index on bugrule that matches the query in your view. Try this one.
CREATE INDEX bugrule_v ON bugrule (version, ruleid, bugid)
and try switching the last two columns like so
CREATE INDEX bugrule_v ON bugrule (version, ruleid, bugid)
These indexes are called covering indexes because they contain all the columns needed to satisfy your query. version appears first because that helps optimize WHERE version IS NULL in your view definition. That makes it faster.
Pro tip: Avoid using SELECT * in views and queries, especially when you have performance problems. Instead, list the columns you actually need. The * may force the query optimizer to avoid a covering index, even when the index would help.
When using MySQL 5.6 (or older), try with at least MySQL 5.7. According to What’s New in MySQL 5.7?:
We have to a large extent unified the handling of derived tables and views. Until now, subqueries in the FROM clause (derived tables) were unconditionally materialized, while views created from the same query expressions were sometimes materialized and sometimes merged into the outer query. This behavior, beside being inconsistent, can lead to a serious performance penalty.
Can anyone suggest a good index to make this query run quicker?
SELECT
s.*,
sl.session_id AS session_id,
sl.lesson_id AS lesson_id
FROM
cdu_sessions s
INNER JOIN cdu_sessions_lessons sl ON sl.session_id = s.id
WHERE
(s.sort = '1') AND
(s.enabled = '1') AND
(s.teacher_id IN ('193', '1', '168', '1797', '7622', '19951'))
Explain:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
1 | SIMPLE | s | NULL | ALL | PRIMARY | NULL | NULL | NULL | 2993 | 0.50 | Using where
1 | SIMPLE | sl | NULL | ref | session_id,ix2 | ix2 | 4 | ealteach_main.s.id | 5 | 100.00 | Using index
cdu_sessions looks like this:
------------------------------------------------
id | int(11)
name | varchar(255)
map_location | enum('classroom', 'school'...)
sort | tinyint(1)
sort_titles | tinyint(1)
friend_gender | enum('boy', 'girl'...)
friend_name | varchar(255)
friend_description | varchar(2048)
friend_description_format | varchar(128)
friend_description_audio | varchar(255)
friend_description_audio_fid | int(11)
enabled | tinyint(1)
created | int(11)
teacher_id | int(11)
custom | int(1)
------------------------------------------------
cdu_sessions_lessons contains 3 fields - id, session_id and lesson_id
Thanks!
Without looking at the query plan, row count and distribution on each table, is hard to predict a good index to make it run faster.
But, I would say that this might help:
> create index sessions_teacher_idx on cdu_sessions(teacher_id);
looking at where condition you could use a composite index for table cdu_sessions
create index idx1 on cdu_sessions(teacher_id, sort, enabled);
and looking to join and select for table cdu_sessions_lessons
create index idx2 on cdu_sessions_lessons(session_id, lesson_id);
First, write the query so no type conversions are necessary. All the comparisons in the where clause are to numbers, so use numeric constants:
SELECT s.*,
sl.session_id, -- unnecessary because s.id is in the result set
sl.lesson_id
FROM cdu_sessions s INNER JOIN
cdu_sessions_lessons sl
ON sl.session_id = s.id
WHERE s.sort = 1 AND
s.enabled = 1 AND
s.teacher_id IN (193, 1, 168, 1797, 7622, 19951);
Although it might not be happening in this specific case, mixing types can impede the use of indexes.
I removed the column as aliases (as session_id for instance). These were redundant because the column name is the alias and the query wasn't changing the name.
For this query, first look at the WHERE clause. All the column references are from one table. These should go in the index, with the equality comparisons first:
create index idx_cdu_sessions_4 on cdu_sessions(sort, enabled, teacher_id, id)
I added id because it is also used in the JOIN.
Formally, id is not needed in the index if it is the primary key. However, I like to be explicit if I want it there.
Next you want an index for the second table. Only two columns are referenced from there, so they can both go in the index. The first column should be the one used in the join:
create index idx_cdu_sessions_lessons_2 on cdu_sessions_lessons(session_id, lesson_id);
I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.
I've read a lot of questions about query optimization but none have helped me with this.
As setup, I have 3 tables that represent an "entry" that can have zero or more "categories".
> show create table entries;
CREATE TABLE `entries` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
...
`name` varchar(255),
`updated_at` timestamp NOT NULL,
...
PRIMARY KEY (`id`),
KEY `name` (`name`)
) ENGINE=InnoDB
> show create table entry_categories;
CREATE TABLE `entry_categories` (
`ent_name` varchar(255),
`cat_id` int(11),
PRIMARY KEY (`ent_name`,`cat_id`),
KEY `names` (`ent_name`)
) ENGINE=InnoDB
(The actual "category" table doesn't come into the question.)
Editing an "entry" in the application creates a new row in the entry table -- think like the history of a wiki page -- with the same name and a newer timestamp. I want to see how many uniquely-named Entries don't have a category, which seems really straightforward:
SELECT COUNT(id)
FROM entries e
LEFT JOIN entry_categories c
ON e.name=c.ent_name
WHERE c.ent_name IS NUL
GROUP BY e.name;
On my small dataset (about 6000 total entries, with about 4000 names, averaging about one category per named entry) this query takes over 24 seconds (!). I've also tried
SELECT COUNT(id)
FROM entries e
WHERE NOT EXISTS(
SELECT ent_name
FROM entry_categories c
WHERE c.ent_name = e.name
)
GROUP BY e.name;
with similar results. This seems really, really slow to me, especially considering that finding entries in a single category with
SELECT COUNT(*)
FROM entries e
JOIN (
SELECT ent_name as name
FROM entry_categories
WHERE cat_id = 123
)c
USING (name)
GROUP BY name;
runs in about 120ms on the same data. Is there a better way to find records in a table that don't have at least one corresponding entry in another table?
I'll try to transcribe the EXPLAIN results for each query:
> EXPLAIN {no category query};
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | e | index | NULL | name | 767 | NULL | 6222 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | c | index | PRIMARY,names | names | 767 | NULL | 6906 | Using where; using index; Not exists |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
> EXPLAIN {single category query}
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2850 | Using temporary; Using filesort |
| 1 | PRIMARY | e | ref | name | 767 | c.name | 1 | Using where; Using index | |
| 2 | DERIVED | c | index | NULL | names | NULL | 6906 | Using where; Using index | |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
Try:
select name, sum(e) count_entries from
(select name, 1 e, 0 c from entries
union all
select ent_name name, 0 e, 1 c from entry_categories) s
group by name
having sum(c) = 0
First: remove the names key as it's the same as the primary key (as the ent_name column is the left-most in the primary key and the PK can be used to resolve the query). This should change the output of explain by using the PK in the join.
The keys you are using to join are pretty large (255 varchar column) - it is better if you can use integers for this, even if this mean introducing one more table (with the room_id, room_name mapping)
For some reason the query uses filesort, despite that you don't have an order by clause.
Can you show the explain results next to each query, and the single category query, for further diagnosis?
iam fighting with some performance problems on a very simple table which seems to be very slow when fetching data by using its primary key (bigint)
I have this table with 124 million entries:
CREATE TABLE `nodes` (
`id` bigint(20) NOT NULL,
`lat` float(13,7) NOT NULL,
`lon` float(13,7) NOT NULL,
PRIMARY KEY (`id`),
KEY `lat_index` (`lat`),
KEY `lon_index` (`lon`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
and a simple query which takes some id from another table using the IN clause to fetch data from the nodes tables, but it takes like 1 hour only to fetch a few rows from this table.
EXPLAIN shows me its not using the PRIMARY key as index, its simply scanning the whole table. Why that? id and the id column from the other table are both from type bigint(20).
mysql> EXPLAIN SELECT lat, lon FROM nodes WHERE id IN (SELECT node_id FROM ways_elements WHERE way_id = '4962890');
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
| 1 | PRIMARY | nodes | ALL | NULL | NULL | NULL | NULL | 124035228 | Using where |
| 2 | DEPENDENT SUBQUERY | ways_elements | ref | way_id | way_id | 8 | const | 2 | Using where |
+----+--------------------+-------------------+------+---------------+--------+---------+-------+-----------+-------------+
The query SELECT node_id FROM ways_elements WHERE way_id = '4962890' simply returns two node ids, so the whole query should only return two rows, but it takes more or less 1 hour.
Using "force index (PRIMARY)" didnt help, even if it would help, why does MySQL not take that index since its a primary key? EXPLAIN doesnt even mention anything in the possible_keys columns but select_type shows PRIMARY.
Am i doing something wrong?
How does this perform?
SELECT lat, lon FROM nodes t1 join ways_elements t2 on (t1.id=t2.node_id) WHERE t2.way_id = '4962890'
I suspect that your query is checking each row in nodes against each item in the "IN" clause.
This is what is called a correlated subquery. You can see this as reference or this popular question posted on Stackoverflow. A better query to use is:
SELECT lat,
lon
FROM nodes n
JOIN ways_elements w ON n.id = w.node_id
WHERE way_id = '4962890'