Need index for simple query please - mysql

Can anyone suggest a good index to make this query run quicker?
SELECT
s.*,
sl.session_id AS session_id,
sl.lesson_id AS lesson_id
FROM
cdu_sessions s
INNER JOIN cdu_sessions_lessons sl ON sl.session_id = s.id
WHERE
(s.sort = '1') AND
(s.enabled = '1') AND
(s.teacher_id IN ('193', '1', '168', '1797', '7622', '19951'))
Explain:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
1 | SIMPLE | s | NULL | ALL | PRIMARY | NULL | NULL | NULL | 2993 | 0.50 | Using where
1 | SIMPLE | sl | NULL | ref | session_id,ix2 | ix2 | 4 | ealteach_main.s.id | 5 | 100.00 | Using index
cdu_sessions looks like this:
------------------------------------------------
id | int(11)
name | varchar(255)
map_location | enum('classroom', 'school'...)
sort | tinyint(1)
sort_titles | tinyint(1)
friend_gender | enum('boy', 'girl'...)
friend_name | varchar(255)
friend_description | varchar(2048)
friend_description_format | varchar(128)
friend_description_audio | varchar(255)
friend_description_audio_fid | int(11)
enabled | tinyint(1)
created | int(11)
teacher_id | int(11)
custom | int(1)
------------------------------------------------
cdu_sessions_lessons contains 3 fields - id, session_id and lesson_id
Thanks!

Without looking at the query plan, row count and distribution on each table, is hard to predict a good index to make it run faster.
But, I would say that this might help:
> create index sessions_teacher_idx on cdu_sessions(teacher_id);

looking at where condition you could use a composite index for table cdu_sessions
create index idx1 on cdu_sessions(teacher_id, sort, enabled);
and looking to join and select for table cdu_sessions_lessons
create index idx2 on cdu_sessions_lessons(session_id, lesson_id);

First, write the query so no type conversions are necessary. All the comparisons in the where clause are to numbers, so use numeric constants:
SELECT s.*,
sl.session_id, -- unnecessary because s.id is in the result set
sl.lesson_id
FROM cdu_sessions s INNER JOIN
cdu_sessions_lessons sl
ON sl.session_id = s.id
WHERE s.sort = 1 AND
s.enabled = 1 AND
s.teacher_id IN (193, 1, 168, 1797, 7622, 19951);
Although it might not be happening in this specific case, mixing types can impede the use of indexes.
I removed the column as aliases (as session_id for instance). These were redundant because the column name is the alias and the query wasn't changing the name.
For this query, first look at the WHERE clause. All the column references are from one table. These should go in the index, with the equality comparisons first:
create index idx_cdu_sessions_4 on cdu_sessions(sort, enabled, teacher_id, id)
I added id because it is also used in the JOIN.
Formally, id is not needed in the index if it is the primary key. However, I like to be explicit if I want it there.
Next you want an index for the second table. Only two columns are referenced from there, so they can both go in the index. The first column should be the one used in the join:
create index idx_cdu_sessions_lessons_2 on cdu_sessions_lessons(session_id, lesson_id);

Related

Optimize join query with SUM, Group By and ORDER By Clauses

I have the following database schema
keywords(id, keyword, lang) :( about 8M records)
topics(id, topic, lang) : ( about 2.6M records)
topic_keywords(topic_id, keyword_id, weight) : (200M records)
In a script, I have about 50-100 keywords with an additional field keyword_score and I want to retrieve the top 20 topics that corresponds to those keywords based on the following formula : SUM(keyword_score * topic_weight)
A solution I implemented currently in my script is :
I create a temporary table as follow temporary_keywords(keyword_id, keyword_score )
Insert all 50-100 keywords to it with their keyword_score
Then execute the following query to retrieve topics
SELECT topic_id, SUM(weight * keyword_score) AS score
FROM temporary_keywords
JOIN topic_keywords USING keyword_id
GROUP BY topic_id
ORDER BY score DESC
LIMIT 20
This solution works, but it takes in some cases up to 3 seconds to execute, which is too much for me.
I'm asking if there is a way to optimize this query? or should I redesign the data structure into a NoSQL database?
Any other solutions or ideas beyond what is listed above are most appreciated
UPDATE (SHOW CREATE TABLE)
CREATE TABLE `topic_keywords` (
`topic_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
`weight` float DEFAULT '0',
PRIMARY KEY (`topic_id`,`keyword_id`),
KEY `keyword_id_idx` (`keyword_id`,`topic_id`,`weight`)
)
CREATE TEMPORARY TABLE temporary_keywords
( keyword_id INT PRIMARY KEY NOT NULL,
keyword_score DOUBLE
)
EXPLAIN QUERY
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
| 1 | SIMPLE | temporary_keywords | ALL | PRIMARY | NULL | NULL | NULL | 100 | Using temporary; Using filesort |
| 1 | SIMPLE | topic_keywords | ref | keyword_id_idx | keyword_id_idx | 4 | topics.temporary_keywords.keyword_id | 10778853 | Using index |
+----+-------------+--------------------+------+----------------------+----------------------+---------+--------------------------------------+----------+---------------------------------+
Incorrect, but uncaught, syntax.
JOIN topic_keywords USING keyword_id
-->
JOIN topic_keywords USING(keyword_id)
If that does not fix it, please provide EXPLAIN FORMAT=JSON SELECT ...

Why is COUNT() query from large table much faster than SUM()

I have a data warehouse with the following tables:
main
about 8 million records
CREATE TABLE `main` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`cid` mediumint(8) unsigned DEFAULT NULL, //This is the customer id
`iid` mediumint(8) unsigned DEFAULT NULL, //This is the item id
`pid` tinyint(3) unsigned DEFAULT NULL, //This is the period id
`qty` double DEFAULT NULL,
`sales` double DEFAULT NULL,
`gm` double DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=7978349 DEFAULT CHARSET=latin1
period
This table has about 50 records and has the following fields
id
month
year
customer
This has about 23,000 records and the following fileds
id
number //This field is unique
name //This is simply a description field
The following query runs very fast (less than 1 second) and returns about 2,000:
select count(*)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
But this query is much slower (mmore than 45 seconds), which is the same as the previous but sums instead of counts:
select sum(sales)
from mydb.main m
INNER JOIN mydb.period p ON p.id = m.pid
INNER JOIN mydb.customer c ON c.id = m.cid
WHERE p.year = 2013 AND c.number = 'ABC';
When I explain each query, the ONLY difference I see is that on the 'count()'
query the 'Extra' field says 'Using index', while for the 'sum()' query this field is NULL.
Explain count() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | Using index |
Explain sum() query
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | c | const | PRIMARY,idx_customer | idx_customer | 11 | const | 1 | Using index |
| 1 | SIMPLE | p | ref | PRIMARY,idx_period | idx_period | 4 | const | 6 | Using index |
| 1 | SIMPLE | m | ref | idx_pci,idx_pic | idx_pci | 6 | mydb.p.id,const | 7 | NULL |
Why is the count() so much faster than sum()? Shouldn't it be using the index for both?
What can I do to make the sum() go faster?
Thanks in advance!
EDIT
All the tables show that it is using Engine InnoDB
Also, as a side note, if I just do a 'SELECT *' query, this runs very quickly (less than 2 seconds). I would expect that the 'SUM()' shouldn't take any longer than that since SELECT * has to retrieve the rows anyways...
SOLVED
This is what I've learned:
Since the sales field is not a part of the index, it has to retrieve the records from the hard drive (which can be kind've slow).
I'm not too familiar with this, but it looks like I/O performance can be increased by switching to a SSD (Solid-state drive). I'll have to research this more.
For now, I think I'm going to create another layer of summary in order to get the performance I'm looking for.
I redefined my index on the main table to be (pid,cid,iid,sales,gm,qty) and now the sum() queries are running VERY fast!
Thanks everybody!
The index is the list of key rows.
When you do the count() query the actual data from the database can be ignored and just the index used.
When you do the sum(sales) query, then each row has to be read from disk to get the sales figure, hence much slower.
Additionally, the indexes can be read in bulk and then processed in memory, while the disk access will be randomly trashing the drive trying to read rows from across the disk.
Finally, the index itself may have summaries of the counts (to help with the plan generation)
Update
You actually have three indexes on your table:
PRIMARY KEY (`id`),
KEY `idx_pci` (`pid`,`cid`,`iid`) USING HASH,
KEY `idx_pic` (`pid`,`iid`,`cid`) USING HASH
So you only have indexes on the columns id, pid, cid, iid. (As an aside, most databases are smart enough to combine indexes, so you could probably optimize your indexes somewhat)
If you added another key like KEY idx_sales(id,sales) that could improve performance, but given the likely distribution of sales values numerically, you would be adding extra performance cost for updates which is likely a bad thing
The simple answer is that count() is only counting rows. This can be satisfied by the index.
The sum() needs to identify each row and then fetch the page in order to get the sales column. This adds a lot of overhead -- about one page load per row.
If you add sales into the index, then it should also go very fast, because it will not have to fetch the original data.

MySQL query to find items without matching record in JOIN is very slow

I've read a lot of questions about query optimization but none have helped me with this.
As setup, I have 3 tables that represent an "entry" that can have zero or more "categories".
> show create table entries;
CREATE TABLE `entries` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT
...
`name` varchar(255),
`updated_at` timestamp NOT NULL,
...
PRIMARY KEY (`id`),
KEY `name` (`name`)
) ENGINE=InnoDB
> show create table entry_categories;
CREATE TABLE `entry_categories` (
`ent_name` varchar(255),
`cat_id` int(11),
PRIMARY KEY (`ent_name`,`cat_id`),
KEY `names` (`ent_name`)
) ENGINE=InnoDB
(The actual "category" table doesn't come into the question.)
Editing an "entry" in the application creates a new row in the entry table -- think like the history of a wiki page -- with the same name and a newer timestamp. I want to see how many uniquely-named Entries don't have a category, which seems really straightforward:
SELECT COUNT(id)
FROM entries e
LEFT JOIN entry_categories c
ON e.name=c.ent_name
WHERE c.ent_name IS NUL
GROUP BY e.name;
On my small dataset (about 6000 total entries, with about 4000 names, averaging about one category per named entry) this query takes over 24 seconds (!). I've also tried
SELECT COUNT(id)
FROM entries e
WHERE NOT EXISTS(
SELECT ent_name
FROM entry_categories c
WHERE c.ent_name = e.name
)
GROUP BY e.name;
with similar results. This seems really, really slow to me, especially considering that finding entries in a single category with
SELECT COUNT(*)
FROM entries e
JOIN (
SELECT ent_name as name
FROM entry_categories
WHERE cat_id = 123
)c
USING (name)
GROUP BY name;
runs in about 120ms on the same data. Is there a better way to find records in a table that don't have at least one corresponding entry in another table?
I'll try to transcribe the EXPLAIN results for each query:
> EXPLAIN {no category query};
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | e | index | NULL | name | 767 | NULL | 6222 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | c | index | PRIMARY,names | names | 767 | NULL | 6906 | Using where; using index; Not exists |
+----+-------------+-------+-------+---------------+-------+---------+------+------+----------------------------------------------+
> EXPLAIN {single category query}
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 2850 | Using temporary; Using filesort |
| 1 | PRIMARY | e | ref | name | 767 | c.name | 1 | Using where; Using index | |
| 2 | DERIVED | c | index | NULL | names | NULL | 6906 | Using where; Using index | |
+----+-------------+------------+-------+---------------+-------+---------+------+--------------------------+---------------------------------+
Try:
select name, sum(e) count_entries from
(select name, 1 e, 0 c from entries
union all
select ent_name name, 0 e, 1 c from entry_categories) s
group by name
having sum(c) = 0
First: remove the names key as it's the same as the primary key (as the ent_name column is the left-most in the primary key and the PK can be used to resolve the query). This should change the output of explain by using the PK in the join.
The keys you are using to join are pretty large (255 varchar column) - it is better if you can use integers for this, even if this mean introducing one more table (with the room_id, room_name mapping)
For some reason the query uses filesort, despite that you don't have an order by clause.
Can you show the explain results next to each query, and the single category query, for further diagnosis?

MySQL join issue, query hangs

I have a table holding numeric data points with timestamps, like so:
CREATE TABLE `value_table1` (
`datetime` datetime NOT NULL,
`value` decimal(14,8) DEFAULT NULL,
KEY `datetime` (`datetime`)
) ENGINE=InnoDB;
My table holds a data point for every 5 seconds, so timestamps in the table will be, e.g.:
"2013-01-01 10:23:35"
"2013-01-01 10:23:40"
"2013-01-01 10:23:45"
"2013-01-01 10:23:50"
I have a few such value tables, and it is sometimes necessary to look at the ratio between two value series.
I therefore attempted a join, but it seems to not work:
SELECT value_table1.datetime, value_table1.value / value_table2.rate
FROM value_table1
JOIN value_table2
ON value_table1.datetime = value_table2.datetime
ORDER BY value_table1.datetime ASC;
Running EXPLAIN on the query shows:
+----+-------------+--------------+------+---------------+------+---------+------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+-------+---------------------------------+
| 1 | SIMPLE | value_table1 | ALL | NULL | NULL | NULL | NULL | 83784 | Using temporary; Using filesort |
| 1 | SIMPLE | value_table2 | ALL | NULL | NULL | NULL | NULL | 83735 | |
+----+-------------+--------------+------+---------------+------+---------+------+-------+---------------------------------+
Edit
Problem solved, no idea where my index disappeared to. EXPLAIN showed it, thanks!
Thanks!
As your explain shows, the query is not using indexes on the join. Without indexes, it has to scan every row in both tables to process the join.
First of all, make sure the columns used in the join are both indexed.
If they are, then it might be the column type that is causing issues. You could create an integer representation of the time, and then use that to join the two tables.

MySQL & nested set: slow JOIN (not using index)

I have two tables:
localities:
CREATE TABLE `localities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL,
`type` varchar(30) NOT NULL,
`parent_id` int(11) DEFAULT NULL,
`lft` int(11) DEFAULT NULL,
`rgt` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_localities_on_parent_id_and_type` (`parent_id`,`type`),
KEY `index_localities_on_name` (`name`),
KEY `index_localities_on_lft_and_rgt` (`lft`,`rgt`)
) ENGINE=InnoDB;
locatings:
CREATE TABLE `locatings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`localizable_id` int(11) DEFAULT NULL,
`localizable_type` varchar(255) DEFAULT NULL,
`locality_id` int(11) NOT NULL,
`category` varchar(50) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_locatings_on_locality_id` (`locality_id`),
KEY `localizable_and_category_index` (`localizable_type`,`localizable_id`,`category`),
KEY `index_locatings_on_category` (`category`)
) ENGINE=InnoDB;
localities table is implemented as a nested set.
Now, when user belongs to some locality (through some locating) he also belongs to all its ancestors (higher level localities). I need a query that will select all the localities that all the users belong to into a view.
Here is my try:
select distinct lca.*, lt.localizable_type, lt.localizable_id
from locatings lt
join localities lc on lc.id = lt.locality_id
left join localities lca on (lca.lft <= lc.lft and lca.rgt >= lc.rgt)
The problem here is that it takes way too much time to execute.
I consulted EXPLAIN:
+----+-------------+-------+--------+---------------------------------+---------+---------+----------------------------------+-------+----------+-----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+---------------------------------+---------+---------+----------------------------------+-------+----------+-----------------+
| 1 | SIMPLE | lt | ALL | index_locatings_on_locality_id | NULL | NULL | NULL | 4926 | 100.00 | Using temporary |
| 1 | SIMPLE | lc | eq_ref | PRIMARY | PRIMARY | 4 | bzzik_development.lt.locality_id | 1 | 100.00 | |
| 1 | SIMPLE | lca | ALL | index_localities_on_lft_and_rgt | NULL | NULL | NULL | 11439 | 100.00 | |
+----+-------------+-------+--------+---------------------------------+---------+---------+----------------------------------+-------+----------+-----------------+
3 rows in set, 1 warning (0.00 sec)
The last join obviously doesn’t use lft, rgt index as I expect it to. I’m desperate.
UPDATE:
After adding a condition as #cairnz suggested, the query takes still too much time to process.
UPDATE 2: Column names instead of the asterisk
Updated query:
SELECT DISTINCT lca.id, lt.`localizable_id`, lt.`localizable_type`
FROM locatings lt FORCE INDEX(index_locatings_on_category)
JOIN localities lc
ON lc.id = lt.locality_id
INNER JOIN localities lca
ON lca.lft <= lc.lft AND lca.rgt >= lc.rgt
WHERE lt.`category` != "Unknown";
Updated EXAPLAIN:
+----+-------------+-------+--------+-----------------------------------------+-----------------------------+---------+---------------------------------+-------+----------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+--------+-----------------------------------------+-----------------------------+---------+---------------------------------+-------+----------+-------------------------------------------------+
| 1 | SIMPLE | lt | range | index_locatings_on_category | index_locatings_on_category | 153 | NULL | 2545 | 100.00 | Using where; Using temporary |
| 1 | SIMPLE | lc | eq_ref | PRIMARY,index_localities_on_lft_and_rgt | PRIMARY | 4 | bzzik_production.lt.locality_id | 1 | 100.00 | |
| 1 | SIMPLE | lca | ALL | index_localities_on_lft_and_rgt | NULL | NULL | NULL | 11570 | 100.00 | Range checked for each record (index map: 0x10) |
+----+-------------+-------+--------+-----------------------------------------+-----------------------------+---------+---------------------------------+-------+----------+-------------------------------------------------+
Any help appreciated.
Ah, it just occurred to me.
Since you are asking for everything in the table, mysql decides to use a full table scan instead, as it deems it more efficient.
In order to get some key usage, add in some filters to restrict looking for every row in all the tables anyways.
Updating Answer:
Your second query does not make sense. You are left joining to lca yet you have a filter in it, this negates the left join by itself. Also you're looking for data in the last step of the query, meaning you will have to look through all of lt, lc and lca in order to find your data. Also you have no index with left-most column 'type' on locations, so you still need a full table scan to find your data.
If you had some sample data and example of what you are trying to achieve it would perhaps be easier to help.
try to experiment with forcing index - http://dev.mysql.com/doc/refman/5.1/en/index-hints.html, maybe it's just optimizer issue.
It looks like you're wanting the parents of the single result.
According to the person credited with defining Nested Sets in SQL, Joe Celko at http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html "This model is a natural way to show a parts explosion, because a final assembly is made of physically nested assemblies that break down into separate parts."
In other words, Nested Sets are used to filter children efficiently to an arbitrary number of independent levels within a single collection. You have two tables, but I don't see where the properties of the set "locatings" can't be de-normalized into "localities"?
If the localities table had a geometry column, could I not find the one locality from a "locating" and then select on the one table using a single filter: parent.lft <= row.left AND parent.rgt >= row.rgt ?
UPDATED
In this answer https://stackoverflow.com/a/1743952/3018894, there is an example from http://explainextended.com/2009/09/29/adjacency-list-vs-nested-sets-mysql/ where the following example gets all the ancestors to an arbitrary depth of 100000:
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data
FROM (
SELECT #r AS _id,
#level := #level + 1 AS level,
(
SELECT #r := NULLIF(parent, 0)
FROM t_hierarchy hn
WHERE id = _id
)
FROM (
SELECT #r := 1000000,
#level := 0
) vars,
t_hierarchy hc
WHERE #r IS NOT NULL
) hc
JOIN t_hierarchy hp
ON hp.id = hc._id
ORDER BY
level DESC