I've read multiple questions in here but none could help me so far. For the same query and table structure on my previous [unanswered] question Optimizing a SELECT … UNION … query with ORDER and LIMIT on a table with 5M+ rows besides having all the indexes defined, the query is still logged as "not using index".
SELECT `id`, `title`, `title_fa`
FROM
( SELECT `p`.`id` AS `id`, `p`.`title` AS `title`, `p`.`title_fa` AS `title_fa`,
`p`.`unique` AS `unique`, `p`.`date` AS `date`
FROM `articles` `p`
LEFT JOIN `authors` `a` ON `p`.`unique` = `a`.`unique`
WHERE 1
AND MATCH (`p`.`title`) AGAINST ('"heat"' IN BOOLEAN MODE)
UNION
SELECT `p`.`id` AS `id`, `p`.`title` AS `title`, `p`.`title_fa` AS `title_fa`,
`p`.`unique` AS `unique`, `p`.`date` AS `date`
FROM `articles` `p`
LEFT JOIN `authors` `a` ON `p`.`unique` = `a`.`unique`
WHERE 1
AND MATCH (`p`.`title_fa`) AGAINST ('"گرما"' IN BOOLEAN MODE)
) AS `subQuery`
GROUP BY `unique`
ORDER BY `date` DESC
LIMIT 0,10;
I don't know how should I use an index in the outer SELECT where it's grouping the two SELECTs using UNION.
Thanks
Update
This is the structure of the article table:
CREATE TABLE `articles` (
`id` int(10) unsigned NOT NULL,
`title` text COLLATE utf8_persian_ci NOT NULL,
`title_fa` text COLLATE utf8_persian_ci NOT NULL,
`description` text COLLATE utf8_persian_ci NOT NULL,
`description_fa` text COLLATE utf8_persian_ci NOT NULL,
`date` date NOT NULL,
`unique` tinytext COLLATE utf8_persian_ci NOT NULL,
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_persian_ci;
ALTER TABLE `articles`
ADD PRIMARY KEY (`id`),
ADD KEY `unique` (`unique`(128)),
ADD FULLTEXT KEY `TtlDesc` (`title`,`description`);
ADD FULLTEXT KEY `Title` (`title`);
ADD FULLTEXT KEY `faTtlDesc` (`title_fa`,`description_fa`);
ADD FULLTEXT KEY `faTitle` (`title_fa`);
MODIFY `id` int(10) unsigned NOT NULL AUTO_INCREMENT;
UPDATE 2:
Here is the output of EXPLAIN SELECT (I didn't know how to get it from phpMyAdmin any better! Sorry if it doesn't look good):
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 4 Using temporary; Using filesort
2 DERIVED p fulltext title title 0 NULL 1 Using where
3 UNION p fulltext title_fa title_fa 0 NULL 1 Using where
NULL UNION RESULT <union2,3> ALL NULL NULL NULL NULL NULL Using temporary
) ASsubQuery
It is a subquery, a derived table, and it is manifested coming out of a temporary table. It has no chance of index use.
As I wrote in this answer:
The document Derived Tables in MySQL 5.7 describes it well for
versions 5.6 and 5.7, where the latter will provide no penalty due to
the change in materialized derived table output being incorporated
into the outer query. In prior versions, substantial overhead was
endured with temporary tables with the derived.
When there is a MATCH clause, only a FULLTEXT index will be used.
Meanwhile, tips on syntax and pagination:
The usual pattern:
( SELECT ...
GROUP BY ... ORDER BY ... -- apply to result of inner SELECT
)
UNION ALL
( SELECT ...
GROUP BY ... ORDER BY ... -- apply to result of inner SELECT
)
GROUP BY ... ORDER BY ... -- apply to result of UNION
(If you need pagination, see my blog .)
Addenda
In the EXPLAIN... The 1st and 4th lines say ALL and NULL -- this indicates that no index was used in any way. In those cases, we are talking about 4 rows, and all 4 rows are needed. So, do not worry that no INDEX was used.
In the 2nd and 3rd lines, a FULLTEXT index was used.
The phrase Using index (which does not show in your EXPLAIN) does not mean "using some index", it means "using only the index". To elaborate... The data for a table is in one place, the index is in another. When all the necessary columns are in the index, the query does not need to reach over into the data. This is labeled as Using index, and it is termed a "covering index". This particular situation is not relevant for your query.
A similar phrase, Using index condition, means something else. It says that the WHERE clause can be handled by the storage engine, and does not need to involve the handler. Let's simply say that it is an optimization making things run a little faster.
Bottom line: You query is well written, and your indexes are fine for this query.
Maybe no UNION?
Try getting rid of the UNION and simply search for both strings at the same time:
FULLTEXT(title, title_fa)
MATCH (title, title_fa) AGAINST ('"heat" "گرما"' IN BOOLEAN MODE)
If that does not work, then explain what goes wrong.
Related
I have a query that is taking an embarrassingly long time. ~7 minutes embarrassing. I would really appreciate some help. Missing indexes? Rewrite the query? All of the above?
Many thanks
mysql Ver 14.14 Distrib 5.7.25, for Linux (x86_64)
The query looks like:
SELECT COUNT(*) AS count_all, name
FROM api_events ae
INNER JOIN products p on p.token=ae.product_token
WHERE (ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY name
Here are the two table definitions
api_events has ~31 million records
CREATE TABLE `api_events` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`api_name` varchar(200) NOT NULL,
`hostname` varchar(200) NOT NULL,
`controller_action` varchar(2000) NOT NULL,
`duration` decimal(12,5) NOT NULL DEFAULT '0.00000',
`view` decimal(12,5) NOT NULL DEFAULT '0.00000',
`db` decimal(12,5) NOT NULL DEFAULT '0.00000',
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
`product_token` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `product_token` (`product_token`)
) ENGINE=InnoDB AUTO_INCREMENT=64851218 DEFAULT CHARSET=latin1;
and
products has only 12 records
CREATE TABLE `products` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`code` varchar(30) NOT NULL,
`name` varchar(100) NOT NULL,
`description` varchar(2000) NOT NULL,
`token` varchar(50) NOT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=19 DEFAULT CHARSET=latin1;
You could improve the join performance adding index
create index idx1 on api_events(product_token, created_at);
create index idx2 on products(token);
You could also trying inverting the columns ofr api_events
create index idx1 on api_events(created_at, product_token);
and trying add redundancy to product index
create index idx2 on products(token, name);
For the query as stated, you needed
api_events: INDEX(created_at, product_token)
products: INDEX(token, name)
Because the WHERE mentions api_events, the Optimizer is likely to start with that table. created_at is in the WHERE, so the index starts with that, even though starting with a 'range' is usually wrong. In this case, the pair is "covering".
Then, INDEX(token, name) is also "covering".
"Covering" indexes give a small, but widely varying, amount of performance improvement.
What happens if you group by the token instead of the name?
SELECT ae.product_token, COUNT(*) AS count_all
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732')
GROUP BY ae.product_token;
For this query, an index on api_events(created_at, product_token) will probably help.
If this is faster, then you can bring in the name using a subquery.
It seems like the criteria on created_at is very selective (looking at only the past 7 days?). That's crying out to explore an index with created_at as a leading column.
The query is also referencing the product_token column from the same table, so we can include that column in the index, to make it a covering index.
api_events_IX ON api_events ( created_at, product_token )
Using that index, we can probably avoid looking at the vast majority of the 31 million rows, and quickly narrow in on the subset of rows we actually need to look at.
Using the index, the query will still need a "Using filesort" operation to satisfy the GROUP BY.
(My guess here is that the join to the 12 rows in product doesn't exclude a lot of rows... that on the vast majority of rows in api_event the product_token refers to a row that exists in product.
Use MySQL EXPLAIN to see the query execution plan.
A further possible refinement (to test the performance of) would be to do some of the aggregation in an inline view:
SELECT SUM(s.count_all) AS count_all
, p.name
FROM ( SELECT COUNT(*) AS count_all
, ae.product_token
FROM api_events ae
WHERE ae.created_at > '2019-01-21 12:16:53.853732'
GROUP
BY ae.product_token
) s
JOIN products p
ON p.token = s.product_token
GROUP
BY p.name
If the assumption about product_token is misinformed, if there are lots of rows in api_event that have product_token values that don't reference a row in product ... we might take a different tack ...
What is the proper indexing for this query.
I tried given different combinations of indexes for this query but it is still using from using tempory , using filesort etc.
Total table data - 7,60,346
product= 'Dresses' - Total rows = 122 554
CREATE TABLE IF NOT EXISTS `product_data` (
`table_id` int(11) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`price` int(11) NOT NULL,
`store` varchar(255) NOT NULL,
`brand` varchar(255) DEFAULT NULL,
`product` varchar(255) NOT NULL,
`model` varchar(255) NOT NULL,
`size` varchar(50) NOT NULL,
`discount` varchar(255) NOT NULL,
`gender_id` int(11) NOT NULL,
`availability` int(11) NOT NULL,
PRIMARY KEY (`table_id`),
UNIQUE KEY `table_id` (`table_id`),
KEY `id` (`id`),
KEY `discount` (`discount`),
KEY `step_one` (`product`,`availability`),
KEY `step_two` (`product`,`availability`,`brand`,`store`),
KEY `step_three` (`product`,`availability`,`brand`,`store`,`id`),
KEY `step_four` (`brand`,`store`),
KEY `step_five` (`brand`,`store`,`id`)
) ENGINE=InnoDB ;
Query :
SELECT id ,store,brand FROM `product_data` WHERE product='dresses' and
availability='1' group by brand,store order by store limit 10;
excu..time :- (10 total, Query took 1.0941 sec)
EXPLAIN PLAN :
possible_keys :- step_one, step_two, step_three, step_four, step_five
key :- step_two
ref :- const,const
rows :- 229438
Extra :-Using where; Using temporary; Using filesort
I tried these indexes
Key step_one (product,availability)
Key step_two (product,availability,brand,store)
Key step_three (product,availability,brand,store,id)
Key step_four (brand,store)
Key step_five (brand,store,id)
The real problem is not the index, but the mismatch between GROUP BY and ORDER BY preventing taking advantage of LIMIT.
This
INDEX(product, availability, store, brand, id)
will be "covering" and in the right order. But note that I have swapped store and brand...
Change the query to
SELECT id ,store,brand
FROM `product_data`
WHERE product='dresses'
and availability='1'
GROUP BY store, brand -- change
ORDER BY store, brand -- change
limit 10;
That changes the GROUP BY to start with store, to reflect the ORDER BY ordering -- this avoid an extra sort. And it changes the ORDER BY to be identical to the GROUP BY so that the two can be combined.
Given those changes, the INDEX can now go all the way through to the LIMIT, thereby allowing the processing to look at only 10 rows, not a much larger set.
Anything less than all these changes will not be as efficient.
Further discussion:
INDEX(product, availability, -- these two can be in either order
store, brand, -- must match both `GROUP BY` and `ORDER BY`
id) -- tacked on (on the end) to make it "covering"
"Covering" means that all the columns for the SELECT are found in the INDEX, so no need to reach over into the data.
But... The whole query does not make sense because of the inclusion of id in the SELECT. If you want to find what stores have available dresses, then get rid of id. If you want to list all the available dresses, then change id to GROUP_CONCAT(id).
For the indexes, the best index is the step_two. The product field is used in where and has more variation than the availability-field.
Couple of notes about the query:
availability='1' should be availability=1 so that needless int->varchar conversion would be avoided.
"group by brand" should not be used as GROUP BY should only be used when you use aggregate functions as selected columns. What as it that you were trying to achieve with the group by?
Your group by clause doesn't really make sense without an aggregate function.
If you can re-write the query to
SELECT id ,store
FROM `product_data`
WHERE product='dresses'
and availability='1'
order by store limit 10;
Then an index on (product,availability,store) will remove all filesorts.
See SQLFiddle: http://sqlfiddle.com/#!9/60f33d/2
UPDATE:
The SQLFiddle makes your intention clear - you're using GROUP BY to simulate DISTINCT. I don't think you can get rid of the filesort and temporary table steps in your query if this is the case - but I also don't think those steps should be hugely expensive.
I would like to understand at what point in time will MySQL use the indexed column when using ORDER BY.
For example, the query
SELECT * FROM A
INNER JOIN B ON B.id = A.id
WHERE A.status = 1 AND A.name = 'Mike' AND A.created_on BETWEEN '2014-10-01 00:00:00' AND NOW()
ORDER BY A.accessed_on DESC
Based on my knowledge a good index for the above query is an index on table A (id, status, name created_on, accessed_on) and another on B.id.
I also understand that SQL execution follow the order below. but I am not sure how the order selection and order works.
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
Question
Is will it be better to start the index with the id column or in this case is does not matter since WHERE is executed first before the JOIN? or should it be
Second question the column accessed_on should it be at the beginning of the index combination, end or the middle? or should the id column come after all the columns in the WHERE clause?
I appreciate a detailed answer so I can understand the execution level of MySQL/SQL
UPDATED
I added few million records to both tables A and B then I have added multiple indexes to see which would be the best index. But, MySQL seems to like the index id_2 (ie. (status, name, created_on, id, accessed_on))
It seems to be applying the where and it will figure out that it would need and index on status, name, created_on then it apples the INNER JOIN and it will use the id index followed by the first 3. Finally, it will look for accessed_on as the last column. so the index (status, name, created_on, id, accessed_on) fits the same execution order
Here is the tables structures
CREATE TABLE `a` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`status` int(2) NOT NULL,
`name` varchar(255) NOT NULL,
`created_on` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`accessed_on` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `status` (`status`,`name`),
KEY `status_2` (`status`,`name`,`created_on`),
KEY `status_3` (`status`,`name`,`created_on`,`accessed_on`),
KEY `status_4` (`status`,`name`,`accessed_on`),
KEY `id` (`id`,`status`,`name`,`created_on`,`accessed_on`),
KEY `id_2` (`status`,`name`,`created_on`,`id`,`accessed_on`)
) ENGINE=InnoDB AUTO_INCREMENT=3135750 DEFAULT CHARSET=utf8
CREATE TABLE `b` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=3012644 DEFAULT CHARSET=utf8
The best indexes for this query is: A(status, name, created_on) and B(id). These indexes will satisfy the where clause and use the index for the join to B.
This index will not be used for sorting. There are two major impediments to using any index for sorting. The first is the join. The second is the non-equality on created_on. Some databases might figure out to use an index on A(status, name, accessed_on), but I don't think MySQL is smart enough for that.
You don't want id as the first column in the index. This precludes using the index to filter on A, because id is used for the join rather than in the where.
Query :
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
pm_replies as r
LEFT JOIN users as u
ON u.uid = r.uid
WHERE
r.msg_id = '784351921943772258'
ORDER BY r.date DESC
i tried all index combinations i could think of, searched in google how best i could index this but nothing worked.
this query takes 0,33 on 500 returned items and counting...
EXPLAIN:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL index1 NULL NULL NULL 540 Using where; Using filesort
1 SIMPLE u eq_ref uid uid 8 site.r.uid 1
SHOW CREATE pm_replies
CREATE TABLE `pm_replies` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`reply_id` bigint(20) NOT NULL,
`msg_id` bigint(20) NOT NULL,
`uid` bigint(20) NOT NULL,
`body` text COLLATE utf8_unicode_ci NOT NULL,
`date` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `index1` (`msg_id`,`date`,`uid`)
) ENGINE=MyISAM AUTO_INCREMENT=541 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
SHOW CREATE users
CREATE TABLE `users` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`uid` bigint(20) NOT NULL,
`username` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`email` text CHARACTER SET latin1 NOT NULL,
`password` text CHARACTER SET latin1 NOT NULL,
`profile_picture` text COLLATE utf8_unicode_ci NOT NULL,
`date_registered` datetime NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uid` (`uid`),
UNIQUE KEY `username` (`username`)
) ENGINE=MyISAM AUTO_INCREMENT=2004 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
For the query as it is, the best indexes would seem to be...
pm_replies: (msg_id, date, uid)
users: (uid)
The important one is pm_replies. You use it to both filter your data (the filter column is first) then order your data (the order column is second).
The would be different if you removed the filter. Then you'd just want (date, uid) as your index.
The last field in the index just makes it a fraction friendlier to the join, the important part is actually the index on users.
There is a lot more that coudl be said on this, a whole chapter in a book at the very least, and several books if your wanted to. But I hope this helps.
EDIT
Not that my suggested index for pm_replies is one index covering three fields, and not just three indexes. This ensures that all the entries in the index are pre-sorted by those columns. It's like sorting data in Excel by three columns.
Having three separate indexes is like having the Excel data on three tabs. Each sorted by a different fields.
Only whith one index over three fields do you get this behaviour...
- You can select one 'bunch' of records with the same msg_id
- That whole 'bunch' are next to each other, no gaps, etc
- That whole 'bunch' are sorted in date order for that msg_id
- For any rows with the same date, they're ordered by user_id
(Again the user_id part is really very minor.)
Please try this:
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
pm_replies as r
LEFT JOIN users as u
ON (u.uid = r.uid AND r.msg_id = '784351921943772258')
ORDER BY r.date DESC
in my case it help.
Add date to your index1 key so that msg_id and date are both in the index.
What Dems is saying should be correct, but there is one additional detail if you are using InnoDB: perhaps you are paying the price of secondary indexes on clustered tables - essentially, accessing a row through the secondary index requires additional lookup trough the primary, i.e. clustering index. This "double lookup" might make the index less attractive to the query optimizer.
To alleviate this, try covering the all the fields in your select statement with the index:
pm_replies: (msg_id, date, uid, reply_id, body, date)
users: (uid, username, profile_picture)
It appears the optimizer is trying to force the index by ID to make the join to the user table. Since you are doing a left-join (which doesn't make sense since I would expect every entry to have a user ID, thus a normal INNER JOIN), I'll keep it left join.
So, I would try the following. Query just the replies based on the MESSAGE ID and order by the date descending on its own merits, THEN left join, such as
SELECT
r.reply_id,
r.msg_id,
r.uid,
r.body,
r.date,
u.username as username,
u.profile_picture as profile_picture
FROM
( select R2.*
from pm_replies R2
where r2.msg_id = '784351921943772258' ) r
LEFT JOIN users as u
ON u.uid = r.uid
ORDER BY
r.date DESC
In addition, since I don't have MySQL readily available, and can't remember if order by is allowed in a sub-query, if so, you can optimize the inner prequery (using alias "R2") and put the order by there, so it uses the (msgid, date) index and returns just that set... THEN joins to user table on the ID which no index is required at that point from the SOURCE result set, just the index on the user table to find the match.
I have web application that use a similar table scheme like below. simply I want to optimize the selection of articles. articles are selected based on the tag given. for example, if the tag is 'iphone' , the query should output all open articles about 'iphone' from the last month.
CREATE TABLE `article` (
`id` int(11) NOT NULL auto_increment,
`title` varchar(100) NOT NULL,
`body` varchar(200) NOT NULL,
`date` timestamp NOT NULL default CURRENT_TIMESTAMP,
`author_id` int(11) NOT NULL,
`section` varchar(30) NOT NULL,
`status` int(1) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
CREATE TABLE `tags` (
`name` varchar(30) NOT NULL,
`article_id` int(11) NOT NULL,
PRIMARY KEY (`name`,`article_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE `users` (
`id` int(11) NOT NULL auto_increment,
`username` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=3 ;
The following is my MySQL query
explain select article.id,users.username,article.title
from article,users,tags
where article.id=tags.article_id and tags.name = 'iphone4'
and article.author_id=users.id and article.status = '1'
and article.section = 'mobile'
and article.date > '2010-02-07 13:25:46'
ORDER BY tags.article_id DESC
the output is
id select_type table type possible_keys key key_len ref rows Extra <br>
1 SIMPLE tags ref PRIMARY PRIMARY 92 const 55 Using where; Using index <br>
1 SIMPLE article eq_ref PRIMARY PRIMARY 4 test.tags.article_id 1 Using where <br>
1 SIMPLE users eq_ref PRIMARY PRIMARY 4 test.article.author_id 1 <br>
is it possible to optimize it more?
This query may be optimized, depending on which condition is more selective: tags.name = 'iphone4' or article.date > '2010-02-07 13:25:46'
If there are less articles tagged iphone than those posted after Feb 7, then your original query is nice.
If there are many articles tagged iphone, but few those posted after Feb 7, then this query will be more efficient:
SELECT article.id, users.username, article.title
FROM tags
JOIN article
ON article.id = tags.article_id
AND article.status = '1'
AND article.section = 'mobile'
AND article.date > '2010-02-07 13:25:46'
JOIN users
ON users.id = article.author_id
WHERE tags.name = 'iphone4'
ORDER BY
tags.article_date DESC, tags.article_id DESC
Note that the ORDER BY condition has changed. This may or may not be what you want, however, generally the orders of id and date correspond to each other.
If you really need your original ORDER BY condition you may leave it but it will add a filesort (or just revert to your original plan).
In either case, create an index on
article (status, section, date, id)
the query should output all open articles about 'iphone' from the last month.
So the only query you are going to run on this data uses the tag and the date. You've got a index for the tag in the tags table, but the date is stored in a different table (article - you're a bit inconsistent with your naming schema). Adding an index on the article table using date would be no benefit at all. Using id,date (in that order) would help a little - but really the date needs to be denormalised into the tags table to get the query running really fast.
Unless you're regularly moving around bulk data sets - just add a datetime column with a default of the current timestamp to the tags table.
I expect that you may be wanting to interact with the data in lots of other ways - really you should set a low (no?) threshold for slow query logging then analyse the resulting data to identify where you're performance problems are (try looking at the queries with the highest values for duration^2*frequency first).
There's a script at the URL below which is useful for this analysis:
http://www.retards.org/projects/mysql/
You could index the additional fields in article that you are referencing in your select statement. In this case, I would suggest you create an index in article like this:
CREATE INDEX article_idx ON article (author_id, status, section, date);
Creating that index should speed up your query depending on how many overall records you are dealing with. From my understanding, properly creating indexes involves looking at the queries you've written and indexing the columns that are a part of your where clause. This helps the query optimizer better process the query in general. That does not mean create an index on each individual column, however, as its both inefficient to do so and ineffective. When possible, create multiple column indexes that represent your select statement.