How to speed up left join queries by indexing? - mysql

At the moment I am experiencing some slower MySQL queries in my application which I want to speed up. Unfortunately I’m not quite sure which is the correct way to do it.
I have the following (fictitious) tables: Book, Page and Word.
Word is child of Page by word_page_id
Page is child of Book by page_book id
I already have individual indexes on page_book_id, word_page_id, book_user_id and book_flag_delete.
SELECT `book`.*, COUNT(word_id) AS `word_amount` FROM `book`
LEFT JOIN `page` ON page_book_id = book_id
LEFT JOIN `word` ON word_page_id = paragraph_id
WHERE (book_user_id = 1) AND (book_flag_delete IS NULL)
GROUP BY `book_id`
ORDER BY `book_id` ASC LIMIT 100
SELECT COUNT(DISTINCT `book_id`) AS `book_row_count` FROM `book`
LEFT JOIN `page` ON page_book_id = book_id
LEFT JOIN `word` ON word_page_id = page_id
WHERE (book_user_id = 59) AND (book_flag_delete IS NULL)
Any ideas how to speed up such queries?
Is there extra indexing involved?

Set indexes on the fields you use for joining.
Further make sure that these have both the same datatype, encoding, and collation, else the index will also not be used.
mysql> EXPLAIN <query> will show you the actually used fields (key column in output) and the available indexes (possible_keys output field).

For this query:
SELECT b.*, COUNT(w.word_id) AS `word_amount`
FROM `book` b LEFT JOIN
`page` p
ON p.page_book_id = b.book_id LEFT JOIN
`word` w
ON w.word_page_id = p.paragraph_id
WHERE (b.book_user_id = 1) AND (b.book_flag_delete IS NULL)
GROUP BY b.`book_id`
ORDER BY b.`book_id` ASC
LIMIT 100;
The best indexes are: book(user_id, book_flag_delete, book_id), page(page_book_id, paragraph_id), and word(word_page_id, word_id).
However, the overall group by might be expensive. You might try writing the query as:
SELECT b.*,
(SELECT COUNT(w.word_id)
FROM `page` p JOIN
`word` w
ON w.word_page_id = p.paragraph_id
WHERE p.page_book_id = b.book_id
) AS `word_amount`
FROM `book` b LEFT JOIN
WHERE (b.book_user_id = 1) AND (b.book_flag_delete IS NULL)
ORDER BY b.`book_id` ASC
LIMIT 100;
The same indexes indexes work here. But, this query should avoid a group by on all the data at once (instead, it uses the indexes for the aggregation).

The optimal schema for a many-to-many mapping table is
CREATE TABLE XtoY (
# No surrogate id for this table
x_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to one table
y_id MEDIUMINT UNSIGNED NOT NULL, -- For JOINing to the other table
# Include other fields specific to the 'relation'
PRIMARY KEY(x_id, y_id), -- When starting with X
INDEX (y_id, x_id) -- When starting with Y
) ENGINE=InnoDB;
The details on 'why' are in my index cookbook

In your select you're gonna want to refrain from using the wildcard "*" to grab columns. Plus utilize aliases ALWAYS!! This will keep your db from having to create a "virtual" alias.
select book1.column1, book1.column2, page1.column1
from book book1
left join page page1
on page1.page_book_id = book1.book_id
..... blah

Related

JOIN query taking long time and creating issue "converting HEAP to MyISAM

My query like below. here I used join query to take data. can u pls suggest how can I solve "converting HEAP to MyISAM" issue.
Can I use subquery here to update it? pls suggest how can I.
Here I have joined users table to check user is exist or not. can I refine it without join so that "converting HEAP to MyISAM" can solve.
Oh one more sometimes I will not check with specific user_id. like here I have added user_id = 16082
SELECT `user_point_logs`.`id`,
`user_point_logs`.`user_id`,
`user_point_logs`.`point_get_id`,
`user_point_logs`.`point`,
`user_point_logs`.`expire_date`,
`user_point_logs`.`point` AS `sum_point`,
IF(sum(`user_point_used_logs`.`point`) IS NULL, 0, sum(`user_point_used_logs`.`point`)) AS `minus`
FROM `user_point_logs`
JOIN `users` ON ( `users`.`id` = `user_point_logs`.`user_id` )
LEFT JOIN (SELECT *
FROM user_point_used_logs
WHERE user_point_log_id NOT IN (
SELECT DISTINCT return_id
FROM user_point_logs
WHERE return_id IS NOT NULL
AND user_id = 16082
)
)
AS user_point_used_logs
ON ( `user_point_logs`.`id` = `user_point_used_logs`.`user_point_log_used_id` )
WHERE expire_date >= 1563980400
AND `user_point_logs`.`point` >= 0
AND `users`.`id` IS NOT NULL
AND ( `user_point_logs`.`return_id` = 0
OR `user_point_logs`.`return_id` IS NULL )
AND `user_point_logs`.`user_id` = '16082'
GROUP BY `user_point_logs`.`id`
ORDER BY `user_point_logs`.`expire_date` ASC
DB FIDDLE HERE WITH STRUCTURE
Kindly try this, If it works... will optimize further by adding composite index.
SELECT
upl.id,
upl.user_id,
upl.point_get_id,
upl.point,
upl.expire_date,
upl.point AS sum_point,
coalesce(SUM(upl.point),0) AS minus -- changed from complex to readable
FROM user_point_logs upl
JOIN users u ON upl.user_id = u.id
LEFT JOIN (select supul.user_point_log_used_id from user_point_used_logs supul
left join user_point_logs supl on supul.user_point_log_id=supl.return_id and supl.return_id is null and supl.user_id = 16082) AS upul
ON upl.id=upul.user_point_log_used_id
WHERE
upl.user_id = 16082 and coalesce(upl.return_id,0)= 0
and upl.expire_date >= 1563980400 -- tip: if its unix timestamp change the datatype and if possible use range between dates
#AND upl.point >= 0 -- since its NN by default removing this condition
#AND u.id IS NOT NULL -- removed since the inner join matches not null
GROUP BY upl.id
ORDER BY upl.expire_date ASC;
Edit:
Try adding index in the column return_id on the table user_point_logs.
Since this column is used in join on derived query.
Or use composite index with user_id and return_id
Indexes:
user_point_logs: (user_id, expire_date)
user_point_logs: (user_id, return_id)
OR is hard to optimize. Decide on only one way to say whatever is being said here, then get rid of the OR:
AND ( `user_point_logs`.`return_id` = 0
OR `user_point_logs`.`return_id` IS NULL )
DISTINCT is redundant:
NOT IN ( SELECT DISTINCT ... )
Change
IF(sum(`user_point_used_logs`.`point`) IS NULL, 0,
sum(`user_point_used_logs`.`point`)) AS `minus`
to
COALESCE( ( SELECT SUM(point) FROM user_point_used_logs ... ), 0) AS minus
and toss LEFT JOIN (SELECT * FROM user_point_used_logs ... )
Since a PRIMARY KEY is a key, the second of these is redundant and can be DROPped:
ADD PRIMARY KEY (`id`),
ADD KEY `id` (`id`) USING BTREE;
After all that, we may need another pass to further simplify and optimize it.

Optimize table to avoid using temporary and using filesort

I have a messages table
CREATE TABLE `messages` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`author` int(11) DEFAULT NULL,
`time` int(10) unsigned DEFAULT NULL,
`text` text CHARACTER SET latin1,
`dest` int(11) unsigned DEFAULT NULL,
`type` tinyint(4) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `author` (`author`),
KEY `dest` (`dest`)
) ENGINE=InnoDB AUTO_INCREMENT=2758 DEFAULT CHARSET=utf8;
I need to get messages between two users
SELECT
...
FROM
`messages` m
LEFT JOIN `people` p ON m.author = p.id
WHERE
(author = 1 AND dest = 2)
OR (author = 2 AND dest = 1)
ORDER BY
m.id DESC
LIMIT 0, 25
When I EXPLAIN this query I get
Please excuse any ignorance, but is there a way I could optimize this table to avoid using a temporary table and filesort for this query, for now it is not causing a problem but I'm pretty sure in future it is going to be troublesome?
First, I'm guessing the left join is not necessary. Second, consider using union all instead. Then one approach is:
(SELECT ...
FROM messages m JOIN
people p
ON m.author = p.id
WHERE author = 1 AND dest = 2
ORDER BY id DESC
LIMIT 25
)
UNION ALL
(SELECT ...
FROM messages m JOIN
people p
ON m.author = p.id
WHERE author = 2 AND dest = 1
ORDER BY id DESC
LIMIT 25
)
ORDER BY m.id DESC
LIMIT 0, 25
With this query, an index on messages(author, dest, id) should make it fast. (Note: you might need to include m.id in the SELECT list.)
To build on Gordon's answer:
SELECT m2..., p...
FROM
(
( SELECT id
FROM messages
WHERE author = 1
AND dest = 2
ORDER BY id DESC
LIMIT 75
)
UNION ALL
(
SELECT id
FROM messages
WHERE author = 2
AND dest = 1
ORDER BY id DESC
LIMIT 75
)
) ORDER BY id DESC
LIMIT 50, 25 ) AS m1
JOIN messages AS m2 ON m2.id = m1.id
JOIN people p ON p.id = m2.author
ORDER BY m1.id DESC
Notes:
Gordon's index is now "covering". (This adds efficiency, thereby masking some of the other stuff I added.)
Lazy evaluation means that it does not need to shovel all the bulky fields of more than 25 rows around. Instead, only 25 need to be handled. Also, I avoid touching people to start with.
The code shows what "page 3" should look like. Note LIMIT 75 versus LIMIT 50,25.
"Pagination via OFFSET" has several problems. See my blog.
This formulation still will not avoid "filesort" and "using temp". But speed is the real goal, correct? ("Filesort" is a misnomer -- if you don't include that TEXT column, the sort will be done in RAM.)
When you add INDEX(author, dest, id), INDEX(author) becomes redundant; drop it.
The ALL after UNION is not the default for UNION, but it avoids an extra pass (and temp table) to de-duplicate the data.
There will still be 2 or 3 temp tables involved. See EXPLAIN FORMAT=JSON SELECT ... for details.

MySQL Query is Extremely Slow

Hello I am looking for ways to optimize the mysql query, basically I am fetching the articles for the user which belong to category_id = 25 and source_id not in a table where I store source id's from which user has unsubscribed.
select
a.article_id,
a.article_title,
a.source_id,
a.article_publish_date,
a.article_details,
n.source_name
from sources n
INNER JOIN articles a
ON (a.source_id = n.source_id)
WHERE n.category_id = 25
AND n.source_id NOT IN(select
source_id
from news_sources_deselected
WHERE user_id = 5)
ORDER BY a.article_publish_date DESC
Schema for Articles Table
CREATE TABLE IF NOT EXISTS `articles` (<br>
`article_id` int(255) NOT NULL auto_increment,<br>
`article_title` varchar(255) NOT NULL,<br>
`source_id` int(255) NOT NULL,<br>
`article_publish_date` bigint(255) NOT NULL,<br>
`article_details` text NOT NULL,<br>
PRIMARY KEY (`article_id`),<br>
KEY `source_id` (`source_id`),<br>
KEY `article_publish_date` (`article_publish_date`)<br>
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Contains articles.';
Structure for Sources table
CREATE TABLE IF NOT EXISTS `sources` (<br>
`source_id` int(255) NOT NULL auto_increment,<br>
`category_id` int(255) NOT NULL,<br>
`source_name` varchar(255) character set latin1 NOT NULL,<br>
`user_id` int(255) NOT NULL,<br>
PRIMARY KEY (`source_id`),<br>
KEY `category_id` (`category_id`),<br>
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='News Sources.'
The articles table has around 0.3 Million records and sources table contains around 1000 records, the query takes around 180 seconds to execute.
Any help will be greatly appreciated.
Try using a derieved query with IS NULL condition. You explain says there is a dependent subquery. Ignore using it and use derieved query for your problem. This will increase the performance
select
a.article_id,
a.article_title,
a.source_id,
a.article_publish_date,
a.article_details,
n.source_name
from sources n
INNER JOIN articles a
ON (a.source_id = n.source_id)
LEFT JOIN (SELECT *
FROM news_sources_deselected
WHERE user_id = 5) AS nsd
ON nsd.source_id = n.source_id
WHERE n.category_id = 25
AND nsd.source_id IS NULL
ORDER BY a.article_publish_date DESC
Use EXPLAIN in front of your query and analyze results.
Here you can find how to start your optimization work.
I see few issues you could check.
You're not using relations despite using InnoDB engine.
You're selecting fields without index.
You're selecting all rows at once.
Do you need all those rows at once? Maybe consider splitting this query to multiple shards (paging)?
Try this query
select
a.article_id,
a.article_title,
a.source_id,
a.article_publish_date,
a.article_details,
n.source_name
from
sources n
INNER JOIN
articles a
ON
n.category_id = 25 AND
a.source_id = n.source_id
INNER JOIN
news_sources_deselected nsd
ON
nsd.user_id <> 5 AND n.source_id = nsd.source_id
ORDER BY
a.article_publish_date DESC
I have removed the extra query and added news_sources_deselected in join by accepting all source_id for user_id other than with id 5.
Or we can go for using only needed records for join as user raheelshan has mentioned
select
a.article_id,
a.article_title,
a.source_id,
a.article_publish_date,
a.article_details,
n.source_name
from
(select
*
from
sources
where
category_id = 25) n
INNER JOIN
articles a
ON
a.source_id = n.source_id
INNER JOIN
(select
*
from
news_sources_deselected
where
user_id <> 5) nsd
ON
n.source_id = nsd.source_id
ORDER BY
a.article_publish_date DESC
Hope this helps..
I fixed the issue by partitioning the table, but I am still open to suggestions.

SQL database index design for inner join keyword search

I have this query
SELECT a.*
FROM entries a
INNER JOIN entries_keywords b ON a.id = b.entry_id
INNER JOIN keywords c ON b.keyword_id = c.id
WHERE c.key IN ('wake', 'up')
GROUP BY a.id
HAVING COUNT(*) = 2
but it's slow. How do I design indexes optimally to speed things up?
EDIT
This is the current schema
CREATE TABLE `entries` (`id` integer PRIMARY KEY AUTOINCREMENT, `sha` text);
CREATE TABLE `entries_keywords` (`id` integer PRIMARY KEY AUTOINCREMENT, `entry_id` integer REFERENCES `entries`, `keyword_id` integer REFERENCES `keywords`);
CREATE TABLE `keywords` (`id` integer PRIMARY KEY AUTOINCREMENT, `key` string);
CREATE INDEX `entries_keywords_entry_id_index` ON `entries_keywords` (`entry_id`);
CREATE INDEX `entries_keywords_entry_id_keyword_id_index` ON `entries_keywords` (`entry_id`, `keyword_id`);
CREATE INDEX `entries_keywords_keyword_id_index` ON `entries_keywords` (`keyword_id`);
CREATE INDEX `keywords_key_index` ON `keywords` (`key`);
I'm using Sqlite3, the query doesn't fail, but is slow.
Right now I'm a query like this (subquery for each keyword):
select *
from (
select *
from (entries) e
inner join entries_keywords ek on e.id = ek.entry_id
inner join keywords k on ek.keyword_id = k.id
where k.key = 'wake') e
inner join entries_keywords ek on e.id = ek.entry_id
inner join keywords k on ek.keyword_id = k.id
where k.key = 'up';
This is way faster but doesn't feel right since it's going to get ugly if I have a lot of keywords.
The key indexes required for that query
keywords(key)
entries_keywords(keyword_id,entry_id)
entries(id)
You must be using MySQL, because the SELECT a.* would otherwise fail.
EDIT after the 2nd comment about this statement, let me point out why select a.* will fail here - it's because of the GROUP BY.
To explain, because the criteria (WHERE) is on c.key, it needs to be indexed.
This then goes up the JOIN against b.keyword_id. We create an index to include b.entry_id so that it never has to look up against the table - the index alone can cover the columns required.
Finally, a.id=b.entry_id joins back to the entries table, so we index the id of that table.
It is quite likely entries(id) is already the primary key, but you may have entries_keywords indexed the other way around - it won't work to satisfy this join.

How to optimize query looking for rows where conditional join rows do not exist?

I've got a table of keywords that I regularly refresh against a remote search API, and I have another table that gets a row each each time I refresh one of the keywords. I use this table to block multiple processes from stepping on each other and refreshing the same keyword, as well as stat collection. So when I spin up my program, it queries for all the keywords that don't have a request currently in process, and don't have a successful one within the last 15 mins, or whatever the interval is. All was working fine for awhile, but now the keywords_requests table has almost 2 million rows in it and things are bogging down badly. I've got indexes on almost every column in the keywords_requests table, but to no avail.
I'm logging slow queries and this one is taking forever, as you can see. What can I do?
# Query_time: 20 Lock_time: 0 Rows_sent: 568 Rows_examined: 1826718
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT JOIN `keywords_requests` as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
AND KeywordsRequest.created > FROM_UNIXTIME(1234551323)
)
WHERE KeywordsRequest.id IS NULL
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;
It seems your most selective index on Keywords is one on KeywordRequest.created.
Try to rewrite query this way:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` as kr
WHERE created > FROM_UNIXTIME(1234567890) /* Happy unix_time! */
) AS KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
AND (KeywordsRequest.status = 'success' OR KeywordsRequest.status = 'active')
AND KeywordsRequest.source_id = '29'
)
WHERE keyword_id IS NULL;
It will (hopefully) hash join two not so large sources.
And Bill Karwin is right, you don't need the GROUP BY or ORDER BY
There is no fine control over the plans in MySQL, but you can try (try) to improve your query in the following ways:
Create a composite index on (keyword_id, status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
UNION
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN `keywords_requests` kr
ON (
keyword_id = id
AND status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
WHERE keyword_id IS NULL
This ideally should use NESTED LOOPS on your index.
Create a composite index on (status, source_id, created) and make it so:
SELECT Keyword.id, Keyword.keyword
FROM `keywords` as Keyword
LEFT OUTER JOIN (
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'success'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
UNION ALL
SELECT *
FROM `keywords_requests` kr
WHERE
status = 'active'
AND source_id = '29'
AND created > FROM_UNIXTIME(1234567890)
)
ON keyword_id = id
WHERE keyword_id IS NULL
This will hopefully use HASH JOIN on even more restricted hash table.
When diagnosing MySQL query performance, one of the first things you need to analyze is the report from EXPLAIN.
If you learn to read the information EXPLAIN gives you, then you can see where queries are failing to make use of indexes, or where they are causing expensive filesorts, or other performance red flags.
I notice in your query, the GROUP BY is irrelevant, since there will be only one NULL row returned from KeywordRequests. Also the ORDER BY is irrelevant, since you're ordering by a column that will always be NULL due to your WHERE clause. If you remove these clauses, you'll probably eliminate a filesort.
Also consider rewriting the query into other forms, and measure the performance of each. For example:
SELECT k.id, k.keyword
FROM `keywords` AS k
WHERE NOT EXISTS (
SELECT * FROM `keywords_requests` AS kr
WHERE kr.keyword_id = k.id
AND kr.status IN ('success', 'active')
AND kr.source_id = '29'
AND kr.created > FROM_UNIXTIME(1234551323)
);
Other tips:
Is kr.source_id an integer? If so, compare to the integer 29 instead of the string '29'.
Are there appropriate indexes on keyword_id, status, source_id, created? Perhaps even a compound index over all four columns would be best, since MySQL will use only one index per table in a given query.
You did a screenshot of your EXPLAIN output and posted a link in the comments. I see that the query is not using an index from Keywords, which makes sense since you're scanning every row in that table anyway. The phrase "Not exists" indicates that MySQL has optimized the LEFT OUTER JOIN a bit.
I think this should be improved over your original query. The GROUP BY/ORDER BY was probably causing it to save an intermediate data set as a temporary table, and sorting it on disk (which is very slow!). What you'd look for is "Using temporary; using filesort" in the Extra column of EXPLAIN information.
So you may have improved it enough already to mitigate the bottleneck for now.
I do notice that the possible keys probably indicate that you have individual indexes on four columns. You may be able to improve that by creating a compound index:
CREATE INDEX kr_cover ON keywords_requests
(keyword_id, created, source_id, status);
You can give MySQL a hint to use a specific index:
... FROM `keywords_requests` AS kr USE INDEX (kr_cover) WHERE ...
Dunno about MySQL but in MSSQL the lines of attack I would take are:
1) Create a covering index on KeywordsRequest status, source_id and created
2) UNION the results tog et around the OR on KeywordsRequest.status
3) Use NOT EXISTS instead o the Outer Join (and try with UNION instead of OR too)
Try this
SELECT Keyword.id, Keyword.keyword
FROM keywords as Keyword
LEFT JOIN (select * from keywords_requests where source_id = '29' and (status = 'success' OR status = 'active')
AND source_id = '29'
AND created > FROM_UNIXTIME(1234551323)
AND id IS NULL
) as KeywordsRequest
ON (
KeywordsRequest.keyword_id = Keyword.id
)
GROUP BY Keyword.id
ORDER BY KeywordsRequest.created ASC;