I have a mysql query which combines data from 3 tables, which I'm calling "first_table", "second_table", and "third_table" as shown below.
This query consistently shows up in the MySQL slow query log, even though all fields referenced in the query are indexed, and the actual amount of data in these tables is not large (< 1000 records, except for "third_table" which has more like 10,000 records).
I'm trying to determine if there is a better way to structure this query to achieve better performance, and what part of this query is likely to be the most likely culprit for causing the slowdown.
Please note that "third_table.placements" is a JSON field type. All "label" fields are varchar(255), "id" fields are primary key integer fields, "sample_img" is an integer, "guid" is a string, "deleted" is an integer, and "timestamp" is a datetime.
SELECT DISTINCT first_table.id,
first_table.label,
(SELECT guid
FROM second_table
WHERE second_table.id = first_table.sample_img) AS guid,
Count(third_table.id) AS
related_count,
Sum(Json_length(third_table.placements)) AS
placements_count
FROM first_table
LEFT JOIN third_table
ON Json_overlaps(third_table.placements,
Cast(first_table.id AS CHAR))
WHERE first_table.deleted IS NULL
AND third_table.deleted IS NULL
AND Unix_timestamp(third_table.timestamp) >= 1647586800
AND Unix_timestamp(third_table.timestamp) < 1648191600
GROUP BY first_table.id
ORDER BY Lower(first_table.label) ASC
LIMIT 0, 1000
The biggest problem is that these are not sargable:
WHERE ... Unix_timestamp(third_table.timestamp) < 1648191600
ORDER BY Lower(first_table.label)
That is, don't hide a potentially indexed column inside a function call. Instead:
WHERE ... third_table.timestamp < FROM_UNIXTIME(1648191600)
and use a case insensitive COLLATION for first_table.label. That is any collation ending in _ci. (Please provide SHOW CREATE TABLE so I can point that out, and to check the vague "all fields are indexed" -- That usually indicates not knowing the benefits of "composite" indexes.)
Json_overlaps(...) is probably also not sargable. But it gets trickier to fix. Please explain the structure of the json and the types of id and placements.
Do you really need 1000 rows in the output? That is quite large for "pagination".
How big are the tables? UUIDs/GUIDs are notorious when the tables are too big to be cached in RAM.
It is possibly never useful to have both SELECT DISTINCT and GROUP BY. Removing the DISTINCT may speed up the query by avoiding an extra sort.
Do you really want LEFT JOIN, not just JOIN? (I don't understand the query enough to make a guess.)
After you have fixed most of those, and if you still need help, I may have a way to get rid of the GROUP BY by adding a 'derived' table. Later. (Then I may be able to address the "json_overlaps" discussion.)
Please provide EXPLAIN SELECT ...
I query wp_posts table in my database.
Current indexes are as below:
I run the followin query:
SELECT MIN(post_parent)
FROM wp_posts
WHERE post_type = 'product_variation' and post_parent > 365191;
it checks 162880 rows as you can see below:
However there the following query returns just 67156 rows.
SELECT COUNT(*)
FROM wp_posts
WHERE post_type = 'product_variation' and post_parent > 365191;
So why my index is not working as I expected ?
P.S: post_parent is a bigInt
Your index is "working", but it is not useful for this query.
Basically, one purpose of an index is to reduce the number of data pages that need to be read. When have a query that reads 67,156 out of 162,880 rows, the optimizer figures that every data pages needs to be read anyway. So, why bother using the index.
As a note, for this query:
SELECT MIN(post_parent)
FROM wp_posts
WHERE post_type = 'product_variation' and post_parent > 365191;
The optimal index is on wp_posts(post_type, post_parent). I am guessing that this index would actually be used, because it is a covering index for the query. So reading the index has advantages over reading the original data pages.
Your index is "working", but it is only partially useful for this query.
You did not disclose the exact details of the type_status_date index, but based on the name I assume it indexes post type, post status, and post date fields, in this particular order.
This means that type_status_date can only be used to narrow down the results based on post type, but cannot be used for looking up post parent, since the latter field is not part of the index. Mysql can only use a single index for a table in a query.
If you executed a
SELECT COUNT(*)
FROM wp_posts
WHERE post_type = 'product_variation'
query, then the count will probably be a lot closer to the scanned rows count in the explain.
As Gordon explained, an index on post type, post parent fields would be more efficient in this case.
The "Rows" in EXPLAIN are an approximation; sometimes very far off.
INDEX(post_type, post_parent) is not only 'covering' (meaning that all the necessary columns are found in the index), but, together with MIN(), it will look at only one row to give the answer.
Explain will indicate "covering" by saying "Using index" (not "Using index condition").
EXPLAIN does not always lower the "Rows" to account for LIMIT, MIN, etc. So, again, you can't always trust it. See this for an accurate count: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#handler_counts
I had a table that is holding a domain and id
the query is
select distinct domain
from user
where id = '1'
the index is using the order idx_domain_id is faster than idx_id_domain
if the order of the execution is
(FROM clause,WHERE clause,GROUP BY clause,HAVING clause,SELECT
clause,ORDER BY clause)
then the query should be faster if it use the sorted where columns than the select one.
at 15:00 to 17:00 it show the same query i am working on
https://serversforhackers.com/laravel-perf/mysql-indexing-three
the table has a 4.6 million row.
time using idx_domain_id
time after change the order
This is your query:
select distinct first_name
from user
where id = '1';
You are observing that user(first_name, id) is faster than user(id, firstname).
Why might this be the case? First, this could simply be an artifact of how your are doing the timing. If your table is really small (i.e. the data fits on a single data page), then indexes are generally not very useful for improving performance.
Second, if you are only running the queries once, then the first time you run the query, you might have a "cold cache". The second time, the data is already stored in memory, so it runs faster.
Other issues can come up as well. You don't specify what the timings are. Small differences can be due to noise and might be meaningless.
You don't provide enough information to give a more definitive explanation. That would include:
Repeated timings run on cold caches.
Size information on the table and the number of matching rows.
Layout information, particularly the type of id.
Explain plans for the two queries.
select distinct domain
from user
where id = '1'
Since id is the PRIMARY KEY, there is at most one row involved. Hence, the keyword DISTINCT is useless.
And the most useful index is what you already have, PRIMARY KEY(id). It will drill down the BTree to find id='1' and deliver the value of domain that is sitting right there.
On the other hand, consider
select distinct domain
from user
where something_else = '1'
Now, the obvious index is INDEX(something_else, domain). This is optimal for the WHERE clause, and it is "covering" (meaning that all the columns needed by the query exist in the index). Swapping the columns in the index will be slower. Meanwhile, since there could be multiple rows, DISTINCT means something. However, it is not the logical thing to use.
Concerning your title question (order of columns): The = columns in the WHERE clause should come first. (More details in the link below.)
DISTINCT means to gather all the rows, then de-duplicate them. Why go to that much effort when this gives the same answer:
select domain
from user
where something_else = '1'
LIMIT 1
This hits only one row, not all the 1s.
Read my Indexing Cookbook.
(And, yes, Gordon has a lot of good points.)
Table type: MyISAM
Rows: 120k
Data Length: 30MB
Index Length: 40MB
my.ini, MySQL 5.6.2 Windows
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 16M
Windows Server 2012, 12GB RAM, SSD 400MB/s
1 Slow Query:
SELECT article_id, title, author, content, pdate, MATCH(author, title, content)
AGAINST('Search Keyword') AS score FROM articles ORDER BY score DESC LIMIT 10;
Executing this query takes 352ms uses index. After profiling, it shows that most of the time is spent on Creating sort index. (Complete detail: http://pastebin.com/raw/jT58DCN5)
2 Faster Query:
SELECT article_id, title, author, content, pdate, MATCH(author, title, content)
AGAINST('Search Keyword') AS score FROM articles LIMIT 10;
Executing this query takes 23ms and does a full table scan, I don't like full table scans.
The problem / question is, Query #1 is the one that I need to use, since the sorting is very important.
Is there anything I can do about speeding up that query / re-writting it and achieve the same result (As #1)?
Appreciate any input and help.
Maybe you are simply expecting too much? 350ms for doing a
MATCH(author, title, content) AGAINST('Search Keyword')
ORDER BY
on 120k records doesn't sound too shabby to me; especially if content is 'bigish' ...
Keep in mind that for your "SLOW QUERY" to work, the system has to read every row, calculate the score and then in the end sort all scores, figure out the lowest 10 values and then return all relevant row information for it. If you leave out the ORDER BY then it simply picks the first 10 rows and only needs to calculate the score for those 10 rows.
That said, I think the EXPLAIN is a bit misleading in that it seems to blame everything on the SORT while most probably it is the MATCH that takes up most of the time. I'm guessing the MATCH() operator is executed in a 'lazy' manner and thus only is run when the data is asked for which in this case is while the sorting is happening.
To figure this out simply add a new column score and split up the query into 2 parts.
UPDATE articles SET score = MATCH() etc... => will take about 300 ms I guess
SELECT article_id, title, author, content, pdate, score FROM articles ORDER BY score DESC LIMIT 10; => will take about 50 ms I guess
Off course this is no working solution, but if I'm right it will show you that your problem is not with the SORT but rather with the fulltext searching...
PS: you forgot to mention what the indexes are on the table, might be useful to know too. cf https://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html
Try these variations:
AGAINST('word')
AGAINST('+word')
AGAINST('+word' IN BOOLEAN MODE)
Try
SELECT ... MATCH ...,
FROM tbl
WHERE MATCH ... -- repeat the test here
...
The test is to eliminate the rows that do not match at all, thereby greatly diminishing the number of rows to sort. (Again, depending on + and BOOLEAN.)
(I usually use all three: +, BOOLEAN, and WHERE MATCH.)
key_buffer_size = 2G may help also.
You should consider moving to InnoDB, where FT is faster.
I'm going through an application and trying to optimize some queries and I'm really struggling with a few of them. Here's an example:
SELECT `Item` . * , `Source` . * , `Keyword` . * , `Author` . *
FROM `items` AS `Item`
JOIN `sources` AS `Source` ON ( `Item`.`source_id` = `Source`.`id` )
JOIN `authors` AS `Author` ON ( `Item`.`author_id` = `Author`.`id` )
JOIN `items_keywords` AS `ItemsKeyword` ON ( `Item`.`id` = `ItemsKeyword`.`item_id` )
JOIN `keywords` AS `Keyword` ON ( `Keyword`.`id` = `ItemsKeyword`.`keyword_id` )
JOIN `keywords_profiles` AS `KeywordsProfile` ON ( `Keyword`.`id` = `KeywordsProfile`.`keyword_id` )
JOIN `profiles` AS `Profile` ON ( `Profile`.`id` = `KeywordsProfile`.`profile_id` )
WHERE `KeywordsProfile`.`profile_id` IN ( 17 )
GROUP BY `Item`.`id`
ORDER BY `Item`.`timestamp` DESC , `Item`.`id` DESC
LIMIT 0 , 20;
This one is taking 10-30 seconds...in the tables referenced, there are about 500k author rows, and about 750k items and items_keywords rows. Everything else is less than 500 rows.
Here's the explain output:
http://img.skitch.com/20090220-fb52wd7jf58x41ikfxaws96xjn.jpg
EXPLAIN is relatively new to me, but I went through this line by line and it all seems fine. Not sure what else I can do, as I've got indexes on everything...what am I missing?
The server this sits on is just a 256 slice over at slicehost, but there's nothing else running on it and the CPU is at 0% before its run. And yet still it cranks away on this query. Any ideas?
EDIT: Some further info; one of the things that makes this really frustrating is that if I repeatedly run this query, it takes less than .1 seconds. I'm assuming this is due to the query cache, but if I run RESET QUERY CACHE before it, it still runs extremely quickly. It's only after I wait a little while or run some other queries that the 10-30 second times return. All the tables are MyISAM...does this indicate that MySQL is loading stuff into memory and that's why it runs so much faster for awhile?
EDIT 2: Thanks so much to everyone for your help...an update...I cut everything down to this:
SELECT i.id
FROM items AS i
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;
Consistently took 5-6 seconds, despite there only being 750k records in the DB. Once I dropped the 2nd column on the ORDER BY clause, it was pretty much instant. There's obviously several things going on here, but when I cut the query down to this:
SELECT i.id
FROM items AS i
JOIN items_keywords AS ik ON ( i.id = ik.item_id )
JOIN keywords AS k ON ( k.id = ik.keyword_id )
JOIN keywords_profiles AS kp ON ( k.id = kp.keyword_id )
WHERE kp.profile_id IN (139)
ORDER BY i.timestamp DESC
LIMIT 20;
It's still taking 10+ seconds...what else can I do?
Minor curiosity: on the explain, the rows column for items_keywords is always 1544, regardless of what profile_id I'm using in the query. shouldn't it change depending on the number of items associated with that profile?
EDIT 3: Ok, this is getting ridiculous :). If I drop the ORDER BY clause entirely, things are very speedy and the temp table / filesort disappears from explain. There's currently an index on the item.timestamp column, but is it not being used for some reason? I thought I remembered something about mysql only using one index per table or something? should I create a multi-column index over all the columns on the items table that this query references (source_id, author_id, timestamp, etc)?
Try this and see how it does:
SELECT i.*, s.*, k.*, a.*
FROM items AS i
JOIN sources AS s ON (i.source_id = s.id)
JOIN authors AS a ON (i.author_id = a.id)
JOIN items_keywords AS ik ON (i.id = ik.item_id)
JOIN keywords AS k ON (k.id = ik.keyword_id)
WHERE k.id IN (SELECT kp.keyword_id
FROM keywords_profiles AS kp
WHERE kp.profile_id IN (17))
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;
I factored out a couple of the joins into a non-correlated subquery, so you wouldn't have to do a GROUP BY to map the result to distinct rows.
Actually, you may still get multiple rows per i.id in my example, depending on how many keywords map to a given item and also to profile_id 17.
The filesort reported in your EXPLAIN report is probably due to the combination of GROUP BY and ORDER BY using different fields.
I agree with #ʞɔıu's answer that the speedup is probably because of key caching.
It looks okay, every row in the explain is using an index of some sort. One possible worry is the filesort bit. Try running the query without the order by clause and see if that improves it.
Then, what I would do is gradually take out each join until you (hopefully) get that massive speed increase, then concentrate on why that's happening.
The reason I mention the filesort is because I can't see a mention of timestamp anywhere in the explain output (even though it's your primary sort criteria) - it might be requiring a full non-indexed sort.
UPDATE#1:
Based on edit#2, the query:
SELECT i.id
FROM items AS i
ORDER BY i.timestamp DESC, i.id DESC
LIMIT 0, 20;
takes 5-6 seconds. That's abhorrent. Try creating a composite index on both TIMESTAMP and ID and see if that improves it:
create index timestamp_id on items(timestamp,id);
select id from items order by timestamp desc,id desc limit 0,20;
select id from items order by timestamp,id limit 0,20;
select id from items order by timestamp desc,id desc;
select id from items order by timestamp,id;
On one of the tests, I've left off the descending bit (DB2 for one sometimes doesn't use indexes if they're in the opposite order). The other variation is to take off the limit in case that's affecting it.
For your query to run fast, you need:
Create an index: CREATE INDEX ix_timestamp_id ON items (timestamp, id)
Ensure that id's on sources, authors and keywords are primary keys.
Force MySQL to use this index for items, and perform NESTED LOOP joins for other items:
EXPLAIN EXTENDED
SELECT Item.*, Source . * , Keyword . * , Author . *
FROM items AS Item FORCE INDEX FOR ORDER BY (ix_timestamp_id)
JOIN items_keywords AS ItemsKeyword FORCE INDEX (ix_item_keyword) ON ( Item.id = ItemsKeyword.item_id AND ItemsKeyword.keyword_id IN
(
SELECT keyword_id
FROM keywords_profiles AS KeywordsProfile FORCE INDEX (ix_keyword_profile)
WHERE KeywordsProfile.profile_id = 17
)
)
JOIN sources AS Source FORCE INDEX (primary) ON ( Item.source_id = Source.id )
JOIN authors AS Author FORCE INDEX (primary) ON ( Item.author_id = Author.id )
JOIN keywords AS Keyword FORCE INDEX (primary) ON ( Keyword.id = ItemsKeyword.keyword_id )
ORDER BY Item.timestamp DESC, Item.id DESC
LIMIT 0, 20
As you can see, we get rid of GROUP BY, push the subquery into the JOIN condition and force PRIMARY KEYs to be used for joins.
That's how we ensure that NESTED LOOPS with items as a leading tables will be used for all joins.
As a result:
1, 'PRIMARY', 'Item', 'index', '', 'ix_timestamp_id', '12', '', 20, 2622845.00, ''
1, 'PRIMARY', 'Author', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.author_id', 1, 100.00, ''
1, 'PRIMARY', 'Source', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.source_id', 1, 100.00, ''
1, 'PRIMARY', 'ItemsKeyword', 'ref', 'PRIMARY', 'PRIMARY', '4', 'test.Item.id', 1, 100.00, 'Using where; Using index'
1, 'PRIMARY', 'Keyword', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.ItemsKeyword.keyword_id', 1, 100.00, ''
2, 'DEPENDENT SUBQUERY', 'KeywordsProfile', 'unique_subquery', 'PRIMARY', 'PRIMARY', '8', 'func,const', 1, 100.00, 'Using index; Using where'
, and when we run this, we get
20 rows fetched in 0,0038s (0,0019s)
There are 500k in items, 600k in items_keywords, 512 values in keywords and 512 values in keywords_profiles (all with profile 17).
I would suggest you run a profiler on the query, then you can see how long each subquery took and where the time is being consumed. If you have phpmyadmin, it's a simple chekbox you need to check to get this functionality, but my guess is you can get it manually from the mysql terminal app as well. I haven't seen this explain thing before, if it is in fact the profiling i am used to in phpmyadmin i apologize for the nonsense.
What is the GROUP BY clause achieving? There are no aggregate functions in the SELECT so the GROUP BY should be unnecessary
Some things to try:
Try not selecting all columns from all tables, and select only what you need. That may preclude the use of covering indexes (looking for using index in the extra column) and in general will soak up a lot of needless IO.
That filesort looks a little troubling. Try removing the order by and replacing it with order by null -- group by implicitly sorts in mysql so you have to order by null to remove that implicit sort.
Try adding an index on item (timestamp, id) or (id, timestamp). Might do something about that filesort (you never know).
Why are you grouping by item id? and not selecting any aggregate columns? if you group by a column and then select (much less order by) some other non-aggregate columns then the values of those columns will be selected more or less arbitrary. Unless, is item id is always unique for this query, in which case the group by will not accomplish anything.
Lastly, in my experience, mysql sometimes will just inexplicably freak out if you give it too many joins to try to optimize. Try and figure out if there's some way you don't have to do so many joins all once like that, i.e. split it up into multiple queries if you can.
one of the things that makes this really frustrating is that if I repeatedly run this query, it takes less than .1 seconds. I'm assuming this is due to the query cache — add SQL_NO_CACHE after the SELECT keyword to disable the use of the query cache per this query
All the tables are MyISAM...does this indicate that MySQL is loading stuff into memory and that's why it runs so much faster for awhile — MyISAM uses a key buffer and only caches index data in memory, and relies on the OS to hopefully cache non-index data. Unlike Innodb, which caches everything in the buffer pool.
Is it possible you're having issues because of filesystem I/O ? The EXPLAIN shows that there have to be 1544 rows fetched from the ItemsKeyword table. If you have to go to disk for each of those you'll add about 10-15 second total to the run time (assuming a high-ish seek time because you're on a VM). Normally the tables are cached in RAM or the data is stored close enough on the disk that reads can be combined. However, you're running on a VM with 256MB of ram, so you may no memory spare it can cache into and if your table file is fragmented enough you might be able to get the query performance degraded this much.
You could probably get some idea of what's happening with I/O during the query by running something like pidstat -d 1 or iostat 1 in another shell on the server.
EDIT:
From looking at the query adding an index on (ItemsKeyword.item_id, ItemsKeyword.keyword_id) should fix it if my theory is right about it being a problem with the seeks for the ItemsKeyword table.
MySQL loads a lot into different caches, including indexes and queries. In addition, your operating system will keep a file system cache that could speed up your query when executed repeatedly.
One thing to consider is how MySQL creates temporary tables during this type of query. As you can see in your explain, a temporary table is being created, probably for sorting of the results. Generally, MySQL will create these temporary tables in memory, except for 2 conditions. First, if they exceed the maximum size set in MySQL settings (max temp table size or heap size - check mysqlperformanceblogs.com for more info on these settings). The second and more important one is this:
Temporary tables will always be created on disk when text or blob tables are selected in the query.
This can create a major performance hit, and even lead to an i/o bottleneck if your server is getting any amount of action.
Check to see if any of your columns are of this data type. If they are, you can try to rewrite the query so that a temporary table is not created (group by always causes them, I think), or try not selecting these out. Another strategy would be to break this up into several smaller queries that might execute in a fraction of the time.
Good luck!
I may be completely wrong but what happens when you change
WHERE kp.profile_id IN (139)
to
WHERE kp.profile_id = 139
Try this:
SELECT i.id
FROM ((items AS i
INNER JOIN items_keywords AS ik ON ( i.id = ik.item_id ))
INNER JOIN keywords AS k ON ( k.id = ik.keyword_id ))
INNER JOIN keywords_profiles AS kp ON ( k.id = kp.keyword_id AND kp.profile_id = 139)
ORDER BY i.timestamp DESC
LIMIT 20;
Looking at the pastie.org link in the comments to the question:
you're joining items.source_id int(4) to sources.id int(16)
also items.id int(16) to itemskeywords.item_id int(11)
I can't see any good reason for the two fields to have different widths in these cases
I realise that these are just display widths and that the actual range of numbers which the column can store is determined solely by the INT part but the MySQL 6.0 reference manual says:
Note that if you store larger values
than the display width in an integer
column, you may experience problems
when MySQL generates temporary tables
for some complicated joins, because in
these cases MySQL assumes that the
data fits into the original column
width.
From the rough figures you quoted, it doesn't look as though you are exceeding the display width on any of the ID columns. You may as well tidy up these inconsistencies though just to eliminate another possible bug.
You might be as well to remove the display widths altogether if you don't have a need for them
edit:
I would hazard a guess that the original author of the database perhaps thought that int(4) meant "an integer with up to 4 digits" whereas it actually means "an integer between -2147483648 and 2147482647 displayed with at least 4 characters left-padded with spaces if need be"
Definitions like authors.refreshed int(20) or items.timestamp int(30) don't really make sense as there can only be 10 digits plus the sign in an int. Even a bigint can't exceed 20 characters. Perhaps the original author thought that int(4) was analogous to varchar(4)?
Try a backup copy of your tables. After that rename the original tables to something else, rename the new tables to the original and try again with your new-but-old-named tables...
Or you can try to repair the tables, but this doesn't always help.
Edit: Man, this was an old question...
The problem appears that it has to full joins across every single table before it even tries to do a where clause. This can cause 500k rows per table across you're looking in the millions+ rows that it's populating in memory. I would try changing the JOINS to LEFT JOINS USING ().