Table type: MyISAM
Rows: 120k
Data Length: 30MB
Index Length: 40MB
my.ini, MySQL 5.6.2 Windows
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 16M
Windows Server 2012, 12GB RAM, SSD 400MB/s
1 Slow Query:
SELECT article_id, title, author, content, pdate, MATCH(author, title, content)
AGAINST('Search Keyword') AS score FROM articles ORDER BY score DESC LIMIT 10;
Executing this query takes 352ms uses index. After profiling, it shows that most of the time is spent on Creating sort index. (Complete detail: http://pastebin.com/raw/jT58DCN5)
2 Faster Query:
SELECT article_id, title, author, content, pdate, MATCH(author, title, content)
AGAINST('Search Keyword') AS score FROM articles LIMIT 10;
Executing this query takes 23ms and does a full table scan, I don't like full table scans.
The problem / question is, Query #1 is the one that I need to use, since the sorting is very important.
Is there anything I can do about speeding up that query / re-writting it and achieve the same result (As #1)?
Appreciate any input and help.
Maybe you are simply expecting too much? 350ms for doing a
MATCH(author, title, content) AGAINST('Search Keyword')
ORDER BY
on 120k records doesn't sound too shabby to me; especially if content is 'bigish' ...
Keep in mind that for your "SLOW QUERY" to work, the system has to read every row, calculate the score and then in the end sort all scores, figure out the lowest 10 values and then return all relevant row information for it. If you leave out the ORDER BY then it simply picks the first 10 rows and only needs to calculate the score for those 10 rows.
That said, I think the EXPLAIN is a bit misleading in that it seems to blame everything on the SORT while most probably it is the MATCH that takes up most of the time. I'm guessing the MATCH() operator is executed in a 'lazy' manner and thus only is run when the data is asked for which in this case is while the sorting is happening.
To figure this out simply add a new column score and split up the query into 2 parts.
UPDATE articles SET score = MATCH() etc... => will take about 300 ms I guess
SELECT article_id, title, author, content, pdate, score FROM articles ORDER BY score DESC LIMIT 10; => will take about 50 ms I guess
Off course this is no working solution, but if I'm right it will show you that your problem is not with the SORT but rather with the fulltext searching...
PS: you forgot to mention what the indexes are on the table, might be useful to know too. cf https://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html
Try these variations:
AGAINST('word')
AGAINST('+word')
AGAINST('+word' IN BOOLEAN MODE)
Try
SELECT ... MATCH ...,
FROM tbl
WHERE MATCH ... -- repeat the test here
...
The test is to eliminate the rows that do not match at all, thereby greatly diminishing the number of rows to sort. (Again, depending on + and BOOLEAN.)
(I usually use all three: +, BOOLEAN, and WHERE MATCH.)
key_buffer_size = 2G may help also.
You should consider moving to InnoDB, where FT is faster.
Related
We have a wordpress with a database with 2 million records in wp_posts, new relic is showing that we have several slow queries but this one is taking
SELECT wp_posts.ID
FROM wp_posts
WHERE ?=? AND wp_posts.post_type = ? AND ((wp_posts.post_status = ?)) ORDER BY wp_posts.post_title ASC
LIMIT ?, ?
And it's explaining this:
I don't know exactly how can I optimize this query but I guess I need to add an index with (id, post_type, post_status, and post_title)?
What you need is INDEX(type, status, title, ID). But that won't work because you cannot have a TEXT (as in post_title) column in an index.
So there is no optimization and no useful index.
WP is not designed to scale. 2M rows is a lot, though not so bad as it would be if you were joining to postmeta or other tables.
This query needs to collect all the titles of a certain type and status, sort them, skip over the OFFSET, and finally deliver some IDs.
If you are using OFFSET because of "pagination", then you have a second problem -- the query will be slower and slower as the user pages through the list. (Notice the "skip" step I mentioned.)
If you could limit titles to under 255 (or maybe 191, depending on version and charset) characters, then the INDEX I mentioned will work, and it will work a lot faster. That would involve and ALTER TABLE to change post_title to VARCHAR(...) which would take some time and lose data if any titles are truncated.
A possible solution... Since it is not reasonable to deliver 2M rows to the user, can we guess that type and/or status is rather selective? That is, are there really only at most a few hundred rows of interest? In that case, this may help:
INDEX(post_type, post_status) -- in either order.
I was wondering what would be faster and what's the tradeoffs of using one or the other query?
SELECT * FROM table WHERE somecolumn = 'something' LIMIT 999;
vs.
SELECT * FROM table WHERE somecolumn = 'something';
Now, considering that the results of the query will never return more than a couple of hundreds of rows, does using LIMIT 999 makes some significate performance impact or not?
I'm looking into this option as in my project I will have some kind of option for a user to limit results as he'd like, and he can leave limit empty to show all, so it's easier for me to leave LIMIT part of the query and then just to change the number.
Now, the table is really big, ranging from couple of hundreds of thousands to couple of millions rows.
The exact quesy looks something like:
SELECT SUM(revenue) AS cost,
IF(ISNULL(headline) OR headline = '', 'undefined', headline
) AS headline
FROM `some_table`
WHERE ((date >= '2017-01-01')
AND (date <= '2017-12-31')
)
AND -- (sic)
GROUP BY `headline`
ORDER BY `cost` DESC
As I said before, this query will never return more than about a hundred rows.
Disk I/O, if any, is by far the most costly part of a query.
Fetching each row ranks next.
Almost everything else is insignificant.
However, if the existence of LIMIT can change what the Optimizer does, then there could be a significant difference.
In most cases, including the queries you gave, a too-big LIMIT has not impact.
In certain subqueries, a LIMIT will prevent the elimination of ORDER BY. A subquery is, by definition, is a set not an ordered set. So LIMIT is a kludge to prevent the optimization of removing ORDER BY.
If there is a composite index that includes all the columns needed for WHERE, GROUP BY, and ORDER BY, then the Optimizer can stop when the LIMIT is reached. Other situations go through tmp tables and sorts for GROUP BY and ORDER BY and can do the LIMIT only against a full set of rows.
Two caches were alluded to in the Comments so far.
"Query cache" -- This records exact queries and their result sets. If it is turned on and if it applicable, then the query comes back "instantly". By "exact", I include the existence and value of LIMIT.
To speed up all queries, data and indexes blocks are "cached" in RAM (see innodb_buffer_pool_size). This avoids disk I/O when a similar (not necessarily exact) query is run. See my first sentence, above.
I am reading High performance MySQL and I am a little confused about deferred join.
The book says that the following operation cannot be optimized by index(sex, rating) because the high offset requires them to spend most of their time scanning a lot of data that they will then throw away.
mysql> SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;
While a deferred join helps minimize the amount of work MySQL must do gathering data that it will only throw away.
SELECT <cols> FROM profiles INNER JOIN (
SELECT <primary key cols> FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
) AS x USING(<primary key cols>);
Why a deferred join will minimize the amount of gathered data.
The example you presented assumes that InnoDB is used. Let's say that the PRIMARY KEY is just id.
INDEX(sex, rating)
is a "secondary key". Every secondary key (in InnoDB) includes the PK implicitly, so it is really an ordered list of (sex, rating, id) values. To get to the "data" (<cols>), it uses id to drill down the PK BTree (which contains the data, too) to find the record.
Fast Case: Hence,
SELECT id FROM profiles
WHERE x.sex='M' ORDER BY rating LIMIT 100000, 10
will do a "range scan" of 100010 'rows' in the index. This will be quite efficient for I/O, since all the information is consecutive, and nothing is wasted. (No, it is not smart enough to jump over 100000 rows; that would be quite messy, especially when you factor in the transaction_isolation_mode.) Those 100010 rows probably fit in about 1000 blocks of the index. Then it gets the 10 values of id.
With those 10 ids, it can do 10 joins ("NLJ" = "Nested Loop Join"). It is rather likely that the 10 rows are scattered around the table, possibly requiring 10 hits to the disk.
Let's "count the disk hits" (ignoring non-leaf nodes in the BTrees, which are likely to be cached anyway): 1000 + 10 = 1010. On ordinary disks, this might take 10 seconds.
Slow Case: Now let's look at the original query (SELECT <cols> FROM profiles WHERE sex='M' ORDER BY rating LIMIT 100000, 10;). Let's continue to assume INDEX(sex, rating) plus the implicit id on the end.
As before, it will index scan through the 100010 rows (est. 1000 disk hits). But as it goes, it is too dumb to do what was done above. It will reach over into the data to get the <cols>. This often (depending on caching) requires a random disk hit. This could be upwards of 100010 disk hits (if the table is huge and caching is not very useful).
Again, 100000 are tossed and 10 are delivered. Total 'cost': 100010 disk hits (worst case), which might take 17 minutes.
Keep in mind that there are 3 editions of High performance MySQL; they were written over the past 13 or so years. You are probably using a much newer version of MySQL than they covered. I do not happen to know if the optimizer has gotten any smarter in this area. These, if available to you, may give clues:
EXPLAIN FORMAT=JSON SELECT ...;
OPTIMIZER TRACE...
My favorite "Handler" trick for studying how things work may be helpful:
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%'.
You are likely to see numbers like 100000 and 10, or small multiples of such. But, keep in mind that a fast range scan of the index counts as 1 per row, and so does a slow random disk hit for a big set of <cols>.
Overview: To make this technique work, the subquery need a "covering" index, with the columns correctly ordered.
"Covering" means that (sex, rating, id) contains all the columns touched. (We are assuming that <cols> contains other columns, perhaps bulky ones that won't work in an INDEX.)
"Correct" ordering of the columns: The columns are in just the right order to get all the way through the query. (See also my cookbook.)
First come any WHERE columns compared with = to constants. (sex)
Then comes the entire ORDER BY, in order. (rating)
Finally it is 'covering'. (id)
From the description below from official (https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html):
If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause are selected, and most or all of them are sorted, before the first row_count are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.
We can see that they should have no difference.
But the percona suggest this, and give test data. But give no reason, I think there maybe exist some "bug" in mysql when deal with this kind of case. So we just regard this as a useful experience.
[site_list] ~100,000 rows... 10mb in size.
site_id
site_url
site_data_most_recent_record_id
[site_list_data] ~ 15+ million rows and growing... about 600mb in size.
record_id
site_id
site_connect_time
site_speed
date_checked
columns in bold are unique index keys.
I need to return 50 most recently updated sites AND the recent data that goes with it - connect time, speed, date...
This is my query:
SELECT SQL_CALC_FOUND_ROWS
site_list.site_url,
site_list_data.site_connect_time,
site_list_data.site_speed,
site_list_data.date_checked
FROM site_list
LEFT JOIN site_list_data
ON site_list.site_data_most_recent_record_id = site_list_data.record_id
ORDER BY site_data.date_checked DESC
LIMIT 50
Without the ORDER BY and SQL_CALC_FOUND_ROWS(I need it for pagination), the query takes about 1.5 seconds, with those it takes over 2 seconds or more which is not good enough because that particular page where this data will be shown is getting 20K+ pageviews/day and this query is apparently too heavy(server almost dies when I put this live) and too slow.
Experts of mySQL, how would you do this? What if the table got to 100 million records? Caching this huge result into a temp table every 30 seconds is the only other solution I got.
You need to add a heuristic to the query. You need to gate the query to get reasonable performance. It is effectively sorting your site_list_date table by date descending -- the ENTIRE table.
So, if you know that the top 50 will be within the last day or week, add a "and date_checked > <boundary_date>" to the query. Then it should reduce the overall result set first, and THEN sort it.
SQL_CALC_ROWS_FOUND is slow use COUNT instead. Take a look here
A couple of observations.
Both ORDER BY and SQL_CALC_FOUND_ROWS are going to add to the cost of your performance. ORDER BY clauses can potentially be improved with appropriate indexing -- do you have an index on your date_checked column? This could help.
What is your exact need for SQL_CALC_FOUND_ROWS? Consider replacing this with a separate query that uses COUNT instead. This can be vastly better assuming your Query Cache is enabled.
And if you can use COUNT, consider replacing your LEFT JOIN with an INNER JOIN as this will help performance as well.
Good luck.
I have a database of a million rows which isn't a lot. They're all sorted by cities with a city_id (indexed). I wanted to show the most recent post with:
SELECT * FROM table FORCE INDEX(PRIMARY, city_id) WHERE city_id=1 ORDER BY 'id' DESC LIMIT 0,4
id is also labeled primary. Before adding the force index it took 5.9 seconds. I found that solution on SO and it worked great. The query now takes 0.02 seconds.
The problem is that this seems to only worked in city_id 1 when I change that city to 2 or 3 or anything else it seems to be back to 6 seconds.
I'm not certain how mysql works. Does it index better on frequent queries or am I missing something here.
Do an explain on your query (with and without the force):
explain SELECT * FROM table WHERE city_id=1 ORDER BY 'id' DESC LIMIT 0,4
and see what mysql tells you about the cost of the use of a certain index. Concerning your indexing strategy and your force: MySQL loves combined indexes and is usually not very good at combining them itself and the primary indexing is always on, so no need to specify it. Concerning your statement I would use something like this and see if it improves the performance:
select * from table use index(city_id) where city_id=1 ORDER BY 'id'
DESC LIMIT 0,4
And always keep in mind: measuring the time of a statement on cmdline more than once will always let the cache kickin so it's pretty useless (and maybe that's the reason you get a bad performance if you change the city_id to a different value).
You will find a lot of good tips and performance hints in the MySQL Performance Blog.