Simple query optimization (WHERE + ORDER + LIMIT) - mysql

I have this query that runs unbelievably slow (4 minutes):
SELECT * FROM `ad` WHERE `ad`.`user_id` = USER_ID ORDER BY `ad`.`id` desc LIMIT 20;
Ad table has approximately 10 million rows.
SELECT COUNT(*) FROM `ad` WHERE `ad`.`user_id` = USER_ID;
Returns 10k rows.
Table has following indexes:
PRIMARY KEY (`id`),
KEY `idx_user_id` (`user_id`,`status`,`sorttime`),
EXPLAIN gives this:
id: 1
select_type: SIMPLE
table: ad
type: index
possible_keys: idx_user_id
key: PRIMARY
key_len: 4
ref: NULL
rows: 4249
Extra: Using where
I am failing to understand why does it take so long? Also this query is generated by ORM (pagination) so it would be nice to optimize it from outside (maybe add some extra index).
BTW this query works fast:
select aa.*
from (select id from ad where user_id=USER_ID order by id desc limit 20) as a
join ad as aa on a.id = aa.id ;
Edit: I tried another user with much less rows (dozens) than original one. I am wondering why doesn't original query use idx_user_id:
EXPLAIN SELECT * FROM `ad` WHERE `ad`.`user_id` = ANOTHER_ID ORDER BY `ad`.`id` desc LIMIT 20;
id: 1
select_type: SIMPLE
table: ad
type: ref
possible_keys: idx_user_id
**key: idx_user_id**
key_len: 3
ref: const
rows: 84
Extra: Using where; Using filesort
Edit2: with help of Alexander I decided to try force MySQL to use the index I want, and following query is much faster (1 sec instead of 4 mins):
SELECT *
FROM `ad` USE INDEX (idx_user_id)
WHERE `ad`.`user_id` = 1884774
ORDER BY `ad`.`id` desc LIMIT 20;

In the EXPLAIN output you can see that the key value is PRIMARY. This means that MySQL optimizer decided that it is faster to scan all table records (which are already sorted by id) and search first 20 records with the specific user_id value than to use idx_user_id key, which was considered by optimizer as a possible key and then rejected.
In your second query the optimizer sees that only id values are necessary in the subquery, and decided to use idx_user_id index instead, as that index allows to calculate the list of necessary ids without touching the table itself. Then only 20 records are retrieved by direct search by primary key value, which is very fast operation for that small number of records.
As you query with ANOTHER_ID shows, the MySQL wrong decision was based on the number of rows for the previous USER_ID value. This number was so big that the optimizer guessed that it will find the first 20 records with this specific user_id faster just by looking at the table records itself and skipping records with wrong user_id values.
If table rows are accessed by index, it requires random access operations. For typical HDD random access operations are about 100 time slower then sequential scan. So in order for index to be useful it must reduce the count of rows to less then 1% of the total rows count. If the rows for the specific USER_ID value accounts for more than 1% of the total number of rows, it may be more efficient to do full table scan instead of using of index, if we want to retrieve all these rows. But MySQL optimizer doesn't takes into account the fact that only 20 of this rows will be retrieved. So it mistakenly decided not to use index and do full table scan instead.
In order to make your query fast for any user_id value you can add one more index which will allow the query execution in the fastest way possible:
create index idx_user_id_2 on ad(user_id, id);
This index allows MySQL to do both filtering and sorting. To do that the columns used for filtering should be placed first, and the columns used for ordering should be placed second. MySQL should be smart enough to use that index, because this index allows to search all necessary records without skipping any records.

Related

MySQL: Efficiency of views containing GROUP BY

The fact that I haven't been able to come up (or research) a solution to this question means that I'm either too stupid to read the docs or it is in fact a complicated problem.
In a rather big database I often need a query like this:
SELECT ... WHERE condition GROUP BY something;
This takes a fraction of a second to complete. So I put this in a VIEW:
CREATE VIEW view_x AS SELECT ... GROUP BY something;
And when I then do
SELECT * FROM view_x WHERE condition;
it takes more than a minute to complete. Now it's easy to see why: In the plain SELECT, the DB engine first selects a few hundred results from millions of records and then does the aggregating and grouping only on the matching records. When using the view, it seems to first evaluate the entire dataset, aggregating and grouping everything, and then returns only the records meeting the condition and throwing away the expensively calculated rest.
Is there a more intelligent VIEW solution, or do I have to use the full SELECT each time?
Thanks.
EDIT: Here's the original SQL code for the view:
CREATE VIEW v_status1 AS SELECT
FROM_UNIXTIME(J.ts_start) AS job_start,
J.id AS job_id, J.carrier, J.n_wafers,
count(W.id) AS n
FROM job AS J
JOIN wafer AS W ON J.id=W.job_id
GROUP BY J.carrier, J.n_wafers, W.status_id;
table job: 100k records, table wafer: 2M records.
Comparison is between these queries:
SELECT * FROM v_status1 WHERE carrier LIKE 'W96L00%'; -- very slow
versus the identical SELECT in the VIEW definition with the WHERE clause before the GROUP BY clause.
Some additional information: The query yields 9 records. Using the view it takes 19 seconds to execute. Using the direct query, it takes 0.000 seconds according to MySQL Workbench.
When I replace the WHERE clause in the direct query by a HAVING clause with the same condition at the end of the query, I end up at the same execution time as the query using the view.
Yes, I forgot some columns in the GROUP BY part. Put them in, doesn't make much of a difference.
Minimal example (5 seconds execution time):
CREATE VIEW v_status2 AS SELECT
job_id,
status_id,
count(id) AS n
FROM wafer
GROUP BY job_id, status_id;
yields 2 records given some job_id
well, I did the obvious and asked MySQL to EXPLAIN. The output is below. My interpretation is what I suspected all along: MySQL first builds a temporary table, doing all the hard work aggregating and grouping, and then selects only the rows matching the selection criteria. In other words, MySQL is not intelligent enough to first analyze the view to find where it can efficiently cull the original dataset and only work on the remaining records.
BTW, this has nothing to do with joins and indexes. You can see the effect with any sufficiently large two-column table.
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 952929 Using where
2 DERIVED WS index PRIMARY ix_waferstatus_text 123 NULL 9 Using index; Using temporary; Using filesort
2 DERIVED W ref ix_wafer_job_id,wafer_ibfk_2 wafer_ibfk_2 5 jobwatch.WS.id 105881 Using where
2 DERIVED J eq_ref PRIMARY,job_ibkf_2 PRIMARY 4 jobwatch.W.job_id 1 Using where
2 DERIVED T eq_ref PRIMARY PRIMARY 4 jobwatch.J.tool_id 1

Basic query in unexpectedly slow in MySQL

I am running a basic select on a table with 189,000 records. The table structure is:
items
id - primary key
ad_count - int, indexed
company_id - varchar, indexed
timestamps
the select query is:
select *
from `items`
where `company_id` is not null
and `ad_count` <= 100
order by `ad_count` desc, `items`.`id` asc
limit 50
On my production servers, just the MySQL portion of the execution takes 300 - 400ms
If I run an explain, I get:
select type: SIMPLE
table: items
type: range
possible_keys: items_company_id_index,items_ad_count_index
key: items_company_id_index
key_len: 403
ref: NULL
rows: 94735
Extra: Using index condition; Using where; Using filesort
When fetching this data in our application, we paginate it groups of 50, but the above query is "the first page"
I'm not too familiar with dissecting explain queries. Is there something I'm missing here?
An ORDER BY clause with different sorting order can cause the creation of temporary tables and filesort. MySQL below (and including) v5.7 doesn't handle such scenarios well at all, and there is actually no point in indexing the fields in the ORDER BY clause, as MySQL's optimizer will never use them.
Therefore, if the application's requirements allow, it's best to use the same order for all columns in the ORDER BY clause.
So in this case:
order by `ad_count` desc, `items`.`id` asc
Will become:
order by `ad_count` desc, `items`.`id` desc
P.S, as a small tip to read more about - it seems that MySQL 8.0 is going to change things and these use cases might perform significantly better when it's released.
Try replacing items_company_id_index with a multi-column index on (company_id, ad_count).
DROP INDEX items_company_id_index ON items;
CREATE INDEX items_company_id_ad_count_index ON items (company_id, ad_count);
This will allow it to use the index to test both conditions in the WHERE clause. Currently, it's using the index just to find non-null company_id, and then doing a full scan of those records to test ad_count. If most records have non-null company_id, it's scanning most of the table.
You don't need to retain the old index on just the company_id column, because a multi-column index is also an index on any prefix columns, because of the way B-trees work.
I could be wrong here (depending on your sql version this could be faster) but try a Inner Join with your company table.
Like:
Select *
From items
INNER JOIN companies ON companies.id = items.company_id
and items.ad_count <= 100
LIMIT 50;
because of your high indexcount building the btrees will slow down the database each time a new entry is inserted. Maybe remove the index of ad_count?! (this depends on how often you use that entry for queries)

mysql order by -id vs order by id desc

I wish to fetch the last 10 rows from the table of 1 M rows.
CREATE TABLE `test` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`updated_date` datetime NOT NULL,
PRIMARY KEY (`id`)
)
One way of doing this is -
select * from test order by -id limit 10;
**10 rows in set (0.14 sec)**
Another way of doing this is -
select * from test order by id desc limit 10;
**10 rows in set (0.00 sec)**
So I did an 'EXPLAIN' on these queries -
Here is the result for the query where I use 'order by desc'
EXPLAIN select * from test order by id desc limit 10;
And here is the result for the query where I use 'order by -id'
EXPLAIN select * from test order by -id limit 10;
I thought this would be same but is seems there are differences in the execution plan.
RDBMS use heuristics to calculate the execution plan, they cannot always determine the semantic equivalence of two statements as it is a too difficult problem (in terms of theoretical and practical complexity).
So MySQL is not able to use the index, as you do not have an index on "-id", that is a custom function applied to the field "id". Seems trivial, but the RDBMSs must minimize the amount of time needed to compute the plans, so they get stuck with simple problems.
When an optimization cannot be found for a query (i.e. using an index) the system fall back to the implementation that works in any case: a scan of the full table.
As you can see in Explain results,
1 : order by id
MySQL is using indexing on id. So it need to iterate only 10 rows as it is already indexed. And also in this case MySQL don't need to use filesort algorithm as it is already indexed.
2 : order by -id
MySQL is not using indexing on id. So it needs to iterate all the rows.( e.g. 455952) to get your expected results. In this case MySQL needs to use filesort algorithm as id is not indexed. So it will obviously take more time :)
You use ORDER BY with an expression that includes terms other than the key column name:
SELECT * FROM t1 ORDER BY ABS(key);
SELECT * FROM t1 ORDER BY -key;
You index only a prefix of a column named in the ORDER BY clause. In this case, the index cannot be used to fully resolve the sort order. For example, if you have a CHAR(20) column, but index only the first 10 bytes, the index cannot distinguish values past the 10th byte and a filesort will be needed.
The type of table index used does not store rows in order. For example, this is true for a HASH index in a MEMORY table.
Please follow this link: http://dev.mysql.com/doc/refman/5.7/en/order-by-optimization.html

Do Mysql sub queries use indexes too?

I have a table that I do fulltext searching on. It's starting to get big already with a relatively small amount of users - 20 million rows
Searches will only ever need to be on rows that belong to the PKs relevant to the search ie rows that belong to that user, and at most, that's about 200 000 per user. I figured if the fulltext search was only done on a subquery that first selects that user's rows, it should be super fast eg
SELECT * FROM
(SELECT * FROM table1 WHERE userID = 2 ) AS r
WHERE MATCH (r.fullTextCol1) AGAINST ('+monkey* ' IN BOOLEAN MODE)
ORDER BY r.fullTextCol1, r.fullTextCol2 ASC LIMIT 0,50
However, this query takes 4 seconds.
EXPLAIN says...
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 185927 Using where; Using filesort
2 DERIVED table1 ref PRIMARY,unique unique 4 193082
My indexes are:
PRIMARY (userID, userSubList, userItemID)
FULLTEXT fullTextCol1
FULLTEXT fullTextCol2
The subquery seems to not use the userID index at all.
Is my thinking right in approaching it like this - sub selecting the relevent user row to search on?
Thanks for your time and help.
Have you tried like this? :
SELECT *
FROM table1
WHERE userID = 2
AND MATCH (fullTextCol1) AGAINST ('+monkey* ' IN BOOLEAN MODE)
ORDER BY fullTextCol1, fullTextCol2 ASC LIMIT 0,50;
Or run without ORDER BY to check JOIN is slow or ORDERing is slow (or mixed)
EDIT
In your case, composite index on (userID, fullTextCol1) is needed but MySQL doesn't have it. Another already answered about this. see Compound FULLTEXT index in MySQL
please, let me know above answer makes sense and it's result.

Slow MySQL query with AS and subquery

I have a problem with this slow query that runs for 10+ seconds:
SELECT DISTINCT siteid,
storyid,
added,
title,
subscore1,
subscore2,
subscore3,
( 1 * subscore1 + 0.8 * subscore2 + 0.1 * subscore3 ) AS score
FROM articles
WHERE added > '2011-10-23 09:10:19'
AND ( articles.feedid IN (SELECT userfeeds.siteid
FROM userfeeds
WHERE userfeeds.userid = '1234')
OR ( articles.title REGEXP '[[:<:]]keyword1[[:>:]]' = 1
OR articles.title REGEXP '[[:<:]]keyword2[[:>:]]' = 1 ) )
ORDER BY score DESC
LIMIT 0, 25
This outputs a list of stories based on the sites that a user added to his account. The ranking is determined by score, which is made up out of the subscore columns.
The query uses filesort and uses indices on PRIMARY and feedid.
Results of an EXPLAIN:
1 PRIMARY articles
range
PRIMARY,added,storyid
PRIMARY 729263 rows
Using where; Using filesort
2 DEPENDENT SUBQUERY
userfeeds
index_subquery storyid,userid,siteid_storyid
siteid func
1 row
Using where
Any suggestions to improve this query? Thank you.
I would move the calculation logic to the client and only load fields from the database. This makes your query and the calculation itself faster. It's not a good style to do such things in SQL code.
And also is the regex very slow, maybe another searching mode like 'LIKE' is faster.
Looking at your EXPLAIN, it doesn't appear your query is utilizing any index (thus the filesort). This is being caused by the sort on the calculated column (score).
Another barrier is the size of the table (729263 rows). You don't want to create an index that is too wide as it will take much more space and impact performance of your CUD operations. What we want to do is target the columns that are being selected, however, in this situation we can't since it's a calculated column. You can try creating a VIEW or either remove the sort or do it at the application layer.