SQL: Optimize the query on large table with indexing

SQL: Optimize the query on large table with indexing - mysql

For example, I have the following table:
table Product
------------
id
category_id
processed
product_name
This table has index on columns id category_id and processed and (category_id, proccessed). The statistic on this table is:
select count(*) from Product; -- 50M records
select count(*) from Product where category_id=10; -- 1M records
select count(*) from Product where processed=1; -- 30M records
My simplest query I want to query is: (select * is the must).
select * from Product
where category_id=10 and processed=1
order by id ASC LIMIT 100
The above query without limit only has about 10,000 records.
I want to call the above query for multiple time. Every time I get out I will update field processed to 0. (so it will not appear on the next query). When I test on the real data, sometime the optimizer try to use id as the key, so it cost a lot of time.
How can I optimize the above query (In general term)
P/S: for avoiding confuse, I know that the best index should be (category, processed, id). But I cannot change the index. My question is just only related to optimize the query.
Thanks

For this query:
select *
from Product
where category_id = 10 and processed = 1
order by id asc
limit 100;
The optimal index is on product(category_id, processed, id). This is a single index with a three-part key, with the keys in this order.

Given that you have INDEX(category_id, processed), there is virtually no advantage in also having just INDEX(category_id). So DROP the latter.
That may have the beneficial side effect of pushing the Optimizer toward the composite INDEX(category_id, processed), which is at least "better" for the query.
Without touching the indexes, you could use a FORCE INDEX mentioning the composite index's name. But I don't recommend it. "It may help today, but hurt tomorrow, after the data changes."
Why do you say "But I cannot change the index."? Newer version of MySQL/MariaDB make ADD/DROP INDEX much faster than older versions. Also, pt-online-schema-change is provides a fast way.

Related

Is there a way to avoid sorting for a query with WHERE and ORDER BY?

I am looking to understand how a query with both WHERE and ORDER BY can be indexed properly. Say I have a query like:
SELECT *
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3
With an index on date_created, it seems like the execution plan will prefer to use the PRIMARY key and then sort the results itself. This seems to be very slow when it needs to sort a large amount of results.
I was reading through this guide on indexing for ordered queries which mentions an almost identical example and it mentions:
If the database uses a sort operation even though you expected a pipelined execution, it can have two reasons: (1) the execution plan with the explicit sort operation has a better cost value; (2) the index order in the scanned index range does not correspond to the order by clause.
This makes sense to me but I am unsure of a solution. Is there a way to index my particular query and avoid an explicit sort or should I rethink how I am approaching my query?

The Optimizer is caught between a rock and a hard place.
Plan A: Use an index starting with id; collect however many rows that is; sort them; then deliver only 3. The downside: If the list is large and the ids are scattered, it could take a long time to find all the candidates.
Plan B: Use an index starting with date_created filtering on id until it gets 3 items. The downside: What if it has to scan all the rows before it finds 3.
If you know that the query will always work better with one query plan than the other, you can use an "index hint". But, when you get it wrong, it will be a slow query.
A partial answer... If * contains bulky columns, both approaches may be hauling around stuff that will eventually be tossed. So, let's minimize that:
SELECT u.*
FROM ( SELECT id
FROM users
WHERE id IN (1, 2, 3, 4)
ORDER BY date_created
LIMIT 3 -- not repeated
) AS x
JOIN users AS u USING(id)
ORDER BY date_created; -- repeated
Together with
INDEX(date_created, id),
INDEX(id, date_created)
Hopefully, the Optimizer will pick one of those "covering" indexes to perform the "derived table" (subquery). If so that will be somewhat efficiently performed. Then the JOIN will look up the rest of the columns for the 3 desired rows.
If you want to discuss further, please provide
SHOW CREATE TABLE.
How many ids you are likely to have.
Why you are not already JOINing to another table to get the ids.
Approximately how many rows in the table.

You best bet might to to write this in a more complicated way:
SELECT u.*
FROM ((SELECT u.*
FROM users u
WHERE id = 1
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 2
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 3
ORDER BY date_created
LIMIT 3
) UNION ALL
(SELECT u.*
FROM users u
WHERE id = 4
ORDER BY date_created
LIMIT 3
)
) u
ORDER BY date_created
LIMIT 3;
Each of the subqueries will now use an index on users(id, date_created). The outer query is then sorting at most 12 rows, which should be trivial from a performance perspective.

You could create a composite index on (id, date_created) - that will give the engine the option of using an index for both steps - but the optimiser may still choose not to.
If there aren't many rows in your table or it thinks the resultset will be small it's quicker to sort after the fact than it is to traverse the index tree.
If you really think you know better than the optimiser (which you don't), you can use index hints to tell it what to do, but this is almost always a bad idea.

SQL gets slow on a simple query with ORDER BY

I have problem with MySQL ORDER BY, it slows down query and I really don't know why, my query was a little more complex so I simplified it to a light query with no joins, but it stills works really slow.
Query:
SELECT
W.`oid`
FROM
`z_web_dok` AS W
WHERE
W.`sent_eRacun` = 1 AND W.`status` IN(8, 9) AND W.`Drzava` = 'BiH'
ORDER BY W.`oid` ASC
LIMIT 0, 10
The table has 946,566 rows, with memory taking 500 MB, those fields I selecting are all indexed as follow:
oid - INT PRIMARY KEY AUTOINCREMENT
status - INT INDEXED
sent_eRacun - TINYINT INDEXED
Drzava - VARCHAR(3) INDEXED
I am posting screenshoots of explain query first:
The next is the query executed to database:
And this is speed after I remove ORDER BY.
I have also tried sorting with DATETIME field which is also indexed, but I get same slow query as with ordering with primary key, this started from today, usually it was fast and light always.
What can cause something like this?

The kind of query you use here calls for a composite covering index. This one should handle your query very well.
CREATE INDEX someName ON z_web_dok (Drzava, sent_eRacun, status, oid);
Why does this work? You're looking for equality matches on the first three columns, and sorting on the fourth column. The query planner will use this index to satisfy the entire query. It can random-access the index to find the first row matching your query, then scan through the index in order to get the rows it needs.
Pro tip: Indexes on single columns are generally harmful to performance unless they happen to match the requirements of particular queries in your application, or are used for primary or foreign keys. You generally choose your indexes to match your most active, or your slowest, queries. Edit You asked whether it's better to create specific indexes for each query in your application. The answer is yes.

There may be an even faster way. (Or it may not be any faster.)
The IN(8, 9) gets in the way of easily handling the WHERE..ORDER BY..LIMIT completely efficiently. The possible solution is to treat that as OR, then convert to UNION and do some tricks with the LIMIT, especially if you might also be using OFFSET.
( SELECT ... WHERE .. = 8 AND ... ORDER BY oid LIMIT 10 )
UNION ALL
( SELECT ... WHERE .. = 9 AND ... ORDER BY oid LIMIT 10 )
ORDER BY oid LIMIT 10
This will allow the covering index described by OJones to be fully used in each of the subqueries. Furthermore, each will provide up to 10 rows without any temp table or filesort. Then the outer part will sort up to 20 rows and deliver the 'correct' 10.
For OFFSET, see http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or

Optimize pagination query

I have 50K records in my table. This is my query.
SELECT * FROM user_details WHERE user_id > 10 AND status = 1 LIMIT 0,10
When I "EXPLAIN" this query it is still traversing 24959 rows. Can it be more optimizable so that it can traverse less rows?

You're filtering by both user_id > 10 and status = 1.
There is no index on status.
user_id > 10 will be nearly every row in the table. A full table scan will be faster than using the index.
So the optimizer has decided a full table scan will be fastest.
You can fix this by adding an index on status. A longer term solution might be to partition the table on status.
Side note: rather than relying on magic user_ids, consider adding an explicit field to user to indicate their role.

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).

All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

Will this composite index help all of these queries?

Let's say I have the following table:
Orders
======
OrderID
CustomerID
StatusID
DateCreated
And I have the following queries:
select CustomerID from Orders where OrderID = 100
select OrderID from Orders where CustomerID = 20
select OrderID, StatusID from Orders where CustomerID = 100 and OrderID = 1000
If I make the following index:
create nonclustered index Example
On dbo.Orders(OrderID,CustomerID)
Include(StatusID)
Does that take care of optimization of all 3 queries with one index? In other words, do composite indexes improve queries that use one of the items within the composite? Or should individual indexes be created just on those columns as well (ie OrderID, CustomerID) in order to satisfy queries 1 and 2?

This index will not help the second query, since it is not possible to first seek on the left-most column in the index.
Think about a phone book. First, try to find all the people with the last name Smith. Easy, right? Now, try to find all the people with the first name John. The way the "index" in a phone book works is LastName, FirstName. When you're trying to find all the Johns, having them first sorted by LastName is not helpful at all - you still have to look through the whole phone book, because there will be a John Anderson and a John Zolti and everything in between.
To get the most out of all three queries, and assuming these are the only three query forms that are in use, I would potentially suggest an additional index:
create nonclustered index Example2
On dbo.Orders(CustomerID) INCLUDE(OrderID)
(If OrderID is the primary key, you shouldn't need to INCLUDE it.)
However, this should be tested against your workload, since indexes aren't free - they do require additional disk space, and extra work to be maintained when you are running DML queries. And again, this assumes that the three queries listed in your question are the only shapes you use. If you have other queries with different output columns or different where clauses, all bets are off.

Above answers are correct. But they leave one thing out. You probably want a clustered index on ORDER_ID. And if you create a clustered index on ORDER_ID, then any non-clustered index on the table will automatically include that value, since non-clustered index entries point to the clustered index on tables where there are clustered indexes. So you would want this:
create clustered index IX_ORDERS_ORDERID on ORDERS (OrderID)
go
create nonclustered index IX_ORDERS_CustomerID
On dbo.Orders(CustomerID)
Include(StatusID)
go
Now you have fast search on order id or customer id and all your queries will run well. Do you know how to see the execution plan? In sql studio go to query - include actual execution plan, and then run your queries. You'll see a graphical representation of which indexes are used, whether a seek or scan is executed, etc.

create nonclustered index Example
On dbo.Orders(OrderID,CustomerID)
Include(StatusID)
Read this index as: Create a system maintained copy of OrderID, CustomerID, StatusID from the Orders table. Order this copy by OrderID and break ties with CustomerID.
select CustomerID from Orders where OrderID = 100
Since the index is ordered by OrderID, finding the first qualifying record is fast. Once we find the first record, we can continue reading in the index until we find one where OrderID isn't 100. Then we can stop. Since all the columns we want are in the index, we don't have to lookup into the actual table. Great!
select OrderID from Orders where CustomerID = 20
Since the index is ordered by OrderID and then by CustomerID, qualifying records could appear anywhere in the index. The first record might qualify (OrderID = 1, CustomerID = 20). The last record might qualify (OrderID = 1000000000, CustomerID = 20). We must read the whole index to find qualifying records. This is bad. A minor help: since all the columns we want are in the index, we don't have to lookup into the actual table. So, technically the second query is helped by the index - just not to the degree the other queries are helped.
select OrderID, StatusID from Orders where CustomerID = 100 and OrderID = 1000
Since the index is ordered by OrderID then by CustomerID, finding the first qualifying record is fast. Once we find the first record, we can continue reading in the index until we find a non-qualifying record. Then we can stop. Since all the columns we want are in the index, we don't have to lookup into the actual table. Great!
do composite indexes improve queries that use one of the items within the composite?
Sometimes!
Or should individual indexes be created just on those columns as well (ie OrderID, CustomerID) in order to satisfy queries 1 and 2?
Sometimes not!
The real answer is nuanced by the fact that the order of the columns in the index declaration determines the order of records in the index. Some queries are helped by some orderings, while others aren't. You may need to complement your current index with one more to cover the CustomerID, OrderID case.
"since all the columns we want are in the index, we don't have to lookup into the actual table"-- so the index can be used for reading purposes although not used for seeking/finding purposes?
When an index (which is a copy of a portion of a table) includes all the information needed to resolve the query, the actual table does not need to be read. The index "covers" the query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008