Will this composite index help all of these queries? - sql-server-2008

Let's say I have the following table:
Orders
======
OrderID
CustomerID
StatusID
DateCreated
And I have the following queries:
select CustomerID from Orders where OrderID = 100
select OrderID from Orders where CustomerID = 20
select OrderID, StatusID from Orders where CustomerID = 100 and OrderID = 1000
If I make the following index:
create nonclustered index Example
On dbo.Orders(OrderID,CustomerID)
Include(StatusID)
Does that take care of optimization of all 3 queries with one index? In other words, do composite indexes improve queries that use one of the items within the composite? Or should individual indexes be created just on those columns as well (ie OrderID, CustomerID) in order to satisfy queries 1 and 2?

This index will not help the second query, since it is not possible to first seek on the left-most column in the index.
Think about a phone book. First, try to find all the people with the last name Smith. Easy, right? Now, try to find all the people with the first name John. The way the "index" in a phone book works is LastName, FirstName. When you're trying to find all the Johns, having them first sorted by LastName is not helpful at all - you still have to look through the whole phone book, because there will be a John Anderson and a John Zolti and everything in between.
To get the most out of all three queries, and assuming these are the only three query forms that are in use, I would potentially suggest an additional index:
create nonclustered index Example2
On dbo.Orders(CustomerID) INCLUDE(OrderID)
(If OrderID is the primary key, you shouldn't need to INCLUDE it.)
However, this should be tested against your workload, since indexes aren't free - they do require additional disk space, and extra work to be maintained when you are running DML queries. And again, this assumes that the three queries listed in your question are the only shapes you use. If you have other queries with different output columns or different where clauses, all bets are off.

Above answers are correct. But they leave one thing out. You probably want a clustered index on ORDER_ID. And if you create a clustered index on ORDER_ID, then any non-clustered index on the table will automatically include that value, since non-clustered index entries point to the clustered index on tables where there are clustered indexes. So you would want this:
create clustered index IX_ORDERS_ORDERID on ORDERS (OrderID)
go
create nonclustered index IX_ORDERS_CustomerID
On dbo.Orders(CustomerID)
Include(StatusID)
go
Now you have fast search on order id or customer id and all your queries will run well. Do you know how to see the execution plan? In sql studio go to query - include actual execution plan, and then run your queries. You'll see a graphical representation of which indexes are used, whether a seek or scan is executed, etc.

create nonclustered index Example
On dbo.Orders(OrderID,CustomerID)
Include(StatusID)
Read this index as: Create a system maintained copy of OrderID, CustomerID, StatusID from the Orders table. Order this copy by OrderID and break ties with CustomerID.
select CustomerID from Orders where OrderID = 100
Since the index is ordered by OrderID, finding the first qualifying record is fast. Once we find the first record, we can continue reading in the index until we find one where OrderID isn't 100. Then we can stop. Since all the columns we want are in the index, we don't have to lookup into the actual table. Great!
select OrderID from Orders where CustomerID = 20
Since the index is ordered by OrderID and then by CustomerID, qualifying records could appear anywhere in the index. The first record might qualify (OrderID = 1, CustomerID = 20). The last record might qualify (OrderID = 1000000000, CustomerID = 20). We must read the whole index to find qualifying records. This is bad. A minor help: since all the columns we want are in the index, we don't have to lookup into the actual table. So, technically the second query is helped by the index - just not to the degree the other queries are helped.
select OrderID, StatusID from Orders where CustomerID = 100 and OrderID = 1000
Since the index is ordered by OrderID then by CustomerID, finding the first qualifying record is fast. Once we find the first record, we can continue reading in the index until we find a non-qualifying record. Then we can stop. Since all the columns we want are in the index, we don't have to lookup into the actual table. Great!
do composite indexes improve queries that use one of the items within the composite?
Sometimes!
Or should individual indexes be created just on those columns as well (ie OrderID, CustomerID) in order to satisfy queries 1 and 2?
Sometimes not!
The real answer is nuanced by the fact that the order of the columns in the index declaration determines the order of records in the index. Some queries are helped by some orderings, while others aren't. You may need to complement your current index with one more to cover the CustomerID, OrderID case.
"since all the columns we want are in the index, we don't have to lookup into the actual table"-- so the index can be used for reading purposes although not used for seeking/finding purposes?
When an index (which is a copy of a portion of a table) includes all the information needed to resolve the query, the actual table does not need to be read. The index "covers" the query.

Related

SQL: Optimize the query on large table with indexing

For example, I have the following table:
table Product
------------
id
category_id
processed
product_name
This table has index on columns id category_id and processed and (category_id, proccessed). The statistic on this table is:
select count(*) from Product; -- 50M records
select count(*) from Product where category_id=10; -- 1M records
select count(*) from Product where processed=1; -- 30M records
My simplest query I want to query is: (select * is the must).
select * from Product
where category_id=10 and processed=1
order by id ASC LIMIT 100
The above query without limit only has about 10,000 records.
I want to call the above query for multiple time. Every time I get out I will update field processed to 0. (so it will not appear on the next query). When I test on the real data, sometime the optimizer try to use id as the key, so it cost a lot of time.
How can I optimize the above query (In general term)
P/S: for avoiding confuse, I know that the best index should be (category, processed, id). But I cannot change the index. My question is just only related to optimize the query.
Thanks
For this query:
select *
from Product
where category_id = 10 and processed = 1
order by id asc
limit 100;
The optimal index is on product(category_id, processed, id). This is a single index with a three-part key, with the keys in this order.
Given that you have INDEX(category_id, processed), there is virtually no advantage in also having just INDEX(category_id). So DROP the latter.
That may have the beneficial side effect of pushing the Optimizer toward the composite INDEX(category_id, processed), which is at least "better" for the query.
Without touching the indexes, you could use a FORCE INDEX mentioning the composite index's name. But I don't recommend it. "It may help today, but hurt tomorrow, after the data changes."
Why do you say "But I cannot change the index."? Newer version of MySQL/MariaDB make ADD/DROP INDEX much faster than older versions. Also, pt-online-schema-change is provides a fast way.

Choosing the optimal index for SQL Query

I'm doing some problem sets in my database management course and I can't figure this specific problem out.
We have the following relation:
Emp (id, name, age, sal, ...)
And the following query:
SELECT id
FROM Emp
WHERE age > (select max(sal) from Emp);
We are then supposed to choose an index that we would be a good query optimizer. My answer would be to just use Emp(age) but the solution to the question is
Emp(age)
&
Emp(sal)
How come there are 2 indices? I can't seem to wrap my head around why you would need more than the age attribute..
Of course, you realize that the query is non-sensical, comparing age to sal (which is presumably a salary). That said, two indexes are appropriate for:
SELECT e.id
FROM Emp e
WHERE e.age > (select max(e2.sal) from Emp e2);
I added table aliases to emphasize that the query is referring to the Emp table twice.
To get the maximum sal from the table, you want an index on emp(sal). The maximum is a simple index lookup operation.
Then you want to compare this to age. Well, for a comparison to age, you want an index on emp(age). This an entirely separate reference to emp that has no reference to sal, so you cannot put the two columns in a single index.
The index on age may not be necessary. The query may be returning lots of rows -- and tables that returns lots of rows don't generally benefit from a secondary index. The one case where it can benefit from the index is if age is a clustered index (that is, typically the first column in the primary key). However, I wouldn't recommend such an indexing structure.
you need both indexes to get optimal performance
1) the subquery (select max(sal) from Emp) will benefit from indexing Emp(sal) because on a tree-index, retrieving the max would be much quicker
2) the outer query needs to run a filtering on Emp(age), so that also benefits from a tree-index

How do you order the indexing columns in MySQL if you are using order by in your query?

I am reading an article about how Pinterest shards their MySQL database: https://medium.com/#Pinterest_Engineering/sharding-pinterest-how-we-scaled-our-mysql-fleet-3f341e96ca6f
And here they have an example of a table:
CREATE TABLE board_has_pins (
board_id INT,
pin_id INT,
sequence INT,
INDEX(board_id, pin_id, sequence)
) ENGINE=InnoDB;
And they are showing how they query from that table:
SELECT pin_id FROM board_has_pins
WHERE board_id=241294561224164665 ORDER BY sequence
LIMIT 50 OFFSET 150
What I don't understand here is the ordering of the index. Would it not make more sense if the index was like this since they are ordering by sequence and filtering by board_id?
INDEX(board_id, sequence, pin_id)
Am I missing something here or have I misunderstood how indexing works?
You are correct. The better index for this query is:
INDEX(board_id, sequence, pin_id)
The columns should be in this order:
Column(s) involved in equality comparisons. If there are multiple columns, their order does not matter.
Column(s) involved the ORDER BY clause, in the same order they appear in the ORDER BY.
Other columns used to fetch values, like pin_id.
Once the equality conditions find the subset of matching rows, they are all tied with respect to their order, because naturally they all have the same value for the column of the quality condition (board_id in this case).
The tie is resolved by the order of the next column in the index. If (and only if) the next column is the one used in the ORDER BY clause, then the rows can be read in index order, with no further work needed to sort them.
I don't know what is the explanation for the Pinterest blog post you linked to. I guess it's a mistake, because the index is not optimal for the query they showed.

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).
All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

What will be the behavior of the index in these two scenarios in relation databases like mysql?

Let's say I have a table students with the following fields
id,student_id,test_type,score
Consider these two queries
select * from students where student_id = x and score > y
select * from students where student_id = x order by score
Let's say I have indexes on both student_id and score but not a composite index, what will be the indexes that will be used by the database? Will the query be able to use both of the indexes or whether at max one index can be used?
Let's say with the student_id index I am able to restrict the results in the query, will I be able to use the score index to sort or filtering?
or if databases chooses the score index to sort or filter first, will I be able to student_id index for student_id =x filtering?
MySQL's optimizer would like the composite INDEX(student_id, score) for both queries.
Without the composite index... The optimizer almost never uses two indexes. The optimizer would pick between INDEX(student_id) and INDEX(score).
But there is another wrinkle -- If this table is InnoDB, and if it has PRIMARY KEY(student_id), then INDEX(score) implicitly has student_id tacked on then end. HenceINDEX(score)` would be perfect for the first query.
Given two indexes, the optimizer looks at cardinality and various other things to pick between them.
More on creating the best index.
Well it definitely depends on your data set and database. Imagine in the students table if I have 100 different id's but the same student_id. The student_id index would be considered bad and the *Teradata Query Optimizer would be smart enough to choose a better one like score or id. (If using Teradata DB, but most have built in smart features like this). A Composite Index certainly wouldn't be selected because WHY? I think in this tables case wouldn't help fetch at all. The best way to select a good index is to ask okay which column can provide me a solid unique value that is inexpensive (Integer) and can eliminate a good partition or chunk of data if selected. But yes student_id would be the best index in this case. Plus the query that ends with "and score > y" would be quicker. Where clause is always seen first so dataset will be much smaller.