Choosing the optimal index for SQL Query

Choosing the optimal index for SQL Query - mysql

I'm doing some problem sets in my database management course and I can't figure this specific problem out.
We have the following relation:
Emp (id, name, age, sal, ...)
And the following query:
SELECT id
FROM Emp
WHERE age > (select max(sal) from Emp);
We are then supposed to choose an index that we would be a good query optimizer. My answer would be to just use Emp(age) but the solution to the question is
Emp(age)
&
Emp(sal)
How come there are 2 indices? I can't seem to wrap my head around why you would need more than the age attribute..

Of course, you realize that the query is non-sensical, comparing age to sal (which is presumably a salary). That said, two indexes are appropriate for:
SELECT e.id
FROM Emp e
WHERE e.age > (select max(e2.sal) from Emp e2);
I added table aliases to emphasize that the query is referring to the Emp table twice.
To get the maximum sal from the table, you want an index on emp(sal). The maximum is a simple index lookup operation.
Then you want to compare this to age. Well, for a comparison to age, you want an index on emp(age). This an entirely separate reference to emp that has no reference to sal, so you cannot put the two columns in a single index.
The index on age may not be necessary. The query may be returning lots of rows -- and tables that returns lots of rows don't generally benefit from a secondary index. The one case where it can benefit from the index is if age is a clustered index (that is, typically the first column in the primary key). However, I wouldn't recommend such an indexing structure.

you need both indexes to get optimal performance
1) the subquery (select max(sal) from Emp) will benefit from indexing Emp(sal) because on a tree-index, retrieving the max would be much quicker
2) the outer query needs to run a filtering on Emp(age), so that also benefits from a tree-index

Related

MySQL index key on table with more columns

In my script, I have a lot of SQL WHERE clauses, e.g.:
SELECT * FROM cars WHERE active=1 AND model='A3';
SELECT * FROM cars WHERE active=1 AND year=2017;
SELECT * FROM cars WHERE active=1 AND brand='BMW';
I am using different SQL clauses on same table because I need different data.
I would like to set index key on table cars, but I am not sure how to do it. Should I set separate keys for each column (active, model, year, brand) or should I set keys for groups (active,model and active,year and active,brand)?

WHERE a=1 AND y='m'
is best handled by INDEX(a,y) in either order. The optimal set of indexes is several pairs like that. However, I do not recommend having more than a few indexes. Try to limit it to queries that users actually make.
INDEX(a,b,c,d):
WHERE a=1 AND b=22 -- Index is useful
WHERE a=1 AND d=44 -- Index is less useful
Only the "left column(s)" of an index are used. Hence the second case, uses a, but stops because b is not in the WHERE.
You might be tempted to also have (active, year, model). That combination works well for active AND year, active AND year AND model, but not active AND model (but no year).
More on creating indexes: http://mysql.rjweb.org/doc.php/index_cookbook_mysql
Since model implies a make, there is little use to put both of those in the same composite index.
year is not very selective, and users might want a range of years. These make it difficult to get an effective index on year.
How many rows will you have? If it is millions, we need to work harder to avoid performance problems. I'm leaning toward this, but only because the lookup would be more compact.

We use single indexing when we want to query for just one column, same asin your case and multiple group indexing when we have multiple condition in the same where clause.
Go for single indexing.
For more detailed explanation, refer this article: https://www.sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

Most efficient way to join "most recent row"

I know this question has been asked 100 times, and this isn't a "how do I do it", but an efficiency question - a topic I don't know much about.
From my internet reading I have settled on one way of solving the most recent problem that sounds like it's pretty efficient - LEFT JOIN a "max" table (grouped by the matching conditions) and then LEFT JOIN the row that matches the grouped conditions. Something like this:
Select employee.*, evaluation.* form employee
LEFT JOIN (select max(report_date) report_date, employee_id
from evaluation group by employee_id) most_recent_eval
on most_recent_eval.employee_id = employee.id
LEFT JOIN evaluation
on evaluation.employee_id = employee.id and evaluation.report_date = most_recent_eval.report_date
Are there problems with this that I don't know about? Is this doing 2 table scans (one to find the max, and one to find the row)? Does it have to do 2 full scans for every employee?
The reason I'm asking is that I am now looking at joining on 3 tables where I need the most recent row (evaluations, security clearance, and project) and it seems like any inefficiencies are going to be massively multiplied.
Can anyone give me some advice on this?

You should be in pretty good shape with the query pattern you propose.
One possible suggestion, that will help if your evaluation table has its own autoincrementing id column. You may be able to find the latest evaluation for each employee with this subquery:
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
Then your join can look like this:
FROM employee
LEFT JOIN (
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
) most_recent_eval ON most_recent_eval.employee_id=employee.id
LEFT JOIN evaluation ON most_recent_eval.id = evaluation.id
This will work if your id values and your report_date values in your evaluation table are in the same order. Only you know if that's the case in your application. But if it is, this is a very helpful optimization.
Other than that, you may need to add some compound indexes to some tables to speed up your queries. Get them working correctly first. Read http://use-the-index-luke.com/ . Remember that lots of single-column indexes are generally harmful to MySQL query performance unless they're chosen to accelerate particular queries.
If you create a compound index on (employee_id, report_date), this subquery
select max(report_date) report_date, employee_id
from evaluation
group by employee_id
can be satisfied with an astonishingly efficient loose index scan. Similarly, if you're using InnoDB, the query
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
can be satisfied by a loose index scan on a single-column index on employee_id. (If you're using MyISAM, you need a compound index on (employee_id, id) because InnoDB puts the primary key column implicitly into every index.)

Using index with IN clause and ordering by primary key

I am having a problem with the following task using MySQL. I have a table Records(id,enterprise, department, status). Where id is the primary key, and enterprise and department are foreign keys, and status is an integer value (0-CREATED, 1 - APPROVED, 2 - REJECTED).
Now, usually the application need to filter something for a concrete enterprise and department and status:
SELECT * FROM Records WHERE status = 0 AND enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
The order by is required, since I have to provide the user with the most recent records. For this query I have created an index (enterprise, department, status), and everything works fine. However, for some privileged users the status should be omitted:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
ORDER BY id desc LIMIT 0,10;
This obviously breaks the index - it's still good for filtering, but not for sorting. So, what should I do? I don't want create a separate index (enterprise, department), so what if I modify the query like this:
SELECT * FROM Records WHERE enterprise = 11 AND department = 21
AND status IN (0,1,2)
ORDER BY id desc LIMIT 0,10;
MySQL definitely does use the index now, since it's provided with values of status, but how quick will the sorting by primary key be? Will it take the recent 10 values for each status available, and then merge them, or will it first merge the ids for each status together, and only after that take the first ten (this way it's gonna be much slower I guess).

All of the queries will benefit from one composite query:
INDEX(enterprise, department, status, id)
enterprise and department can swapped, but keep the rest of the columns in that order.
The first query will use that index for both the WHERE and the ORDER BY, thereby be able to find the 10 rows without scanning the table or doing a sort.
The second query is missing status, so my index is less than perfect. This would be better:
INDEX(enterprise, department, id)
At that point, it works like above. (Note: If the table is InnoDB, then this 3-column index is identical to your 2-column INDEX(enterprise, department) -- the PK is silently included.)
The third query gets dicier because of the IN. Still, my 4 column index will be nearly the best. It will use the first 3 columns, but not be able to do the ORDER BY id, so it won't use id. And it won't be able to comsume the LIMIT. Hence the EXPLAIN will say Using temporary and/or Using filesort. Don't worry, performance should still be nice.
My second index is not as good for the third query.
See my Index Cookbook.
"How quick will sorting by id be"? That depends on two things.
Whether the sort can be avoided (see above);
How many rows in the query without the LIMIT;
Whether you are selecting TEXT columns.
I was careful to say whether the INDEX is used all the way through the ORDER BY, in which case there is no sort, and the LIMIT is folded in. Otherwise, all the rows (after filtering) are written to a temp table, sorted, then 10 rows are peeled off.
The "temp table" I just mentioned is necessary for various complex queries, such as those with subqueries, GROUP BY, ORDER BY. (As I have already hinted, sometimes the temp table can be avoided.) Anyway, the temp table comes in 2 flavors: MEMORY and MyISAM. MEMORY is favorable because it is faster. However, TEXT (and several other things) prevent its use.
If MEMORY is used then Using filesort is a misnomer -- the sort is really an in-memory sort, hence quite fast. For 10 rows (or even 100) the time taken is insignificant.

What will be the behavior of the index in these two scenarios in relation databases like mysql?

Let's say I have a table students with the following fields
id,student_id,test_type,score
Consider these two queries
select * from students where student_id = x and score > y
select * from students where student_id = x order by score
Let's say I have indexes on both student_id and score but not a composite index, what will be the indexes that will be used by the database? Will the query be able to use both of the indexes or whether at max one index can be used?
Let's say with the student_id index I am able to restrict the results in the query, will I be able to use the score index to sort or filtering?
or if databases chooses the score index to sort or filter first, will I be able to student_id index for student_id =x filtering?

MySQL's optimizer would like the composite INDEX(student_id, score) for both queries.
Without the composite index... The optimizer almost never uses two indexes. The optimizer would pick between INDEX(student_id) and INDEX(score).
But there is another wrinkle -- If this table is InnoDB, and if it has PRIMARY KEY(student_id), then INDEX(score) implicitly has student_id tacked on then end. HenceINDEX(score)` would be perfect for the first query.
Given two indexes, the optimizer looks at cardinality and various other things to pick between them.
More on creating the best index.

Well it definitely depends on your data set and database. Imagine in the students table if I have 100 different id's but the same student_id. The student_id index would be considered bad and the *Teradata Query Optimizer would be smart enough to choose a better one like score or id. (If using Teradata DB, but most have built in smart features like this). A Composite Index certainly wouldn't be selected because WHY? I think in this tables case wouldn't help fetch at all. The best way to select a good index is to ask okay which column can provide me a solid unique value that is inexpensive (Integer) and can eliminate a good partition or chunk of data if selected. But yes student_id would be the best index in this case. Plus the query that ends with "and score > y" would be quicker. Where clause is always seen first so dataset will be much smaller.

Will this composite index help all of these queries?

Let's say I have the following table:
Orders
======
OrderID
CustomerID
StatusID
DateCreated
And I have the following queries:
select CustomerID from Orders where OrderID = 100
select OrderID from Orders where CustomerID = 20
select OrderID, StatusID from Orders where CustomerID = 100 and OrderID = 1000
If I make the following index:
create nonclustered index Example
On dbo.Orders(OrderID,CustomerID)
Include(StatusID)
Does that take care of optimization of all 3 queries with one index? In other words, do composite indexes improve queries that use one of the items within the composite? Or should individual indexes be created just on those columns as well (ie OrderID, CustomerID) in order to satisfy queries 1 and 2?

This index will not help the second query, since it is not possible to first seek on the left-most column in the index.
Think about a phone book. First, try to find all the people with the last name Smith. Easy, right? Now, try to find all the people with the first name John. The way the "index" in a phone book works is LastName, FirstName. When you're trying to find all the Johns, having them first sorted by LastName is not helpful at all - you still have to look through the whole phone book, because there will be a John Anderson and a John Zolti and everything in between.
To get the most out of all three queries, and assuming these are the only three query forms that are in use, I would potentially suggest an additional index:
create nonclustered index Example2
On dbo.Orders(CustomerID) INCLUDE(OrderID)
(If OrderID is the primary key, you shouldn't need to INCLUDE it.)
However, this should be tested against your workload, since indexes aren't free - they do require additional disk space, and extra work to be maintained when you are running DML queries. And again, this assumes that the three queries listed in your question are the only shapes you use. If you have other queries with different output columns or different where clauses, all bets are off.

Above answers are correct. But they leave one thing out. You probably want a clustered index on ORDER_ID. And if you create a clustered index on ORDER_ID, then any non-clustered index on the table will automatically include that value, since non-clustered index entries point to the clustered index on tables where there are clustered indexes. So you would want this:
create clustered index IX_ORDERS_ORDERID on ORDERS (OrderID)
go
create nonclustered index IX_ORDERS_CustomerID
On dbo.Orders(CustomerID)
Include(StatusID)
go
Now you have fast search on order id or customer id and all your queries will run well. Do you know how to see the execution plan? In sql studio go to query - include actual execution plan, and then run your queries. You'll see a graphical representation of which indexes are used, whether a seek or scan is executed, etc.

create nonclustered index Example
On dbo.Orders(OrderID,CustomerID)
Include(StatusID)
Read this index as: Create a system maintained copy of OrderID, CustomerID, StatusID from the Orders table. Order this copy by OrderID and break ties with CustomerID.
select CustomerID from Orders where OrderID = 100
Since the index is ordered by OrderID, finding the first qualifying record is fast. Once we find the first record, we can continue reading in the index until we find one where OrderID isn't 100. Then we can stop. Since all the columns we want are in the index, we don't have to lookup into the actual table. Great!
select OrderID from Orders where CustomerID = 20
Since the index is ordered by OrderID and then by CustomerID, qualifying records could appear anywhere in the index. The first record might qualify (OrderID = 1, CustomerID = 20). The last record might qualify (OrderID = 1000000000, CustomerID = 20). We must read the whole index to find qualifying records. This is bad. A minor help: since all the columns we want are in the index, we don't have to lookup into the actual table. So, technically the second query is helped by the index - just not to the degree the other queries are helped.
select OrderID, StatusID from Orders where CustomerID = 100 and OrderID = 1000
Since the index is ordered by OrderID then by CustomerID, finding the first qualifying record is fast. Once we find the first record, we can continue reading in the index until we find a non-qualifying record. Then we can stop. Since all the columns we want are in the index, we don't have to lookup into the actual table. Great!
do composite indexes improve queries that use one of the items within the composite?
Sometimes!
Or should individual indexes be created just on those columns as well (ie OrderID, CustomerID) in order to satisfy queries 1 and 2?
Sometimes not!
The real answer is nuanced by the fact that the order of the columns in the index declaration determines the order of records in the index. Some queries are helped by some orderings, while others aren't. You may need to complement your current index with one more to cover the CustomerID, OrderID case.
"since all the columns we want are in the index, we don't have to lookup into the actual table"-- so the index can be used for reading purposes although not used for seeking/finding purposes?
When an index (which is a copy of a portion of a table) includes all the information needed to resolve the query, the actual table does not need to be read. The index "covers" the query.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008