How to speed up this SQL index query? - mysql

Given the following SQL table :
Employee(ssn, name, dept, manager,
salary)
You discover that the following query is significantly slower than
expected. There is an index on salary, and you have verified that
the query plan is using it.
SELECT *
FROM Employee
WHERE salary = 48000
Please give a possible reason why this query is slower than expected, and provide a tuning solution that
addresses that reason.
I have two ideas for why this query is slower than expected. One is that we are trying to SELECT * instead of SELECT Employee.salary which would slow down the query as we must search across all columns instead of one. Another idea is that the index on salary is non-clustered, and we want to use a clustered index, as the company could be very large and it would make sense to organize the table by the salary field.
Would either of those two solutions speed up this query? I.e. either change SELECT * to SELECT Employee.salary or explicitly set the index on salary to be clustered?

What indexes do you have now?
Is it really "slow"? What evidence do you have?
Comments on "SELECT * instead of SELECT Employee.salary" --
* is bad form because tomorrow you might add a column, thereby breaking any code that is expecting a certain number of columns in a certain order.
Dealing with * versus salary does not happen until after the row(s) is located.
Locating the row(s) is the costly part.
On the other hand, if you have INDEX(salary) and only look at salary then the index is "covering". That means that the "data" (the other columns) does not need to be fetched. Hence, faster. But this is probably beyond what your teacher has told you about yet.
Comments on "the index on salary is non-clustered, and we want to use a clustered index" --
In MySQL (not necessarily in other RDBMSs), InnoDB has exactly one PRIMARY KEY and it is always UNIQUE and "clustered".
That is, "clustered" implies "unique", which seems inappropriate for "salary".
In InnoDB a "secondary key" implicitly includes the column(s) of the PK (ssn?), with which it can reach over into the data.
"verified that the query plan" -- Have you learned about EXPLAIN SELECT ...?
More Tips on creating the optimal index for a given SELECT.

I will try to be as simple as I can be ,
You can not simply make salary a clustered index unless you make it a unique or primary which is kind of both stupid and senseless because two person can have same salary.
There can be only one clustered index per table according to MYSQL documentation. Database by default elects primary key for being clustered index .
If you do not define a PRIMARY KEY for your table, MySQL locates the
first UNIQUE index where all the key columns are NOT NULL and InnoDB
uses it as the clustered index.
To speed up your query I have a few suggestions , go for secondary indexes,
If you want to search a salary by direct value then hash based indexes are a better option, if MYSQL supports that already.
If you want to search a value using greater than , less than or some range ,then B-tree indexes are better choice.
The first option is faster than the second one , but is limited to only equality operator.
Hope it helps.

Related

MySQL: composite index fulltext+btree?

I want a query that does a fulltext search on one field and then a sort on a different field (imagine searching some text document and order by publication date). The table has about 17M rows and they are more or less uniformly distributed in dates. This is to be used in a webapp request/response cycle, so the query has to finish in at most 200ms.
Schematically:
SELECT * FROM table WHERE MATCH(text) AGAINST('query') ORDER BY date=my_date DESC LIMIT 10;
One possibility is having a fulltext index on the text field and a btree on the publication date:
ALTER TABLE table ADD FULLTEXT index_name(text);
CREATE INDEX index_name ON table (date);
This doesn't work very well in my case. What happens is that MySQL evaluates two execution paths. One is using the fulltext index to find the relevant rows, and once they are selected use a FILESORT to sort those rows. The second is using the BTREE index to sort the entire table and then look for matches using a FULL TABLE SCAN. They're both bad. In my case MySQL chooses the former. The problem is that the first step can select some 30k results which it then has to sort, which means the entire query might take of the order 10 seconds.
So I was thinking: do composite indexes of FULLTEXT+BTREE exist? If you know how a FULLTEXT index works, it first tokenizes the column you're indexing and then builds an index for the tokens. It seems reasonable to me to imagine a composite index such that the second index is a BTREE in dates for each token. Does this exist in MySQL and if so what's the syntax?
BONUS QUESTION: If it doesn't exist in MySQL, would PostgreSQL perform better in this situation?
Use IN BOOLEAN MODE.
The date index is not useful. There is no way to combine the two indexes.
Beware, if a user searches for something that shows up in 30K rows, the query will be slow. There is no straightforward away around it.
I suspect you have a TEXT column in the table? If so, there is hope. Instead of blindly doing SELECT *, let's first find the ids and get the LIMIT applied, then do the *.
SELECT a.*
FROM tbl AS a
JOIN ( SELECT date, id
FROM tbl
WHERE MATCH(...) AGAINST (...)
ORDER BY date DESC
LIMIT 10 ) AS x
USING(date, id)
ORDER BY date DESC;
Together with
PRIMARY KEY(date, id),
INDEX(id),
FULLTEXT(...)
This formulation and indexing should work like this:
Use FULLTEXT to find 30K rows, deliver the PK.
With the PK, sort 30K rows by date.
Pick the last 10, delivering date, id
Reach back into the table 10 times using the PK.
Sort again. (Yeah, this is necessary.)
More (Responding to a plethora of Comments):
The goal behind my reformulation is to avoid fetching all columns of 30K rows. Instead, it fetches only the PRIMARY KEY, then whittles that down to 10, then fetches * only 10 rows. Much less stuff shoveled around.
Concerning COUNT on an InnoDB table:
INDEX(col) makes it so that an index scan works for SELECT COUNT(*) or SELECT COUNT(col) without a WHERE.
Without INDEX(col),SELECT COUNT(*)will use the "smallest" index; butSELECT COUNT(col)` will need a table scan.
A table scan is usually slower than an index scan.
Be careful of timing -- It is significantly affected by whether the index and/or table is already cached in RAM.
Another thing about FULLTEXT is the + in front of words -- to say that each word must exist, else there is no match. This may cut down on the 30K.
The FULLTEXT index will deliver the date, id is random order, not PK order. Anyway, it is 'wrong' to assume any ordering, hence it is 'right' to add ORDER BY, then let the Optimizer toss it if it knows that it is redundant. And sometimes the Optimizer can take advantage of the ORDER BY (not in your case).
Removing just the ORDER BY, in many cases, makes a query run much faster. This is because it avoids fetching, say, 30K rows and sorting them. Instead it simply delivers "any" 10 rows.
(I have not experience with Postgres, so I cannot address that question.)

MySQL Index sometimes not being used

I have a table with 150k rows of data, and I have column with a UNIQUE INDEX, It has a type of VARCHAR(10) and stores 10 digit account numbers.
Now whenever I query, like a simple one:
SELECT * FROM table WHERE account_number LIKE '0103%'
It results 30,000+ ROWS, and when I run a EXPLAIN on my query It shows no INDEX is used.
But when I do:
SELECT * FROM table WHERE account_number LIKE '0104%'
It results 4,000+ ROWS, with the INDEX used.
Anyone can explain this?
I'm using MySQL 5.7 Percona XtraDB.
30k+/150k > 20% and I guess it is faster to do table scan. From 8.2.1.19 Avoiding Full Table Scans:
The output from EXPLAIN shows ALL in the type column when MySQL uses a full table scan to resolve a query. This usually happens under the following conditions:
You are using a key with low cardinality (many rows match the key value) through another column. In this case, MySQL assumes that by using the key it probably will do many key lookups and that a table scan would be faster.
If you don't need all values try to use:
SELECT account_number FROM table WHERE account_number LIKE '0103%'
instead of SELECT *. Then your index will become covering index and optimizer should always use it (as long as WHERE condition is SARGable).
The most database uses B tree for indexing. In this case the database optimizer don't use the index because its faster to scan without index. Like #lad2025 explained.
Your database column is unique and i think your cardinality of your index is high. But since your query using the like filter the database optimizer decides for you to choose not to use the index.
You can use try force index to see the result. Your using varchar with unique index. I would choose another data type or change your index type. If your table only contains numbers change it to numbers. This will help to optimize you query a lot.
In some cases when you have to use like you can use full text index.
If you need help with optimizing your query and table. Provide us more info and which info you want to fetch from your table.
lad2025 is correct. The database is attempting to make an intelligent optimization.
Benchmark with:
SELECT * FROM table FORCE INDEX(table_index) WHERE account_number LIKE '0103%'
and see who is smarter :-) You can always try your hand at questioning the optimizer. That's what index hints are for...
https://dev.mysql.com/doc/refman/5.7/en/index-hints.html

Combined Index performance with optional where clause

I have a table with the following columns:
id-> PK
customer_id-> index
store_id-> index
order_date-> index
last_modified-> index
other_columns...
other_columns...
I have three single column index. I also have a customer_id_store_id index which is a foreign key constraint referencing other tables.
id, customer_id, store_id are char(36) which is UUID. order_date is datetime and last_modifed is UNIX timestamp.
I want to gain some performance by removing all index and adding one with (customer_id, store_id, order_date). Most queries will have these fields in the where clause. But sometimes the store_id will not be needed.
What is the best approach? to add "store_id IS NOT NULL" in the where clause or creating the index this way (customer_id, order_date, store_id).
I also frequently need to query the table by last_modified field (where clause includes customer_id=, store_id=, last_modified>).
As I only have a single column index on it and there are hundreds of customers who is insert/updating the tables, more often the index scans rows more than necessary. Is it better to create another index (customer_id, store_id, last_modified) or leave it as it is? Or add this column to the previous index making it four columns composite index. But then again the order_date is irrelevant here and omitting it might result the index not being used as intended.
The query works fast on customers that don't have many rows possibly using the customer_id index there. But for customers with large amount of data, this isn't optimal. More often I need only few days of data.
Can anyone please advise what's the best index in this scenario.
It is true that lots of single column indexes on a MySQL table are generally considered harmful.
A query with
WHERE customer_id=constant AND store_id=constant AND last_modified>=constant
will be accelerated by an index on (customer_id, store_id, last_modified). Why? The MySQL query planner can random-access the index to the first item it needs to retrieve, then scan the index sequentially. That same index works for
WHERE customer_id=constant AND store_id=constant
AND last_modified>=constant
AND last_modified< constant + INTERVAL 1 DAY
BUT, that index will not be useful for a query with just
WHERE store_id=constant AND last_modified>constant
or
WHERE customer_id=constant AND store_id IS NOT NULL AND last_modified>=constant
For the first of those query patterns you need (store_id, last_modified) to achieve the ability to sequentially scan the index.
The second of those query patterns requires two different range searches. One is something IS NOT NULL. That's a range search because it has to romp through all the non-null values in the column. The second range search is last_modified>=constant. That's a range search, because it starts with the first value of last_modified that meets the given criterion, and scans to the end of the index.
MySQL indexes are B-trees. That means, essentially, that they're sorted into a particular single order. So, an index is best for accelerating queries that require just one range search. So, the second query pattern is inherently hard to satisfy with an index.
A table can have multiple compound indexes designed to satisfy multiple different query patterns. That's usually the strategy to large tables work well in practical applications. Each index imposes a little bit of performance penalty on updates and inserts. Indexes also take storage space. But storage is very cheap these days.
If you want to use a compound index to search on multiple criteria, these things must be true:
all but one of the criteria must be equality criteria like store_id = constant.
one criterion can be a range-scan criterion like last_modified >= constant or something IS NOT NULL.
the columns in the index must be ordered so that the columns involved in equality criteria all appear, then the the column involved in the range-scan criterion.
you may mention other columns after the range scan criterion. But they make up part of a covering index strategy (beyond the scope of this post).
http://use-the-index-luke.com/ is a good basic intro to the black art of indexing.

Are there any advantage sorting on indexed fields

I want to perform a sort operation on some field. Is it advantageous making an index on that field. For example:
SELECT * FROM `users` WHERE `age`=33 ORDER BY `name`
In this query, I know that having an index on age is helpful, but would it be better if I maintain an index for name. Would there be a performance gain by indexing it. And the ORDER BY operation is frequently needed for other queries as well.
An index on name alone would not be likely to help significantly with this query, but an index on (age, name) would.
While it's not entirely accurate, it's often instructive to think of an index as a list of rows sorted by the keys in the index (e.g, sorted by age first, then by name). In the case of your sample query, all the rows with age=33 would naturally come out of the composite index sorted by name, saving you from doing a separate sort. Having a separate index for name wouldn't help the same way.

Should I avoid COUNT all together in InnoDB?

Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.
SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.
Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.
I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.