Is it possible to optimize a query that uses the '<>' operator? - mysql

This a follow-up to a previous question.
How can I optimize this query so that it does not perform a full table scan?
SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
.
explain SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Employee | ALL | PRIMARY | NULL | NULL | NULL | 5000 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
(Empoyee.id is the primary key, in case that isn't clear.)

Have a covering index for name and id, and it should be able to fulfill the query using the index. This might be faster, because there's a good chance the entire index will already be in memory, while a table scan is more likely to need to go to disk.
Because of the low (non-existent) selectivity of your where clause you may need to provide a hint to get the database to use your index. I'm a sql server guy, and so I'm not sure of the syntax needed in mysql to hint an index, or even if mysql is able to take advantage of a covering index in this manner.
That said, I doubt you can get much improvement: you're returning every row but one. You should expect that to need to scan the table.

There are a lot of things to try, it depends on how the database engine chooses to parse it, really. Some options:
select employee.name from employee where employee.id not in (1000);
You could also try a union with a less than and then a greater than.
But in the specific example you are giving (which may just be too simple for your real case) a table scan isn't necessarily a bad thing. If all the records have to be returned except one, using an index may in fact be slower.

In traditional databases, you cant!
Of course, you could just omit all Employees with the given Id (when it is key or has an index) -- but normally you will still have the total majority of the table under your feet. So using an index might complicate things and thus a fts normally is the faster option.
When you have specialized databases, you could store the names of all employees adjacent to each other.
Edit: I now saw the other answer of Joel. Yes, this could be a way, since in fact your special index is now a specialized form of storing a part of the content. Good databases can just use the index content when it covers the columns needed -- this is rather nice. Of course, you will endup in a so called "full index scan" (but normally much faster as a full-table-scan).

Nothing you can do will increase performance. In this case the database must do a complete table scan, as you are asking for every record save one. Reading every page in an index on top of that would only reduce performance. Fortunately, even if you added an index, the database would be smart enough to ignore it...
EDIT to address #Juergens comment.
Juergen, you are right about a covering index, but there are conflicting effects here. Any use of an index in a scenario like this has bad effects in one sense... The query engine could have to perform one I/O Operation for each level in the index, for each row it needs to examine. If there are, say, 5 levels in the index, and 1M rows, that would be 5 Million I/O operations, compared to only 1M I/Os to do a complete table scan. This is why, in this scenario, most query optimizers would ignore any available index and do the table scan anyway. (unless you force it to use the index with a hint) The only mitigating factor is if EVERY attribute required by the query is in the index (covering index) and the number of index rows per page on disk is sufficiently smaller than the number of table rows per page to counteract the negative effect of having to traverse each level of the index for each row returned by the query.

Related

Two non-primary/unique indexes in a three column table

I've got a three col table. It has a unique index, and another two (for two different columnts) for faster queries.
+-------------+-------------+----------+
| category_id | related_id | position |
+-------------+-------------+----------+
Sometimes the query is
SELECT * FROM table WHERE category_id = foo
and sometimes it's
SELECT * FROM table WHERE related_id = foo
So I decided to make both category_id and related_id an index for better performance. Is this bad practice? What are the downsides of this approach?
In the case I already have 100.000 rows in that table, and am inserting another 100.000, will it be an overkill. having to refresh the index with every new insert? Would that operation then take too long? Thanks
There are no downsides if it's doing exactly what you want, you query on a specific column a lot, so you make that column indexed, that's the whole point. Now you have a 60 column table and your adding indexes to columns you never query on then you are wasting resources because those indexes need to be maintained on INSERT/UPDATE/DELETE operations.
If you have created index for each column then you will definitely get benefit out of it.
Don't go for composite indexes (Multiple coulmn indexes).
You yourself can see the advantage of index in your query by using EXPLAIN (statement provides information about how MySQL executes statements).
EXAMPLE:
EXPLAIN SELECT * FROM table WHERE category_id = foo;
Hope this will help.
~K
Its good to have indexes. Just understand that indexes would take more disk space, but faster search.
It is in your best interest to index those fields which have less repeated values. For eg. Indexing a field that contains a Boolean flag might not be a good idea.
Since in your case you are having an id, hence I think you won't be having any problem in keeping the indexes that you have created.
Also, the inserts would be slower, but since you are saving id's there won't be much of a difference in the time required to insert. Go ahead and do the insert.
My personal advice :
When you are inserting large number of rows in a single table in one go, don't insert them using a single query, unless mandatory. This would prevent your table from getting locked and inaccessible for a long time.

Is it better to force index usage for an ORDER BY?

I'm currently trying to optimize a query generated by Doctrine 2 on this table:
CREATE TABLE `publication` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`global_order` int(11) NOT NULL,
`title` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`slug` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`type` varchar(7) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_AF3C6779B12CE9DB` (`global_order`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The query is
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
type is a discriminator column added by Doctrine. Although the WHERE clause is useless as type is always one of the IN values, I cannot remove it.
EXPLAIN shows me
+------+---------------+------+------+-----------------------------+
| type | possible_keys | key | rows | Extra |
+------+---------------+------+------+-----------------------------+
| ALL | NULL | NULL | 562 | Using where; Using filesort |
+------+---------------+------+------+-----------------------------+
(rows is different each time I execute the query)
After some reading I found I can force an index usage like this:
ALTER TABLE `publication` DROP INDEX `UNIQ_AF3C6779B12CE9DB` ,
ADD UNIQUE `UNIQ_AF3C6779B12CE9DB` ( `global_order` , `type` )
and
SELECT *
FROM publication
FORCE INDEX(UNIQ_AF3C6779B12CE9DB)
WHERE global_order > 0
AND type IN ('article', 'event', 'work')
ORDER BY global_order DESC
The WHERE clause is always useless, but this time EXPLAIN shows me
+-------+-----------------------+-----------------------+------+-------------+
| type | possible_keys | key | rows | Extra |
+-------+-----------------------+-----------------------+------+-------------+
| range | UNIQ_AF3C6779B12CE9DB | UNIQ_AF3C6779B12CE9DB | 499 | Using where |
+-------+-----------------------+-----------------------+------+-------------+
It seems to me it's better, but it seems it's not common to have to force an index too so I wonder if it's really efficient for such a simple query.
Does anyone know what is the better way to perform this query?
Thanks!
If your query really is:
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
... and all entries (or nearly all) will match the IN clause, you're actually better off with no index at all. If you toss in a limit clause, then the index you'll want is actually on global_order, without the type field. The reason for this is, it actually costs something to read an index.
If you're going for the entire table, sequentially reading the table and sorting its rows in memory will be your cheapest plan. If you only need a few rows and most will match the where clause, going for the smallest index will do the trick.
To understand why, picture the disk IO involved.
Suppose you want the whole table without an index. To do this, you read data_page1, data_page2, data_page3, etc., visiting the various disk pages involved in order, until you reach the end of the table. You then then sort and return.
If you want the top 5 rows without an index, you'd sequentially read the entire table as before, while heap-sorting the top 5 rows. Admittedly, that's a lot of reading and sorting for a handful of rows.
Suppose, now, that you want the whole table with an index. To do this, you read index_page1, index_page2, etc., sequentially. This then leads you to visit, say, data_page3, then data_page1, then data_page3 again, then data_page2, etc., in a completely random order (that by which the sorted rows appear in the data). The IO involved makes it cheaper to just read the whole mess sequentially and sort the grab bag in memory.
If you merely want the top 5 rows of an indexed table, in contrast, using the index becomes the correct strategy. In the worst case scenario you load 5 data pages in memory and move on.
A good SQL query planner, btw, will make its decision on whether to use an index or not based on how fragmented your data is. If fetching rows in order means zooming back and forth across the table, a good planner may decide that it's not worth using the index. In contrast, if the table is clustered using that same index, the rows are guaranteed to be in order, increasing the likelihood that it'll get used.
But then, if you join the same query with another table and that other table has an extremely selective where clause that can use a small index, the planner might decide it's actually better to, e.g. fetch all IDs of rows that are tagged as foo, hash join them with publications, and heap sort them in memory.
MySQL tries to determine the best way to run a given query, and decides whether or not to use indexes based on what it thinks is the best.
It isn't always correct. Sometimes manually forcing a query to use an index is faster, sometimes its not.
If you run some testing with sample data in your specific situation, you should be able to see which method performs faster, and stick with that one.
Make sure you take into account query caching to get an accurate performance benchmark.
Forcing the use of an index is rarely the best answer. In general it is better to create and/or optimize the indices (indexes) so that MySQL chooses to use them. (It is even better to optimize the queries, but I understand you cannot do that here.)
When you are using something like Doctrine where you cannot optimize the queries and the indices don't help, your best bet is to focus on query caching. :-)

MySQL performance boost after create & drop index

I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.
Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.
How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?
What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)

The explain tells that the query is awful (it doesn't use a single key) but I'm using LIMIT 1. Is this a problem?

The explain command with the query:
explain SELECT * FROM leituras
WHERE categorias_id=75 AND
textos_id=190304 AND
cookie='3f203349ce5ad3c67770ebc882927646' AND
endereco_ip='127.0.0.1'
LIMIT 1
The result:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE leituras ALL (null) (null) (null) (null) 1022597 Using where
Will it make any difference adding some keys on the table? Even that the query will always return only one row.
In answer to your question, yes. You should add indexes where necessary (thanks #Col.Shrapnel) on the columns that appear in your WHERE clause - in this case, categorias_id, textos_id, cookie, and endereco_ip.
If you always perform a query using the same 3 columns in the WHERE clause, it may be beneficial to add an index which comprises the 3 columns in one go, rather than adding individual indexes.
It still has to do a linear search over the table until it finds that one row. So adding indexes could noticeably improve performance.
Yes, indexes are even more important when you want to return only one row.
If you are returning half of the rows and your database system has to scan the entire table, you're still at 50% efficiency.
However, if you want to return just one row, and your database system has to scan 1022597 rows to find your row, your efficiency is minuscule.
LIMIT 1 does offer some efficiency in that it stops as soon as it finds the first matching row, but it obviously has to scan an enormous number of records to find that first row.
Adding an index for each of the columns in your WHERE clause allows your database system to avoid scanning rows that don't match your criteria. With adequate indexes, you'll see that the rows column in the explain will get closer to the actual number of returned rows.
Using a compound index that covers all four of the columns in your WHERE clause allows even better performance and less scanning, as the index will provide full coverage. Compound indexes do use a lot of memory and negatively affect insert performance, so you might only want to add a compound index if a large percentage of your queries repeatedly do a look up on the same columns, or if you rarely insert records, or it's just that important to you for that particular query to be fast.
Another way to improve performance is to return only the columns that you need rather than using SELECT *. If you had a compound index on those four columns, and you returned only those four columns, your database system wouldn't need to hit your records at all. The database system could get everything it needed right from the indexes.

What indexes can be used to improve this query?

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.
An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?
Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).
The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.
I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.