What indexes can be used to improve this query? - mysql

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.

An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?

Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.

Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).

The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.

Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.

I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.

Related

What is the "Default order by" for a mysql Innodb query that omits the Order by clause?

So i understand and found posts that indicates that it is not recommended to omit the order by clause in a SQL query when you are retrieving data from the DBMS.
Resources & Post consulted (will be updated):
SQL Server UNION - What is the default ORDER BY Behaviour
When no 'Order by' is specified, what order does a query choose for your record set?
https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql
Questions :
See logic of the question below if you want to know more.
My question is : under mysql with innoDB engine, does anyone know how the DBMS effectively gives us the results ?
I read that it is implementation dependent, ok, but is there a way to know it for my current implementation ?
Where is this defined exactly ?
Is it from MySQL, InnoDB , OS-Dependent ?
Isn't there some kind of list out there ?
Most importantly, if i omit the order by clause and get my result, i can't be sure that this code will still work with newer database versions and that the DBMS will never give me the same result, can i ?
Use case & Logic :
I'm currently writing a CRUD API, and i have table in my DB that doesn't contain an "id" field (there is a PK though), and so when i'm showing the results of that table without any research criteria, i don't really have a clue on what i should use to order the results. I mean, i could use the PK or any field that is never null, but it wouldn't make it relevant. So i was wondering, as my CRUD is supposed to work for any table and i don't want to solve this problem by adding an exception for this specific table, i could also simply omit the order by clause.
Final Note :
As i'm reading other posts, examples and code samples, i'm feeling like i want to go too far. I understand that it is common knowledge that it's just a bad practice to omit the Order By clause in a request and that there is no reliable default order clause, not to say that there is no order at all unless you specify it.
I'd just love to know where this is defined, and would love to learn how this works internally or at least where it's defined (DBMS / Storage Engine / OS-Dependant / Other / Multiple criteria). I think it would also benefit other people to know it, and to understand the inners mechanisms in place here.
Thanks for taking the time to read anyway ! Have a nice day.
Without a clear ORDER BY, current versions of InnoDB return rows in the order of the index it reads from. Which index varies, but it always reads from some index. Even reading from the "table" is really an index—it's the primary key index.
As in the comments above, there's no guarantee this will remain the same in the next version of InnoDB. You should treat it as a coincidental behavior, it is not documented and the makers of MySQL don't promise not to change it.
Even if their implementation doesn't change, reading in index order can cause some strange effects that you might not expect, and which won't give you query result sets that makes sense to you.
For example, the default index is the clustered index, PRIMARY. It means index order is the same as the order of values in the primary key (not the order in which you insert them).
mysql> create table mytable ( id int primary key, name varchar(20));
mysql> insert into mytable values (3, 'Hermione'), (2, 'Ron'), (1, 'Harry');
mysql> select * from mytable;
+----+----------+
| id | name |
+----+----------+
| 1 | Harry |
| 2 | Ron |
| 3 | Hermione |
+----+----------+
But if your query uses another index to read the table, like if you only access column(s) of a secondary index, you'll get rows in that order:
mysql> alter table mytable add key (name);
mysql> select name from mytable;
+----------+
| name |
+----------+
| Harry |
| Hermione |
| Ron |
+----------+
This shows it's reading the table by using an index-scan of that secondary index on name:
mysql> explain select name from mytable;
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | mytable | index | NULL | name | 83 | NULL | 3 | Using index |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
In a more complex query, it can become very tricky to predict which index InnoDB will use for a given query. The choice can even change from day to day, as your data changes.
All this goes to show: You should just use ORDER BY if you care about the order of your query result set!
Bill's answer is good. But not complete.
If the query is a UNION, it will (I think) deliver first the results of the first SELECT (according to the rules), then the results of the second. Also, if the table is PARTITIONed, it is likely to do a similar thing.
GROUP BY may sort by the grouping expressions, thereby leading to a predictable order, or it may use a hashing technique, which scrambles the rows. I don't know how to predict which.
A derived table used to be an ordered list that propagates into the parent query's ordering. But recently, the ORDER BY is being thrown away in that subquery! (Unless there is a LIMIT.)
Bottom Line: If you care about the order, add an ORDER BY, even if it seems unnecessary based on this Q & A.
MyISAM, in contrast, starts with this premise: The default order is the order in the .MYD file. But DELETEs leave gaps, UPDATEs mess with the gaps, and INSERTs prefer to fill in gaps over appending to the file. So, the row order is rather unpredictable. ALTER TABLE x ORDER BY y temporarily sets the .MYD order; this 'feature' does not work for InnoDB.

Why does it contain "Using where"?

Here is my table schema.
CREATE TABLE `usr_block_phone` (
`usr_block_phone_uid` BIGINT (20) UNSIGNED NOT NULL AUTO_INCREMENT,
`time` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`usr_uid` INT (10) UNSIGNED NOT NULL,
`block_phone` VARCHAR (20) NOT NULL,
`status` INT (4) NOT NULL,
PRIMARY KEY (`usr_block_phone_uid`),
KEY `block_phone` (`block_phone`),
KEY `usr_uid_block_phone` (`usr_uid`, `block_phone`) USING BTREE,
KEY `usr_uid` (`usr_uid`) USING BTREE
) ENGINE = INNODB DEFAULT CHARSET = utf8
And This is my SQL
SELECT
ubp.usr_block_phone_uid
FROM
usr_block_phone ubp
WHERE
ubp.usr_uid = 19
AND ubp.block_phone = '80000000001'
By the way, when I ran "EXPLAIN", I got the result as following.
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
| 1 | SIMPLE | ubp | ref | block_phone,usr_uid_block_phone,usr_uid | usr_uid_block_phone | 66 | const,const | 1 | Using where; Using index |
+------+-------------+-------+------+-----------------------------------------+---------------------+---------+-------------+------+--------------------------+
Why is index usr_uid_block_phone not working?
I want to use using index only.
This table has 20000 rows now.
Your index is actually used, see the key column. At the moment the query looks good and the execution plan is good as well.
Fill it with at least a hundred for it to be used (and ensure you still use a predicate that filters just one row).
And a general advice: it's near to impossible to predict how optimiser would behave in a particular situation unless you're a mysql dbms developer yourself. So it's always better to try on a dataset that is as close (in terms of size and quality of data) to your production as possible.
Both columns that are used in the WHERE clause (usr_uid and block_phone) are present in the usr_uid_block_phone index and this makes it a possible key to be used to process the query. Even more, it is the index selected but because of the small number of rows in the table, MySQL decides that is faster to not use an index.
The reason is in the expressions present in the SELECT clause:
SELECT
ubp.usr_block_phone_uid
Because the column usr_block_phone_uid is not present in the selected index, in order to process the query MySQL needs to read both the index (to determine what rows match the WHERE conditions) and the table data (to get the value of column usr_block_phone_uid of those rows).
It is faster to read only the table data and use the WHERE conditions to find the matching rows and get their usr_block_phone_uid column. It needs to read data from storage from one place. It needs to read the same data and the index data if it uses an index.
The situation (and the report of EXPLAIN) changes when the table grows. At some point, reading information from the index (and using it to filter out rows) is compensated by the large number of rows that are filtered out (i.e. their data is not read from the storage).
The exact point when this happens is not fixed. It depends a lot of the structure of your table and how the values in the table are spread out. Even when the table is large, MySQL can decide to ignore the index in order to read less information from the storage medium. For example, if a large percentage (let's say 90%) of the table rows match the WHERE condition, it is more efficient to read all the table data (and ignore the index) than to read 90% of table data and 90% of the index.
90% in the previous paragraph is a figure I made up for explanation purposes. I don't know how MySQL decides that it's better to ignore the index.

Stop MySQL after first match

I noticed that adding LIMIT 1 at the end of a query does not decrease the execution time. I have a few thousand records and a simple query. How do I make MySQL stop after the first match?
For example, these two queries both take approximately half a second:
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes
SELECT id,content FROM links WHERE LENGTH(content)<500 ORDER BY likes LIMIT 1
Edit: And here's the explain results:
id | select_type | table | type possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | links | ALL | NULL | NULL | NULL | NULL | 38556 | Using where; Using filesort
The difference between the two queries run time relies on the actual data.
There are several possible scenarios:
There are many records with LENGTH(content)<500
In this case, MySQL will start scanning all table rows (according to primary key order since you didn't provide any ORDER BY).
There is no index use since your WHERE condition can't be indexed.
Since there are relatively many rows with LENGTH(content)<500, the LIMIT query will return faster than the other one.
There are no records with LENGTH(content)<500
Again, MySQL will start scanning all table rows, but will have to go through all records to figure out none of them satisfies the condition.
Again no index can be used for the same reason.
In this case - the two queries will have exactly the same run time.
Anything between those two scenarios will have different run times, which will be farther apart as you have more valid records in the table.
Edit
Now that you added the ORDER BY, the answer is a bit different:
If there is an index on likes column, ORDER BY would use it and the time would be the time it takes to get to the first record that satisfies the WHERE condition (if 66% of the records do, than this should be faster than without the LIMIT).
If there is no index on likes column, the ORDER BY will take most of the time - MySQL must scan all table to get all records that satisfy the WHERE, then order them by likes, and then take the first one.
In this case both queries will have similar run time (scanning and sorting the results is much longer than returning 1 record or many records...)!
Calling functions on data results in an automatic table scan, these can't be indexed. What you might do is create a derived column where you've saved this value in advance if performance is a concern here:
ALTER TABLE links ADD COLUMN content_length INT
UPDATE links SET content_length=LENGTH(content)
ALTER TABLE links ADD INDEX idx_content_length (content_length)
Once denormalized like this and properly you'll be able to run the query much faster. Keep in mind you'll have to populate content_length each time you add a record.

MySQL Hash Indexes for Optimization

So maybe this is noob, but I'm messing with a couple tables.
I have TABLE A roughly 45,000 records
I have TABLE B roughly 1.5 million records
I have a query:
update
schema1.tablea a
inner join (
SELECT DISTINCT
ID, Lookup,
IDpart1, IDpart2
FROM
schema1.tableb
WHERE
IDpart1 is not NULL
AND
Lookup is not NULL
ORDER BY
ID,Lookup
) b Using(ID,Lookup)
set
a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
where
a.ID is NOT NULL
AND
a.Elg_IDpart1 is NULL
So I am forcing the index on ID, Lookup. Each table does have a index on those columns as well but because of the sub-query I forced it.
It is taking FOR-EVER to run, and it really should take, i'd imagine under 5 minutes...
My questions are in regards to the indexes, not the query.
I know that you can't use hash index in ordered index.
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??
Can I have a single hash index, and the rest of the indexes b B-tree index?
This is not a primary key field.
I would post my explain but i changed the name on these tables. Basically it is using the index only for ID...instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
The Lookup field maps the first and second part of the ID...
Any help or insight on this is appreciated.
 UPDATE
An EXPLAIN on the UPDATE after I took out sub-query.
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
| 1 | SIMPLE | m | ALL | Lookup_Idx,ID_Idx,ID_Lookup | | | | 44023 | Using where |
| 1 | SIMPLE | c | ref | ID_LookupIdx | ID_LookupIdx | 5 | schema1.tableb.ID | 4 | Using where |
+----+-------------+-------+------+-----------------------------+--------------+---------+-------------------+-------+-------------+
tablea relevant indexes:
ID_LookupIdx (ID, Lookup)
tableb relevant indexes:
ID (ID)
Lookup_Idx (Lookup)
ID_Lookup_Idx (ID, Lookup)
All of the indexes are normal B-trees.
Firstly, to deal with the specific questions that you raise:
I currently have indexes on both ID, Lookup sperately, and as one index, and it is a B-Tree index. Based on my WHERE Clause, does a hash index fit for as an optimization technique??
As documented under CREATE INDEX Syntax:
+----------------+--------------------------------+
| Storage Engine | Permissible Index Types |
+----------------+--------------------------------+
| MyISAM | BTREE |
| InnoDB | BTREE |
| MEMORY/HEAP | HASH, BTREE |
| NDB | BTREE, HASH (see note in text) |
+----------------+--------------------------------+
Therefore, before even considering HASH indexing, one should be aware that it is only available in the MEMORY and NDB storage engines: so may not even be an option to you.
Furthermore, be aware that indexes on combinations of ID and Lookup alone may not be optimal, as your WHERE predicate also filters on tablea.Elg_IDpart1 and tableb.IDpart1—you may benefit from indexing on those columns too.
Can I have a single hash index, and the rest of the indexes b B-tree index?
Provided that the desired index types are supported by the storage engine, you can mix them as you see fit.
instead of using the ID, Lookup, I would like to force it to use both, or at least turn it into a different kind of index and see if that helps?
You could use an index hint to force MySQL to use different indexes to those that the optimiser would otherwise have selected.
Now I know MySQL is smart enough to determine which index is most appropriate, so is that what it's doing?
It is usually smart enough, but not always. In this case, however, it has probably determined that the cardinality of the indexes is such that it is better to use those that it has chosen.
Now, depending on the version of MySQL that you are using, tables derived from subqueries may not have any indexes upon them that can be used for further processing: consequently the join with b may require a full scan of that derived table (there's insufficient information in your question to determine exactly how much of a problem this might be, but schema1.tableb having 1.5 million records suggests it could be a significant factor).
See Subquery Optimization for more information.
One should therefore try to avoid using derived tables if at all possible. In this case, there does not appear to be any purpose to your derived table as one could simply join schema1.tablea and schema1.tableb directly:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND a.ID IS NOT NULL
AND b.IDpart1 IS NOT NULL
AND b.Lookup IS NOT NULL
ORDER BY ID, Lookup
The only thing that has been lost is the filter for DISTINCT records, but duplicate records will simply (attempt to) overwrite updated values with those same values again—which will have no effect, but may have proved very costly (especially with so many records in that table).
The use of ORDER BY in the derived table was pointless as it could not be relied upon to achieve any particular order to the UPDATE, whereas in this revised version it will ensure that any updates that overwrite previous ones take place in the specified order: but is that necessary? Perhaps it can be removed and save on any sorting operation.
One should check the predicates in the WHERE clause: are they all necessary (the NOT NULL checks on a.ID and b.Lookup, for example, are superfluous given that any such NULL records will be eliminated by the JOIN predicate)?
Altogether, this leaves us with:
UPDATE schema1.tablea a
JOIN schema1.tableb b USING (ID, Lookup)
SET a.Elg_IDpart1 = b.IDpart1,
a.Elg_IDpart2 = b.IDpart2
WHERE a.Elg_IDpart1 IS NULL
AND b.IDpart1 IS NOT NULL
Only if performance is still unsatisfactory should one look further at the indexing. Are relevant columns (i.e. those used in the JOIN and WHERE predicates) indexed? Are the indexes being selected for use by MySQL (bear in mind that it can only use one index per table for lookups: for testing both the JOIN predicate and the filter predicates: perhaps you need an appropriate composite index)? Check the query execution plan by using EXPLAIN to investigate such issues further.

Is it possible to optimize a query that uses the '<>' operator?

This a follow-up to a previous question.
How can I optimize this query so that it does not perform a full table scan?
SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
.
explain SELECT Employee.name FROM Employee WHERE Employee.id <> 1000;
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | Employee | ALL | PRIMARY | NULL | NULL | NULL | 5000 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
(Empoyee.id is the primary key, in case that isn't clear.)
Have a covering index for name and id, and it should be able to fulfill the query using the index. This might be faster, because there's a good chance the entire index will already be in memory, while a table scan is more likely to need to go to disk.
Because of the low (non-existent) selectivity of your where clause you may need to provide a hint to get the database to use your index. I'm a sql server guy, and so I'm not sure of the syntax needed in mysql to hint an index, or even if mysql is able to take advantage of a covering index in this manner.
That said, I doubt you can get much improvement: you're returning every row but one. You should expect that to need to scan the table.
There are a lot of things to try, it depends on how the database engine chooses to parse it, really. Some options:
select employee.name from employee where employee.id not in (1000);
You could also try a union with a less than and then a greater than.
But in the specific example you are giving (which may just be too simple for your real case) a table scan isn't necessarily a bad thing. If all the records have to be returned except one, using an index may in fact be slower.
In traditional databases, you cant!
Of course, you could just omit all Employees with the given Id (when it is key or has an index) -- but normally you will still have the total majority of the table under your feet. So using an index might complicate things and thus a fts normally is the faster option.
When you have specialized databases, you could store the names of all employees adjacent to each other.
Edit: I now saw the other answer of Joel. Yes, this could be a way, since in fact your special index is now a specialized form of storing a part of the content. Good databases can just use the index content when it covers the columns needed -- this is rather nice. Of course, you will endup in a so called "full index scan" (but normally much faster as a full-table-scan).
Nothing you can do will increase performance. In this case the database must do a complete table scan, as you are asking for every record save one. Reading every page in an index on top of that would only reduce performance. Fortunately, even if you added an index, the database would be smart enough to ignore it...
EDIT to address #Juergens comment.
Juergen, you are right about a covering index, but there are conflicting effects here. Any use of an index in a scenario like this has bad effects in one sense... The query engine could have to perform one I/O Operation for each level in the index, for each row it needs to examine. If there are, say, 5 levels in the index, and 1M rows, that would be 5 Million I/O operations, compared to only 1M I/Os to do a complete table scan. This is why, in this scenario, most query optimizers would ignore any available index and do the table scan anyway. (unless you force it to use the index with a hint) The only mitigating factor is if EVERY attribute required by the query is in the index (covering index) and the number of index rows per page on disk is sufficiently smaller than the number of table rows per page to counteract the negative effect of having to traverse each level of the index for each row returned by the query.