Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index? - mysql

First, I will describe a simplified version of the problem domain.
There is table strings:
CREATE TABLE strings (
value CHAR(3) COLLATE utf8_unicode_ci NOT NULL,
INDEX(value)
) ENGINE=InnoDB;
As you can see, it have a non-unique index of CHAR(3) column.
The table is populated using the following script:
CREATE TABLE a_variants (
letter CHAR(1) COLLATE utf8_unicode_ci NOT NULL
) ENGINE=MEMORY;
INSERT INTO a_variants VALUES -- 60 variants of letter 'A'
('A'),('a'),('À'),('Á'),('Â'),('Ã'),('Ä'),('Å'),('à'),('á'),('â'),('ã'),
('ä'),('å'),('Ā'),('ā'),('Ă'),('ă'),('Ą'),('ą'),('Ǎ'),('ǎ'),('Ǟ'),('ǟ'),
('Ǡ'),('ǡ'),('Ǻ'),('ǻ'),('Ȁ'),('ȁ'),('Ȃ'),('ȃ'),('Ȧ'),('ȧ'),('Ḁ'),('ḁ'),
('Ạ'),('ạ'),('Ả'),('ả'),('Ấ'),('ấ'),('Ầ'),('ầ'),('Ẩ'),('ẩ'),('Ẫ'),('ẫ'),
('Ậ'),('ậ'),('Ắ'),('ắ'),('Ằ'),('ằ'),('Ẳ'),('ẳ'),('Ẵ'),('ẵ'),('Ặ'),('ặ');
INSERT INTO strings
SELECT CONCAT(a.letter, b.letter, c.letter) -- 60^3 variants of string 'AAA'
FROM a_variants a, a_variants b, a_variants c
UNION ALL SELECT 'BBB'; -- one variant of string 'BBB'
So, it contains 216000 indistinguishable (in terms of the utf8_unicode_ci collation) variants of string "AAA" and one variant of string "BBB":
SELECT value, COUNT(*) FROM strings GROUP BY value;
+-------+----------+
| value | COUNT(*) |
+-------+----------+
| AAA | 216000 |
| BBB | 1 |
+-------+----------+
As value is indexed, I expect the following two queries to have similar performance:
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA';
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB';
But in practice the first one is more than 300x times slower than the second! See:
+----------+------------+---------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+---------------------------------------------------------------+
| 1 | 0.11749275 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
| 2 | 0.00033325 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB' |
| 3 | 0.11718050 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
+----------+------------+---------------------------------------------------------------+
-- I ran the 'AAA' query twice here just to be sure.
If I change size of the indexed column or change its type to VARCHAR, the problem with performance still manifests itself. Meanwhile, in analogous situations, but when the non-unique index is not CHAR/VARCHAR (e.g. INT), queries are as fast as expected.
So, the question is why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?
I have strong feeling that MySQL perform full linear scan of all the values matched by the index key. But why it do so when it can just return the count of the matched rows? Am I missing something and that is really needed? Or is that a sad shortcoming of MySQL optimizer?

Clearly, the issue is that the query is doing an index scan. The alternative approach would be to do two index lookups, for the first and last values that are the same, and then use meta information in the index for the calculation. Based on your observations, MySQL does both.
The rest of this answer is speculation.
The reason the performance is "only" 300 times slower, rather than 200,000 times slower, is because of overhead in reading the index. Actually scanning the entries is quite fast compared to other operations that are needed.
There is a fundamental difference between numbers and strings when it comes to comparisons. The engine can just look at the bit representations of two numbers and recognize whether they are the same or different. Unfortunately, for strings, you need to take encoding/collation into account. I think that is why it needs to look at the values.
It is possible that if you had 216,000 copies of exactly the same string, then MySQL would be able to do the count using metadata in the index. In other words, the indexer is smart enough to use metadata for exact equality comparisons. But, it is not smart enough to take encoding into account.

One of things you may want to check on is the logical I/O of each query. I'm sure you'll see quite a difference. To count the number of 'BBB's in the table, probably only 3 or 4 LIOs are needed (depending on things like bucket size). To count the number of 'AAA's, essentially the entire table must be scanned, index or not. With 216k rows, that can add up to significantly more LIOs -- not to mention physical I/Os. Logical I/Os are faster than physical I/Os, but any I/O is a performance killer.
As for text vs numbers, it is always easier and faster for software (any software, not just database engines) to compare numbers than text.

Related

What is the "Default order by" for a mysql Innodb query that omits the Order by clause?

So i understand and found posts that indicates that it is not recommended to omit the order by clause in a SQL query when you are retrieving data from the DBMS.
Resources & Post consulted (will be updated):
SQL Server UNION - What is the default ORDER BY Behaviour
When no 'Order by' is specified, what order does a query choose for your record set?
https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql
Questions :
See logic of the question below if you want to know more.
My question is : under mysql with innoDB engine, does anyone know how the DBMS effectively gives us the results ?
I read that it is implementation dependent, ok, but is there a way to know it for my current implementation ?
Where is this defined exactly ?
Is it from MySQL, InnoDB , OS-Dependent ?
Isn't there some kind of list out there ?
Most importantly, if i omit the order by clause and get my result, i can't be sure that this code will still work with newer database versions and that the DBMS will never give me the same result, can i ?
Use case & Logic :
I'm currently writing a CRUD API, and i have table in my DB that doesn't contain an "id" field (there is a PK though), and so when i'm showing the results of that table without any research criteria, i don't really have a clue on what i should use to order the results. I mean, i could use the PK or any field that is never null, but it wouldn't make it relevant. So i was wondering, as my CRUD is supposed to work for any table and i don't want to solve this problem by adding an exception for this specific table, i could also simply omit the order by clause.
Final Note :
As i'm reading other posts, examples and code samples, i'm feeling like i want to go too far. I understand that it is common knowledge that it's just a bad practice to omit the Order By clause in a request and that there is no reliable default order clause, not to say that there is no order at all unless you specify it.
I'd just love to know where this is defined, and would love to learn how this works internally or at least where it's defined (DBMS / Storage Engine / OS-Dependant / Other / Multiple criteria). I think it would also benefit other people to know it, and to understand the inners mechanisms in place here.
Thanks for taking the time to read anyway ! Have a nice day.
Without a clear ORDER BY, current versions of InnoDB return rows in the order of the index it reads from. Which index varies, but it always reads from some index. Even reading from the "table" is really an index—it's the primary key index.
As in the comments above, there's no guarantee this will remain the same in the next version of InnoDB. You should treat it as a coincidental behavior, it is not documented and the makers of MySQL don't promise not to change it.
Even if their implementation doesn't change, reading in index order can cause some strange effects that you might not expect, and which won't give you query result sets that makes sense to you.
For example, the default index is the clustered index, PRIMARY. It means index order is the same as the order of values in the primary key (not the order in which you insert them).
mysql> create table mytable ( id int primary key, name varchar(20));
mysql> insert into mytable values (3, 'Hermione'), (2, 'Ron'), (1, 'Harry');
mysql> select * from mytable;
+----+----------+
| id | name |
+----+----------+
| 1 | Harry |
| 2 | Ron |
| 3 | Hermione |
+----+----------+
But if your query uses another index to read the table, like if you only access column(s) of a secondary index, you'll get rows in that order:
mysql> alter table mytable add key (name);
mysql> select name from mytable;
+----------+
| name |
+----------+
| Harry |
| Hermione |
| Ron |
+----------+
This shows it's reading the table by using an index-scan of that secondary index on name:
mysql> explain select name from mytable;
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | mytable | index | NULL | name | 83 | NULL | 3 | Using index |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
In a more complex query, it can become very tricky to predict which index InnoDB will use for a given query. The choice can even change from day to day, as your data changes.
All this goes to show: You should just use ORDER BY if you care about the order of your query result set!
Bill's answer is good. But not complete.
If the query is a UNION, it will (I think) deliver first the results of the first SELECT (according to the rules), then the results of the second. Also, if the table is PARTITIONed, it is likely to do a similar thing.
GROUP BY may sort by the grouping expressions, thereby leading to a predictable order, or it may use a hashing technique, which scrambles the rows. I don't know how to predict which.
A derived table used to be an ordered list that propagates into the parent query's ordering. But recently, the ORDER BY is being thrown away in that subquery! (Unless there is a LIMIT.)
Bottom Line: If you care about the order, add an ORDER BY, even if it seems unnecessary based on this Q & A.
MyISAM, in contrast, starts with this premise: The default order is the order in the .MYD file. But DELETEs leave gaps, UPDATEs mess with the gaps, and INSERTs prefer to fill in gaps over appending to the file. So, the row order is rather unpredictable. ALTER TABLE x ORDER BY y temporarily sets the .MYD order; this 'feature' does not work for InnoDB.

How can make this join of two huge MySQL Tables finish?

I have two tables
table1:
column1: varchar(20)
column2: varchar(20)
column3: varchar(20)
table2:
column1: varchar(20)
column2: varchar(20)
column3: varchar(20) <- empty
column1 and column2 both have a separate Fulltext index in table1
both tables hold 20 million rows
I need to fill column3 of table2 by matching column1 & column2 from table2 to column1 & column2 from table1, then take the value in column3 from table1 and put it into column3 of table2. column1 & column2 might not match exactly, so the query I use for this is:
UPDATE table1, table2
SET table2.column3 = table1.column3
WHERE table2.column1 LIKE table1.'%column1%' AND
table2.column2 LIKE table1.'%column2%';
This query never finishes. I let it run for 2 weeks and still didn't produce any result. It utilized one CPU core 100%, had little SSD IO and apparently needs to be optimized somehow.
I am open to any suggestions regarding query optimization, index optimization or even DBMS optimization (or even migration, if it helps) since I need to do queries like this more often in the future.
EDIT1
There are plenty of optimization guides, please use google for that. You can increase the threads in config (InnoDB). For the Update itself i recommend to first create a temp_table and then copy to db2
I know that but couldn't quite solve my scenario with those guides. I also know that questions of all possible permutations of combinations for this problem (huge databases, performance, bottlenecks, query design) are all around, also on stackoverflow. However, to this day I couldn't figure out what the best way to proceed would be for this specific combination of problems and hoped for getting help here. That being said:
- more threads would require sharding or partitioning in order to utilize more than one CPU core, which I would like to avoid if I can solve the problem with other means
- how would you propose to create such temporary table here?
Why do you use like operator if you do not use wild card characters? Replace them with =. Also, do you have multi-column index on the 3 columns in the where criteria in each of the tables? Pls share the output of the explain as well, along with any existing indexes in the 2 tables.
I left those characters out in the example but want to use them once the basic query works, sorry for the confusion. I am not entirely sure how to put those wildcards into a column comparison though.
I have two seperate indizes, should I create a 2-column index instead? (there are only 2 columns in the where criteria)
would you rather have the explain of the structure I have now or prefer the explain of the structure with a 2-column index?
i guess you say databases but you are talking about tables, right?
Exactly, sorry for the confusion.
The query you wrote will perform 20m x 20m lookups (for each row in table 1 look up all rows in table 2). You can't write whatever in and expect it to work if you have an SSD or a good CPU. If you arrived at this point, it's time to think before you start writing SQL. What it is that you need to do, what are the tools you have at your disposal and what's the middle part that you don't know - those are the questions you need to answer every time before you issue 400 billion lookup query.
That is the scenario I am facing though. I don't expect it to work at all like it is at the moment, to be honest, so I am looking for pointers which might make this a solvable scenario. The basic "update this, where that matches" query apparently doesn't apply here. So I am trying to figure out a way to a more advanced solution. Any criticism is very welcome, so thank you for this input. How would you suggest to proceed here?
EDIT2
Give us some sample values and non-exact comparisons.
table1:
+---------+---------+-------------+---------+---------+---------+
| column1 | column2 | column3 | column4 | column5 | columnN |
+---------+---------+-------------+---------+---------+---------+
| John | Doe_ | employee001 | xyz | 12345 | ... |
| Jim | Doe | employee002 | abc | 67890 | ... |
+---------+---------+-------------+---------+---------+---------+
table2:
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| John | Doe | |
| Jim | Doe | |
+---------+---------+---------+
Here, a LIKE query would fill both rows of table 2, if it would match "Doe_" for "Doe". But by writing this down, I just realized that a LIKE query is no option here because the variations wouldn't constrain to a suffix of column2 in table 1, rather various possible likes would be required (leading AND trailing variants for both columns in both tables). This in turn would multiply the number of required matches.
So let's forget about the LIKE and concentrate on exact matching only.
FULLTEXT and LIKE have nothing to do with each other.
"Might not match exactly" -- You will need more limitations on this non-restriction. Else, any attempt at a query will continue to take weeks.
t2.c1 LIKE CONCAT('%', t1.c1, '%') requires checking ever row of t1 against every row of t2; that's 400 trillion tests. No hardware can do that in a reasonable length of time.
FULLTEXT works with "words". If your c1 and c2 are strings of words, then there is some hope to use FULLTEXT. FULLTEXT is much faster than LIKE because it has an index structure based on words.
However, even FULLTEXT is no where near the speed of t2.c1 = t1.c1. Still, that would need a composite INDEX(c1, c2) Then it would be a full table scan (20M rows) of one table, plus 20M probes via a BTree index into the other table. This is like 40M operations -- a lot better than 400T for LIKE.
In order to proceed, please think through your definition of "Might not match exactly" and present the best you can live with.
Ok, since I decided to drop the LIKE requirement, what exactly do you propose to use as index?
I read your post like this:
ALTER TABLE `table1` ADD FULLTEXT INDEX `indexname1` (`column1`, `column2`);
ALTER TABLE `table2` ADD FULLTEXT INDEX `indexname2` (`column1`, `column2`);
UPDATE `table1`, `table2`
SET `table2`.`column3` = `table1`.`column3 `
WHERE CONCAT(`table1`.`column1`, `table1`.`column2`) = CONCAT(`table2`.`column1`, `table2`.`column2`);
Is this correct?
Two followup questions though:
1) Is the update in your oppinion as fast, faster or slower as creating a new table, i.e.:
CREATE TABLE `merged` AS
SELECT `table1`.`column1`, `table1`.`column2`, `table1`.`column3`
FROM `table1`, `table2`
WHERE CONCAT(`table1`.`column1`, `table1`.`column2`) = CONCAT(`table2`.`column1`, `table2`.`column2`);
2) Would the indizes and / or the matching be case sensitive? If yes, can adapt the query without having to change column1 & column2 to all upper case (or all lower case)?
FULLTEXT and LIKE have nothing to do with each other.
"Might not match exactly" -- You will need more limitations on this non-restriction. Else, any attempt at a query will continue to take weeks.
t2.c1 LIKE CONCAT('%', t1.c1, '%') requires checking ever row of t1 against every row of t2; that's 400 trillion tests. No hardware can do that in a reasonable length of time.
FULLTEXT works with "words". If your c1 and c2 are strings of words, then there is some hope to use FULLTEXT. FULLTEXT is much faster than LIKE because it has an index structure based on words.
However, even FULLTEXT is no where near the speed of t2.c1 = t1.c1. Still, that would need a composite INDEX(c1, c2) Then it would be a full table scan (20M rows) of one table, plus 20M probes via a BTree index into the other table. This is like 40M operations -- a lot better than 400T for LIKE.
In order to proceed, please think through your definition of "Might not match exactly" and present the best you can live with.
Edit
WHERE CONCAT(t1.c1, t1.c2) = CONCAT(t2.c1, t2.c2) is a lot worse than saying WHERE t1.c1=t2.c2 AND t1.c2 = t2.c2. The latter will run fast with INDEX(c1,c2).
Try this:
1. Add a new column to db1 and db2 with a character, that never appears in column1 and column2, for example #
ALTER TABLE `db1` ADD `column4` VARCHAR(40) NOT NULL ;
UPDATE db1 SET column4 = column1 + '#' + column2
2. Do the same for db2. Then create an index (BTREE) on column 4 (in db1 and db2).
ALTER TABLE `db1` ADD INDEX ( `column4` ) ;
ALTER TABLE `db2` ADD INDEX ( `column4` ) ;
3. Then run next query:
UPDATE db1, db2 SET db2.column3 = db1.column3 WHERE db1.column4 = db2.column4;
It should run fast enough.
When it's done - just drop column4 and it's index

MySQL indexing char(1) columns

I have a table with a complex query that I look for optimization,
I read most of the documentation on MySQL indexing .. but in this case I`m not sure
what to do:
Data structure:
-- please, don't comment on the field types and names, it is outsourced project.
CREATE TABLE items(
record_id INT(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
solid CHAR(1) NOT NULL, -- only 'Y','N' values
optional CHAR(1) NULL, -- only 'Y','N', NULL values
data TEXT
);
Query:
SELECT * FROM items
WHERE record_id != 88
AND solid = 'Y'
AND optional !='N' -- 'Y' OR NULL
Of course there are extra joins and related data, but this are the biggest filters.
In the scenario of:
- 200 000+ records,
- 10% (from all) with solid = 'Y',
- 10% (from all) with optional !='N',
What would be good index for this query ?
or more precisely:
does the first check record != 88 slows they query in any way ?
(it only eleminates one result...?)
which is faster (optional !='N') or ( 'optional' = 'Y' OR 'optional' iS NULL )
as mentioned above optional = 'N' are 10% of the total count.
is there anything special for indexing a CHAR(1) column with only 2 possible values?
can I use this index (record_id, solid, optional)?
can I create a index for specific value (solid = 'Y', optional !='N')?
As #Jack requested, current EXPLAIN result (out of 30 000 total rows with 20 results):
+-------------+-------+--------------+---------+---------+------+-------+-------------+
| select_type | type | possible_key | key | key_len | ref | rows | Extra |
+-------------+-------+--------------+---------+---------+------+-------+-------------+
| PRIMARY | range | PRIMARY | PRIMARY | 4 | NULL | 16228 | Using where |
+-------------+-------+--------------+---------+---------+------+-------+-------------+
This is an interesting question. Overall, your query has an estimated selectivity of about 1%. So, if 100 records fit on a page, then you would assume that each page would still have to be read, even with the index. Because a record is so small (depending on data that is), this is quite likely. From that perspective, an index is not worth it.
An index would be worth it under the following circumstances. The first is when the index is a covering index, meaning that you can satisfy the query with all the columns in the index. For example:
select count(*)
FROM items
WHERE record_id != 88 AND solid = 'Y' AND optional !='N' -- 'Y' OR NULL
Where the index is on solid, optional, record_id. The query doesn't need to go back to the original data pages.
Another case would be when the index is a primary (or clustered) index. The data is stored in that order, so fetching a limited number of results would reduce the read overhead of the query. The downside to this is that updates and inserts are more expensive, because data actually has to move.
My best guess in your case is that an index would not be useful, unless data is quite large (in the kilobyte range).
You should try to put indexes on the columns that will do the most discrimination. Usually indexing a binary column is not very helpful, if the database is about evenly split between the values. But if the value you often search for only appears 10% of the time, it can be a useful index.
If any of the columns are indexed, they will usually be checked before doing any other WHERE processing. The order that you put the conditions in the WHERE clause is not generally relevant. You can use EXPLAIN to find out which indexes a query uses.

Is it better to force index usage for an ORDER BY?

I'm currently trying to optimize a query generated by Doctrine 2 on this table:
CREATE TABLE `publication` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`global_order` int(11) NOT NULL,
`title` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`slug` varchar(63) COLLATE utf8_unicode_ci NOT NULL,
`type` varchar(7) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `UNIQ_AF3C6779B12CE9DB` (`global_order`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
The query is
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
type is a discriminator column added by Doctrine. Although the WHERE clause is useless as type is always one of the IN values, I cannot remove it.
EXPLAIN shows me
+------+---------------+------+------+-----------------------------+
| type | possible_keys | key | rows | Extra |
+------+---------------+------+------+-----------------------------+
| ALL | NULL | NULL | 562 | Using where; Using filesort |
+------+---------------+------+------+-----------------------------+
(rows is different each time I execute the query)
After some reading I found I can force an index usage like this:
ALTER TABLE `publication` DROP INDEX `UNIQ_AF3C6779B12CE9DB` ,
ADD UNIQUE `UNIQ_AF3C6779B12CE9DB` ( `global_order` , `type` )
and
SELECT *
FROM publication
FORCE INDEX(UNIQ_AF3C6779B12CE9DB)
WHERE global_order > 0
AND type IN ('article', 'event', 'work')
ORDER BY global_order DESC
The WHERE clause is always useless, but this time EXPLAIN shows me
+-------+-----------------------+-----------------------+------+-------------+
| type | possible_keys | key | rows | Extra |
+-------+-----------------------+-----------------------+------+-------------+
| range | UNIQ_AF3C6779B12CE9DB | UNIQ_AF3C6779B12CE9DB | 499 | Using where |
+-------+-----------------------+-----------------------+------+-------------+
It seems to me it's better, but it seems it's not common to have to force an index too so I wonder if it's really efficient for such a simple query.
Does anyone know what is the better way to perform this query?
Thanks!
If your query really is:
SELECT *
FROM publication
WHERE type IN ('article', 'event', 'work')
ORDER BY global_order DESC
... and all entries (or nearly all) will match the IN clause, you're actually better off with no index at all. If you toss in a limit clause, then the index you'll want is actually on global_order, without the type field. The reason for this is, it actually costs something to read an index.
If you're going for the entire table, sequentially reading the table and sorting its rows in memory will be your cheapest plan. If you only need a few rows and most will match the where clause, going for the smallest index will do the trick.
To understand why, picture the disk IO involved.
Suppose you want the whole table without an index. To do this, you read data_page1, data_page2, data_page3, etc., visiting the various disk pages involved in order, until you reach the end of the table. You then then sort and return.
If you want the top 5 rows without an index, you'd sequentially read the entire table as before, while heap-sorting the top 5 rows. Admittedly, that's a lot of reading and sorting for a handful of rows.
Suppose, now, that you want the whole table with an index. To do this, you read index_page1, index_page2, etc., sequentially. This then leads you to visit, say, data_page3, then data_page1, then data_page3 again, then data_page2, etc., in a completely random order (that by which the sorted rows appear in the data). The IO involved makes it cheaper to just read the whole mess sequentially and sort the grab bag in memory.
If you merely want the top 5 rows of an indexed table, in contrast, using the index becomes the correct strategy. In the worst case scenario you load 5 data pages in memory and move on.
A good SQL query planner, btw, will make its decision on whether to use an index or not based on how fragmented your data is. If fetching rows in order means zooming back and forth across the table, a good planner may decide that it's not worth using the index. In contrast, if the table is clustered using that same index, the rows are guaranteed to be in order, increasing the likelihood that it'll get used.
But then, if you join the same query with another table and that other table has an extremely selective where clause that can use a small index, the planner might decide it's actually better to, e.g. fetch all IDs of rows that are tagged as foo, hash join them with publications, and heap sort them in memory.
MySQL tries to determine the best way to run a given query, and decides whether or not to use indexes based on what it thinks is the best.
It isn't always correct. Sometimes manually forcing a query to use an index is faster, sometimes its not.
If you run some testing with sample data in your specific situation, you should be able to see which method performs faster, and stick with that one.
Make sure you take into account query caching to get an accurate performance benchmark.
Forcing the use of an index is rarely the best answer. In general it is better to create and/or optimize the indices (indexes) so that MySQL chooses to use them. (It is even better to optimize the queries, but I understand you cannot do that here.)
When you are using something like Doctrine where you cannot optimize the queries and the indices don't help, your best bet is to focus on query caching. :-)

What indexes can be used to improve this query?

This query selects all the unique visitor sessions in a certain date range:
select distinct(accessid) from accesslog where date > '2009-09-01'
I have indexes on the following fields:
accessid
date
some other fields
Here's what explain looks like:
mysql> explain select distinct(accessid) from accesslog where date > '2009-09-01';
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
| 1 | SIMPLE | accesslog | range | date,dateurl,dateaff | date | 3 | NULL | 64623 | Using where; Using temporary |
+----+-------------+-----------+-------+----------------------+------+---------+------+-------+------------------------------+
mysql> explain select distinct(accessid) from accesslog;
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
| 1 | SIMPLE | accesslog | index | NULL | accessid | 257 | NULL | 1460253 | Using index |
+----+-------------+-----------+-------+---------------+----------+---------+------+---------+-------------+
Why doesn't the query with the date clause use the accessid index?
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Edit - Resolution
Reducing column width on accessid from varchar 255 to char 32 improved query time by ~75%.
Adding a date+accessid index had no effect on query time.
An index on (date,accessid) could help. However, before tweaking indices I'd recommend checking the type of your accessid column. EXPLAIN says the key is 257 bytes long, which sounds like a lot for an ID column. Are you using a VARCHAR(256) for accessid? If so, can't you use a more compact type? If it's a number, it should by INT (SMALLINT, BIGINT, whatever fits your needs) and if it's an alphanumeric ID, can it really be 256 chars long? If its length is fixed, can't you use CHAR (CHAR(32) for example) instead?
Your problem is that your condition is a range clause (on the date column).
A multi-column index of date->accessid likely wont help the situation as MySQL can't use index columns after a range condition. In theory they should be able to use it to cover the computation in this case, but it appears to be a shortcoming in MySQL, I've never gotten it to use a multi column index in this situation successfully.
You can try creating an index on (date,accessid) hoping that it will use it to cover the query (so you won't need to hit any tables), but I don't hold much hope. There's not a great deal you can do.
Edit:
My answer is courtesy of High Performance MySQL - Second Edition, worth it's weight in gold if you have to do serious MySQL development.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index is more efficient. That's because it's likely to pare the search space down faster.
At least one DBMS (DB2/z, I don't know much about MySQL) would benefit from an index on date+accessid since the access IDs would be sorted within dates in that index. That DBMS will use the date+accessid key to efficiently use the where clause to whittle down the search space and to return distinct values of accessid within that space.
Whether MySQL is that smart, I have no idea. My suggestion would be to try it and see (which is the best answer to most DB optimization questions).
The query uses the 'date' index because thats what you use in the where clause.
This is the only sensible option, if it used the access id index it would need to read all the accessid rows then check the date before it and only then decide if it was distinct.
If this is a really big table a compound index on date and accessid might help.
Why doesn't the query with the date clause not use the accessid index?
Because using the date index allows it to ignore a large part of the data in the table. The chances are that the table holds mostly historical data, and a lot of it refers to dates a lot longer ago than the beginning of the current month, so the date criterion is selective and reduces the workload for the optimizer by allowing it to ignore most of the data.
If it used the accessid index, it would also have to read each row (as well as each index entry) to see whether the date meets the search criterion. This means reading the whole of the index and the whole of the table - in fact, it would do better in the context to ignore the index, but I started of with "if it used the accessid index".
Are there any other indexes I can use to speed up queries for distinct accessid's in certain date spans?
Depending on the sophistication of the optimizer, an index on (date, accessid) might improve things. It can do range searches on the leading column of the index, and the trailing column means that it does not have to refer to the data in the table to establish the accessid - the information is in the index. So, this might convert a query that access an index and a table into one that only accesses the index - which will reduce the amount of I/O needed and therefore improve the performance of the query.
If you have other criteria that need data from other columns, or you need to return more than just the unique accessid values, then you end up reading part of the table data; this is probably still a win compared with scanning the whole of the table.
I have no way of testing it, but I would definitely try to add an index spanning both accessid and date.
Index optimizations if often like alchemy. Different DBMS behave differently, and sometimes you need to simply try (and fail) various combinations. I’m not saying it’s not possible to reason. It is in many cases, but up to a certain point. Often it’s simply faster and easier to follow your instinct.