mariadb optimisation of primary key not working - mysql

If you use a count on a non-null-column, on one table, without any where-parts, the optimaizer just return the number of rows in that table.
If you ask for a DISTINCT count on a UNIQE non-null-column, like the PRIMARY KEY, the answers should be the same, but this time mariadb do the calculations insted.
And if you have left join on other tables, and still no where-parts, the results should still be the number of rows in that table.
Is there a reason for mariadb not using thous optimizations? Is there case when the DISTINCT count of an unfiltered primary key, could give any other result then the number of rows in that tabel?
case:
CREATE TABLE products (
our_article_id varchar(50) CHARACTER SET utf8 NOT NULL,
...,
PRIMARY KEY(our_article_id)
);
CREATE TABLE product_article_id (
article_id varchar(255) COLLATE utf8_bin NOT NULL,
our_article_id varchar(50) CHARACTER SET utf8 NOT NULL,
...
PRIMARY KEY(article_id),
INDEX(our_article_id)
);
Count queries, 1st, basic count
DESCRIBE SELECT COUNT(our_article_id) FROM products;
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
| 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
2nd DISTINCT on primary key
DESCRIBE SELECT COUNT(DISTINCT our_article_id) FROM products;
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
| 1 | SIMPLE | products | index | NULL | PRIMARY | 152 | NULL | 225089 | Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+--------+-------------+
3th, DISTINCT on PRIMARY KEY, and a LEFT JOIN without WHERE-parts
DESCRIBE SELECT COUNT(DISTINCT our_article_id) FROM products LEFT JOIN product_article_id USING (our_article_id);
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+
| 1 | SIMPLE | products | index | NULL | PRIMARY | 152 | NULL | 225089 | Using index |
| 1 | SIMPLE | product_article_id | ref | PRIMARY | PRIMARY | 152 | testseek.products.our_article_id | 12579 | Using index |
+------+-------------+--------------------+-------+---------------+---------+---------+----------------------------------+--------+-------------+

"Is there a reason for mariadb not using thous optimizations?" -- There are a zillion missing optimizations in MySQL/MariaDB; that's missing. Let's look at the history.
MySQL started about 2 decades ago as a lean and mean database engine. It focused on features that most people needed, while minimizing the overhead. This meant that a lot of rare optimizations were not in the early releases, and only get added over time if they seem important enough.
Take the PRIMARY KEY, for example. It is defined as UNIQUE. It is BTree organized. And, with InnoDB, it is also defined as Clustered. Other vendors allow various combinations clustering, non-BTree indexing, etc. MySQL decided that the limitations were "good enough" for "most" people.
Over the years, the 'worst' omissions have been gradually fixed. Transactions is probably the biggest and most important. It arrived in 2001(?), and MyISAM is being removed this year (2016) with the advent of 8.0.
4.1 (2002?) saw subqueries. Before that, creating a tmp table was "good enough". Now (8.0) subqueries are being one-upped by CTEs, which covers a few things that neither tmp tables nor subqueries can do efficiently.
There have been a huge number of optimizations put into MySQL 5.6 and 5.7 and MariaDB 10.x; you probably have not used more than a couple of them. The product is into "diminishing returns". It would damage its "lean and mean" heritage if it slowed down the optimizer to check for the next thousand extremely rare optimizations.
Meanwhile, guys like me spend a lot of time saying "MySQL/MariaDB doesn't have that; here's the workaround". It's the shorter COUNT(*) in your case. Since there is a clean workaround, it may be another decade before your suggestions are implemented. It is OK to file a bug report with bugs.mysql.com or mariadb.com to suggest the optimizations.
Another, almost never needed case, is INDEX(a ASC, b DESC) as a way of optimizing ORDER BY a ASC, b DESC. That is coming with 8.0. But I doubt if more than one query in 5,000 really needs it. (I have seen a lot of queries.) I suggest that its rarity is why it took two decades to implement it. The lack of a clean workaround is why it did not take another decade.

Related

How to properly use indexing in MySQL

I'm running a fairly simple auto catalog
CREATE TABLE catalog_auto (
id INT(10) UNSIGNED NOT NULL auto_increment,
make varchar(35),
make_t varchar(35),
model varchar(40),
model_t varchar(40),
model_year SMALLINT(4) UNSIGNED,
fuel varchar(35),
gearbox varchar(15),
wd varchar(5),
engine_cc SMALLINT(4) UNSIGNED,
variant varchar(40),
body varchar(30),
power_ps SMALLINT(4) UNSIGNED,
power_kw SMALLINT(4) UNSIGNED,
power_hp SMALLINT(4) UNSIGNED,
max_rpm SMALLINT(5) UNSIGNED,
torque SMALLINT(5) UNSIGNED,
top_spd SMALLINT(5) UNSIGNED,
seats TINYINT(2) UNSIGNED,
doors TINYINT(1) UNSIGNED,
weight_kg SMALLINT(5) UNSIGNED,
lkm_def TINYINT(3) UNSIGNED,
lkm_mix TINYINT(3) UNSIGNED,
lkm_urb TINYINT(3) UNSIGNED,
tank_cap TINYINT(3) UNSIGNED,
co2 SMALLINT(5) UNSIGNED,
PRIMARY KEY(id),
INDEX `gi`(`make`,`model`,`model_year`,`fuel`,`gearbox`,`wd`,`engine_cc`),
INDEX `mkt`(`make`,`make_t`),
INDEX `mdt`(`make`,`model`,`model_t`)
);
The table has about 60.000 rows so far, so, nothing that simple queries, even without indexes, couldn't handle.
The point is, i'm trying to get the hang of using indexes, so i made a few, based on my most frequent queries.
Say i want engine_cc for a specific set of criteria like so:
SELECT DISTINCT engine_cc FROM catalog_auto WHERE make='audi' AND model='a4' and model_year=2006 AND fuel='diesel' AND gearbox='manual' AND wd='front';
EXPLAIN says:
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------+------+--------------------------+
| 1 | SIMPLE | catalog_auto | ref | gi,mkt,mdt | gi | 408 | const,const,const,const,const,const | 8 | Using where; Using index |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------+------+--------------------------+
The query is using gi index as expected, no problem here.
After selecting base criteria, i need the rest of the columns as well:
SELECT * FROM catalog_auto WHERE make='audi' AND model='a4' and model_year=2006 AND fuel='diesel' AND gearbox='manual' AND wd='front' AND engine_cc=1968;
EXPLAIN says:
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| 1 | SIMPLE | catalog_auto | ref | gi,mkt,mdt | gi | 411 | const,const,const,const,const,const,const | 3 | Using where |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
It selected a KEY but NOT using the index. The query however, is very fast(1 row in set (0.00 sec)), but since the table doesn't have that many rows, i assume even without indexing, it would be the same.
Tried it like this:
SELECT * FROM catalog_auto WHERE id IN (SELECT id FROM catalog_auto WHERE make='audi' AND model='a6' AND model_year=2009);
Again, in EXPLAIN:
+----+--------------------+--------------+-----------------+--------------------+---------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-----------------+--------------------+---------+---------+------+-------+-------------+
| 1 | PRIMARY | catalog_auto | ALL | NULL | NULL | NULL | NULL | 59060 | Using where |
| 2 | DEPENDENT SUBQUERY | catalog_auto | unique_subquery | PRIMARY,gi,mkt,mdt | PRIMARY | 4 | func | 1 | Using where |
+----+--------------------+--------------+-----------------+--------------------+---------+---------+------+-------+-------------+
Still NOT using any index, not even PRIMARY KEY. Shouldn't this, at least use the PRIMARY KEY?
Documentation says: MySQL can ignore a key even if it finds one, if it determines that a full table scan would be faster, depending on the query.
Is that the reason why it's not using any of the indexes? Is this a good practice? If not, how would you recommend indexing columns, for a SELECT * statement, to always use an index, given the above query.
I'm not much of a MySQL expert, so any pointers would be greatly appreciated.
Using MySQL 5.5 with InnoDB.
I'm basically saying the same answer that #DStanley said, but I want to expand on it more than I can fit in a comment.
The "Using index" note means that the query is using only the index to get the columns it needs.
The absence of this note doesn't mean the query isn't using an index.
What you should look at is the key column in the EXPLAIN report:
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
| 1 | SIMPLE | catalog_auto | ref | gi,mkt,mdt | gi | 411 | const,const,const,const,const,const,const | 3 | Using where |
+----+-------------+--------------+------+---------------+------+---------+-------------------------------------------+------+-------------+
The key column says the optimizer chooses to use the gi index. So it is using an index. And the ref column confirms that's referencing all seven columns of that index.
The fact that it must fetch more of the columns to return * means it can't claim "Using [only] index".
Also read this excerpt from https://dev.mysql.com/doc/refman/5.6/en/explain-output.html:
Using index
The column information is retrieved from the table using only information in the index tree without having to do an additional seek to read the actual row. This strategy can be used when the query uses only columns that are part of a single index.
I think of this analogy, to a telephone book:
If you look up a business in a phone book, it's efficient because the book is alphabetized by the name. When you find it, you also have the phone number right there in the same entry. So if that's all you need, it's very quick. That's an index-only query.
If you want extra information about the business, like their hours or credentials or whether they carry a certain product, you have to do the extra step of using that phone number to call them and ask. That's a couple of extra minutes of time to get that information. But you were still able to find the phone number without having to read the entire phone book, so at least it didn't take hours or days. That's a query that used an index, but had to also go look up the row from the table to get other data.
I'm not a MySQL expert, but my guess is that the index was used for the row lookup, but the actual data has to be retrieved from the data pages, so an additional lookup is necessary.
In your first query, the data you ask for is available by looking only at the index keys. When you ask for columns that aren't in the index in the second and third queries, the engine uses the key to do a SEEK on the data tables, so it's still very fast.
With SQL performance, since the optimizer has a lot of freedom to choose the "best" plan, the proof is in the pudding when it comes to indexing. If adding an index makes a common query faster, great, use it. If not, then save the space and overhead of maintaining the index (or look for a better index).
Note that you don't get a free lunch - additional indices can actually slow down a system, particularly if you have frequent inserts or updates on columns that are indexed, since the systme will have to constantly maintain those indices.

Is there a way to hint mysql to use Using index for group-by

I was busying myself with exploring GROUP BY optimizations. On a classical "max salary per departament" query. And suddenly weird results. The dump below goes straight from my console. NO COMMAND were issued between these two EXPLAINS. Only some time had passed.
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 1 | |
| 2 | DERIVED | emploee | index | NULL | dep_id | 8 | NULL | 84 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 3 | |
| 2 | DERIVED | emploee | range | NULL | dep_id | 4 | NULL | 9 | Using index for group-by |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
As you may notice, it examined ten times less rows in second run. I assume it's because some inner counters got changed. But I don't want to depend on these counters. So - is there a way to hint mysql to use "Using index for group by" behavior only?
Or - if my speculations are wrong - is there any other explanation on the behavior and how to fix it?
CREATE TABLE `emploee` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`dep_id` int(11) NOT NULL,
`salary` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `dep_id` (`dep_id`,`salary`)
) ENGINE=InnoDB AUTO_INCREMENT=85 DEFAULT CHARSET=latin1 |
+-----------+
| version() |
+-----------+
| 5.5.19 |
+-----------+
Hm, showing the cardinality of indexes may help, but keep in mind: range's are usually slower then indexes there.
Because it think it can match the full index in the first one, it uses the full one. In the second one, it drops the index and goes for a range, but guesses the total number of rows satisfying that larger range wildly lower then the smaller full index, because all cardinality has changed. Compare it to this: why would "AA" match 84 rows, but "A[any character]" match only 9 (note that it uses 8 bytes of the key in the first, 4 bytes in the second)? The second one will in reality not read less rows, EXPLAIN just guesses the number of rows differently after an update on it's metadata of indexes. Not also that EXPLAIN does not tell you what a query will do, but what it probably will do.
Updating the cardinality can or will occur when:
The cardinality (the number of different key values) in every index of a table is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and on other circumstances (like when the table has changed too much). Note that all tables are opened, and the statistics are re-estimated, when the mysql client starts if the auto-rehash setting is set on (the default).
So, assume 'at any point' due to 'changed too much', and yes, connecting with the mysql client can alter the behavior in choosing indexes of a server. Also: reconnecting of the mysql client after it lost its connection after a timeout counts as connecting with auto-rehash AFAIK. If you want to give mysql help to find the proper method, run ANALYZE TABLE once in a while, especially after heavy updating. If you think the cardinality it guesses is often wrong, you can alter the number of pages it reads to guess some statistics, but keep in mind a higher number means a longer running update of that cardinality, and something you don't want to happen that often when 'data has changed to much' on a table with a lot of operations.
TL;DR: it guesses rows differently, but you'd actually prefer the first behavior if the data makes that possible.
Adding:
On this previously linked page, we can probably also find why especially dep_id might have this problem:
small values like 1 or 2 can result in very inaccurate estimates of cardinality
I could imagine the number of different dep_id's is typically quite small, and I've indeed observed a 'bouncing' cardinality on non-unique indexes with quite a small range compared to the number of rows in my own databases. It easily guesses a range of 1-10 in the hundreds and then down again the next time, just based on the specific sample pages it picks & some algorithm that tries to extrapolate that.

Speed up query: ORDER BY with LIMIT (indexing?)

I am trying to find the closest gene, given position information, from a gene table. Here is an example:
SELECT chrom, txStart, txEnd, name2, strand FROM
wgEncodeGencodeCompV12 WHERE chrom = 'chr1' AND txStart < 713885 AND
strand = '+' ORDER BY txStart DESC LIMIT 1;
My test runs have been pretty slow, which is problematic.
Here is an EXPLAIN output with default indexing (by chrom):
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | ref | chrom | chrom | 257 | const | 15843 | Using where; Using filesort |
Filesort is used and is probably causing all the sluggishness?
I tried speeding up the sorting by indexing (chrom, txStart, strand), or just txStart alone, but it only got slower (?). My reasoning is that txStart is not selective enough to be a good index, and that a whole-table scanning in this case is actually faster?
Here is the EXPLAIN output with the additional indexing:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | range | chrom,closest_gene_lookup | closest_gene_lookup | 261 | NULL | 57 | Using where |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | wgEncodeGencodeCompV12 | range | chrom,txStart | txStart | 4 | NULL | 1571 | Using where |
Table structure
CREATE TABLEwgEncodeGencodeCompV12(
binsmallint(5) unsigned NOT NULL,
namevarchar(255) NOT NULL,
chromvarchar(255) NOT NULL,
strandchar(1) NOT NULL,
txStartint(10) unsigned NOT NULL,
txEndint(10) unsigned NOT NULL,
cdsStartint(10) unsigned NOT NULL,
cdsEndint(10) unsigned NOT NULL,
exonCountint(10) unsigned NOT NULL,
exonStartslongblob NOT NULL,
exonEndslongblob NOT NULL,
scoreint(11) default NULL,
name2varchar(255) NOT NULL,
cdsStartStatenum('none','unk','incmpl','cmpl') NOT NULL,
cdsEndStatenum('none','unk','incmpl','cmpl') NOT NULL,
exonFrameslongblob NOT NULL,
KEYchrom(chrom,bin),
KEYname(name),
KEYname2(name2)
)
Is there a way to make this more efficient? I appreciate your time!
(update)Solution:
Combining both commenters' suggestions significantly improved run time.
In your case (query on a single table, no joins, no complicated stuff) it is important to understand the distribution of values in each column, and to understand how the database server utilizes the indexes. When you have a field with a rather big range of different values, then that one should be used for indexing. (e.g. an index on strand would just split the whole data in + or - and downstream filters would have to process each row of the either + or - result set, thats near the worst case)
So far, we know that txStart has the most differentiated values distribution amongst the interesting columns of your query.
So, your query definitely should utilize an index query on that column! But a btree index, not a hash index (operators <, <=, > etc. are fast on btree, but not on hash).
Try again with just a single (btree) index on txStart (I know you already tried that, but please try again and avoid all secondary indexes etc..).
Multi column indexes are nice, but their complexity make them not as fast as plain single column indexes, MySQLs optimizer is rather stupid in selecting the optimal indexes ;-)
Another important factor could be the dynamic row size (because of using longblob columns). But I am not up-to-date on the current state of MySQL in that regard.
The index that you want is: wgEncodeGencodeCompV12(chrom, strand, txstart).
In general, you want the fields with equalities as the first columns in the index. Then add one field with the inequality.

MySQL index doesn't work

I got a weird problem of MySQL index. I have a table views_video:
CREATE TABLE `views_video` (
`video_id` smallint(5) unsigned NOT NULL,
`record_date` date NOT NULL,
`region` char(2) NOT NULL DEFAULT '',
`views` mediumint(8) unsigned NOT NULL
PRIMARY KEY (`video_id`,`record_date`,`region`),
KEY `video_id` (`video_id`)
)
The table contains 3.4 million records.
I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 156
I got:
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
| 1 | SIMPLE | views_video | range | PRIMARY,video_id | video_id | 2 | NULL | 587984 | Using where |
+----+-------------+-------------+-------+------------------+----------+---------+------+--------+-------------+
But when I run the EXPLAIN on this query:
SELECT video_id, views FROM views_video where video_id <= 157
I got:
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
| 1 | SIMPLE | views_video | ALL | PRIMARY,video_id | NULL | NULL | NULL | 3412892 | Using where |
+----+-------------+-------------+------+------------------+------+---------+------+---------+-------------+
video_id is from 1 to 1034. There is nothing special between 156 and 157.
What happens here?
* update *
I have added more data into the database. Now video_id is from 1 to 1064. And the table now has 3.8M records. And the difference become 114 and 115.
I'm guessing that with 3.4 million records, and only 1064 possible entries for your key, your selectivity is very low. (In other words, there are many duplicates, which makes it far less useful as a key.) The optimizer is taking its best guess if it is more efficient to use the key or not. You've found a threshold for that decision.
It might be the key population
Run these
SELECT (COUNT(1)/20) INTO #FivePctOfData FROM views_video;
SELECT COUNT(1) videpidcount,video_id FROM FROM views_video
WHERE id <= 157 GROUP BY video_id;
The query optimizer proabably took a vacation when one one of the key hit the 5% threshold.
You said there are 3.4 million rows. 5% would be 170,000. Perhaps this number was exceeded at some point in the query optimizer's life cycle on your query.
If you've added/deleted substantial data since creating the table, it's worthwhile to try ANALYZE TABLE on it. It frequently solves a lot of phantom indexing issues, and it's very fast even on large tables.
Update: Also, the unique index values are very low compared to the number of rows in the table. MySQL won't use indexes when a single index value points to too many rows. Try constraining the query further with another column that's part of the primary key.

MySQL datetime index is not working

Table structure:
+-------------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+----------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| total | int(11) | YES | | NULL | |
| thedatetime | datetime | YES | MUL | NULL | |
+-------------+----------+------+-----+---------+----------------+
Total rows: 137967
mysql> explain select * from out where thedatetime <= NOW();
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | out | ALL | thedatetime | NULL | NULL | NULL | 137967 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
The real query is much more longer with more table joins, the point is, I can't get the table to use the datetime index. This is going to be hard for me if I want to select all data until certain date. However, I noticed that I can get MySQL to use the index if I select a smaller subset of data.
mysql> explain select * from out where thedatetime <= '2008-01-01';
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
| 1 | SIMPLE | out | range | thedatetime | thedatetime | 9 | NULL | 15826 | Using where |
+----+-------------+-------------+-------+---------------+-------------+---------+------+-------+-------------+
mysql> select count(*) from out where thedatetime <= '2008-01-01';
+----------+
| count(*) |
+----------+
| 15990 |
+----------+
So, what can I do to make sure MySQL will use the index no matter what date that I put?
There are two things in play here -
Index is not selective enough - if the index covers more than approx. 30% of the rows, MySQL will decide a full table scan is more efficient. When you contract the range the index kicks in.
One index per table in a join
The real query is much more longer
with more table joins, the point is ...
The point is exactly because it has joins that it probably can't use that index. MySQL can use one index per table in a join (unless it qualifies for an index-merge optimization). If the primary key is already used for the join, thedatetime won't be used. In order to use it, you need to create a multi-column index on the join key + thedatetime index, in the correct order.
Check the EXPLAIN of the actual query to see which key MySQL uses for the join. Modify that index to include the thedatetime column as well, or create a new multi-column index from both (depending on what you use the join key for).
Everything works as it is supposed to. :)
Indexes are there to speed up retrieval. They do it using index lookups.
In you first query the index is not used because you are retrieving ALL rows, and in this case using index is slower (lookup index, get row, lookup index, get row... x number of rows is slower then get all rows == table scan)
In the second query you are retrieving only a portion of the data and in this case table scan is much slower.
The job of the optimizer is to use statistics that RDBMS keeps on the index to determine the best plan. In first case index was considered, but planner (correctly) threw it away.
EDIT
You might want to read something like this to get some concepts and keywords regarding mysql query planner.