MySQL InnoDB hash index optimizing - mysql

I was wondering if i could optimize it more, maybe someone struggled with that.
First of all I have table:
CREATE TABLE `site_url` (
`id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
`url_hash` CHAR(32) NULL DEFAULT NULL,
`url` VARCHAR(2048) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `url_hash` (`url_hash`)
)
ENGINE=InnoDB;
where I store site URI (domain is in different table, but for purpose of this question id doesn't matter - I hope)
url_hash is MD5 calculated from url
It seems that all fields are in good length, indexes should be correct but there are a lat of data in it and I'm looking for more optimization.
Standard query looks like this:
select id from site_url where site_url.url_hash = MD5('something - often calculated in application rather than in mysql') and site_url.url = 'something - often calculated in application rather than in mysql'
describe gives:
+----+-------------+----------+------+---------------+----------+---------+-------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+----------+---------+-------+------+------------------------------------+
| 1 | SIMPLE | site_url | ref | url_hash | url_hash | 97 | const | 1 | Using index condition; Using where |
+----+-------------+----------+------+---------------+----------+---------+-------+------+------------------------------------+
But I'm wondering if I could help mysql doing that search. It must by InnoDB engine, I can't add key to url because of it's length
Friend of mine told me to short up hash to 16 chars, and write it as number. Will index on BIGINT be faster than on char(32)? Friend also suggested to do MD5 and take 16 first/last chars of it but I think it will make a lot more collisions.
What are your thoughts about it?

This is your query:
select id
from site_url
where site_url.url_hash = MD5('something - often calculated in application rather than in mysql') and
site_url.url = 'something - often calculated in application rather than in mysql';
The best index for this query would be on site_url(url_hash, url, id). The caveat is that you might need to use a prefix unless you have the large prefix option set (see innodb_large_prefix).

If url_hash is md5 of url why you select by 2 keys?
select id from site_url where site_url.url_hash = MD5('something - often calculated in application rather than in mysql');
Actually you dont need seсond check of site_url.url;
But if you want, you can select by 2 fields with USE INDEX syntax:
select id from site_url USE INDEX (url_hash) where site_url.url_hash = MD5('something - often calculated in application rather than in mysql') and site_url.url = 'something - often calculated in application rather than in mysql');

Related

MySQL seeking pagination on big composite primary key

Let's say I have a MySQL table defined like this:
CREATE TABLE big_table (
primary_1 varbinary(1536),
primary_2 varbinary(1536),
ts timestamp(6),
...
PRIMARY KEY (primary_1, primary_2),
KEY ts_idx (ts),
)
I would like to implement efficient pagination (seeking pagination) as described in this blog post https://use-the-index-luke.com/sql/partial-results/top-n-queries
If I only use the first part of the primary key, the pipelined execution works fast and as expected:
mysql> explain select * from big_table order by ts, primary_1 limit 5;
+----+-------------+-------------------------------------+------------+-------+---------------+--------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------------------------+------------+-------+---------------+--------+---------+------+------+----------+-------+
| 1 | SIMPLE | big_table | NULL | index | NULL | ts_idx | 7 | NULL | 5 | 100.00 | NULL |
+----+-------------+-------------------------------------+------------+-------+---------------+--------+---------+------+------+----------+-------+
However, if I add the second part of the primary key to the ORDER BY clause everything slows down and filesort starts being used:
mysql> explain select * from big_table order by ts, primary_1, primary_2 limit 5;
+----+-------------+-------------------------------------+------------+------+---------------+------+---------+------+---------+----------+----------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------------------------------+------------+------+---------------+------+---------+------+---------+----------+----------------+
| 1 | SIMPLE | big_table | NULL | ALL | NULL | NULL | NULL | NULL | 6388499 | 100.00 | Using filesort |
+----+-------------+-------------------------------------+------------+------+---------------+------+---------+------+---------+----------+----------------+
Is it not possible to do this pipelined execution and ordering on composite primary? Or should the query be written in some special way?
Without prior knowledge about how MySQL works internally, there is no reason to assume that an index on just ts can be used to order by ts, primary_1 without doing an additonal (file)sort on primary_1. Imagine e.g. the edge case that all values for ts are the same - the index will just give you all rows, which you then have to sort by primary_1.
Nevertheless, MySQL can make use of some additional information: InnoDB stores secondary indexes in a way that includes the primary key columns (to be able to find the actual row in the table). Since that information is there anyway, MySQL can just make use of it - and it does, by using Index Extensions. This basically extends the index ts to an index ts, primary_1, primary_2.
So this technical trick allows you to use the index on ts to order by ts, primary_1, primary_2. But since there is always a "but", here is the "but":
Use of index extensions by the optimizer is subject to the usual limits on the number of key parts in an index (16) and the maximum key length (3072 bytes).
The index on ts, primary_1, primary_2 would be longer than 3072 bytes. You can e.g. also not create such an index manually. So this extension doesn't work anymore, and MySQL falls back to treating the index on ts like an index on just ts.
So why does it work for order by ts, primary_1? Well, even if, for those technical reasons, MySQL cannot create an internal index on ts, primary_1, primary_2, it could at least do it for ts, primary_1 without running into technical problems. MySQL actually doesn't do that though - but the MariaDB developers implemented this trick, so I assume you are actually using MariaDB. Nevertheless, the length restriction of 3072 still applies, so your order by both primary columns still won't work.
What can you do?
If you can shorten your primary keys a bit, the index extension would work again. Primary keys that long (and of that type) are uncommon and unpractical anyway (not only for this use case), so maybe you can find a different primary key for your table.
If that is not an option, you may be able to utilize some prior knowledge about your data distribution, e.g. if you know that at most 10 values for ts can be the same, you can first pick the first n+10 rows (using the index), then order only those by the primary keys. If you usually only show the first few pages, this might speed up your specific situation. But you may want to ask a separate question for it with specific details.

Can MySQL FIND_IN_SET or equivalent be made to use indices?

If I compare
explain select * from Foo where find_in_set(id,'2,3');
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | User | ALL | NULL | NULL | NULL | NULL | 4 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
with this one
explain select * from Foo where id in (2,3);
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | User | range | PRIMARY | PRIMARY | 8 | NULL | 2 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
It is apparent that FIND_IN_SET does not exploit the primary key.
I want to put a query such as the above into a stored procedure, with the comma-separated string as an argument.
Is there any way to make the query behave like the second version, in which the index is used, but without knowing the content of the id set at the time the query is written?
In reference to your comment:
#MarcB the database is normalized, the CSV string comes from the UI.
"Get me data for the following people: 101,202,303"
This answer has a narrow focus on just those numbers separated by a comma. Because, as it turns out, you were not even talking about FIND_IN_SET afterall.
Yes, you can achieve what you want. You create a prepared statement that accepts a string as a parameter like in this Recent Answer of mine. In that answer, look at the second block that shows the CREATE PROCEDURE and its 2nd parameter which accepts a string like (1,2,3). I will get back to this point in a moment.
Not that you need to see it #spraff but others might. The mission is to get the type != ALL, and possible_keys and keys of Explain to not show null, as you showed in your second block. For a general reading on the topic, see the article Understanding EXPLAIN’s Output and the MySQL Manual Page entitled EXPLAIN Extra Information.
Now, back to the (1,2,3) reference above. We know from your comment, and your second Explain output in your question that it hits the following desired conditions:
type = range (and in particular not ALL) . See the docs above on this.
key is not null
These are precisely the conditions you have in your second Explain output, and the output that can be seen with the following query:
explain
select * from ratings where id in (2331425, 430364, 4557546, 2696638, 4510549, 362832, 2382514, 1424071, 4672814, 291859, 1540849, 2128670, 1320803, 218006, 1827619, 3784075, 4037520, 4135373, ... use your imagination ..., ..., 4369522, 3312835);
where I have 999 values in that in clause list. That is an sample from this answer of mine in Appendix D than generates such a random string of csv, surrounded by open and close parentheses.
And note the following Explain output for that 999 element in clause below:
Objective achieved. You achieve this with a stored proc similar to the one I mentioned before in this link using a PREPARED STATEMENT (and those things use concat() followed by an EXECUTE).
The index is used, a Tablescan (meaning bad) is not experienced. Further readings are The range Join Type, any reference you can find on MySQL's Cost-Based Optimizer (CBO), this answer from vladr though dated, with a eye on the ANALYZE TABLE part, in particular after significant data changes. Note that ANALYZE can take a significant amount of time to run on ultra-huge datasets. Sometimes many many hours.
Sql Injection Attacks:
Use of strings passed to Stored Procedures are an attack vector for SQL Injection attacks. Precautions must be in place to prevent them when using user-supplied data. If your routine is applied against your own id's generated by your system, then you are safe. Note, however, that 2nd level SQL Injection attacks occur when data was put in place by routines that did not sanitize that data in a prior insert or update. Attacks put in place prior via data and used later (a sort of time bomb).
So this answer is Finished for the most part.
The below is a view of the same table with a minor modification to it to show what a dreaded Tablescan would look like in the prior query (but against a non-indexed column called thing).
Take a look at our current table definition:
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
select min(id), max(id),count(*) as theCount from ratings;
+---------+---------+----------+
| min(id) | max(id) | theCount |
+---------+---------+----------+
| 1 | 5046213 | 4718592 |
+---------+---------+----------+
Note that the column thing was a nullable int column before.
update ratings set thing=id where id<1000000;
update ratings set thing=id where id>=1000000 and id<2000000;
update ratings set thing=id where id>=2000000 and id<3000000;
update ratings set thing=id where id>=3000000 and id<4000000;
update ratings set thing=id where id>=4000000 and id<5100000;
select count(*) from ratings where thing!=id;
-- 0 rows
ALTER TABLE ratings MODIFY COLUMN thing int not null;
-- current table definition (after above ALTER):
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
And then the Explain that is a Tablescan (against column thing):
You can use following technique to use primary index.
Prerequisities:
You know the maximum amount of items in comma separated string and it is not large
Description:
we convert comma separated string into temporary table
inner join to the temporary table
select #ids:='1,2,3,5,11,4', #maxCnt:=15;
SELECT *
FROM foo
INNER JOIN (
SELECT * FROM (SELECT #n:=#n+1 AS n FROM foo INNER JOIN (SELECT #n:=0) AS _a) AS _a WHERE _a.n <= #maxCnt
) AS k ON k.n <= LENGTH(#ids) - LENGTH(replace(#ids, ',','')) + 1
AND id = SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
This is a trick to extract nth value in comma separated list:
SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
Notes: #ids can be anything including other column from other or the same table.

Slow query, state = 'Sorting result' mysql

I have generated query
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from MessageDetails messagedet0_, MessageEntry messageent1_, MailSourceFile mailsource2_
where messagedet0_.id=messageent1_.messageDetails_id
and messageent1_.mailSourceFile_id=mailsource2_.id
order by mailsource2_.file, messageent1_.mboxOffset;
Explain says that there is no full scans and indexes are used:
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys |key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | mailsource2_ | index | PRIMARY |file | 384 | NULL | 1445 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | messageent1_ | ref | msf_idx,md_idx,FKBBB258CB60B94D38,FKBBB258CBF7C835B8 |msf_idx | 9 | skryb.mailsource2_.id | 2721 | Using where |
| 1 | SIMPLE | messagedet0_ | eq_ref | PRIMARY |PRIMARY | 8 | skryb.messageent1_.messageDetails_id | 1 | |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
CREATE TABLE `mailsourcefile` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`file` varchar(127) COLLATE utf8_bin DEFAULT NULL,
`size` bigint(20) DEFAULT NULL,
`archive_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `file` (`file`),
KEY `File_idx` (`file`),
KEY `Archive_idx` (`archive_id`),
KEY `FK7C3F816ECDB9F63C` (`archive_id`),
CONSTRAINT `FK7C3F816ECDB9F63C` FOREIGN KEY (`archive_id`) REFERENCES `archive` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1370 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Also I have indexes for file and mboxOffset. SHOW FULL PROCESSLIST says that mysql is sorting result and it takes more then few minutes. Resultset size is 5M records. How can I optimize this?
Don't think there is much optimization to do in the query itself. Joins would make it more readable, but iirc mysql nowadays is perfectly able to detect these kind of constructs and plan the joins itself.
What would help propably is to increase both the tmp_table_size and max_heap_table_size to allow the resultset of this query to remain in memory, rather than having to write it to disk.
The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values
http://dev.mysql.com/doc/refman/5.5/en/internal-temporary-tables.html
The "using temporary" in the explain denotes that it is using a temporary table (see the link above again) - which will probably be written to disk due to the large amount of data (again, see the link above for more on this).
the file column alone is anywhere between 1 and 384 bytes, so lets take the half for our estimation and ignore the rest of the columns, that leads to 192 bytes per row in the result-set.
1445 * 2721 = 3,931,845 rows
* 192 = 754,914,240 bytes
/ 1024 ~= 737,221 kb
/ 1024 ~= 710 mb
This is certainly more than the max_heap_table_size (16,777,216 bytes) and most likely more than the tmp_table_size.
Not having to write such a result to disk will most certainly increase speed.
Good luck!
Optimization is always tricky. In order to make a dent in your execution time, I think you probably need to do some sort of pre-cooking.
If the file names are similar, (e.g. /path/to/file/1, /path/to/file/2), sorting them will mean a lot of byte comparisons, probably compounded by the unicode encoding. I would calculate a hash of the filename on insertion (e.g. MD5()) and then sort using that.
If the files are already well distributed (e.g. postfix spool names), you probably need to come up with some scheme on insertion whereby either:
simply reading records from some joined table will automatically generate them in correct order; this may not save a lot of time, but it will give you some data quickly so you can start processing, or
find a way to provide a "window" on the data so that not all of it needs to be processed at once.
As #raheel shan said above, you may want to try some JOINs:
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails messagedet0_
inner join
MessageEntry messageent1_
on
messagedet0_.id = messageent1_.messageDetails_id
inner join
MailSourceFile mailsource2_
on
messageent1_.mailSourceFile_id = mailsource2_.id
order by
mailsource2_.file,
messageent1_.mboxOffset
My apologies if the keys are off, but I think I've conveyed the point.
write the query with joins like
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails m0
inner join
MessageEntry m1
on
m0.id = m1.messageDetails_id
inner join
MailSourceFile m2
on
m1.mailSourceFile_id = m2.id
order by
m2_.file,
m1_.mboxOffset;
on seeing ur explain i found 3 things which in my opinion are not good
1 file sort in extra column
2 index in type column
3 key length which is 384
if you reduce the key length you may get quick retrieval for that consider the character set you use and the partial indexes
here you can do force index for order by and use index for join ( create appropriate multi column indexes and assign them) remember it is alway good to order with column present in the same table
index type represents it is scanning entire index column which is not good

MySQL EXPLAIN 'type' changes from 'range' to 'ref' when the date in the where statement is changed?

I've been testing out different ideas for optimizing some of the tables we have in our system at work. Today I came across a table that tracks every view on each vehicle in our system. Create table below.
SHOW CREATE TABLE vehicle_view_tracking;
CREATE TABLE `vehicle_view_tracking` (
`vehicle_view_tracking_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`public_key` varchar(45) NOT NULL,
`vehicle_id` int(10) unsigned NOT NULL,
`landing_url` longtext NOT NULL,
`landing_port` int(11) NOT NULL,
`http_referrer` longtext,
`created_on` datetime NOT NULL,
`created_on_date` date NOT NULL,
`server_host` longtext,
`server_uri` longtext,
`referrer_host` longtext,
`referrer_uri` longtext,
PRIMARY KEY (`vehicle_view_tracking_id`),
KEY `vehicleViewTrackingKeyCreatedIndex` (`public_key`,`created_on_date`),
KEY `vehicleViewTrackingKeyIndex` (`public_key`)
) ENGINE=InnoDB AUTO_INCREMENT=363439 DEFAULT CHARSET=latin1;
I was playing around with multi-column and single column indexes. I ran the following query:
EXPLAIN EXTENDED SELECT dealership_vehicles.vehicle_make, dealership_vehicles.vehicle_model, vehicle_view_tracking.referrer_host, count(*) AS count
FROM vehicle_view_tracking
LEFT JOIN dealership_vehicles
ON dealership_vehicles.dealership_vehicle_id = vehicle_view_tracking.vehicle_id
WHERE vehicle_view_tracking.created_on_date >= '2011-09-07' AND vehicle_view_tracking.public_key IN ('ab12c3')
GROUP BY (dealership_vehicles.vehicle_make) ASC , dealership_vehicles.vehicle_model, referrer_host
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | vehicle_view_tracking | range | vehicleViewTrackingKeyCreatedIndex,vehicleViewTrackingKeyIndex | vehicleViewTrackingKeyCreatedIndex | 50 | NULL | 23086 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dealership_vehicles | eq_ref | PRIMARY | PRIMARY | 8 | vehicle_view_tracking.vehicle_id | 1 | 100.00 | |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
(Execution time for actual select query was .309 seconds)
then I change the date in the where clause from '2011-09-07' to '2011-07-07' and got the following explain results
EXPLAIN EXTENDED SELECT dealership_vehicles.vehicle_make, dealership_vehicles.vehicle_model, vehicle_view_tracking.referrer_host, count(*) AS count
FROM vehicle_view_tracking
LEFT JOIN dealership_vehicles
ON dealership_vehicles.dealership_vehicle_id = vehicle_view_tracking.vehicle_id
WHERE vehicle_view_tracking.created_on_date >= '2011-07-07' AND vehicle_view_tracking.public_key IN ('ab12c3')
GROUP BY (dealership_vehicles.vehicle_make) ASC , dealership_vehicles.vehicle_model, referrer_host
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | vehicle_view_tracking | ref | vehicleViewTrackingKeyCreatedIndex,vehicleViewTrackingKeyIndex | vehicleViewTrackingKeyIndex | 47 | const | 53676 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dealership_vehicles | eq_ref | PRIMARY | PRIMARY | 8 | vehicle_view_tracking.vehicle_id | 1 | 100.00 | |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
(Execution time for actual select query was .670 seconds)
I see 4 main changes:
type changed from range to ref
key changed from vehicleViewTrackingKeyCreatedIndex to vehicleViewTrackingKeyIndex
key_len changed from 50 to 47 (caused by the change in key)
rows changed from 23086 to 53676 (caused by the change in key)
At this point, the execution time is only .6 seconds for the slow query however we only have about 10% of our vehicles in our database.
It's getting late and I may have overlooked something in the mysql docs but I can't seem to find why the key (and in turn the type and rows) are changing when the date is changed in the where clause.
The help is greatly appreciated. I searched for someone having the same/similar issue with a date causing this change and was not able to find anything. If I missed a previous post, please link me :-)
Different search strategies make sense for different data. In particular, index scans (such as range) often have to do a seek to actually read the row. At some point, doing all those seeks is slower than not using the index at all.
Take a trivial example, a table with three columns: id (primary key), name (indexed), birthday. Say it has a lot of data. If you ask MySQL to look for Bob's birthday, it can do that fairly quickly: first, it finds Bob in the name index (this takes a few seeks, log(n) where n is the row count), then one additional seek to read the actual row in the data file and read the birthday from it. That's very quick, and far quicker than scanning the entire table.
Next, consider doing a name like 'Z%'. That is probably a fairly small portion of the table. So its still faster to find where the Zs start in the name index, then for each one seek the data file to read the row. (This is a range scan).
Finally, consider asking for all names starting with M-Z. That's probably around half the data. It could do a range scan, and then a lot of seeks, but seeking randomly over the datafile with the ultimate goal of reading half the rows isn't optimal: it'd be faster to just do a big sequential read over the data file. So, in this case, the index will be ignored.
This is what you're seeing—except in your case, there is another key it can fall back on. (Its also possible that it might actually use the date index if it didn't have the other, it should pick whichever index will be quickest. Beware that MySQL's optimizer often makes errors in this.)
So, in short, this is expected. A query doesn't say how to retrieve the data, rather it says what data to retrieve. The database's optimizer is supposed to find the quickest way to retrieve it.
You may find an index on both columns, in the order (public_key,created_on_date) is preferred in both cases, and speeds up your query. This is because MySQL can only ever use one index per table (per query). Also, the date goes at the end because a range scan can only be done efficiently on the last column in an index.
[InnoDB actually has another layer of indirection, I believe, but it'd just confuse the point. It doesn't make a difference to the explanation.]

Optimizing Datetime fields where indexes aren't being used as expected

I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
Why not using a concatenated index?
CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
Concatenated index vs. merging multiple indexes
Index-Only Scan
After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:
where `created_at` >= convert({specific_date}, datetime)
My original comments still apply for the optimization part.
The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.
This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:
select k.kind_id, count(*)
from
kind_catalog k
inner join (
select kind_id
from counters
where create_at >= ?
) c on k.kind_id = c.kind_id
group by k.kind_id
This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.