GROUP BY query optimization - mysql

Database is MySQL with MyISAM engine.
Table definition:
CREATE TABLE IF NOT EXISTS matches (
id int(11) NOT NULL AUTO_INCREMENT,
game int(11) NOT NULL,
user int(11) NOT NULL,
opponent int(11) NOT NULL,
tournament int(11) NOT NULL,
score int(11) NOT NULL,
finish tinyint(4) NOT NULL,
PRIMARY KEY ( id ),
KEY game ( game ),
KEY user ( user ),
KEY i_gfu ( game , finish , user )
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3149047 ;
I have set an index on (game, finish, user) but this GROUP BY query still needs 0.4 - 0.6 seconds to run:
SELECT user AS player
, COUNT( id ) AS times
FROM matches
WHERE finish = 1
AND game = 19
GROUP BY user
ORDER BY times DESC
The EXPLAIN output:
| id | select_type | table | type | possible_keys | key | key_len |
| 1 | SIMPLE | matches | ref | game,i_gfu | i_gfu | 5 |
| ref | rows | Extra |
| const,const | 155855 | Using where; Using temporary; Using filesort |
Is there any way I can make it faster? The table has about 800K records.
EDIT: I changed COUNT(id) into COUNT(*) and the time dropped to 0.08 - 0.12 seconds. I think I've tried that before making the index and forgot to change it again after.
In the explain output the Using index explains the speeding up:
| rows | Extra |
| 168029 | Using where; Using index; Using temporary; Using filesort |
(Side question: is this dropping of a factor of 5 normal?)
There are about 2000 users, so the final sorting, even if it uses filesort, it doesn't hurt performance. I tried without ORDER BY and it still takes almost same time.

Get rid of 'game' key - it's redundant with 'i_gfu'. As 'id' is unique count(id) just returns number of rows in each group, so you can get rid of that and replace it with count(*). Try it that way and paste output of EXPLAIN:
SELECT user AS player, COUNT(*) AS times
FROM matches
WHERE finish = 1
AND game = 19
GROUP BY user
ORDER BY times DESC

Eh, tough. Try reordering your index: put the user column first (so make the index (user, finish, game)) as that increases the chance the GROUP BY can use the index. However, in general GROUP BY can only use indexes if you limit the aggregate functions used to MIN and MAX (see http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html and http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html). Your order by isn't really helping either.

One of the shortcomings of this query is that you order by an aggregate. That means that you can't return any rows until the full result set has been generated; no index can exist (for mysql myisam, anyway) to fix that.
You can denormalize your data fairly easily to overcome this, though; You could, for instance, add an insert/update trigger to stick a count value in a summary table, with an index, so that you can start returning rows immediately.

The EXPLAIN verifies the (game, finish, user) index was used in the query. That seems like the best possible index to me. Could it be a hardware issue? What is your system RAM and CPU?

I take it that the bulk of the time is spent on extracting and more importantly sorting (twice, including the one skipped by reading the index) 150k rows out of 800k. I doubt you can optimize it much more than it already is.

As others have noted, you may have reached the limit of your ability to tune the query itself. You should next see what the setting of max_heap_table_size and tmp_table_size variables in your server. The default is 16MB, which may be too small for your table.

Related

MariaDB 5.5.68, prevent toxic selects?

I know that this MariaDB version 5.5.68 is really out of date, but I have to stay with this old version for a while.
Is there a way to prevent toxic selects, possibly blocking MyISAM tables for a longer time (minutes)? The thing is that the select creates a READ BLOCK on the whole MyISAM table and further inserts wait until they're all gone. So the long running select starts to block the system.
Take this example table:
CREATE TABLE `tbllog` (
`LOGID` bigint unsigned NOT NULL auto_increment,
`LOGSOURCE` smallint unsigned default NULL,
`USERID` int unsigned default NULL,
`LOGDATE` datetime default NULL,
`SUBPROVIDERID` int unsigned default NULL,
`ACTIONID` smallint unsigned default NULL,
`COMMENT` varchar(255) default NULL,
PRIMARY KEY (`LOGID`),
KEY `idx_LogDate` (`LOGDATE`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The following select works fine until less than 1 Mio entries in the table (the customers set the date range):
SELECT *
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500;
But it becomes toxic if there are 10 Mio entries or more in the table. Then it starts to run for minutes, consumes a lot of memory and starts blocking the app.
This is the query plan with ~600.000 entries in the table:
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | tbllog | index | idx_LogDate | PRIMARY | 8 | NULL | 624 | Using where |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
The thing is, that I need to know if this becomes toxic or not before execution. So maybe I can warn the user that this might block the system for a while or even deny execution.
I know that InnoDB might not have this issue, but I don't know the drawbacks of a switch yet and I think it might be best to stay for the moment.
I tried to do a simple SELECT COUNT(*) FROM tbllog WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 before (removing LIMIT and ORDER BY), but it is not really much faster than the real query and produces double the load in the worst case.
I also considered a worker thread (like mentioned here). But this is a relevant change to the whole system, too. InnoDB would be less impact I think.
Any ideas about this issue?
Your EXPLAIN report shows that it's doing an index-scan on the primary key index. I believe this is because the range of dates is too broad, so the optimizer thinks that it's not much help to use the index instead of simply reading the whole table. By doing an index-scan of the primary key (logid), the optimizer can at least ensure that the rows are read in the order you requested in your ORDER BY clause, so it can skip sorting.
If I test your query (I created the table and filled it with 1M rows of random data), but make it ignore the primary key index, I get this EXPLAIN report:
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate | idx_LogDate | 6 | NULL | 271471 | 10.00 | Using index condition; Using where; Using filesort |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
This makes it use the index on the logdate, so it examine fewer rows, according to the proportion matched by the date range condition. But the resulting rows must be sorted ("Using filesort" in the Extra column) before it can apply the LIMIT.
This won't help at all if your range of dates covers the whole table anyway. In fact, it will be worse, because it will access rows indirectly by the logdate index, and then it will have to sort rows. This solution helps only if the range of dates in the query matches a small portion of the table.
A somewhat better index is a compound index on (subproviderid, logdate).
mysql> alter table tbllog add index (subproviderid, logdate);
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 ORDER BY logid LIMIT 500;
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate,SUBPROVIDERID | SUBPROVIDERID | 11 | NULL | 12767 | 100.00 | Using index condition; Using filesort |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
In my test, this helps the estimate of rows examined drop from 271471 to 12767 because they're restricted by subproviderid, then by logdate. How effective this is depends on how frequently subproviderid=1 is matched. If that's matched by virtually all of the rows anyway, then it won't be any help. If there are many different values of subproviderid and they each have a small fraction of rows, then it will help more to add this to the index.
In my test, I made an assumption that there are 20 different values of subproviderid with equal frequency. That is, my random data inserted round(rand()*20) as the value of subproviderid on each row. Thus it is expected that adding subproviderid resulted in 1/20th of the examined rows in my test.
To choose the order of columns listed in the index, columns referenced in equality conditions must be listed before the column referenced in range conditions.
There's no way to get a prediction of the runtime of a query. That's not something the optimizer can predict. You should block users from requesting a range of dates that will match too great a portion of the table.
For this
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
Add both of these and hope that the Optimizer picks the better one:
INDEX(subproviderid, logdate, logid)
INDEX(subproviderid, logid)
Better yet would be to also change to this (assuming it is 'equivalent' for your purposes):
ORDER BY logdate, logid
Then that first index will probably work nicely.
You really should change to InnoDB. (Caution: the table is likely to triple in size.) With InnoDB, there would be another indexing option. And, with an updated version, you could do "instant" index adding. Meanwhile, MyISAM will take a lot of time to add those indexes.
Try creating a multi-column index specifically for your query.
CREATE INDEX sub_date_logid ON tbllog (subproviderid, logdate, logid);
This index should satisfy the WHERE filters in your query directly. Then it should present the rows in logid order so your ORDER BY ... LIMIT clauses don't have to sort the whole table. Will this help on long-dead MariaDB 5.5 with MyISAM? Hard to say for sure.
If it doesn't solve your performance problem, keep the multicolumn index and try doing the ORDER BY...LIMIT on the logid values rather than all the rows.
SELECT *
FROM tbllog
WHERE logid IN (
SELECT logid
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500 )
ORDER BY logid;
This can speed things up because it lets MariaDB sort just the logid values to find the ones it wants. Then the outer query fetches only the 500 rows needed for your result set. Less data to sort = faster.
One of the options, although an external one, would be to use ProxySQL. It has capabilities to shape the traffic. You can create rules deciding how to process queries that match them. You could, for example, create a query rule that would check if a query is accessing a given table (you can use regular expressions to match the query) and, for example, block that query or introduce a delay in execution.
Another option could be to use pt-kill. It's a script that's part of the Percona Toolkit and it's intended to, well, kill queries. You can define which queries you want to kill (matching them by regular expressions, by how long they ran or in other ways).
Having said that, if SELECTs can be optimized by rewriting or adding proper indexes, that may be the best option to go for.

Very simple MySQL index query running very slowly

I have a very simple query that is running extremely slowly despite being indexed.
My table is as follows:
mysql> show create table mytable
CREATE TABLE `mytable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`status` varchar(64) DEFAULT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `ix_status_user_id_start_time` (`status`,`user_id`,`start_time`),
### other columns and indices, not relevant
) ENGINE=InnoDB AUTO_INCREMENT=115884841 DEFAULT CHARSET=utf8
Then the following query takes more than 10 seconds to run:
select id from mytable USE INDEX (ix_status_user_id_start_time) where status = 'running';
There are about 7 million rows in the table, and approximately 200 of rows have status running.
I would expect this query to take less than a tenth of a second. It should find the first row in the index with status running. And then scan the next 200 rows until it finds the first non-running row. It should not need to look outside the index.
When I explain the query I get a very strange result:
mysql> explain select id from mytable USE INDEX (ix_status_user_id_start_time) where status =
'running';
+----+-------------+---------+------------+------+------------------------------+------------------------------+---------+-------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+------------------------------+------------------------------+---------+-------+---------+----------+-------------+
| 1 | SIMPLE | mytable | NULL | ref | ix_status_user_id_start_time | ix_status_user_id_start_time | 195 | const | 2118793 | 100.00 | Using index |
+----+-------------+---------+------------+------+------------------------------+------------------------------+---------+-------+---------+----------+-------------+
It is estimating a scan of more than 2 million rows! Also, the cardinality of the status index does not seem correct. There are only about 5 or 6 different statuses, not 344.
Other info
There are somewhat frequent insertions and updates to this table. About 2 rows inserted per second, and 10 statuses updated per second. I don't know how much impact this has, but I would not expect it to be 30 seconds worth.
If I query by both status and user_id, sometimes it is fast (sub 0.1s) and sometimes it is slow (> 1s), depending on the user_id. This does not seem to depend on the size of the result set (some users with 20 rows are quick, others with 4 are slow)
Can anybody explain what is going on here and how it can be fixed?
I am using mysql version 5.7.33
As already mentioned in the comment, you are using many indexes on a big table. So the required memory for this indexes is very high.
You can increase the index buffer size in the my.cnf by changing the innodb_buffer_pool_size to a higher value.
But probably it is more efficient to use less indexes and do not use combined indexes if not absolutely needed.
My guess is, that if you remove all indexes and create only one on status this query will run in under 1s.

MySQL Many Rows Examined by Simple Query

There is a table tb_tag_article like
CREATE TABLE `tb_tag_article` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`tag_id` int(16) NOT NULL,
`article_id` int(16) NOT NULL,
PRIMARY KEY (`id`),
KEY `key_tag_id_article_id` (`tag_id`,`article_id`) USING BTREE,
) ENGINE=InnoDB AUTO_INCREMENT=365944 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
When I query like this, the result is 5120.
SELECT count(*) FROM tb_tag WHERE tag_id = 43
But when I explain the query like this
EXPLAIN SELECT count(*) FROM tb_tag WHERE tag_id = 43
examined rows is 13634.
+------+-------------+----------------+------+-----------------------+-----------------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+------+-----------------------+-----------------------+---------+-------+-------+-------------+
| 1 | SIMPLE | tb_tag_article | ref | key_tag_id_article_id | key_tag_id_article_id | 4 | const | 13634 | Using index |
+------+-------------+----------------+------+-----------------------+-----------------------+---------+-------+-------+-------------+
The query use Index but the numbers of examined rows greater than count of real data.
What's the problem?
Q: What's the problem?
A: It doesn't look like there's any problem.
The value for the "rows" column in the EXPLAIN output is an estimate, not an exact number.
Ref: http://dev.mysql.com/doc/refman/5.5/en/explain-output.html
For evaluating the "cost" of each possible access path, the optimizer only needs estimates in order to compare the efficiency of using a range scan operation on index vs. a full scan of all rows in the table. The optimizer doesn't need "exact" counts of the total number rows in the table, or the number of rows that will satisfy a predicate.
For this simple query, there are only a couple of possible plans that MySQL will consider.
And that estimate of 13684 isn't that far off from the exact count of rows. It's off by a factor of 2.5, but MySQL is coming up with the right execution plan: using the index, rather than checking every row in the table.
There is no problem.
From MySQL reference (http://dev.mysql.com/doc/refman/5.0/en/explain-output.html#explain_rows):
The rows column indicates the number of rows MySQL believes it must
examine to execute the query.
For InnoDB tables, this number is an estimate, and may not always be
exact.
It might also be that because it has to analyze the index it takes into account the number of records in the index plus the number of records in the table. But that's just a hypothesis.
Also, it looks like there was a bug in MySQL 5.1 with Explain Row estimation causing the number to be wildly off: http://bugs.mysql.com/bug.php?id=53761 Depending on your version, this might explain some of the oddities.
The main take-away from the documentation seems to be to tree the EXPLAIN rows column with a grain of salt.

Slow query, state = 'Sorting result' mysql

I have generated query
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from MessageDetails messagedet0_, MessageEntry messageent1_, MailSourceFile mailsource2_
where messagedet0_.id=messageent1_.messageDetails_id
and messageent1_.mailSourceFile_id=mailsource2_.id
order by mailsource2_.file, messageent1_.mboxOffset;
Explain says that there is no full scans and indexes are used:
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys |key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | mailsource2_ | index | PRIMARY |file | 384 | NULL | 1445 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | messageent1_ | ref | msf_idx,md_idx,FKBBB258CB60B94D38,FKBBB258CBF7C835B8 |msf_idx | 9 | skryb.mailsource2_.id | 2721 | Using where |
| 1 | SIMPLE | messagedet0_ | eq_ref | PRIMARY |PRIMARY | 8 | skryb.messageent1_.messageDetails_id | 1 | |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
CREATE TABLE `mailsourcefile` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`file` varchar(127) COLLATE utf8_bin DEFAULT NULL,
`size` bigint(20) DEFAULT NULL,
`archive_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `file` (`file`),
KEY `File_idx` (`file`),
KEY `Archive_idx` (`archive_id`),
KEY `FK7C3F816ECDB9F63C` (`archive_id`),
CONSTRAINT `FK7C3F816ECDB9F63C` FOREIGN KEY (`archive_id`) REFERENCES `archive` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1370 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Also I have indexes for file and mboxOffset. SHOW FULL PROCESSLIST says that mysql is sorting result and it takes more then few minutes. Resultset size is 5M records. How can I optimize this?
Don't think there is much optimization to do in the query itself. Joins would make it more readable, but iirc mysql nowadays is perfectly able to detect these kind of constructs and plan the joins itself.
What would help propably is to increase both the tmp_table_size and max_heap_table_size to allow the resultset of this query to remain in memory, rather than having to write it to disk.
The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values
http://dev.mysql.com/doc/refman/5.5/en/internal-temporary-tables.html
The "using temporary" in the explain denotes that it is using a temporary table (see the link above again) - which will probably be written to disk due to the large amount of data (again, see the link above for more on this).
the file column alone is anywhere between 1 and 384 bytes, so lets take the half for our estimation and ignore the rest of the columns, that leads to 192 bytes per row in the result-set.
1445 * 2721 = 3,931,845 rows
* 192 = 754,914,240 bytes
/ 1024 ~= 737,221 kb
/ 1024 ~= 710 mb
This is certainly more than the max_heap_table_size (16,777,216 bytes) and most likely more than the tmp_table_size.
Not having to write such a result to disk will most certainly increase speed.
Good luck!
Optimization is always tricky. In order to make a dent in your execution time, I think you probably need to do some sort of pre-cooking.
If the file names are similar, (e.g. /path/to/file/1, /path/to/file/2), sorting them will mean a lot of byte comparisons, probably compounded by the unicode encoding. I would calculate a hash of the filename on insertion (e.g. MD5()) and then sort using that.
If the files are already well distributed (e.g. postfix spool names), you probably need to come up with some scheme on insertion whereby either:
simply reading records from some joined table will automatically generate them in correct order; this may not save a lot of time, but it will give you some data quickly so you can start processing, or
find a way to provide a "window" on the data so that not all of it needs to be processed at once.
As #raheel shan said above, you may want to try some JOINs:
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails messagedet0_
inner join
MessageEntry messageent1_
on
messagedet0_.id = messageent1_.messageDetails_id
inner join
MailSourceFile mailsource2_
on
messageent1_.mailSourceFile_id = mailsource2_.id
order by
mailsource2_.file,
messageent1_.mboxOffset
My apologies if the keys are off, but I think I've conveyed the point.
write the query with joins like
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails m0
inner join
MessageEntry m1
on
m0.id = m1.messageDetails_id
inner join
MailSourceFile m2
on
m1.mailSourceFile_id = m2.id
order by
m2_.file,
m1_.mboxOffset;
on seeing ur explain i found 3 things which in my opinion are not good
1 file sort in extra column
2 index in type column
3 key length which is 384
if you reduce the key length you may get quick retrieval for that consider the character set you use and the partial indexes
here you can do force index for order by and use index for join ( create appropriate multi column indexes and assign them) remember it is alway good to order with column present in the same table
index type represents it is scanning entire index column which is not good

Optimizing Datetime fields where indexes aren't being used as expected

I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
Why not using a concatenated index?
CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
Concatenated index vs. merging multiple indexes
Index-Only Scan
After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:
where `created_at` >= convert({specific_date}, datetime)
My original comments still apply for the optimization part.
The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.
This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:
select k.kind_id, count(*)
from
kind_catalog k
inner join (
select kind_id
from counters
where create_at >= ?
) c on k.kind_id = c.kind_id
group by k.kind_id
This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.