query becomes slow with GROUP BY - mysql

I have spent 4 hours googling and trying all sorts of indexes, mysqlyog, reading, searching etc. When I add the GROUP BY the query changes from 0.002 seconds to 0.093 seconds. Is this normal and acceptable? Or can I alter the indexes and/or the query?
Table:
uniqueid int(11) NO PRI NULL auto_increment
ip varchar(64) YES NULL
lang varchar(16) YES MUL NULL
timestamp int(11) YES MUL NULL
correct decimal(12,2) YES NULL
user varchar(32) YES NULL
timestart int(11) YES NULL
timeend int(11) YES NULL
speaker varchar(64) YES NULL
postedAnswer int(32) YES NULL
correctAnswerINT int(32) YES NULL
Query:
SELECT
SQL_NO_CACHE
user,
lang,
COUNT(*) AS total,
SUM(correct) AS correct,
ROUND(SUM(correct) / COUNT(*) * 100) AS score,
TIMESTAMP
FROM
maths_score
WHERE TIMESTAMP > 1
AND lang = 'es'
GROUP BY USER
ORDER BY (
(SUM(correct) / COUNT(*) * 100) + SUM(correct)
) DESC
LIMIT 500
explain extended:
id select_type table type possible_keys key key_len ref rows filtered Extra
------ ----------- ----------- ------ ------------------------- -------------- ------- ------ ------ -------- ---------------------------------------------------------------------
1 SIMPLE maths_score ref scoretable,fulltablething fulltablething 51 const 10631 100.00 Using index condition; Using where; Using temporary; Using filesort
Current indexes (I have tried many)
Keyname Type Unique Packed Column Cardinality Collation Null Comment
uniqueid BTREE Yes No uniqueid 21262 A No
scoretable BTREE No No timestamp 21262 A Yes
lang 21262 A Yes
fulltablething BTREE No No lang 56 A Yes
timestamp 21262 A Yes
user 21262 A Yes

Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.
Do you have INDEX(lang, TIMESTAMP)? (Why.) It is likely to help both versions of the query.
Without the GROUP BY, you get one row, correct? With the GROUP BY, you get many rows, correct? Guess what, it takes more time to deliver more rows.
In addition, the GROUP BY probably involves an extra sort. The ORDER BY involves a sort, but in one case there is only 1 row to sort, hence faster. If there are a million USERs, then the ORDER BY will need to sort a million rows, only to deliver 500.
Please provide EXPLAIN SELECT ... for each case -- you will see some of what I am saying.

So you ran the query without GROUP BY and got one result row in 0.002 secs. Then you added GROUP BY (and ORDER BY obviously) and ended up with multiple result rows in 0.093 secs.
In order to produce this result, the DBMS must somehow order your records by user or create buckets per user, so as to get record count, sum, etc. per user. This takes of course much more time than just running through the table, counting records and summing up a value unconditionally. At last the DBMS must even sort these results again. I am not surprised this runs much longer.
The most appropriate index for this query should be:
create index idx on maths_score (lang, timestamp, user, correct);
This is a covering index, starting with the columns in WHERE, continuing with the column in GROUP BY and ending with all other columns used in the query.

Related

MySQL query performance degrade with filesort

I have the below table with more than 190M records,
CREATE TABLE notification (
_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
recipient CHAR(11) NOT NULL,
recipient_group CHAR(11),
topic VARCHAR(25) NOT NULL,
identifier VARCHAR(60) NOT NULL,
timestamp TIMESTAMP(3) NOT NULL,
type VARCHAR(255) NOT NULL,
actioned BIT NOT NULL DEFAULT 0,
expiry_timestamp TIMESTAMP DEFAULT NULL,
INDEX recipient_recipient_group_timestamp_id (recipient, recipient_group, timestamp DESC, _id DESC),
INDEX topic_identifier (topic, identifier),
INDEX expiry_timestamp (expiry_timestamp),
UNIQUE recipient_recipient_group_topic_identifier (recipient, recipient_group, topic, identifier)
) CHARACTER SET ascii COLLATE ascii_bin;
Now I want to query all the notifications for a recipient belonging to a group based on the timestamp,
explain
select * from notification
where (recipient = 'recipient' and (recipient_group = 'group' or recipient_group is null)
and (expiry_timestamp > {ts '2018-06-26 08:00:00.0'} or expiry_timestamp is null)
and timestamp > {ts '1970-01-01 00:00:00.0'} and type in ('TYPE'))
order by timestamp desc, _id desc limit 10;
I have noticed that this query perform poorly when there are large number of notifications for a user as MySQL ended up using filesort in order by timestamp and _id.
+----+-------------+--------------+------------+-------------+----------------------------------------------------------------------------------------------------+--------------------------------------------+---------+-------------+------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------------+----------------------------------------------------------------------------------------------------+--------------------------------------------+---------+-------------+------+----------+----------------------------------------------------+
| 1 | SIMPLE | notification | NULL | ref_or_null | recipient_recipient_group_topic_identifier,recipient_recipient_group_timestamp_id,expiry_timestamp | recipient_recipient_group_topic_identifier | 23 | const,const | 2 | 5.01 | Using index condition; Using where; Using filesort |
+----+-------------+--------------+------------+-------------+----------------------------------------------------------------------------------------------------+--------------------------------------------+---------+-------------+------+----------+----------------------------------------------------+
Is there a way to improve the query performance may be adding/modifying an index?
Edit:
It seems that MySQL uses index recipient_recipient_group_timestamp_id if I remove or recipient_group is null from the where condition.
In general, you can't optimize an inequality condition with an index and also eliminate the filesort in the same query.
Think of a telephone book. It's sorted by last name, first name, then if there are still ties (people with the same name), it's sorted by the phone number. So if you want this query:
SELECT * FROM PhoneBook WHERE last_name=? AND first_name=?
ORDER BY phone_number;
Then the sorting will be a no-op, because if the first two are tied, the matching rows will naturally be stored in the requested order already. The query can skip the filesort if it simply reads the rows in the index order.
But if you query any type of inequality:
SELECT * FROM PhoneBook WHERE last_name=? AND first_name LIKE 'S%'
ORDER BY phone_number;
This matches multiple first names, and reading the matching rows in the index order will not be tied, so they aren't guaranteed to be sorted by phone number. The query has to sort the rows matched.
The same is true of any other type of inequality or range search that can be indexed: !=, IN(), LIKE, >, etc.

MariaDB 5.5.68, prevent toxic selects?

I know that this MariaDB version 5.5.68 is really out of date, but I have to stay with this old version for a while.
Is there a way to prevent toxic selects, possibly blocking MyISAM tables for a longer time (minutes)? The thing is that the select creates a READ BLOCK on the whole MyISAM table and further inserts wait until they're all gone. So the long running select starts to block the system.
Take this example table:
CREATE TABLE `tbllog` (
`LOGID` bigint unsigned NOT NULL auto_increment,
`LOGSOURCE` smallint unsigned default NULL,
`USERID` int unsigned default NULL,
`LOGDATE` datetime default NULL,
`SUBPROVIDERID` int unsigned default NULL,
`ACTIONID` smallint unsigned default NULL,
`COMMENT` varchar(255) default NULL,
PRIMARY KEY (`LOGID`),
KEY `idx_LogDate` (`LOGDATE`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The following select works fine until less than 1 Mio entries in the table (the customers set the date range):
SELECT *
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500;
But it becomes toxic if there are 10 Mio entries or more in the table. Then it starts to run for minutes, consumes a lot of memory and starts blocking the app.
This is the query plan with ~600.000 entries in the table:
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | tbllog | index | idx_LogDate | PRIMARY | 8 | NULL | 624 | Using where |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
The thing is, that I need to know if this becomes toxic or not before execution. So maybe I can warn the user that this might block the system for a while or even deny execution.
I know that InnoDB might not have this issue, but I don't know the drawbacks of a switch yet and I think it might be best to stay for the moment.
I tried to do a simple SELECT COUNT(*) FROM tbllog WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 before (removing LIMIT and ORDER BY), but it is not really much faster than the real query and produces double the load in the worst case.
I also considered a worker thread (like mentioned here). But this is a relevant change to the whole system, too. InnoDB would be less impact I think.
Any ideas about this issue?
Your EXPLAIN report shows that it's doing an index-scan on the primary key index. I believe this is because the range of dates is too broad, so the optimizer thinks that it's not much help to use the index instead of simply reading the whole table. By doing an index-scan of the primary key (logid), the optimizer can at least ensure that the rows are read in the order you requested in your ORDER BY clause, so it can skip sorting.
If I test your query (I created the table and filled it with 1M rows of random data), but make it ignore the primary key index, I get this EXPLAIN report:
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate | idx_LogDate | 6 | NULL | 271471 | 10.00 | Using index condition; Using where; Using filesort |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
This makes it use the index on the logdate, so it examine fewer rows, according to the proportion matched by the date range condition. But the resulting rows must be sorted ("Using filesort" in the Extra column) before it can apply the LIMIT.
This won't help at all if your range of dates covers the whole table anyway. In fact, it will be worse, because it will access rows indirectly by the logdate index, and then it will have to sort rows. This solution helps only if the range of dates in the query matches a small portion of the table.
A somewhat better index is a compound index on (subproviderid, logdate).
mysql> alter table tbllog add index (subproviderid, logdate);
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 ORDER BY logid LIMIT 500;
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate,SUBPROVIDERID | SUBPROVIDERID | 11 | NULL | 12767 | 100.00 | Using index condition; Using filesort |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
In my test, this helps the estimate of rows examined drop from 271471 to 12767 because they're restricted by subproviderid, then by logdate. How effective this is depends on how frequently subproviderid=1 is matched. If that's matched by virtually all of the rows anyway, then it won't be any help. If there are many different values of subproviderid and they each have a small fraction of rows, then it will help more to add this to the index.
In my test, I made an assumption that there are 20 different values of subproviderid with equal frequency. That is, my random data inserted round(rand()*20) as the value of subproviderid on each row. Thus it is expected that adding subproviderid resulted in 1/20th of the examined rows in my test.
To choose the order of columns listed in the index, columns referenced in equality conditions must be listed before the column referenced in range conditions.
There's no way to get a prediction of the runtime of a query. That's not something the optimizer can predict. You should block users from requesting a range of dates that will match too great a portion of the table.
For this
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
Add both of these and hope that the Optimizer picks the better one:
INDEX(subproviderid, logdate, logid)
INDEX(subproviderid, logid)
Better yet would be to also change to this (assuming it is 'equivalent' for your purposes):
ORDER BY logdate, logid
Then that first index will probably work nicely.
You really should change to InnoDB. (Caution: the table is likely to triple in size.) With InnoDB, there would be another indexing option. And, with an updated version, you could do "instant" index adding. Meanwhile, MyISAM will take a lot of time to add those indexes.
Try creating a multi-column index specifically for your query.
CREATE INDEX sub_date_logid ON tbllog (subproviderid, logdate, logid);
This index should satisfy the WHERE filters in your query directly. Then it should present the rows in logid order so your ORDER BY ... LIMIT clauses don't have to sort the whole table. Will this help on long-dead MariaDB 5.5 with MyISAM? Hard to say for sure.
If it doesn't solve your performance problem, keep the multicolumn index and try doing the ORDER BY...LIMIT on the logid values rather than all the rows.
SELECT *
FROM tbllog
WHERE logid IN (
SELECT logid
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500 )
ORDER BY logid;
This can speed things up because it lets MariaDB sort just the logid values to find the ones it wants. Then the outer query fetches only the 500 rows needed for your result set. Less data to sort = faster.
One of the options, although an external one, would be to use ProxySQL. It has capabilities to shape the traffic. You can create rules deciding how to process queries that match them. You could, for example, create a query rule that would check if a query is accessing a given table (you can use regular expressions to match the query) and, for example, block that query or introduce a delay in execution.
Another option could be to use pt-kill. It's a script that's part of the Percona Toolkit and it's intended to, well, kill queries. You can define which queries you want to kill (matching them by regular expressions, by how long they ran or in other ways).
Having said that, if SELECTs can be optimized by rewriting or adding proper indexes, that may be the best option to go for.

Optimize selecting all rows from a table based on results from the same table?

I'll be the first to admit that I'm not great at SQL (and I probably shouldn't be treating it like a rolling log file), but I was wondering if I could get some pointers for improving some slow queries...
I have a large mysql table with 2M rows where I do two full table lookups based on a subset of the most recent data. When I load the page that contains these queries, I often find they take several seconds to complete, but the queries inside are quite quick.
PMA's (supposedly terrible) advisor pretty much throws the entire kitchen sink at me, temporary tables, too many sorts, joins without indexes (I don't even have any joins?), reading from fixed position, reading next position, temporary tables written to disk... that last one especially makes me wonder if it's a configuration issue, but I played with all the knobs, and even paid for a managed service which didn't seem to help.
CREATE TABLE `archive` (
`id` bigint UNSIGNED NOT NULL,
`ip` varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`service` enum('ssh','telnet','ftp','pop3','imap','rdp','vnc','sql','http','smb','smtp','dns','sip','ldap') CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`hostid` bigint UNSIGNED NOT NULL,
`date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `archive`
ADD PRIMARY KEY (`id`),
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-date` (`service`,`date`),
ADD KEY `service-ip` (`service`,`ip`);
Adding indexes definitely helped (even though they're 4x the size of the actual data), but I'm kindof at a loss where I can optimize further. Initially I thought about caching the subquery results in php and using it twice for the main queries, but I don't think I have access to the result once I close the subquery. I looked into doing joins, but they look like they're meant for 2 or more separate tables, but the subquery is from the same table, so I'm not sure if that would even work either. The queries are supposed to find the most active ip/services based on whether I have data from an ip in the past 24 hours...
SELECT service, COUNT(service) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY service HAVING total > 1
ORDER BY total DESC, service ASC LIMIT 10
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | service,ip,date-service,ip-date,ip-service,service-date,service-ip | ip-service | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44246 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
SELECT ip, COUNT(ip) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY ip HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | ip,date-ip,ip-date,ip-service,service-ip | ip-date | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44168 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
common subquery: 0.0351s
whole query 1: 1.4270s
whole query 2: 1.5601s
total page load: 3.050s (7 queries total)
Am I just doomed to terrible performance with this table?
Hopefully there's enough information here to get an idea of what's going, but if anyone can help I would certainly appreciate it. I don't mind throwing more hardware at the issue, but when an 8c/16t server with 16gb can't handle 150mb of data I'm not sure what will. Thanks in advance for reading my long winded question.
You have the right indexes (as well as many other indexes) and your query both meets your specs and runs close to optimally. It's unlikely that you can make this much faster: it needs to look all the way back to the beginning of your table.
If you can change your spec so you only have to look back a limited amount of time like a year you'll get a good speedup.
Some possible minor tweaks.
use the latin1_bin collation for your ip column. It uses 8-bit characters and collates them without any case sensitivity. That's plenty for IPv4 dotted-quad addresses (and IPv6 addresses). You'll get rid of a bit of overhead in matching and grouping. Or, even better,
If you know you will have nothing but IPv4 addresses, rework your ip column to store their binary representations ( that is, the INET_ATON() - generated value of each IPv4). You can fit those in the UNSIGNED INT 32-bit integer data type, making the lookup, grouping, and ordering even faster.
It's possible to rework the way you gather these data. For example, you could arrange to gather at most one row per service per day. That will reduce the timeseries resolution of your data, but it will also make your queries much faster. Define your table something like this:
CREATE TABLE archive2 (
ip VARCHAR(15) COLLATE latin1_bin NOT NULL,
service ENUM ('ssh','telnet','ftp',
'pop3','imap','rdp',
'vnc','sql','http','smb',
'smtp','dns','sip','ldap') COLLATE NOT NULL,
`date` DATE NOT NULL,
`count` INT NOT NULL,
hostid bigint UNSIGNED NOT NULL,
PRIMARY KEY (`date`, ip, service)
) ENGINE=InnoDB;
Then, when you insert a row, use this query:
INSERT INTO archive2 (`date`, ip, service, `count`, hostid)
VALUES (CURDATE(), ?ip, ?service, 1, ?hostid)
ON DUPLICATE KEY UPDATE
SET count = count + 1;
This will automatically increment your count column if the row for the ip, service, and date already exists.
Then your second query will look like:
SELECT ip, SUM(`count`) AS total
FROM archive
WHERE ip IN (
SELECT ip FROM archive
WHERE `date` > CURDATE() - INTERVAL 1 DAY
GROUP BY ip
HAVING total > 1
)
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10;
The index of the primary key will satisfy this query.
First query
(I'm not convinced that it can be made much faster.)
(currently)
SELECT service, COUNT(service) AS total
FROM `archive`
WHERE ip IN (
SELECT DISTINCT ip
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
Notes:
COUNT(service) --> COUNT(*)
DISTINCT is not needed in IN (SELECT DISTINCT ...)
IN ( SELECT ... ) is often slow; rewrite using EXISTS ( SELECT 1 ... ) or JOIN (see below)
INDEX(date, IP) -- for subquery
INDEX(service, IP) -- for your outer query
INDEX(IP, service) -- for my outer query
Toss redundant indexes; they can get in the way. (See below)
It will have to gather all the possible results before getting to the ORDER BY and LIMIT. (That is, LIMIT has very little impact on performance for this query.)
CHARACTER SET utf8 COLLATE utf8_unicode_ci is gross overkill for IP addresses; switch to CHARACTER SET ascii COLLATE ascii_bin.
If you are running MySQL 8.0 (Or MariaDB 10.2), a WITH to calculate the subquery once, together with a UNION to compute the two outer queries, may provide some extra speed.
MariaDB has a "subquery cache" that might have the effect of skipping the second subquery evaluation.
By using DATETIME instead of TIMESTAMP, you will two minor hiccups per year when daylight savings kicks in/out.
I doubt if hostid needs to be a BIGINT (8-bytes).
To switch to a JOIN, think of fetching the candidate rows first:
SELECT service, COUNT(*) AS total
FROM ( SELECT DISTINCT IP
FROM archive
WHERE `date` > NOW() - INTERVAL 24 HOUR
) AS x
JOIN archive USING(IP)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
For any further discussion any slow (but working) query, please provide both flavors of EXPLAIN:
EXPLAIN SELECT ...
EXPLAIN FORMAT=JSON SELECT ...
Drop these indexes:
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
Recommend only
ADD PRIMARY KEY (`id`),
-- as discussed:
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-ip` (`service`,`ip`),
-- maybe other queries need these:
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `service-date` (`service`,`date`),
The general rule here is that you don't need INDEX(a) when you also have INDEX(a,b). In particular, they may be preventing the use of better indexes; see the EXPLAINs.
Second query
Rewrite it
SELECT ip, COUNT(DISTINCT ip) AS total
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
GROUP BY ip
HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC
LIMIT 10
It will use only INDEX(date, ip).

Optimizing Datetime fields where indexes aren't being used as expected

I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
Why not using a concatenated index?
CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
Concatenated index vs. merging multiple indexes
Index-Only Scan
After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:
where `created_at` >= convert({specific_date}, datetime)
My original comments still apply for the optimization part.
The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.
This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:
select k.kind_id, count(*)
from
kind_catalog k
inner join (
select kind_id
from counters
where create_at >= ?
) c on k.kind_id = c.kind_id
group by k.kind_id
This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.

GROUP BY query optimization

Database is MySQL with MyISAM engine.
Table definition:
CREATE TABLE IF NOT EXISTS matches (
id int(11) NOT NULL AUTO_INCREMENT,
game int(11) NOT NULL,
user int(11) NOT NULL,
opponent int(11) NOT NULL,
tournament int(11) NOT NULL,
score int(11) NOT NULL,
finish tinyint(4) NOT NULL,
PRIMARY KEY ( id ),
KEY game ( game ),
KEY user ( user ),
KEY i_gfu ( game , finish , user )
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3149047 ;
I have set an index on (game, finish, user) but this GROUP BY query still needs 0.4 - 0.6 seconds to run:
SELECT user AS player
, COUNT( id ) AS times
FROM matches
WHERE finish = 1
AND game = 19
GROUP BY user
ORDER BY times DESC
The EXPLAIN output:
| id | select_type | table | type | possible_keys | key | key_len |
| 1 | SIMPLE | matches | ref | game,i_gfu | i_gfu | 5 |
| ref | rows | Extra |
| const,const | 155855 | Using where; Using temporary; Using filesort |
Is there any way I can make it faster? The table has about 800K records.
EDIT: I changed COUNT(id) into COUNT(*) and the time dropped to 0.08 - 0.12 seconds. I think I've tried that before making the index and forgot to change it again after.
In the explain output the Using index explains the speeding up:
| rows | Extra |
| 168029 | Using where; Using index; Using temporary; Using filesort |
(Side question: is this dropping of a factor of 5 normal?)
There are about 2000 users, so the final sorting, even if it uses filesort, it doesn't hurt performance. I tried without ORDER BY and it still takes almost same time.
Get rid of 'game' key - it's redundant with 'i_gfu'. As 'id' is unique count(id) just returns number of rows in each group, so you can get rid of that and replace it with count(*). Try it that way and paste output of EXPLAIN:
SELECT user AS player, COUNT(*) AS times
FROM matches
WHERE finish = 1
AND game = 19
GROUP BY user
ORDER BY times DESC
Eh, tough. Try reordering your index: put the user column first (so make the index (user, finish, game)) as that increases the chance the GROUP BY can use the index. However, in general GROUP BY can only use indexes if you limit the aggregate functions used to MIN and MAX (see http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html and http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html). Your order by isn't really helping either.
One of the shortcomings of this query is that you order by an aggregate. That means that you can't return any rows until the full result set has been generated; no index can exist (for mysql myisam, anyway) to fix that.
You can denormalize your data fairly easily to overcome this, though; You could, for instance, add an insert/update trigger to stick a count value in a summary table, with an index, so that you can start returning rows immediately.
The EXPLAIN verifies the (game, finish, user) index was used in the query. That seems like the best possible index to me. Could it be a hardware issue? What is your system RAM and CPU?
I take it that the bulk of the time is spent on extracting and more importantly sorting (twice, including the one skipped by reading the index) 150k rows out of 800k. I doubt you can optimize it much more than it already is.
As others have noted, you may have reached the limit of your ability to tune the query itself. You should next see what the setting of max_heap_table_size and tmp_table_size variables in your server. The default is 16MB, which may be too small for your table.