Index Columns and Order - mysql

If I have a select statement like the statement below, what order and what columns should be included in an index?
SELECT MIN(BenchmarkID),
MIN(BenchmarkDateTime),
Currency1,
Currency2,
BenchmarkType
FROM Benchmark
INNER JOIN MyCurrencyPairs ON Currency1 = Pair1
AND Currency2 = Pair2
WHERE BenchmarkDateTime > IN_BeginningTime
GROUP BY Currency1, Currency2, BenchmarkType;
Items to note:
The Benchmark table will have billions of rows
The MyCurrencyPairs table is a local table that will have less than 10 records
IN_BeginningTime is a input parameter
Columns Currency1 and Currency2 are VARCHARs
Columns BenchmarkID and BenchmarkType are INTs
Column BenchmarkDateTime is a datetime (hopefully that was obvious)
I've created an index with Currency1, Currency2, BenchmarkType, BenchmarkDateTime, and BenchmarkID but I wasn't getting the speed I was wanting. Could I create a better index?
Edit #1: Someone requested the explain results below. Let me know if anything else is needed
Edit #2: Someone requested the DDL (I'm assuming this is the create statement) for the two tables:
(this benchmark table exists in the database)
CREATE TABLE `benchmark` (
`SequenceNumber` INT(11) NOT NULL,
`BenchmarkType` TINYINT(3) UNSIGNED NOT NULL,
`BenchmarkDateTime` DATETIME NOT NULL,
`Identifier` CHAR(6) NOT NULL,
`Currency1` CHAR(3) NULL DEFAULT NULL,
`Currency2` CHAR(3) NULL DEFAULT NULL,
`AvgBMBid` DECIMAL(18,9) NOT NULL,
`AvgBMOffer` DECIMAL(18,9) NOT NULL,
`AvgBMMid` DECIMAL(18,9) NOT NULL,
`MedianBMBid` DECIMAL(18,9) NOT NULL,
`MedianBMOffer` DECIMAL(18,9) NOT NULL,
`OpenBMBid` DECIMAL(18,9) NOT NULL,
`ClosingBMBid` DECIMAL(18,9) NOT NULL,
`ClosingBMOffer` DECIMAL(18,9) NOT NULL,
`ClosingBMMid` DECIMAL(18,9) NOT NULL,
`LowBMBid` DECIMAL(18,9) NOT NULL,
`HighBMOffer` DECIMAL(18,9) NOT NULL,
`BMRange` DECIMAL(18,9) NOT NULL,
`BenchmarkId` INT(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`BenchmarkId`),
INDEX `NextBenchmarkIndex01` (`Currency1`, `Currency2`, `BenchmarkType`),
INDEX `NextBenchmarkIndex02` (`BenchmarkDateTime`, `Currency1`, `Currency2`, `BenchmarkType`, `BenchmarkId`),
INDEX `BenchmarkOptimization` (`BenchmarkType`, `BenchmarkDateTime`, `Currency1`, `Currency2`)
)
(I'm creating the MyCurrencyPairs table in my routine)
CREATE TEMPORARY TABLE MyCurrencyPairs
(
Pair1 VARCHAR(50),
Pair2 VARCHAR(50)
) ENGINE=memory;
CREATE INDEX IDX_MyCurrencyPairs ON MyCurrencyPairs (Pair1, Pair2);

BenchMarkDateTime should be the first column in your index.
The rule is, if you use only a part of a composite index, the used part should be the leading part.
Secondly, the Group By should match an index.
Your performance would be better if some how you can make your query use "=" instead of ">" which is a range check query.

The main problem is that MySQL can't directly use the index to handle the aggregation. This is due to the join with MyCurrencyPairs and the fact that you're asking for MIN(BenchmarkId) while also having the range condition on BenchmarkDateTime. These two need to be eliminated to get a better execution plan.
Let's have a look at the required indexes and the resulting query first:
ALTER TABLE benchmark
ADD KEY `IDX1` (
`Currency1`,
`Currency2`,
`BenchmarkType`,
`BenchmarkDateTime`
),
ADD KEY `IDX2` (
`Currency1`,
`Currency2`,
`BenchmarkType`,
`BenchmarkId`,
`BenchmarkDateTime`
);
SELECT
(
SELECT
BenchmarkId
FROM
benchmark FORCE KEY (IDX2)
WHERE
Currency1 = ob.Currency1 AND
Currency2 = ob.Currency2 AND
BenchmarkType = ob.BenchmarkType
AND BenchmarkDateTime > IN_BeginningTime
ORDER BY
Currency1, Currency2, BenchmarkType, BenchmarkId
LIMIT 1
) AS BenchmarkId
ob.*
FROM
(
SELECT
MIN(BenchmarkDateTime),
Currency1,
Currency2,
BenchmarkType
FROM
benchmark
WHERE
BenchmarkDateTime > IN_BeginningTime
GROUP BY
Currency1, Currency2, BenchmarkType
) AS ob
INNER JOIN
MyCurrencyPairs ON Currency1 = Pair1 AND Currency2 = Pair2;
The first change is that the GROUP BY part happens in its own subquery. This means that it generates all combinations of Currency1, Currency2, BenchmarkType, even those that don't appear in MyCurrencyPairs, but unless there are lots of combinations, the fact that MySQL can now use an index to perform the operation should make this faster. This subquery uses IDX1 without requiring a temporary table or a filesort.
The second change is the isolation of the MIN(BenchmarkId) part into its own subquery. The sorting in that subquery can be handled using IDX2, so no sorting is required here either. The FORCE KEY (IDX2) hint and that even the "fixed-value" columns Currency1, Currency2 and BenchmarkType appear in the ORDER-part is required to make the MySQL optimizer do the right thing. Again, this is a trade-off. If the final result set is large the subqueries might turn out to be a loss, but I presume that there aren't that many rows.
Explaining that query gives the following query plan (uninteresting columns dropped for readability):
+----+--------------------+-----------------+-------+---------+------+---------------------------------------+
| id | select_type | table | type | key_len | rows | Extra |
+----+--------------------+-----------------+-------+---------+------+---------------------------------------+
| 1 | PRIMARY | <derived3> | ALL | NULL | 1809 | |
| 1 | PRIMARY | MyCurrencyPairs | ref | 106 | 2 | Using where |
| 3 | DERIVED | benchmark | range | 17 | 1225 | Using where; Using index for group-by |
| 2 | DEPENDENT SUBQUERY | benchmark | ref | 9 | 520 | Using where; Using index |
+----+--------------------+-----------------+-------+---------+------+---------------------------------------+
We see that all the interesting parts are properly covered by indexes, and we require neither temporary tables nor filesorts.
Timings on my test data show this version to be about 20 times as fast (1.07s vs. 0.05s), but I have only about 1.2million rows in my benchmark table and the data distribution is likely way off, so YMMV.

Related

Optimize selecting all rows from a table based on results from the same table?

I'll be the first to admit that I'm not great at SQL (and I probably shouldn't be treating it like a rolling log file), but I was wondering if I could get some pointers for improving some slow queries...
I have a large mysql table with 2M rows where I do two full table lookups based on a subset of the most recent data. When I load the page that contains these queries, I often find they take several seconds to complete, but the queries inside are quite quick.
PMA's (supposedly terrible) advisor pretty much throws the entire kitchen sink at me, temporary tables, too many sorts, joins without indexes (I don't even have any joins?), reading from fixed position, reading next position, temporary tables written to disk... that last one especially makes me wonder if it's a configuration issue, but I played with all the knobs, and even paid for a managed service which didn't seem to help.
CREATE TABLE `archive` (
`id` bigint UNSIGNED NOT NULL,
`ip` varchar(15) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`service` enum('ssh','telnet','ftp','pop3','imap','rdp','vnc','sql','http','smb','smtp','dns','sip','ldap') CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`hostid` bigint UNSIGNED NOT NULL,
`date` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
ALTER TABLE `archive`
ADD PRIMARY KEY (`id`),
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-date` (`service`,`date`),
ADD KEY `service-ip` (`service`,`ip`);
Adding indexes definitely helped (even though they're 4x the size of the actual data), but I'm kindof at a loss where I can optimize further. Initially I thought about caching the subquery results in php and using it twice for the main queries, but I don't think I have access to the result once I close the subquery. I looked into doing joins, but they look like they're meant for 2 or more separate tables, but the subquery is from the same table, so I'm not sure if that would even work either. The queries are supposed to find the most active ip/services based on whether I have data from an ip in the past 24 hours...
SELECT service, COUNT(service) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY service HAVING total > 1
ORDER BY total DESC, service ASC LIMIT 10
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | service,ip,date-service,ip-date,ip-service,service-date,service-ip | ip-service | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44246 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+----------------------------------------------------------------------------+------------+---------+------------------------+-------+----------+---------------------------------+
SELECT ip, COUNT(ip) AS total FROM `archive`
WHERE ip IN
(SELECT DISTINCT ip FROM `archive` WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR))
GROUP BY ip HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | archive | NULL | ref | ip,date-ip,ip-date,ip-service,service-ip | ip-date | 47 | <subquery2>.ip | 5 | 100.00 | Using index |
| 2 | MATERIALIZED | archive | NULL | range | date,ip,date-ip,date-service,ip-date,ip-service | date-ip | 5 | NULL | 44168 | 100.00 | Using where; Using index |
+----+--------------+-----------------+------------+-------+---------------------------------------------------------------+---------+---------+------------------------+-------+----------+---------------------------------+
common subquery: 0.0351s
whole query 1: 1.4270s
whole query 2: 1.5601s
total page load: 3.050s (7 queries total)
Am I just doomed to terrible performance with this table?
Hopefully there's enough information here to get an idea of what's going, but if anyone can help I would certainly appreciate it. I don't mind throwing more hardware at the issue, but when an 8c/16t server with 16gb can't handle 150mb of data I'm not sure what will. Thanks in advance for reading my long winded question.
You have the right indexes (as well as many other indexes) and your query both meets your specs and runs close to optimally. It's unlikely that you can make this much faster: it needs to look all the way back to the beginning of your table.
If you can change your spec so you only have to look back a limited amount of time like a year you'll get a good speedup.
Some possible minor tweaks.
use the latin1_bin collation for your ip column. It uses 8-bit characters and collates them without any case sensitivity. That's plenty for IPv4 dotted-quad addresses (and IPv6 addresses). You'll get rid of a bit of overhead in matching and grouping. Or, even better,
If you know you will have nothing but IPv4 addresses, rework your ip column to store their binary representations ( that is, the INET_ATON() - generated value of each IPv4). You can fit those in the UNSIGNED INT 32-bit integer data type, making the lookup, grouping, and ordering even faster.
It's possible to rework the way you gather these data. For example, you could arrange to gather at most one row per service per day. That will reduce the timeseries resolution of your data, but it will also make your queries much faster. Define your table something like this:
CREATE TABLE archive2 (
ip VARCHAR(15) COLLATE latin1_bin NOT NULL,
service ENUM ('ssh','telnet','ftp',
'pop3','imap','rdp',
'vnc','sql','http','smb',
'smtp','dns','sip','ldap') COLLATE NOT NULL,
`date` DATE NOT NULL,
`count` INT NOT NULL,
hostid bigint UNSIGNED NOT NULL,
PRIMARY KEY (`date`, ip, service)
) ENGINE=InnoDB;
Then, when you insert a row, use this query:
INSERT INTO archive2 (`date`, ip, service, `count`, hostid)
VALUES (CURDATE(), ?ip, ?service, 1, ?hostid)
ON DUPLICATE KEY UPDATE
SET count = count + 1;
This will automatically increment your count column if the row for the ip, service, and date already exists.
Then your second query will look like:
SELECT ip, SUM(`count`) AS total
FROM archive
WHERE ip IN (
SELECT ip FROM archive
WHERE `date` > CURDATE() - INTERVAL 1 DAY
GROUP BY ip
HAVING total > 1
)
ORDER BY total DESC, INET_ATON(ip) ASC LIMIT 10;
The index of the primary key will satisfy this query.
First query
(I'm not convinced that it can be made much faster.)
(currently)
SELECT service, COUNT(service) AS total
FROM `archive`
WHERE ip IN (
SELECT DISTINCT ip
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
Notes:
COUNT(service) --> COUNT(*)
DISTINCT is not needed in IN (SELECT DISTINCT ...)
IN ( SELECT ... ) is often slow; rewrite using EXISTS ( SELECT 1 ... ) or JOIN (see below)
INDEX(date, IP) -- for subquery
INDEX(service, IP) -- for your outer query
INDEX(IP, service) -- for my outer query
Toss redundant indexes; they can get in the way. (See below)
It will have to gather all the possible results before getting to the ORDER BY and LIMIT. (That is, LIMIT has very little impact on performance for this query.)
CHARACTER SET utf8 COLLATE utf8_unicode_ci is gross overkill for IP addresses; switch to CHARACTER SET ascii COLLATE ascii_bin.
If you are running MySQL 8.0 (Or MariaDB 10.2), a WITH to calculate the subquery once, together with a UNION to compute the two outer queries, may provide some extra speed.
MariaDB has a "subquery cache" that might have the effect of skipping the second subquery evaluation.
By using DATETIME instead of TIMESTAMP, you will two minor hiccups per year when daylight savings kicks in/out.
I doubt if hostid needs to be a BIGINT (8-bytes).
To switch to a JOIN, think of fetching the candidate rows first:
SELECT service, COUNT(*) AS total
FROM ( SELECT DISTINCT IP
FROM archive
WHERE `date` > NOW() - INTERVAL 24 HOUR
) AS x
JOIN archive USING(IP)
GROUP BY service
HAVING total > 1
ORDER BY total DESC, service ASC
LIMIT 10
For any further discussion any slow (but working) query, please provide both flavors of EXPLAIN:
EXPLAIN SELECT ...
EXPLAIN FORMAT=JSON SELECT ...
Drop these indexes:
ADD KEY `service` (`service`),
ADD KEY `date` (`date`),
ADD KEY `ip` (`ip`),
Recommend only
ADD PRIMARY KEY (`id`),
-- as discussed:
ADD KEY `date-ip` (`date`,`ip`),
ADD KEY `ip-service` (`ip`,`service`),
ADD KEY `service-ip` (`service`,`ip`),
-- maybe other queries need these:
ADD KEY `date-service` (`date`,`service`),
ADD KEY `ip-date` (`ip`,`date`),
ADD KEY `service-date` (`service`,`date`),
The general rule here is that you don't need INDEX(a) when you also have INDEX(a,b). In particular, they may be preventing the use of better indexes; see the EXPLAINs.
Second query
Rewrite it
SELECT ip, COUNT(DISTINCT ip) AS total
FROM `archive`
WHERE date > DATE_SUB(CURRENT_TIMESTAMP, INTERVAL 24 HOUR)
GROUP BY ip
HAVING total > 1
ORDER BY total DESC, INET_ATON(ip) ASC
LIMIT 10
It will use only INDEX(date, ip).

MySQL database slowing down

I need some help figuring out a performance issue. A database containing a single table with a growing number of METARs (aviation weather reports) is slowing down after about 8 million records are present. This despite indexes being in use. Performance can be recovered by rebuilding indexes, but that's really slow and takes the database offline, so I've resorted to just dropping the table and recreating it (losing the last few weeks of data).
The behaviour is the same whether a query is run trying to retrieve an actual metar, or whether a simple select count(*) is executed.
The table creation syntax is as follows:
CREATE TABLE `metars` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`tstamp` timestamp NULL DEFAULT NULL,
`metar` varchar(255) DEFAULT NULL,
`icao` char(7) DEFAULT NULL,
`qnh` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `timestamp` (`tstamp`),
KEY `icao` (`icao`),
KEY `qnh` (`qnh`),
KEY `metar` (`metar`)
) ENGINE=InnoDB AUTO_INCREMENT=812803050 DEFAULT CHARSET=latin1;
Up to about 8 million records, a select count(*) returns in about 500ms. Then it gradually increases, currently again at 14 million records, the count takes between 3 and 30 seconds. I was surprised to see that when explaining the count query, it's using the timestamp as an index, not the primary key. Using the primary key this should be a matter of just a few ms to return the number of records:
mysql> explain select count(*) from metars;
+----+-------------+--------+-------+---------------+-----------+---------+------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+-----------+---------+------+----------+-------------+
| 1 | SIMPLE | metars | index | NULL | timestamp | 5 | NULL | 14693048 | Using index |
+----+-------------+--------+-------+---------------+-----------+---------+------+----------+-------------+
1 row in set (0.00 sec)
Forcing it to use the primary index is even slower:
mysql> select count(*) from metars use index(PRIMARY);
+----------+
| count(*) |
+----------+
| 14572329 |
+----------+
1 row in set (37.87 sec)
Oddly, the typical use case query is to get the weather for an airport nearest to a specific point in time which continues to perform very well, despite being more complex than a simple count:
mysql> SELECT qnh, metar from metars WHERE icao like 'KLAX' ORDER BY ABS(TIMEDIFF(tstamp, STR_TO_DATE('2019-10-10 00:00:00', '%Y-%m-%d %H:%i:%s'))) LIMIT 0,1;
+------+-----------------------------------------------------------------------------------------+
| qnh | metar |
+------+-----------------------------------------------------------------------------------------+
| 2980 | KLAX 092353Z 25012KT 10SM FEW015 20/14 A2980 RMK AO2 SLP091 T02000139 10228 20200 56007 |
+------+-----------------------------------------------------------------------------------------+
1 row in set (0.01 sec)
What am I doing wrong here?
InnoDB performs a plain COUNT(*) by traversing some index. It prefers the smallest index because that will require touching the least number of blocks.
The PRIMARY KEY is clustered with the data, so that index is actually the biggest.
What version are you using? TIMESTAMP changed at some point. Perhaps that explains why tstamp is used instead of qnh.
If you are purging old data by using DELETE, see http://mysql.rjweb.org/doc.php/partitionmaint for a faster way.
I assume the data is static; that is it is never UPDATEd? Consider building and maintaining a summary table, perhaps indexed by date. This could have various counts for each day. Then a fetch from that table would be much faster than hitting the raw data. More: http://mysql.rjweb.org/doc.php/summarytables
How many rows for KLAX? That query must fetch all of them in order to convert the timestamp before doing the LIMIT. If you had INDEX(icao, tstamp), you could find the next before or after a given time even faster.

MySQL shows "possible_keys" but does not use it

I have a table with more than a million entries and around 42 columns. I am trying to run SELECT query on this table which takes a minute to execute. In order to reduce the query execution time I added an index on the table, but the index is not being used.
The table structure is as follows. Though the table has 42 columns I am only showing here those that are relevant to my query
CREATE TABLE `tas_usage` (
`uid` int(11) NOT NULL AUTO_INCREMENT,
`userid` varchar(255) DEFAULT NULL,
`companyid` varchar(255) DEFAULT NULL,
`SERVICE` varchar(2000) DEFAULT NULL,
`runstatus` varchar(255) DEFAULT NULL,
`STATUS` varchar(2000) DEFAULT NULL,
`servertime` datetime DEFAULT NULL,
`machineId` varchar(2000) DEFAULT NULL,
PRIMARY KEY (`uid`)
) ENGINE=InnoDB AUTO_INCREMENT=2992891 DEFAULT CHARSET=latin1
The index that I have added is as follows
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (SERVERTIME,COMPANYID(20),MACHINEID(20),SERVICE(50),RUNSTATUS(10));
My SELECT Query
EXPLAIN SELECT DISTINCT t1.COMPANYID, t1.USERID, t1.MACHINEID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
EXPLAIN result is as follows
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
| 1 | SIMPLE | t1 | NULL | ALL | last_quarter | NULL | NULL | NULL | 1765296 | 15.68 | Using where; Using temporary |
| 1 | SIMPLE | INVL | NULL | ref | invalid_company_index | invalid_company_index | 502 | servicerunprod.t1.companyid | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
| 1 | SIMPLE | INVL_MAC_ID | NULL | eq_ref | machineId | machineId | 502 | servicerunprod.t1.machineId | 1 | 100.00 | Using where; Not exists; Using index; Distinct |
+----+-------------+-------------+------------+--------+-----------------------+-----------------------+---------+-----------------------------+---------+----------+------------------------------------------------+
Explanation of my Query
I want to select all the records from table TAS_USAGE
which are between date range(including) 1st October 2018 and 31st
Dec 2018 AND
which do not have columns COMPANYID and MACHINEID matching in
tables TAS_INVALID_COMPANYand TAS_INVALID_MACHINE AND
which do not contain values ('credentialtest%', 'webupdate%') in
SERVICE column and values ('Failed', 'Failed Failed', 'Failed
Success', 'Success Failed', '') in RUNSTATUS column
WHERE t1.SERVERTIME >= '2018-10-01 00:00:00'
AND t1.SERVERTIME <= '2018-12-31 00:00:00'
is strange. It covers 3 months minus 1 day plus 1 second. Suggest you rephrase thus:
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
There are multiple possible reasons why the INDEX(servertime, ...) was not used and/or was not "useful" even if used:
If more than perhaps 20% of the table involved that daterange, using the index is likely to be less efficient than simply scanning the table. Using the index would involve bouncing between the index's BTree and the data's BTree.
Starting an index with a 'range' means that the rest of the index will not be used.
Index "prefixing" (foo(10)) is next to useless.
What you can do:
Normalize most of those string columns. How many "machines" do you have? Probably nowhere near 3 million. By replacing repeated strings with a small id (perhaps a 2-byte SMALLINT UNSIGNED with a max of 65K) will save a lot of space in this table. This, in turn, will speed up the query, and eliminate the desire for index prefixing.
If Normalizing is not practical because there really are upwards of 3 million distinct values, then see if shortening the VARCHAR. If you get it under 255, prefixing is no longer needed.
NOT IN is not optimizable. If you can invert the test and make it IN(...), more possibilities open up, such as INDEX(service, runstatus, servertime). If you have a new enough version of MySQL, I think the optimizer will hop around in the index on the two IN columns and use the index for the time range.
NOT IN ('credentialtest%', 'webupdate%') -- Is % part of the string? If you are using % as a wildcard, that construct will not work. You would need two LIKE clauses.
Reformulate the query thus:
SELECT t1.COMPANYID, t1.USERID, t1.MACHINEID
FROM TAS_USAGE t1
WHERE t1.SERVERTIME >= '2018-10-01'
AND t1.SERVERTIME < '2018-10-01' + INTERVAL 3 MONTH
AND t1.SERVICE NOT IN ('credentialtest%', 'webupdate%')
AND t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed',
'Failed Success', 'Success Failed', '')
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_COMPANY WHERE companyId = t1.COMPANYID )
AND NOT EXISTS( SELECT 1 FROM TAS_INVALID_MACHINE WHERE MACHINEID = t1.MACHINEID );
If the trio t1.COMPANYID, t1.USERID, t1.MACHINEID is unique, then get rid of DISTINCT.
Since there are only 6 (of 42) columns being used in this query, building a "covering" index will probably help:
INDEX(SERVERTIME, SERVICE, RUNSTATUS, COMPANYID, USERID, MACHINEID)
This is because the query can be performed entirely withing the index. In this case, I deliberately put the range first.
Focussing on the date range, MySQL basically has two options :
read the complete table consecutively and throw away records that do not fit the date range
use the index to identify the records in the date range and then look up each record in the table (using the primary key) individually ("random access")
Consecutive reads are significantly faster than random access, but you need to read more data. There will be some break-even point at which using an index will become slower than just simply reading everything, and MySQL assumes this is the case here. If that's the right choice will largely depend on how correctly it guessed how many records are actually in the range. If you make the range smaller, it should actually use the index at some point.
If you know that (or want to test if) using the index is faster, you can force MySQL to use it with
... FROM TAS_USAGE t1 force index (last_quarter) LEFT JOIN ...
You should test it with different ranges, and if you generate your query dynamically, only force the index when you are decently certain (as MySQL will not correct you if you e.g. specify a range that would include all rows).
There is one important way around the slow random access to the table, although it unfortunately does not work with your prefixed index, but I mention it in case you can reduce your field sizes (or change them to lookups/enums). You can include every column that MySQL needs to evaluate the query by using a covering index:
An index that includes all the columns retrieved by a query. Instead of using the index values as pointers to find the full table rows, the query returns values from the index structure, saving disk I/O.
As mentioned, since in a prefixed index, part of the data is missing, those columns unfortunately cannot be used to cover though.
Actually, they also cannot be used for much at all, especially not to filter records before doing the random access, as to evaluate your where-condition for RUNSTATUS or SERVICE, the complete value is required anyway. So you could check if e.g. RUNSTATUS is very significant - maybe 99% of your records are in status 'Failed' - and in that case add an unprefixed filter for just
(SERVERTIME, RUNSTATUS) (and MySQL might even pick that index then on its own).
The distinct clause is the one that interferes with the index usage. Since the index cannot be used to help with the distinct, mysql decided against the use of index completely.
If you rearrange the order of fields in the select list, in the index, and in the where clause, mysql may decide to use it:
ALTER TABLE TAS_USAGE ADD INDEX last_quarter (COMPANYID(20),MACHINEID(20), SERVERTIME, SERVICE(50),RUNSTATUS(10));
SELECT DISTINCT t1.COMPANYID, t1.MACHINEID, t1.USERID FROM TAS_USAGE t1
LEFT JOIN TAS_INVALID_COMPANY INVL ON INVL.COMPANYID = t1.COMPANYID
LEFT JOIN TAS_INVALID_MACHINE INVL_MAC_ID ON INVL_MAC_ID.MACHINEID = t1.MACHINEID
WHERE
INVL.companyId IS NULL AND INVL_MAC_ID.machineId IS NULL AND
t1.SERVERTIME >= '2018-10-01 00:00:00' AND t1.SERVERTIME <= '2018-12-31 00:00:00' AND
t1.SERVICE NOT IN ('credentialtest%', 'webupdate%') AND
t1.RUNSTATUS NOT IN ('Failed', 'Failed Failed', 'Failed Success', 'Success Failed', '');
This way COMPANYID, MACHINEID fields become the leftmost fields in the distinct, where, and index - although the prefix may result in the index still to be discarded. You may want to consider reducing your varchar(255) fields.

MySQL query with less than and ORDER BY DESC

I'm doing some work with a rather large set of data and am trying to create a query from every combination of four different pieces of data. All of those pieces combined form a staggering 122,000,000 rows. Then, I'm trying to find a weight that is less than a certain amount and sort by another value from highest to lowest.
I can use weight < x no problem.
I can use weight < x order by height ASC no problem.
I can even use weight < x order by height DESC when x is around both the upper and lower end. But once it starts creeping into the middle, it very quickly rises from seconds, to minutes, to "I'm not going to wait that long."
Any thoughts? (The names have been changed, but the types have not)
The Create:
CREATE TABLE combinations (
id bigint(20) unsigned NOT NULL auto_increment,
up smallint(2) NOT NULL,
left smallint(2) NOT NULL,
right smallint(2) NOT NULL,
down smallint(2) NOT NULL,
weight decimal(5,1) NOT NULL,
width smallint(3) NOT NULL,
forward decimal(6,2) NOT NULL,
backwards decimal(5,2) NOT NULL,
in decimal(7,2) NOT NULL,
out smallint(3) NOT NULL,
height smallint(3) NOT NULL,
diameter decimal(7,2) NOT NULL,
PRIMARY KEY (id)
);
The Index
ALTER TABLE combinations ADD INDEX weight_and_height(weight,height);
The Query
SELECT * FROM combinations WHERE weight < 20 ORDER BY height DESC limit 0,5;
The Explain
| id | select type | table | type | possible_keys | key | key_len | ref | rows | extra |
| 1 | simple | combinations | index | weight_and_height | weight_and_height | 5 | NULL | 10 | using where |
Your index is used only for filtering on weight. Here are the steps:
All the rows with weight < x (WHERE) are found (using any index starting with weight)
That set is sorted (ORDER BY height ...)
0 (OFFSET) rows are skipped;
5 (LIMIT) rows are delivered.
The potentially costly part is step 1. Probably in your example "20" was very early in the list. In fact the EXPLAIN estimated that the set had only 10 rows. For bigger values of x, step 1 takes longer. That is unavoidable.
All the rows from Step 1 are processed; hence, the time for Step 2 also varies. (5.6 has an extra optimization that partially combines steps 2,3,4.)
Are you really doing SELECT *? If, for example, you just wanted SELECT id, then INDEX(weight, height, id) would run a lot faster because the query can be completely performed in the index.
If you really did need the query you mentioned, then this will run somewhat faster:
SELECT c.*
FROM (
SELECT id FROM combinations
WHERE weight < 20 ORDER BY height DESC limit 0,5
) ids
JOIN combinations AS c USING(id)
ORDER BY height DESC;
Notes:
The subquery is "Using index" as already mentioned.
Only 5 rows are delivered by the subquery.
The outer SELECT has a mere 5 rows to deal with.
id is indexed (because it is the PRIMARY KEY), so the JOIN is efficient.
(Re: the title) "less than" and "DESC" are not significant.

Optimizing Datetime fields where indexes aren't being used as expected

I have a large, fast-growing log table in an application running with MySQL 5.0.77. I'm trying to find the best way to optimize queries that count instances within the last X days according to message type:
CREATE TABLE `counters` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`kind` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_counters_on_kind` (`kind`),
KEY `index_counters_on_created_at` (`created_at`)
) ENGINE=InnoDB AUTO_INCREMENT=302 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
For this test set, there are 668521 rows in the table. The query I'm trying to optimize is:
SELECT kind, COUNT(id) FROM counters WHERE created_at >= ? GROUP BY kind;
Right now, that query takes between 3-5 seconds, and is being estimated as follows:
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | index_counters_on_created_at_idx | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+----------------------------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
With the created_at index removed, it looks like this:
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
| 1 | SIMPLE | counters | index | NULL | index_counters_on_kind | 258 | NULL | 1185531 | Using where |
+----+-------------+----------+-------+---------------+------------------------+---------+------+---------+-------------+
1 row in set (0.00 sec)
(Yes, for some reason the row estimate is larger than the number of rows in the table.)
So, apparently, there's no point to that index.
Is there really no better way to do this? I tried the column as a timestamp, and it just ended up slower.
Edit: I discovered that changing the query to use an interval instead of a specific date ends up using the index, cutting down the row estimate to about 20% of the query above:
SELECT kind, COUNT(id) FROM counters WHERE created_at >=
(NOW() - INTERVAL 7 DAY) GROUP BY kind;
I'm not entirely sure why that happens, but I'm fairly confident that if I understood it then the problem in general would make a lot more sense.
Why not using a concatenated index?
CREATE INDEX idx_counters_created_kind ON counters(created_at, kind);
Should go for an Index-Only Scan (mentioning "Using index" in Extras, because COUNT(ID) is NOT NULL anyway).
References:
Concatenated index vs. merging multiple indexes
Index-Only Scan
After reading the latest edit on the question, the problem seems to be that the parameter being used in the WHERE clause was being interpreted by MySQL as a string rather than as a datetime value. This would explain why the index_counters_on_created_at index was not being selected by the optimizer, and instead it would result in a scan to convert the created_at values to a string representation and then do the comparison. I think, this can be prevented by an explicit cast to datetime in the where clause:
where `created_at` >= convert({specific_date}, datetime)
My original comments still apply for the optimization part.
The real performance killer here is the kind column. Because when doing the GROUP BY the database engine first needs to determine all the distinct values in the kind column which results in a table or index scan. That's why the estimated rows is bigger than the total number of rows in the table, in one pass it will determine the distinct values in the kind column, and in a second pass it will determine which rows meet the create_at >= ? condition.
To make matters worse, the kind column is a varchar (255) which is too big to be efficient, add to that that it uses utf8 character set and utf8_unicode_ci collation, which increment the complexity of the comparisons needed to determine the unique values in that column.
This will perform a lot better if you change the type of the kind column to int. Because integer comparisons are more efficient and simpler than unicode character comparisons. It would also help to have a catalog table for the kind of messages in which you store the kind_id and description. And then do the grouping on a join of the kind catalog table and a subquery of the log table that first filters by date:
select k.kind_id, count(*)
from
kind_catalog k
inner join (
select kind_id
from counters
where create_at >= ?
) c on k.kind_id = c.kind_id
group by k.kind_id
This will first filter the counters table by create_at >= ? and can benefit from the index on that column. Then it will join that to the kind_catalog table and if the SQL optimizer is good it will scan the smaller kind_catalog table for doing the grouping, instead of the counters table.