I've got a table that refuses to use index, and it always uses filesort.
The table is:
CREATE TABLE `article` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Category_ID` int(11) DEFAULT NULL,
`Subcategory` int(11) DEFAULT NULL,
`CTimestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`Publish` tinyint(4) DEFAULT NULL,
`Administrator_ID` int(11) DEFAULT NULL,
`Position` tinyint(4) DEFAULT '0',
PRIMARY KEY (`ID`),
KEY `Subcategory` (`Subcategory`,`Position`,`CTimestamp`,`Publish`),
KEY `Category_ID` (`Category_ID`,`CTimestamp`,`Publish`),
KEY `Position` (`Position`,`Category_ID`,`Publish`),
KEY `CTimestamp` (`CTimestamp`),
CONSTRAINT `article_ibfk_1` FOREIGN KEY (`Category_ID`) REFERENCES `category` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=94290 DEFAULT CHARSET=utf8
The query is:
SELECT * FROM article ORDER BY `CTimestamp`;
The explain is:
+----+-------------+---------+------+---------------+------+---------+------+-------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+---------------+------+---------+------+-------+----------------+
| 1 | SIMPLE | article | ALL | NULL | NULL | NULL | NULL | 63568 | Using filesort |
+----+-------------+---------+------+---------------+------+---------+------+-------+----------------+
When I remove the "ORDER BY" then all are working properly. All other indices (Subcategory, Position, etc) are working fine in other queries. Unfortunately, the timestamp refuses to be used, even with my simple select query. I'm sure I'm missing something important here.
How can I make MySQL use the timestamp index?
Thank you.
In this case, MySQL is not using your index for sorting, and it is a GOOD thing.
Why? Your table contains just 64k rows, average row width is about 26 bytes (if I added column sizes right), so total table size on disk should be around 2MB.
It is very cheap to read just 2MB of data from disk into memory (probably in just 1-2 disk operations or seeks) and then simply perform filesort in memory (probably variation of quicksort).
If MySQL did retrieval by index order as you wish, it would have to perform 64000 disk seek operations, one record after another! It would have been very, very slow.
Indexes can be good when you can use them to quickly jump to known location in huge file and read just small amount of data, like in WHERE clause. But, in this case, it is not good idea - and MySQL is not stupid!
If your table was very big (more than RAM size), then MySQL would certainly start using your index - and this is also good thing.
Well, you can always hint the index. Change your query to
SELECT * FROM article use index (CTimestamp);
This forces MySQL to use the index for the query. The EXPLAIN:
1, 'SIMPLE', 'article', 'ALL', '', '', '', '', 1, 100.00, ''
No filesort to see, and as the used index is CTimestamp, the result should be ordered accordingly.
Alternatively, you can keep your order by clause, but force the index usage:
SELECT * FROM article force index (CTimestamp) order by CTimestamp;
The problem is still strange, though. Have you considered posting it to the official MySQL help forums?
Edit: You seem to be in good company.
Edit: Forcing the index seems to work out well.
Related
I have a MySQL table structured like this:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`author` varchar(250) COLLATE utf8mb4_unicode_ci NOT NULL,
`message` varchar(2000) COLLATE utf8mb4_unicode_ci NOT NULL,
`serverid` varchar(200) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`guildname` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=27769461 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
I need to query this table for various statistics using date ranges for Grafana graphs, however all of those queries are extremely slow, despite the table being indexed using a composite key of id and date.
"id" is auto-incrementing and date is also always increasing.
The queries generated by Grafana look like this:
SELECT
UNIX_TIMESTAMP(date) DIV 120 * 120 AS "time",
count(DISTINCT(serverid)) AS "servercount"
FROM messages
WHERE
date BETWEEN FROM_UNIXTIME(1615930154) AND FROM_UNIXTIME(1616016554)
GROUP BY 1
ORDER BY UNIX_TIMESTAMP(date) DIV 120 * 120
This query takes over 30 seconds to complete with 27 million records in the table.
Explaining the query results in this output:
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| 1 | SIMPLE | messages | NULL | ALL | PRIMARY | NULL | NULL | NULL | 26952821 | 11.11 | Using where; Using filesort |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
This indicates that MySQL is indeed using the composite primary key I created for indexing the data, but still has to scan almost the entire table, which I do not understand. How can I optimize this table for date range queries?
Plan A:
PRIMARY KEY(date, id), -- to cluster by date
INDEX(id) -- needed to keep AUTO_INCREMENT happy
Assiming the table is quite big, having date at the beginning of the PK puts the rows in the given date range all next to each other. This minimizes (somewhat) the I/O.
Plan B:
PRIMARY KEY(id),
INDEX(date, serverid)
Now the secondary index is exactly what is needed for the one query you have provided. It is optimized for searching by date, and it is smaller than the whole table, hence even faster (I/O-wise) than Plan A.
But, if you have a lot of different queries like this, adding a lot more indexes gets impractical.
Plan C: There may be a still better way:
PRIMARY KEY(id),
INDEX(server_id, date)
In theory, it can hop through that secondary index checking each server_id. But I am not sure that such an optimization exists.
Plan D: Do you need id for anything other than providing a unique PRIMARY KEY? If not, there may be other options.
The index on (id, date) doesn't help because the first key is id not date.
You can either
(a) drop the current index and index (date, id) instead -- when date is in the first place this can be used to filter for date regardless of the following columns -- or
(b) just create an additional index only on (date) to support the query.
I try to optimize this query :
select id_store from receipt where receiptDate between '20151109' and '20151116'
I execute this query with the command EXPLAIN. It appears that no key is used. The index of receiptDate is not used. What's wrong ?
Here's the structure of the table receipt :
CREATE TABLE receipt (
id_store tinyint(3) unsigned NOT NULL default '0',
id_receipt int(7) unsigned NOT NULL default '0',
id_product smallint(6) unsigned NOT NULL default '0',
receiptDate char(8) NOT NULL default '',
qty float NOT NULL default '0',
turnover float NOT NULL default '0',
PRIMARY KEY (id_store,id_receipt,id_product,receiptDate),
KEY NDX_1 (receiptDate),
) ENGINE=MEMORY;
Here's the result of the command EXPLAIN :
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
| 1 | SIMPLE | receipt | ALL |NDX_1 | | | |24789225| Using where |
+----+-------------+---------------------+--------+-----------------------------------------------------------+---------------------------------+---------+--------------------------------------+------+---------------------------------+
The table receipt contains 24.789.225 lines, with a average of 15.000 lines per day (receiptDate).
I execute the following query and I obtain 120.295 lines :
select count(*) from receipt where receiptDate between '20151109' and '20151116'
Thanks in advance for your help.
Since you indexed receiptDate, the database engine will use this index if the optimizer thinks it will improve performance. The optimizer takes its decisions based on statistics about your table. It creates these statistics in background, this is a mostly transparent process.
Now you are using a MEMORY engine. Because memory tables are supposed to be short lived, these special engines have very limited optimizer capabilities. You might want to force the query to use your index with the FORCE INDEX keyword.
Your date is stored as a CHAR(8), this is slow because the engine has to parse all height CHAR. You will obtain improved performance with an INT (convert the date to YYYYMMDD). Your queries should still work since the engine converts strings inputs to int automatically.
If you were going to use the InnoDB engine, then you should put the date as a primary key if possible, because the primary key is also the Clustered Index with this engine, meaning the data will be physically sorted by date on the storage.
Instead of an 8-byte (or 24-byte if utf8) CHAR(8) for receiptDate, use a 3-byte DATE datatype.
While you are at it, you could save another byte by making id_receipt MEDIUMINT UNSIGNED if it is only 7 digits.
Specify BTREE:
KEY `NDX_1` (`receiptDate`) USING BTREE
since MEMORY may be making it a HASH index. BTREE can handle ranges; HASH has to do a table scan.
I agree that InnoDB is likely to be better overall.
I looked through multiple similar posts trying to get input on how to redefine my index but can't figure this out. Every time i include the ORDER BY statement, it uses filesort to return the resultset.
Here's the table definition and query:
SELECT
`s`.`title`,
`s`.`price`,
`s`.`price_sale`
FROM `style` `s`
WHERE `s`.`isactive`=1 AND `s`.`department`='women'
ORDER
BY `s`.`ctime` DESC
CREATE TABLE IF NOT EXISTS `style` (
`id` mediumint(6) unsigned NOT NULL auto_increment,
`ctime` timestamp NOT NULL default CURRENT_TIMESTAMP,
`department` char(5) NOT NULL,
`isactive` tinyint(1) unsigned NOT NULL,
`price` float(8,2) unsigned NOT NULL,
`price_sale` float(8,2) unsigned NOT NULL,
`title` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_grid_default` (`isactive`,`department`,`ctime`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci AUTO_INCREMENT=47 ;
Also, here's the explain result set I get:
+----+-------------+-------+------+---------------+----------+---------+-------------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+-------------+------+-----------------------------+
| 1 | SIMPLE | s | ref | idx_grid | idx_grid | 6 | const,const | 3 | Using where; Using filesort |
+----+-------------+-------+------+---------------+----------+---------+-------------+------+-----------------------------+
Why does s.isactive not get used as an index?
MySQL (or any SQL for that matter) will not use a key if it has low cardinality.
In plain English, if many rows share the same value for a key, (My)SQL will not use the index, but just real the table instead.
A boolean field almost never gets picked as an index because of this; too many rows share the same value.
Why does MySQL not use the index on ctime?
ctime is included in a multi-field or composite index. MySQL will only use a composite index if you use all of it or a left-most part of it *)
If you sort on the middle or rightmost field(s) of a composite index, MySQL cannot use the index and will have to resort to filesort.
So a order by isactive , department will use an index;
order by department will not.
order by isactive will also not use an index, but that's because the cardinality of the boolean field isactive is too low.
*) there are some exceptions, but this covers 97% of cases.
Links:
Cardinality wikipedia: http://en.wikipedia.org/wiki/Cardinality_%28data_modeling%29
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
What does Using filesort mean in MySQL?
It does not mean you have a temporary file, it just mean a sort is done (bad name, ignore the 4 first letters).
from Baron Schwartz:
The truth is, filesort is badly named. Anytime a sort can’t be performed from an index, it’s a filesort. It has nothing to do with files. Filesort should be called “sort.” It is quicksort at heart.
I am having some issues with a group query with MySQL.
Question
Is there a reason why a query won't use a 10 character partial index on a varchar(255) field to optimize a group by?
Details
My setup:
CREATE TABLE `sessions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`ref_source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`guid` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`initial_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`referrer_host` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`campaign` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_sessions_on_user_id` (`user_id`),
KEY `index_sessions_on_referrer_host` (`referrer_host`(10)),
KEY `index_sessions_on_initial_path` (`initial_path`(10)),
KEY `index_sessions_on_campaign` (`campaign`(10))
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
A number of columns and indexes are not shown here since they don't really impact the issue.
What I want to do is run a query to see all of the referring hosts and the number of session coming from each. I don't have a huge table, but it is big enough where I full table scans aren't fun. The query I want to run is:
SELECT COUNT(*) AS count_all, referrer_host AS referrer_host FROM `sessions` GROUP BY referrer_host;
The explain gives:
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | sessions | ALL | NULL | NULL | NULL | NULL | 303049 | Using temporary; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
I have a partial index on referrer_host, but it isn't using it. Even if I try to USE INDEX or FORCE INDEX it doesn't help. The explain is the same, as is the performance.
If I add a full index on referrer_host, instead of a 10 character partial index, everything is works better, if not instantly. (350ms vs. 10 seconds)
I have tested partial indexes that are bigger than the longest entry in the field to no avail as well. The full index is the only thing that seems to work.
with the full index, the query will find scan the entire index and return the number of records pointed to for each unique key. the table isn't touched.
with the partial index, the engine doesn't know the value of the referrer_host until it looks at the record. It has to scan the whole table!
if most of the values for referrer_host are less than 10 chars then in theory, the optimiser could use the index and then only check rows that have more than 10 chars. But, because this is not a clustered index it would have to make many non-sequential disk reads to find these records. It could end up being even slower, because a table scan will at least be a sequential read. Instead of making assumptions, the optimiser just does a scan.
Try this query:
EXPLAIN SELECT COUNT(referrer_host) AS count_all, referrer_host FROM `sessions` GROUP BY referrer_host;
Now the count will fail for the group by on referrer_host = null, but I'm uncertain if there's another way around this.
You're grouping on referrer_host for all the rows in the table. As your index doesn't include referrer_host (it contains the first 10 chars!), it's going to scan the whole table.
I'll bet that this is faster, though less detailed:
SELECT COUNT(*) AS count_all, substring(referrer_host,1,10) AS referrer_host FROM `sessions` GROUP BY referrer_host;
If you need the full referrer, index it.
The table looks like this:
CREATE TABLE `tweet_tweet` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`text` varchar(256) NOT NULL,
`created_at` datetime NOT NULL,
`created_date` date NOT NULL,
...
`positive_sentiment` decimal(5,2) DEFAULT NULL,
`negative_sentiment` decimal(5,2) DEFAULT NULL,
`entity_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8
The explain on the query looks like this:
mysql> explain SELECT `tweet_tweet`.`entity_id`,
STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`,
AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`,
COUNT(`tweet_tweet`.`id`) AS `tweet_count`
FROM `tweet_tweet`
WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'
GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | tweet_tweet | ALL | NULL | NULL | NULL | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
1 row in set (0.00 sec)
About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.
Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)
You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).
You're not using the index tweet_tweet_entity_created. Change your query to:
explain SELECT `tweet_tweet`.`entity_id`,
STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`,
AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`,
COUNT(`tweet_tweet`.`id`) AS `tweet_count`
FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'
GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;
You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Sometimes MySQL's query optimizer needs a little help.
MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.
I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.
if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html