MySQL partial indexes on varchar fields and group by optimization - mysql

I am having some issues with a group query with MySQL.
Question
Is there a reason why a query won't use a 10 character partial index on a varchar(255) field to optimize a group by?
Details
My setup:
CREATE TABLE `sessions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`ref_source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`guid` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`initial_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`referrer_host` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`campaign` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_sessions_on_user_id` (`user_id`),
KEY `index_sessions_on_referrer_host` (`referrer_host`(10)),
KEY `index_sessions_on_initial_path` (`initial_path`(10)),
KEY `index_sessions_on_campaign` (`campaign`(10))
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
A number of columns and indexes are not shown here since they don't really impact the issue.
What I want to do is run a query to see all of the referring hosts and the number of session coming from each. I don't have a huge table, but it is big enough where I full table scans aren't fun. The query I want to run is:
SELECT COUNT(*) AS count_all, referrer_host AS referrer_host FROM `sessions` GROUP BY referrer_host;
The explain gives:
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | sessions | ALL | NULL | NULL | NULL | NULL | 303049 | Using temporary; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
I have a partial index on referrer_host, but it isn't using it. Even if I try to USE INDEX or FORCE INDEX it doesn't help. The explain is the same, as is the performance.
If I add a full index on referrer_host, instead of a 10 character partial index, everything is works better, if not instantly. (350ms vs. 10 seconds)
I have tested partial indexes that are bigger than the longest entry in the field to no avail as well. The full index is the only thing that seems to work.

with the full index, the query will find scan the entire index and return the number of records pointed to for each unique key. the table isn't touched.
with the partial index, the engine doesn't know the value of the referrer_host until it looks at the record. It has to scan the whole table!
if most of the values for referrer_host are less than 10 chars then in theory, the optimiser could use the index and then only check rows that have more than 10 chars. But, because this is not a clustered index it would have to make many non-sequential disk reads to find these records. It could end up being even slower, because a table scan will at least be a sequential read. Instead of making assumptions, the optimiser just does a scan.

Try this query:
EXPLAIN SELECT COUNT(referrer_host) AS count_all, referrer_host FROM `sessions` GROUP BY referrer_host;
Now the count will fail for the group by on referrer_host = null, but I'm uncertain if there's another way around this.

You're grouping on referrer_host for all the rows in the table. As your index doesn't include referrer_host (it contains the first 10 chars!), it's going to scan the whole table.
I'll bet that this is faster, though less detailed:
SELECT COUNT(*) AS count_all, substring(referrer_host,1,10) AS referrer_host FROM `sessions` GROUP BY referrer_host;
If you need the full referrer, index it.

Related

MySQL Date Range Query Optimization

I have a MySQL table structured like this:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`author` varchar(250) COLLATE utf8mb4_unicode_ci NOT NULL,
`message` varchar(2000) COLLATE utf8mb4_unicode_ci NOT NULL,
`serverid` varchar(200) COLLATE utf8mb4_unicode_ci NOT NULL,
`date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`guildname` varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
PRIMARY KEY (`id`,`date`)
) ENGINE=InnoDB AUTO_INCREMENT=27769461 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
I need to query this table for various statistics using date ranges for Grafana graphs, however all of those queries are extremely slow, despite the table being indexed using a composite key of id and date.
"id" is auto-incrementing and date is also always increasing.
The queries generated by Grafana look like this:
SELECT
UNIX_TIMESTAMP(date) DIV 120 * 120 AS "time",
count(DISTINCT(serverid)) AS "servercount"
FROM messages
WHERE
date BETWEEN FROM_UNIXTIME(1615930154) AND FROM_UNIXTIME(1616016554)
GROUP BY 1
ORDER BY UNIX_TIMESTAMP(date) DIV 120 * 120
This query takes over 30 seconds to complete with 27 million records in the table.
Explaining the query results in this output:
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
| 1 | SIMPLE | messages | NULL | ALL | PRIMARY | NULL | NULL | NULL | 26952821 | 11.11 | Using where; Using filesort |
+----+-------------+----------+------------+------+---------------+------+---------+------+----------+----------+-----------------------------+
This indicates that MySQL is indeed using the composite primary key I created for indexing the data, but still has to scan almost the entire table, which I do not understand. How can I optimize this table for date range queries?
Plan A:
PRIMARY KEY(date, id), -- to cluster by date
INDEX(id) -- needed to keep AUTO_INCREMENT happy
Assiming the table is quite big, having date at the beginning of the PK puts the rows in the given date range all next to each other. This minimizes (somewhat) the I/O.
Plan B:
PRIMARY KEY(id),
INDEX(date, serverid)
Now the secondary index is exactly what is needed for the one query you have provided. It is optimized for searching by date, and it is smaller than the whole table, hence even faster (I/O-wise) than Plan A.
But, if you have a lot of different queries like this, adding a lot more indexes gets impractical.
Plan C: There may be a still better way:
PRIMARY KEY(id),
INDEX(server_id, date)
In theory, it can hop through that secondary index checking each server_id. But I am not sure that such an optimization exists.
Plan D: Do you need id for anything other than providing a unique PRIMARY KEY? If not, there may be other options.
The index on (id, date) doesn't help because the first key is id not date.
You can either
(a) drop the current index and index (date, id) instead -- when date is in the first place this can be used to filter for date regardless of the following columns -- or
(b) just create an additional index only on (date) to support the query.

mysql query optimization: select with counted subquery extremely slow

I have the following tables:
mysql> show create table rsspodcastitems \G
*************************** 1. row ***************************
Table: rsspodcastitems
Create Table: CREATE TABLE `rsspodcastitems` (
`id` char(20) NOT NULL,
`description` mediumtext,
`duration` int(11) default NULL,
`enclosure` mediumtext NOT NULL,
`guid` varchar(300) NOT NULL,
`indexed` datetime NOT NULL,
`published` datetime default NULL,
`subtitle` varchar(255) default NULL,
`summary` mediumtext,
`title` varchar(255) NOT NULL,
`podcast_id` char(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `podcast_id` (`podcast_id`,`guid`),
UNIQUE KEY `UKfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `IDXkcqf7wi47t3epqxlh34538k7c` (`indexed`),
KEY `IDXt2ofice5w51uun6w80g8ou7hc` (`podcast_id`,`published`),
KEY `IDXfb6nlyxvxf3i2ibwd8jx6k025` (`podcast_id`,`guid`),
KEY `published` (`published`),
FULLTEXT KEY `title` (`title`),
FULLTEXT KEY `summary` (`summary`),
FULLTEXT KEY `subtitle` (`subtitle`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> show create table station_cache \G
*************************** 1. row ***************************
Table: station_cache
Create Table: CREATE TABLE `station_cache` (
`Station_id` char(36) NOT NULL,
`item_id` char(20) NOT NULL,
`item_type` int(11) NOT NULL,
`podcast_id` char(20) NOT NULL,
`published` datetime NOT NULL,
KEY `Station_id` (`Station_id`,`published`),
KEY `IDX12n81jv8irarbtp8h2hl6k4q3` (`Station_id`,`published`),
KEY `item_id` (`item_id`,`item_type`),
KEY `IDXqw9yqpavo9fcduereqqij4c80` (`item_id`,`item_type`),
KEY `podcast_id` (`podcast_id`,`published`),
KEY `IDXkp2ehbpmu41u1vhwt7qdl2fuf` (`podcast_id`,`published`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
The "item_id" column of the second refers to the "id" column of the former (there isn't a foreign key between the two because the relationship is polymorphic, i.e. the second table may have references to entities that aren't in the first but in other tables that are similar but distinct).
I'm trying to get a query that lists the most recent items in the first table that do not have any corresponding items in the second. The highest performing query I've found so far is:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from rsspodcastitems i
having stations = 0
order by published desc
I've also considered using a where not exists (...) subquery to perform the restriction, but this was actually slower than the one I have above. But this is still taking a substantial length of time to complete. MySQL's query plan doesn't seem to be using the available indices:
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
| 1 | PRIMARY | i | ALL | NULL | NULL | NULL | NULL | 106978 | Using filesort |
| 2 | DEPENDENT SUBQUERY | station_cache | ALL | NULL | NULL | NULL | NULL | 44227 | Using where |
+----+--------------------+---------------+------+---------------+------+---------+------+--------+----------------+
Note that neither portion of the query is using a key, whereas it ought to be able to use KEY published (published) from the primary table and KEY item_id (item_id,item_type) for the subquery.
Any suggestions how I can get an appropriate result without waiting for several minutes?
I would expect the fastest query to be:
select i.*
from rsspodcastitems i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
)
order by published desc;
This would take advantage of an index on station_cache(item_id) and perhaps rsspodcastitems(published, id).
Your query could be faster, if your query returns a significant number of rows. Your phrasing of the query allows the index on rsspodcastitems(published) to avoid the file sort. If you remove the group by, the exists version should be faster.
I should note that I like your use of the having clause. When faced with this in the past, I have used a subquery:
select i.*,
(select count(station_id)
from station_cache
where item_id = i.id) as stations
from (select i.*
from rsspodcastitems i
order by published desc
) i
where not exists (select 1
from station_cache sc
where sc.item_id = i.id
);
This allows one index for sorting.
I prefer a slight variation on your method:
select i.*,
(exists (select 1
from station_cache sc
where sc.item_id = i.id
)
) as has_station
from rsspodcastitems i
having has_station = 0
order by published desc;
This should be slightly faster than the version with count().
You might want to detect and remove redundant indexes from your tables. Reviewing your CREATE TABLE information for both tables with help you discover several, including podcast_id,guid and Station_id,published, item_id,item_type and podcast_id,published there may be more.
My eventual solution was to delete the full text indices and use an externally generated index table (produced by iterating over the words in the text, filtering stop words, and applying a stemming algorithm) to allow searching. I don't know why the full text indices were causing performance problems, but they seemed to slow down every query that touched the table even if they weren't used.

MySQL uses filesort on indexed TIMESTAMP column

I've got a table that refuses to use index, and it always uses filesort.
The table is:
CREATE TABLE `article` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`Category_ID` int(11) DEFAULT NULL,
`Subcategory` int(11) DEFAULT NULL,
`CTimestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`Publish` tinyint(4) DEFAULT NULL,
`Administrator_ID` int(11) DEFAULT NULL,
`Position` tinyint(4) DEFAULT '0',
PRIMARY KEY (`ID`),
KEY `Subcategory` (`Subcategory`,`Position`,`CTimestamp`,`Publish`),
KEY `Category_ID` (`Category_ID`,`CTimestamp`,`Publish`),
KEY `Position` (`Position`,`Category_ID`,`Publish`),
KEY `CTimestamp` (`CTimestamp`),
CONSTRAINT `article_ibfk_1` FOREIGN KEY (`Category_ID`) REFERENCES `category` (`ID`)
) ENGINE=InnoDB AUTO_INCREMENT=94290 DEFAULT CHARSET=utf8
The query is:
SELECT * FROM article ORDER BY `CTimestamp`;
The explain is:
+----+-------------+---------+------+---------------+------+---------+------+-------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+------+---------------+------+---------+------+-------+----------------+
| 1 | SIMPLE | article | ALL | NULL | NULL | NULL | NULL | 63568 | Using filesort |
+----+-------------+---------+------+---------------+------+---------+------+-------+----------------+
When I remove the "ORDER BY" then all are working properly. All other indices (Subcategory, Position, etc) are working fine in other queries. Unfortunately, the timestamp refuses to be used, even with my simple select query. I'm sure I'm missing something important here.
How can I make MySQL use the timestamp index?
Thank you.
In this case, MySQL is not using your index for sorting, and it is a GOOD thing.
Why? Your table contains just 64k rows, average row width is about 26 bytes (if I added column sizes right), so total table size on disk should be around 2MB.
It is very cheap to read just 2MB of data from disk into memory (probably in just 1-2 disk operations or seeks) and then simply perform filesort in memory (probably variation of quicksort).
If MySQL did retrieval by index order as you wish, it would have to perform 64000 disk seek operations, one record after another! It would have been very, very slow.
Indexes can be good when you can use them to quickly jump to known location in huge file and read just small amount of data, like in WHERE clause. But, in this case, it is not good idea - and MySQL is not stupid!
If your table was very big (more than RAM size), then MySQL would certainly start using your index - and this is also good thing.
Well, you can always hint the index. Change your query to
SELECT * FROM article use index (CTimestamp);
This forces MySQL to use the index for the query. The EXPLAIN:
1, 'SIMPLE', 'article', 'ALL', '', '', '', '', 1, 100.00, ''
No filesort to see, and as the used index is CTimestamp, the result should be ordered accordingly.
Alternatively, you can keep your order by clause, but force the index usage:
SELECT * FROM article force index (CTimestamp) order by CTimestamp;
The problem is still strange, though. Have you considered posting it to the official MySQL help forums?
Edit: You seem to be in good company.
Edit: Forcing the index seems to work out well.

mysql single table SELECT query ORDER BY causes FILESORT

I looked through multiple similar posts trying to get input on how to redefine my index but can't figure this out. Every time i include the ORDER BY statement, it uses filesort to return the resultset.
Here's the table definition and query:
SELECT
`s`.`title`,
`s`.`price`,
`s`.`price_sale`
FROM `style` `s`
WHERE `s`.`isactive`=1 AND `s`.`department`='women'
ORDER
BY `s`.`ctime` DESC
CREATE TABLE IF NOT EXISTS `style` (
`id` mediumint(6) unsigned NOT NULL auto_increment,
`ctime` timestamp NOT NULL default CURRENT_TIMESTAMP,
`department` char(5) NOT NULL,
`isactive` tinyint(1) unsigned NOT NULL,
`price` float(8,2) unsigned NOT NULL,
`price_sale` float(8,2) unsigned NOT NULL,
`title` varchar(200) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_grid_default` (`isactive`,`department`,`ctime`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_general_ci AUTO_INCREMENT=47 ;
Also, here's the explain result set I get:
+----+-------------+-------+------+---------------+----------+---------+-------------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+----------+---------+-------------+------+-----------------------------+
| 1 | SIMPLE | s | ref | idx_grid | idx_grid | 6 | const,const | 3 | Using where; Using filesort |
+----+-------------+-------+------+---------------+----------+---------+-------------+------+-----------------------------+
Why does s.isactive not get used as an index?
MySQL (or any SQL for that matter) will not use a key if it has low cardinality.
In plain English, if many rows share the same value for a key, (My)SQL will not use the index, but just real the table instead.
A boolean field almost never gets picked as an index because of this; too many rows share the same value.
Why does MySQL not use the index on ctime?
ctime is included in a multi-field or composite index. MySQL will only use a composite index if you use all of it or a left-most part of it *)
If you sort on the middle or rightmost field(s) of a composite index, MySQL cannot use the index and will have to resort to filesort.
So a order by isactive , department will use an index;
order by department will not.
order by isactive will also not use an index, but that's because the cardinality of the boolean field isactive is too low.
*) there are some exceptions, but this covers 97% of cases.
Links:
Cardinality wikipedia: http://en.wikipedia.org/wiki/Cardinality_%28data_modeling%29
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
What does Using filesort mean in MySQL?
It does not mean you have a temporary file, it just mean a sort is done (bad name, ignore the 4 first letters).
from Baron Schwartz:
The truth is, filesort is badly named. Anytime a sort can’t be performed from an index, it’s a filesort. It has nothing to do with files. Filesort should be called “sort.” It is quicksort at heart.

How to optimize a query that's using group by on a large number of rows

The table looks like this:
CREATE TABLE `tweet_tweet` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`text` varchar(256) NOT NULL,
`created_at` datetime NOT NULL,
`created_date` date NOT NULL,
...
`positive_sentiment` decimal(5,2) DEFAULT NULL,
`negative_sentiment` decimal(5,2) DEFAULT NULL,
`entity_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8
The explain on the query looks like this:
mysql> explain SELECT `tweet_tweet`.`entity_id`,
STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`,
AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`,
COUNT(`tweet_tweet`.`id`) AS `tweet_count`
FROM `tweet_tweet`
WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'
GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
| 1 | SIMPLE | tweet_tweet | ALL | NULL | NULL | NULL | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
1 row in set (0.00 sec)
About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.
Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)
You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).
You're not using the index tweet_tweet_entity_created. Change your query to:
explain SELECT `tweet_tweet`.`entity_id`,
STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`,
AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`,
COUNT(`tweet_tweet`.`id`) AS `tweet_count`
FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'
GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;
You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html
Sometimes MySQL's query optimizer needs a little help.
MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.
I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.
if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html