CONTEXT
I have a large table full of "documents" that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I'm having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the updated_at > last_indexed_at in a more
reasonable time?
DETAILS
I'm pretty certain that my query is, in some way, not sargable. Unfortunately, I can't find what about this query prevents it from being executed on a row independent basis.
select count(*)
from documents
where last_indexed_at is null or updated_at > last_indexed_at;
doesn't do any better.
nor does
select count( distinct( id ) )
from documents
where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) )
from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
`id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
`fields` mediumtext COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`updated_at` datetime(6) NOT NULL,
`last_indexed_at` datetime(6) DEFAULT NULL,
`deleted_at` datetime(6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_documents_on_document_type` (`document_type`),
KEY `index_documents_on_locale` (`locale`),
KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION(); got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
Oh! MySQL 5.7 introduced Generated Columns — which gives us a way of indexing expressions! 🥳
If you do something like this:
ALTER TABLE documents
ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED,
ADD INDEX index_documents_on_dirty(dirty);
...and change the query to:
SELECT COUNT(*) FROM documents WHERE dirty;
...what results do you get?
Hopefully, we're moving the work of evaluating COALESCE(updated_at > last_indexed_at, TRUE) from Read time to Write time.
Add a covering INDEX
If you had INDEX(last_indexed_at, updated_at), the 15-minute queries might run somewhat faster. (The order of the columns does not matter in this case.)
Assuming both of those columns are columns in the table. If so, then the query must read every row. (I don't know if the term "sargable" covers this situation.)
The INDEX I suggest will be faster because it is "covering". By reading only the index, there is less I/O.
The repeated 15 minutes is probably because innodb_buffer_pool_size was not big enough to hold the entire table. So, it was I/O-bound. My INDEX will be smaller, hence (hopefully) small enough to fit in the buffer_pool. So, it will be faster, and even faster on the second run.
Slow OR
OR is usually a terrible slowdown. But I don't think it matters here.
If you were to initialize last_indexed_at to some old date (say, '2000-01-01'), you could get rid of the COALESCE or OR.
Another way to clean it up is
SELECT SUM(last_indexed_at IS NULL) +
SUM(updated_at > last_indexed_at) AS "Need indexing"
FROM t;
I still need the index. SUM(boolean expression) sees the expression as 0 (false or NULL) or 1 (TRUE).
Meanwhile, I don't think the COUNT(DISTINCT id) is any different than COUNT(*). And the pair of SUMs should also give you the value.
Again, I am depending on "covering" being the trick.
"More than .." trick
In some situation, you don't really need the exact number, especially if it is "more than some threshold".
SELECT 1 FROM tbl WHERE ... LIMIT 1000,1;
If it comes back with "1", there are at least 1000 rows. If it comes back empty (no row returned), then not.
That will still have to touch up to 1000 rows (hopefully in an index), but that is better than touching a million.
You can, if you're on a recent MySQL version (5.7+), add a generated column to your table containing your search expression, then index that column.
ALTER TABLE t
ADD COLUMN needs_indexing TINYINT
GENERATED ALWAYS AS
(CASE WHEN last_indexed_at IS NULL THEN 1
WHEN updated_at > last_indexed_at THEN 1
ELSE 0 END) VIRTUAL;
ALTER TABLE t
ADD INDEX needs_indexing (needs_indexing);
This uses drive space for the index, but not in your table.
Then you can do SELECT SUM(needs_indexing) FROM t to get the number of items matching your criterion.
But: you don't have to count all the items to know you need to reindex some items. Doing a COUNT(*) on a large InnoDB table is very expensive as you have discovered. You can do this:
SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;
You'll get 1 or 0 from this query very quickly. 1 means you have at least one row meeting your criteria.
And, of course, your indexing job can do
SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;
or whatever makes sense. That will also be fast.
Related
Okay, so as the title says, the queries were running fine 2 days ago, then all of a sudden yesterday the site was loading very slow. I tracked it down to a couple queries. One I was able to add an index which seems to have helped, but this one I just can't figure out. I tried running a repair and optimize on the tables, and that didn't help. I don't know what could have changed so much that would make it go from less than a second to query to 20+ seconds. Any help would be much appreciated.
SELECT city
FROM listings LEFT JOIN agencies
ON listings.agencyid_fk = agencies.agencyid
WHERE listingstatus IN (1,3) AND appid_fk = 101 AND active = 1
AND auction IS NULL AND agencies.multilist = 1
AND notagency IS NULL
GROUP BY city
ORDER BY city;
I wasn't sure how to export the explain query result to make it readable on here, so I just put it in a code snippet. click run to see it in an html table.
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE listings ref appid_fk,listingstatus appid_fk 2 const 21699 Using where; Using temporary; Using filesort
1 SIMPLE agencies eq_ref PRIMARY,active PRIMARY 2 mls2.listings.agencyid_fk 1 Using where
And here are the tables...
listings table:
CREATE TABLE mls2.listings (
listingid INT(11) AUTO_INCREMENT NOT NULL,
appid_fk SMALLINT(3) NOT NULL DEFAULT '0',
agencyid_fk SMALLINT(3) NOT NULL DEFAULT '0',
listingstatus SMALLINT(3),
city VARCHAR(30) CHARACTER SET latin1 COLLATE latin1_swedish_ci,
multilist TINYINT(1),
auction TINYINT(1),
PRIMARY KEY (listingid)
) ENGINE = myisam ROW_FORMAT = DEFAULT CHARACTER SET latin1;
agencies table:
CREATE TABLE mls2.agenciesx (
agencyid SMALLINT(6) AUTO_INCREMENT NOT NULL,
multilist TINYINT(4) DEFAULT '0',
notagency TINYINT(1),
active TINYINT(1),
PRIMARY KEY (agencyid)
) ENGINE = myisam ROW_FORMAT = DEFAULT CHARACTER SET latin1;
Once you've taken on board the comments above, try adding the following indexes to your tables...
INDEX(city,listingstatus,appid_fk,auction)
INDEX(active,multilist,notagency)
In both cases, the order in which columns are arranged in the index may make a difference, so play around with those, although there are so few rows in the agencies table, that that one won't really matter.
Next, get rid of the GROUP BY clause, and write your query as follows.
SELECT DISTINCT l.city
FROM listings l
JOIN agencies a
ON a.agencyid = l.agencyid_fk
WHERE l.listingstatus IN (1,3)
AND l.appid_fk = 101
AND a.active = 1
AND l.auction IS NULL
AND a.multilist = 1
AND a.notagency IS NULL
ORDER
BY city;
Note: Although irrelevant for this particular problem, your original question showed that this schema is desperately in need of normalisation.
I am running a MySQL database.
I have the following script:
DROP TABLE IF EXISTS `org_apiinteg_assets`;
DROP TABLE IF EXISTS `assessmentinstances`;
CREATE TABLE `org_apiinteg_assets` (
`id` varchar(20) NOT NULL default '0',
`instance_id` varchar(20) default NULL,
PRIMARY KEY (`id`)
) ENGINE= MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;
CREATE TABLE `assessmentinstances` (
`id` varchar(20) NOT NULL default '0',
`title` varchar(180) default NULL,
PRIMARY KEY (`id`)
) ENGINE= MyISAM DEFAULT CHARSET=utf8 PACK_KEYS=1;
INSERT INTO assessmentinstances(id, title) VALUES ('14026lvplotw6','One radio question survey');
INSERT INTO org_apiinteg_assets(id, instance_id) VALUES ('8kp9wgx43jflrgjfe','14026lvplotw6');
Looks like this
assessmentinstances
+---------------+---------------------------+
| id | title |
+---------------+---------------------------+
| 14026lvplotw6 | One radio question survey |
+---------------+---------------------------+
org_apiinteg_assets
+-------------------+---------------+
| id | instance_id |
+-------------------+---------------+
| 8kp9wgx43jflrgjfe | 14026lvplotw6 |
+-------------------+---------------+
And I then have the following query (I reduced it to the simplest failing query)
SELECT ai.id, COUNT(*) AS `count`
FROM assessmentinstances ai, org_apiinteg_assets a
WHERE a.instance_id = ai.id
AND ai.id = '14026lvplotw6'
AND a.id != '8kp9wgx43jflrgjfe';
When I run the query I get this
null, 0
Until now, all is good. Now, here is my issue, when I recreate both tables with ENGINE=InnoDB instead of ENGINE=MyISAM and run the same query again, I get this:
'14026lvplotw6','0'
So 2 things are confusing me:
Why don't I get the same result?
How can the COUNT(*) return 0 in the second case when it actually returns values for the row, and should therefore be 1?
I am lost, I'd appreciate if anybody could explain this behaviour to me.
EDIT:
Interestingly, if I add GROUP BY ai.id at the end of the query, it works fine in both cases and return no rows.
This happen because you are using aggregation function without GROUP BY .. in this case the result for non aggregated column is unpredictable .. (typically is show the first value encountered during the query)
Try adding a GROUP BY
SELECT ai.id, COUNT(*) AS `count`
FROM assessmentinstances ai, org_apiinteg_assets a
WHERE a.instance_id = ai.id
AND a.id != '8kp9wgx43jflrgjfe'
AND ai.id = '14026lvplotw6'
GROUP BY ai.id;
Remember that the use of aggregation in presence of column not mentioned in group by is deprecated in SQL and is not allowed in most of the db and in the more recent version of mysql (starting from 5.7)
EXPLAIN SELECT for MyISAM returns: Impossible WHERE noticed after reading const tables. So MyISAM isn't processing any data at all.
For the InnoDB there are two rows of EXPLAIN results: one Using Index and one Using where. So InnoDB data is being scanned and bits of it slip into the output as there is no aggregate function specified for the first column and AFAIK its not specified what should happen in such situation. If you directly specify some aggregate function, then if there are no matching rows, it will return NULL. So, for example, SELECT min(ai.id), COUNT(*) ... would return NULL, 0.
In MySQL slow query log I have the following query:
SELECT * FROM `news_items`
WHERE `ctime` > 1465013901 AND `feed_id` IN (1, 2, 9) AND
`moderated` = '1' AND `visibility` = '1'
ORDER BY `views` DESC
LIMIT 5;
Here is the result of EXPLAIN:
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
| 1 | SIMPLE | news_items | index | feed_id,ctime,ctime_2,feed_id_2,moderated,visibility,feed_id_3,cday_complex,feed_id_4 | views | 4 | NULL | 5 | Using where |
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.00 sec)
When I run this query manually, it takes like 0.00 sec but for some reason it appears in MySQL's slow log taking 1-5 seconds sometimes. I believe it happens when server is under high load.
Here is the table structure:
CREATE TABLE IF NOT EXISTS `news_items` (
`item_id` int(10) NOT NULL,
`category_id` int(10) NOT NULL,
`source_id` int(10) NOT NULL,
`feed_id` int(10) NOT NULL,
`title` varchar(255) CHARACTER SET utf8 NOT NULL,
`announce` varchar(255) CHARACTER SET utf8 NOT NULL,
`content` text CHARACTER SET utf8 NOT NULL,
`hyperlink` varchar(255) CHARACTER SET utf8 NOT NULL,
`ctime` varchar(11) CHARACTER SET utf8 NOT NULL,
`cday` tinyint(2) NOT NULL,
`img` varchar(100) CHARACTER SET utf8 NOT NULL,
`video` text CHARACTER SET utf8 NOT NULL,
`gallery` text CHARACTER SET utf8 NOT NULL,
`comments` int(11) NOT NULL DEFAULT '0',
`views` int(11) NOT NULL DEFAULT '0',
`visibility` enum('1','0') CHARACTER SET utf8 NOT NULL DEFAULT '0',
`pin` tinyint(1) NOT NULL,
`pin_dttm` varchar(10) CHARACTER SET utf8 NOT NULL,
`moderated` tinyint(1) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The index named as "views" consists of 1 field only -- views.
I also have many other indexes consisting of (for example):
feed_id + views + visibility + moderated
moderated + visibility + feed_id + ctime
moderated + visibility + feed_id + views + ctime
I used fields in mentioned order because that was the only reason MySQL started to use them. However, I never got "Using where; using index" in EXPLAIN.
Any ideas on how to make EXPLAIN to show me "using index"?
If you have change the storage engine to InnoDB and create the correct composite index you can try this. The first query only gets the item_id for the first 5 rows. Limit is done after the complete SELECT is done. So its better to do this without any big data and then get the hole row only from the 5 woes
SELECT idata.* FROM (
SELECT item_id FROM `news_items`
WHERE `ctime` > 1465013901 AND `feed_id` IN (1, 2, 9) AND
`moderated` = '1' AND `visibility` = '1'
ORDER BY `views` DESC
LIMIT 5 ) as i_ids
LEFT JOIN news_items AS idata ON idata.item_id = i_ids.item_id
ORDER BY `views` DESC;
If your table "also have many other indexes", why do they not show in the SHOW CREATE TABLE?
There are two ways that
WHERE `ctime` > 1465013901
AND `feed_id` IN (1, 2, 9)
AND `moderated` = '1'
AND `visibility` = '1'
ORDER BY `views` DESC
could use indexing:
INDEX(views) and hope that the desired 5 rows (see LIMIT) show up early.
INDEX(moderated, visibility, feed_id, ctime)
This last 'composite' index starts with the two columns (in either order) that are compared = constant, then moves on to IN and finally a "range" (ctime > const). Older versions won't get past IN; newer versions will leapfrog through the IN values and make use of the range on ctime. More discussion.
It is useless to include the ORDER BY columns in a composite index before all of the WHERE columns. However, it will not be useful to include views in your case because the "range" on ctime.
The tip about 'lazy evaluation' that Bernd suggests will also help.
I agree that InnoDB would probably be better. Conversion tips.
To answer your question: "using index" means that MySQL will use only index to satisfy your query. To do this we will need to create a "covered" index (index which "covers" the query) = index which covers both "where" and "order by/group by" and all fields from "select" However, you are doing "select *" so that will not be practical.
MySQL chooses index on views as you have limit 5 in the query. It does that as 1) index is small 2) it can avoid filesort in this case.
I believe the problem is not with the index but rather than with the engine=MyISAM. MyISAM uses table level lock, so if you change the news_items it will be locked. I would suggest converting table to InnoDB.
Another possibility may be that if the table is large, index on (views) may not be the best option.
If you use Percona Server you can enable slow log verbosity option and see the query plan for the slow query as described here: https://www.percona.com/doc/percona-server/5.5/diagnostics/slow_extended_55.html
I have a mySQL DB of ~1 milllion entries.
I run the query:
SELECT a.id as aid, a.title as atitle, a.slug, summary,
a.link as alink, author, published, image, a.cat as acat,
a.rss as arss, a.site as asite
FROM articles a
ORDER BY published DESC
LIMIT 616150, 50;
It takes about 5 minutes or more ot load.
My TABLE and INDEXes:
CREATE TABLE IF NOT EXISTS `articles` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`slug` varchar(255) NOT NULL,
`summary` text NOT NULL,
`link` text NOT NULL,
`author` varchar(255) NOT NULL,
`published` datetime NOT NULL,
`image` text NOT NULL,
`cat` int(11) NOT NULL,
`rss` int(11) NOT NULL,
`site` int(11) NOT NULL,
`bitly` varchar(255) NOT NULL,
`checked` tinyint(4) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `title` (`title`),
KEY `cat` (`cat`),
KEY `published` (`published`),
KEY `site` (`site`),
KEY `rss` (`rss`),
KEY `checked` (`checked`),
KEY `id_publ_index` (`id`,`published`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1230234;
What explain says:
mysql> EXPLAIN EXTENDED SELECT a.id as aid, a.title as atitle, a.slug, summary, a.link as alink, author, published, image, a.cat as acat, a.rss as arss, a.site as asite FROM articles a ORDER BY published DESC LIMIT 616150, 50;
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+----------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+----------+-------+
| 1 | SIMPLE | a | index | NULL | published | 8 | NULL | 616200 | 152.94 | |
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+----------+-------+
1 row in set, 1 warning (0.46 sec)
Any tips of how to optimize this query? Why mySQL needs to read all 616200 rows and not just the 50 that were asked?
Thank you for your time.
The reason that you are seeing the published key being used is because that is what you are ordering by. How often does this query need to run?
There is one simple thing that you can do to help this query run much, much faster:
Make better use of your published key. Use WHERE to define what range of dates you want to retrieve from your table.
The reason why you are reading 616,200 rows of your table right now is because you are not using the index to limit the range. MySQL needs to use your full index to:
Sort the first 616200 rows in DESC order and then
Finally limit the result to 50 rows.
If possible, you should filter the results of your database in a different way. Changing your results to be based on the WHERE (making more efficient use of your index) will be the quickest way.
For example:
SELECT a.id as aid, a.title as atitle, a.slug, summary,
a.link as alink, author, published, image, a.cat as acat,
a.rss as arss, a.site as asite
FROM articles a
WHERE published > '2010-01-01'
ORDER BY published DESC
LIMIT 6150, 50;
The sad part is that ORDER BY and LIMIT does not scale too well, and you will loose you speed very quickly. (eg, change your limit to 0, 50, and then to 900000, 50 and see how your speed is affected) so adding more information to your WHERE will help your query be much faster.
EDIT:
There is no way I can know what to display by date, so putting a where is not possible. In addition this query is run on a news aggregator, that collects news every ... second. The limit is made so I can create paginated results.
Because you are inserting new posts, your LIMIT statement is going to cause the news items to jump when a user is going through page anyway. For instance, if I am on page one and three items get added before I press 'Next', then by the time that I click 'Next', I will see the last three items from the previous page.
For the best possible user experience, you should try adding the ID of the last seen news item or the date of the last seen news item to the pagination somehow. This could be done either by sessions or part of the query URL, but it will allow you to have better use of your indexes.
I understand why the limit is there - it's just how can you fix the issue of the query being slow after a certain amount of pages have been hit.
To efficiently fix your speed issues, you will need to make better use of an index, rather then relying on 'LIMIT' to be your sole method of pagination. LIMIT is amazing, yes, but it is not optimized for retrieving records the way that you are trying to do it, because you need to sort by the date.
Even though as you say 'there is no way I can know what to display by date' (at least currently...) there must be a way for your application to limit what needs to be fetched from your db. The same way Facebook does not need to go through every member of the website's individual posts just to make results show on your Facebook wall. You need to find out how it can be made more efficient.
I am having some issues with a group query with MySQL.
Question
Is there a reason why a query won't use a 10 character partial index on a varchar(255) field to optimize a group by?
Details
My setup:
CREATE TABLE `sessions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`ref_source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`guid` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`initial_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`referrer_host` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`campaign` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_sessions_on_user_id` (`user_id`),
KEY `index_sessions_on_referrer_host` (`referrer_host`(10)),
KEY `index_sessions_on_initial_path` (`initial_path`(10)),
KEY `index_sessions_on_campaign` (`campaign`(10))
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
A number of columns and indexes are not shown here since they don't really impact the issue.
What I want to do is run a query to see all of the referring hosts and the number of session coming from each. I don't have a huge table, but it is big enough where I full table scans aren't fun. The query I want to run is:
SELECT COUNT(*) AS count_all, referrer_host AS referrer_host FROM `sessions` GROUP BY referrer_host;
The explain gives:
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | sessions | ALL | NULL | NULL | NULL | NULL | 303049 | Using temporary; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
I have a partial index on referrer_host, but it isn't using it. Even if I try to USE INDEX or FORCE INDEX it doesn't help. The explain is the same, as is the performance.
If I add a full index on referrer_host, instead of a 10 character partial index, everything is works better, if not instantly. (350ms vs. 10 seconds)
I have tested partial indexes that are bigger than the longest entry in the field to no avail as well. The full index is the only thing that seems to work.
with the full index, the query will find scan the entire index and return the number of records pointed to for each unique key. the table isn't touched.
with the partial index, the engine doesn't know the value of the referrer_host until it looks at the record. It has to scan the whole table!
if most of the values for referrer_host are less than 10 chars then in theory, the optimiser could use the index and then only check rows that have more than 10 chars. But, because this is not a clustered index it would have to make many non-sequential disk reads to find these records. It could end up being even slower, because a table scan will at least be a sequential read. Instead of making assumptions, the optimiser just does a scan.
Try this query:
EXPLAIN SELECT COUNT(referrer_host) AS count_all, referrer_host FROM `sessions` GROUP BY referrer_host;
Now the count will fail for the group by on referrer_host = null, but I'm uncertain if there's another way around this.
You're grouping on referrer_host for all the rows in the table. As your index doesn't include referrer_host (it contains the first 10 chars!), it's going to scan the whole table.
I'll bet that this is faster, though less detailed:
SELECT COUNT(*) AS count_all, substring(referrer_host,1,10) AS referrer_host FROM `sessions` GROUP BY referrer_host;
If you need the full referrer, index it.