Finding optimal index for this MySQL query - mysql

In MySQL slow query log I have the following query:
SELECT * FROM `news_items`
WHERE `ctime` > 1465013901 AND `feed_id` IN (1, 2, 9) AND
`moderated` = '1' AND `visibility` = '1'
ORDER BY `views` DESC
LIMIT 5;
Here is the result of EXPLAIN:
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
| 1 | SIMPLE | news_items | index | feed_id,ctime,ctime_2,feed_id_2,moderated,visibility,feed_id_3,cday_complex,feed_id_4 | views | 4 | NULL | 5 | Using where |
+----+-------------+------------+-------+---------------------------------------------------------------------------------------+-------+---------+------+------+-------------+
1 row in set (0.00 sec)
When I run this query manually, it takes like 0.00 sec but for some reason it appears in MySQL's slow log taking 1-5 seconds sometimes. I believe it happens when server is under high load.
Here is the table structure:
CREATE TABLE IF NOT EXISTS `news_items` (
`item_id` int(10) NOT NULL,
`category_id` int(10) NOT NULL,
`source_id` int(10) NOT NULL,
`feed_id` int(10) NOT NULL,
`title` varchar(255) CHARACTER SET utf8 NOT NULL,
`announce` varchar(255) CHARACTER SET utf8 NOT NULL,
`content` text CHARACTER SET utf8 NOT NULL,
`hyperlink` varchar(255) CHARACTER SET utf8 NOT NULL,
`ctime` varchar(11) CHARACTER SET utf8 NOT NULL,
`cday` tinyint(2) NOT NULL,
`img` varchar(100) CHARACTER SET utf8 NOT NULL,
`video` text CHARACTER SET utf8 NOT NULL,
`gallery` text CHARACTER SET utf8 NOT NULL,
`comments` int(11) NOT NULL DEFAULT '0',
`views` int(11) NOT NULL DEFAULT '0',
`visibility` enum('1','0') CHARACTER SET utf8 NOT NULL DEFAULT '0',
`pin` tinyint(1) NOT NULL,
`pin_dttm` varchar(10) CHARACTER SET utf8 NOT NULL,
`moderated` tinyint(1) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The index named as "views" consists of 1 field only -- views.
I also have many other indexes consisting of (for example):
feed_id + views + visibility + moderated
moderated + visibility + feed_id + ctime
moderated + visibility + feed_id + views + ctime
I used fields in mentioned order because that was the only reason MySQL started to use them. However, I never got "Using where; using index" in EXPLAIN.
Any ideas on how to make EXPLAIN to show me "using index"?

If you have change the storage engine to InnoDB and create the correct composite index you can try this. The first query only gets the item_id for the first 5 rows. Limit is done after the complete SELECT is done. So its better to do this without any big data and then get the hole row only from the 5 woes
SELECT idata.* FROM (
SELECT item_id FROM `news_items`
WHERE `ctime` > 1465013901 AND `feed_id` IN (1, 2, 9) AND
`moderated` = '1' AND `visibility` = '1'
ORDER BY `views` DESC
LIMIT 5 ) as i_ids
LEFT JOIN news_items AS idata ON idata.item_id = i_ids.item_id
ORDER BY `views` DESC;

If your table "also have many other indexes", why do they not show in the SHOW CREATE TABLE?
There are two ways that
WHERE `ctime` > 1465013901
AND `feed_id` IN (1, 2, 9)
AND `moderated` = '1'
AND `visibility` = '1'
ORDER BY `views` DESC
could use indexing:
INDEX(views) and hope that the desired 5 rows (see LIMIT) show up early.
INDEX(moderated, visibility, feed_id, ctime)
This last 'composite' index starts with the two columns (in either order) that are compared = constant, then moves on to IN and finally a "range" (ctime > const). Older versions won't get past IN; newer versions will leapfrog through the IN values and make use of the range on ctime. More discussion.
It is useless to include the ORDER BY columns in a composite index before all of the WHERE columns. However, it will not be useful to include views in your case because the "range" on ctime.
The tip about 'lazy evaluation' that Bernd suggests will also help.
I agree that InnoDB would probably be better. Conversion tips.

To answer your question: "using index" means that MySQL will use only index to satisfy your query. To do this we will need to create a "covered" index (index which "covers" the query) = index which covers both "where" and "order by/group by" and all fields from "select" However, you are doing "select *" so that will not be practical.
MySQL chooses index on views as you have limit 5 in the query. It does that as 1) index is small 2) it can avoid filesort in this case.
I believe the problem is not with the index but rather than with the engine=MyISAM. MyISAM uses table level lock, so if you change the news_items it will be locked. I would suggest converting table to InnoDB.
Another possibility may be that if the table is large, index on (views) may not be the best option.
If you use Percona Server you can enable slow log verbosity option and see the query plan for the slow query as described here: https://www.percona.com/doc/percona-server/5.5/diagnostics/slow_extended_55.html

Related

How to optimize datetime comparisons in mysql in where clause

CONTEXT
I have a large table full of "documents" that are updated by outside sources. When I notice the updates are more recent than my last touchpoint I need to address these documents. I'm having some serious performance issues though.
EXAMPLE CODE
select count(*) from documents;
gets me back 212,494,397 documents in 1 min 15.24 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
which is apx the actual query gets me 55,988,860 in 14 min 36.23 sec.
select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE) limit 1;
notably takes about 15 minutes as well. (this was surprising to me)
THE PROBLEM
How do I perform the updated_at > last_indexed_at in a more
reasonable time?
DETAILS
I'm pretty certain that my query is, in some way, not sargable. Unfortunately, I can't find what about this query prevents it from being executed on a row independent basis.
select count(*)
from documents
where last_indexed_at is null or updated_at > last_indexed_at;
doesn't do any better.
nor does
select count( distinct( id ) )
from documents
where last_indexed_at is null or updated_at > last_indexed_at limit 1;
nor does
select count( distinct( id ) )
from documents limit 1;
EDIT: FOLLOW UP REQUESTED DATA
This question only involves one table (thankfully) in a rails project, so we conveniently have the rails definition for the table.
/*!40101 SET #saved_cs_client = ##character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `documents` (
`id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_id` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`document_type` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`locale` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`allowed_ids` text COLLATE utf8mb4_unicode_ci NOT NULL,
`fields` mediumtext COLLATE utf8mb4_unicode_ci,
`created_at` datetime(6) NOT NULL,
`updated_at` datetime(6) NOT NULL,
`last_indexed_at` datetime(6) DEFAULT NULL,
`deleted_at` datetime(6) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_documents_on_document_type` (`document_type`),
KEY `index_documents_on_locale` (`locale`),
KEY `index_documents_on_last_indexed_at` (`last_indexed_at`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
SELECT VERSION(); got me 5.7.27-30-log
And probably most import,
explain select count(*) from documents where COALESCE( updated_at > last_indexed_at, TRUE);
gets me exactly
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
| 1 | SIMPLE | documents | NULL | ALL | NULL | NULL | NULL | NULL | 208793754 | 100.00 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+-----------+----------+-------------+
Oh! MySQL 5.7 introduced Generated Columns — which gives us a way of indexing expressions! 🥳
If you do something like this:
ALTER TABLE documents
ADD COLUMN dirty BOOL GENERATED ALWAYS AS (COALESCE(updated_at > last_indexed_at, TRUE)) STORED,
ADD INDEX index_documents_on_dirty(dirty);
...and change the query to:
SELECT COUNT(*) FROM documents WHERE dirty;
...what results do you get?
Hopefully, we're moving the work of evaluating COALESCE(updated_at > last_indexed_at, TRUE) from Read time to Write time.
Add a covering INDEX
If you had INDEX(last_indexed_at, updated_at), the 15-minute queries might run somewhat faster. (The order of the columns does not matter in this case.)
Assuming both of those columns are columns in the table. If so, then the query must read every row. (I don't know if the term "sargable" covers this situation.)
The INDEX I suggest will be faster because it is "covering". By reading only the index, there is less I/O.
The repeated 15 minutes is probably because innodb_buffer_pool_size was not big enough to hold the entire table. So, it was I/O-bound. My INDEX will be smaller, hence (hopefully) small enough to fit in the buffer_pool. So, it will be faster, and even faster on the second run.
Slow OR
OR is usually a terrible slowdown. But I don't think it matters here.
If you were to initialize last_indexed_at to some old date (say, '2000-01-01'), you could get rid of the COALESCE or OR.
Another way to clean it up is
SELECT SUM(last_indexed_at IS NULL) +
SUM(updated_at > last_indexed_at) AS "Need indexing"
FROM t;
I still need the index. SUM(boolean expression) sees the expression as 0 (false or NULL) or 1 (TRUE).
Meanwhile, I don't think the COUNT(DISTINCT id) is any different than COUNT(*). And the pair of SUMs should also give you the value.
Again, I am depending on "covering" being the trick.
"More than .." trick
In some situation, you don't really need the exact number, especially if it is "more than some threshold".
SELECT 1 FROM tbl WHERE ... LIMIT 1000,1;
If it comes back with "1", there are at least 1000 rows. If it comes back empty (no row returned), then not.
That will still have to touch up to 1000 rows (hopefully in an index), but that is better than touching a million.
You can, if you're on a recent MySQL version (5.7+), add a generated column to your table containing your search expression, then index that column.
ALTER TABLE t
ADD COLUMN needs_indexing TINYINT
GENERATED ALWAYS AS
(CASE WHEN last_indexed_at IS NULL THEN 1
WHEN updated_at > last_indexed_at THEN 1
ELSE 0 END) VIRTUAL;
ALTER TABLE t
ADD INDEX needs_indexing (needs_indexing);
This uses drive space for the index, but not in your table.
Then you can do SELECT SUM(needs_indexing) FROM t to get the number of items matching your criterion.
But: you don't have to count all the items to know you need to reindex some items. Doing a COUNT(*) on a large InnoDB table is very expensive as you have discovered. You can do this:
SELECT EXISTS (SELECT 1 FROM t WHERE needs_indexing = 1) something_needs_indexing;
You'll get 1 or 0 from this query very quickly. 1 means you have at least one row meeting your criteria.
And, of course, your indexing job can do
SELECT * FROM t WHERE needs_indexing=1 LIMIT 1;
or whatever makes sense. That will also be fast.

Different data sizes in INFORMATION_SCHEMA.INNODB_BUFFER_PAGE and INFORMATION_SCHEMA.TABLES

I'm trying to understand if the table is being loaded to InnoDB buffer. For that I'm querying INFORMATION_SCHEMA.INNODB_BUFFER_PAGE table.
From what I see, the table is fully loaded. However, amount of data loaded (MB) into buffer is very different from the numbers displayed in INFORMATION_SCHEMA.TABLES.
For example:
SELECT TABLE_NAME, TABLE_ROWS
, CAST(DATA_LENGTH/POWER(1024,2) AS DECIMAL(5, 0)) AS DATA_LENGTH_MB
, CAST(DATA_FREE/POWER(1024,2) AS DECIMAL(5, 0)) AS DATA_FREE_MB
FROM INFORMATION_SCHEMA.TABLES
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = '<db_name>'
AND TABLE_NAME = '<table_name>';
| TABLE_NAME | TABLE_ROWS | DATA_LENGTH_MB | DATA_FREE_MB |
|-----------------------------------------------------------|
| <table_name> | 39735968 | 10516 | 548 |
So there is around 39.7 million records and 10.5 GB in data pages according to INFORMATION_SCHEMA.TABLES
However, when I'm running this:
SELECT p.TABLE_NAME, p.INDEX_NAME
, ROUND(SUM(DATA_SIZE)/POWER(1024,2)) AS DATA_SIZE_MB
, SUM(NUMBER_RECORDS) AS NUMBER_RECORDS
FROM INFORMATION_SCHEMA.INNODB_BUFFER_PAGE AS p
WHERE p.TABLE_NAME LIKE '`<db_name>`.`<table_name>`' AND p.INDEX_NAME = 'PRIMARY'
AND p.PAGE_TYPE = 'INDEX' AND p.PAGE_STATE = 'FILE_PAGE'
ORDER BY p.TABLE_NAME, p.INDEX_NAME
I'm getting:
| TABLE_NAME | INDEX_NAME | DATA_SIZE_MB | NUMBER_RECORDS |
-----------------------------------------------------------------------
| <db_name>.<table_name> | PRIMARY | 3505 | 45224835 |
And finally,
SELECT COUNT(1) FROM <db_name>.<table_name>;
44947428
NUMBER_RECORDS is slightly greater that TABLE_ROWS in INFORMATION_SCHEMA.TABLES. so I assume that table is fully loaded into memory, and TABLE_ROWS is either approximate or not up to date.
But why DATA_SIZE in INFORMATION_SCHEMA.INNODB_BUFFER_PAGE is so much different (3.5 GB vs. 10.5 GB) ?
What am I missing? Is the data size in TABLES completely incorrect?
Database is running on Amazon RDS (Aurora MySQL 5.7) if that matters.
Thanks.
P.S. CREATE TABLE statement (columns names obfuscated, sorry : )
CREATE TABLE `table_name` (
`recid` BINARY(32) NOT NULL,
`col1` INT(11) UNSIGNED NOT NULL,
`col2` TINYINT(1) UNSIGNED NOT NULL,
`col3` VARCHAR(250) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col4` TINYINT(1) UNSIGNED NOT NULL,
`col5` VARCHAR(250) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col6` TINYINT(1) UNSIGNED NOT NULL,
`col7` TINYINT(1) UNSIGNED NOT NULL,
`col8` VARCHAR(100) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col9` TINYINT(1) UNSIGNED NOT NULL,
`col10` TINYINT(1) UNSIGNED NOT NULL,
`col11` VARCHAR(100) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col12` TINYINT(1) UNSIGNED NOT NULL DEFAULT '0',
`col13` TINYINT(1) UNSIGNED NOT NULL DEFAULT '1',
`col14` INT(11) UNSIGNED NULL DEFAULT NULL,
`col15` BINARY(32) NULL DEFAULT NULL,
`col16` CHAR(2) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col17` TINYINT(1) NULL DEFAULT NULL,
`col18` VARCHAR(50) NULL DEFAULT NULL COLLATE 'utf8_general_ci',
`col19` TINYINT(1) NULL DEFAULT NULL,
`col20` TINYINT(1) NULL DEFAULT NULL,
PRIMARY KEY (`recid`) USING BTREE,
UNIQUE INDEX `col3` (`col3`) USING BTREE,
INDEX `col5` (`col5`) USING BTREE,
INDEX `col8` (`col8`) USING BTREE
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
The Information Schema INNODB_BUFFER_PAGE table contains information about pages in the buffer pool.
Note the last 4 words.
That suggests that SUM from INNODB_BUFFER_PAGE may be smaller than what you get from INFORMATION_SCHEMA.TABLES.
(I don't know all the details, but here are some general statements.)
The buffer_pool may contain:
Some or all of the leaf nodes for a table.
Some or all of the non-leaf nodes for a table.
Ditto for leaf and non-leaf nodes for each non-PRIMARY index for a table.
Ditto for more tables.
TEXT and BLOB (and large VARCHAR) may be stored off-record. This greatly increases the disk space occupied. But I don't think this happens in your case. However, see below.
25% (tunable) of the buffer_pool is reserved for the "change buffer"; this is sort of a write cache for changes to secondary indexes.
Other stuff
A few percent of the buffer_pool is held in reserve or lost for other reasons.
Blocks are kicked out of the buffer_pool in [roughly] a least-recently-used order.
I don't know, but I would expect that it might not be possible to keep a table in the buffer_pool if that table is half the size of the buffer_pool.
Another thing to note... The Data_free metric for each table is just one of quite a few categories of "overhead" in a table.
Pre-allocated blocks (perhaps reflected in Data_free)
Unfilled blocks (perhaps no data or index block is 100% full)
Transactions lead to extra copies of rows -- these come and go, either in the undo/redo space or in the data blocks.
Block splits
etc.
A Rule of Thumb is that the disk space occupied by the data (Data_length) is 2x-3x the predicted size. ("Predicted" = adding up individual data sizes, such as 4 bytes for each INT.)
Wild idea
What is the ROW_FORMAT?
Your 3.5GB computation may be the on-record space, and all the VARCHARs are stored off_record. The math almost works out.
Let's pursue 2 avenues of thought with
SELECT count(*),
AVG(LENGTH(col3)) AS avg3,
AVG(LENGTH(col5)) AS avg5,
... -- the rest of the VARCHARs
FROM table_name;
(I specifically want LENGTH(), not CHAR_LENGTH().)
Sorry for a long delay. I have finally managed to confirm there in fact was a data clean-up performed on the table in question. Around 60% or the records were deleted.
That should explain the difference between size and n_leaf_pages values in mysql.innodb_index_stats table. Not sure if that's normal behavior or not.
SO to answer my ow question. To estimate how much table would take in InnoDB pool I should probably be looking into mysql.innodb_index_stats.size instead of INFORMATION_SCHEMA.TABLE.
SELECT TABLE_NAME, ROUND((stat_value*##innodb_page_size)/POWER(1024,2)) AS DATA_SIZE_MB
FROM mysql.innodb_index_stats
WHERE database_name = 'db_name' AND index_name = 'PRIMARY' AND table_name = 'table_name'
AND stat_name = 'n_leaf_pages';
Thanks #Rick James for helping me with this one

Concurrent queries on composite index with order by id drastically slow

I have a table defined as follows:
| book | CREATE TABLE `book` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`provider_id` int(10) unsigned DEFAULT '0',
`source_id` varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
`title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`description` longtext COLLATE utf8_unicode_ci,
PRIMARY KEY (`id`),
UNIQUE KEY `provider` (`provider_id`,`source_id`),
KEY `idx_source_id` (`source_id`),
) ENGINE=InnoDB AUTO_INCREMENT=1605425 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci |
when there are about 10 concurrent read with following sql:
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '1037122800') ORDER BY `book`.`id` ASC LIMIT 1
it becomes slow, it takes about 100 ms.
however if I changed it to
SELECT * FROM `book` WHERE (provider_id = '1' AND source_id = '221630001') LIMIT 1
then it is normal, it takes several ms.
I don't understand why adding order by id makes query much slower? could anyone expain?
Try to add desired columns (Select Column Name,.. ) instead of * or Refer this.
Why is my SQL Server ORDER BY slow despite the ordered column being indexed?
I'm not a mysql expert, and not able to perform a detailed analysis, but my guess would be that because you are providing values for the UNIQUE KEY in the WHERE clause, the engine can go and fetch that row directly using an index.
However, when you ask it to ORDER BY the id column, which is a PRIMARY KEY, that changes the access path. The engine now guesses that since it has an index on id, and you want to order by id, it is better to fetch that data in PK order, which will avoid a sort. In this case though, it leads to a slower result, as it has to compare every row to the criteria (a table scan).
Note that this is just conjecture. You would need to EXPLAIN both statements to see what is going on.

Is there any way to optimize this SELECT query any further?

I have a MySQL table that is filled with mails from a postfix mail log. The table is updated very often, some times multiple times per second. Here's the SHOW CREATE TABLE output:
Create Table postfix_mails CREATE TABLE `postfix_mails` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`mail_id` varchar(20) COLLATE utf8_danish_ci NOT NULL,
`host` varchar(30) COLLATE utf8_danish_ci NOT NULL,
`queued_at` datetime NOT NULL COMMENT 'When the message was received by the MTA',
`attempt_at` datetime NOT NULL COMMENT 'When the MTA last attempted to relay the message',
`attempts` smallint(5) unsigned NOT NULL,
`from` varchar(254) COLLATE utf8_danish_ci DEFAULT NULL,
`to` varchar(254) COLLATE utf8_danish_ci NOT NULL,
`source_relay` varchar(100) COLLATE utf8_danish_ci DEFAULT NULL,
`target_relay` varchar(100) COLLATE utf8_danish_ci DEFAULT NULL,
`target_relay_status` enum('sent','deferred','bounced','expired') COLLATE utf8_danish_ci NOT NULL,
`target_relay_comment` varchar(4098) COLLATE utf8_danish_ci NOT NULL,
`dsn` varchar(10) COLLATE utf8_danish_ci NOT NULL,
`size` int(11) unsigned NOT NULL,
`delay` float unsigned NOT NULL,
`delays` varchar(50) COLLATE utf8_danish_ci NOT NULL,
`nrcpt` smallint(5) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `mail_signature` (`host`,`mail_id`,`to`),
KEY `from` (`from`),
KEY `to` (`to`),
KEY `source_relay` (`source_relay`),
KEY `target_relay` (`target_relay`),
KEY `target_relay_status` (`target_relay_status`),
KEY `mail_id` (`mail_id`),
KEY `last_attempt_at` (`attempt_at`),
KEY `queued_at` (`queued_at`)
) ENGINE=InnoDB AUTO_INCREMENT=111592 DEFAULT CHARSET=utf8 COLLATE=utf8_danish_ci
I want to know how many mails were relayed through a specific host on a specific date, so I'm using this query:
SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` LIKE '2016-04-11%'
AND `host` = 'mta03'
The query takes between 100 and 110 ms.
Currently the table contains about 70 000 mails, and the query returns around 31 000. This is only a couple of days' worth of mails, and I plan to keep at least a month. The query cache doesn't help much because the table is getting updated constantly.
I have tried doing this instead:
SELECT SQL_NO_CACHE COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` >= '2016-04-11'
AND `queued_at` < '2016-04-12'
AND `host` = 'mta03'
But the query takes the exact same time to run. I have made these changes to the MySQL configuration:
[mysqld]
query_cache_size = 128M
key_buffer_size = 256M
read_buffer_size = 128M
sort_buffer_size = 128M
innodb_buffer_pool_size = 4096M
And confirmed that they are all in effect (SHOW VARIABLES) but the query doesn't run any faster.
Am I doing something stupid that makes this query take this long? Can you spot any obvious or non-obvious ways to make it faster? Is there another database engine that works better than InnoDB in this scenario?
mysql> EXPLAIN SELECT SQL_NO_CACHE COUNT(*) as `count`
-> FROM `postfix_mails`
-> WHERE `queued_at` >= '2016-04-11'
-> AND `queued_at` < '2016-04-12'
-> AND `host` = 'mta03';
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
| 1 | SIMPLE | postfix_mails | ref | mail_signature,queued_at | mail_signature | 92 | const | 53244 | Using where |
+----+-------------+---------------+------+--------------------------+----------------+---------+-------+-------+-------------+
1 row in set (0.00 sec)
queued_at is a datetime value. Don't use LIKE. That converts it to a string, preventing the use of indexes and imposing a full-table scan. Instead, you want an appropriate index and to fix the query.
The query is:
SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE `queued_at` >= '2016-04-11' AND `queued_at` < DATE_ADD('2016-04-11', interval 1 day) AND
`host` = 'mta03';
Then you want a composite index on postfix_mails(host, queued_at). The host column needs to be first.
Note: If your current version is counting 31,000 out of 70,000 emails, then an index will not be much help for that. However, this will make the code more scalable for the future.
If you need your query to be really fast, you'll need to materialize it.
MySQL lacks a way to do that natively, so you'll have to create a table like that:
CREATE TABLE mails_host_day
(
host VARCHAR(30) NOT NULL,
day DATE NOT NULL,
mails BIGINT NOT NULL,
PRIMARY KEY (host, day)
)
and update it either in a trigger on postfix_mails or with a script once in a while:
INSERT
INTO mails_host_day (host, day, mails)
SELECT host, CAST(queued_at AS DATE), COUNT(*)
FROM postfix_mails
WHERE id > :last_sync_id
GROUP BY
host, CAST(queued_at AS DATE)
ON DUPLICATE KEY
UPDATE mails = mails + VALUES(mails)
This way, querying a host-day entry is a single primary key seek.
Note that trigger-based solution will affect DML performance, while the script-based solution will result in slightly less actual data.
However, you can improve the script-based solution actuality if you union the most recent actual data with the stored results:
SELECT host, day, SUM(mails) AS mails
FROM (
SELECT host, day, mails
FROM mails_host_day
UNION ALL
SELECT host, CAST(queued_at) AS day, COUNT(*) AS mails
FROM postfix_mails
WHERE id >= :last_sync_id
GROUP BY
host, CAST(queued_at) AS day
) q
It's not a single index seek anymore, however, if you run the update script often enough, there will be less actual records to read.
You have a unique key on 'host', 'mail_id', and 'to', however when the query engine tries to use that index, you aren't filtering on 'mail_id' and 'to', so it may not be as efficient. A solution could be to add another index just on 'host' or add AND 'mail_id' IS NOT NULL AND'to' IS NOT NULL to your query to fully utilize the existing unique index.
You could use pagination to speed up queries in PHP which is usually how I resolve anything that contains a large amount of data - but this depends on your Table hierarchy.
Integrate your LIMIT in the SQL query.
PHP:
foreach ($db->Prepare("SELECT COUNT(*) as `count`
FROM `postfix_mails`
WHERE DATEDIFF(`queued_at`, '2016-04-11') = 0)
AND mail_id < :limit "))->execute(array(':limit' => $_POST['limit'])) as $row)
{
// normal output
}
jQuery:
$(document).ready( function() {
var starting = 1;
$('#next').click( function() {
starting = starting + 10;
$.post('phpfilehere.php', { limit: starting })
.done( function(data) {
$('#mail-output').innerHTML = data;
});
);
);
Here, each page shows 10 emails on, of course you can change this and modify it and even add a search which I actually have an Object I use for all my Projects.
I just thought I'd share the idea - it also adds real-time data flow on your site too.
This was inspired to me by Facebook's scrolling show more - which really isn't hard but is such a good way for querying a lot of data.

MySQL partial indexes on varchar fields and group by optimization

I am having some issues with a group query with MySQL.
Question
Is there a reason why a query won't use a 10 character partial index on a varchar(255) field to optimize a group by?
Details
My setup:
CREATE TABLE `sessions` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`ref_source` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`guid` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`initial_path` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`referrer_host` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`campaign` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `index_sessions_on_user_id` (`user_id`),
KEY `index_sessions_on_referrer_host` (`referrer_host`(10)),
KEY `index_sessions_on_initial_path` (`initial_path`(10)),
KEY `index_sessions_on_campaign` (`campaign`(10))
) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
A number of columns and indexes are not shown here since they don't really impact the issue.
What I want to do is run a query to see all of the referring hosts and the number of session coming from each. I don't have a huge table, but it is big enough where I full table scans aren't fun. The query I want to run is:
SELECT COUNT(*) AS count_all, referrer_host AS referrer_host FROM `sessions` GROUP BY referrer_host;
The explain gives:
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
| 1 | SIMPLE | sessions | ALL | NULL | NULL | NULL | NULL | 303049 | Using temporary; Using filesort |
+----+-------------+----------+------+---------------+------+---------+------+--------+---------------------------------+
I have a partial index on referrer_host, but it isn't using it. Even if I try to USE INDEX or FORCE INDEX it doesn't help. The explain is the same, as is the performance.
If I add a full index on referrer_host, instead of a 10 character partial index, everything is works better, if not instantly. (350ms vs. 10 seconds)
I have tested partial indexes that are bigger than the longest entry in the field to no avail as well. The full index is the only thing that seems to work.
with the full index, the query will find scan the entire index and return the number of records pointed to for each unique key. the table isn't touched.
with the partial index, the engine doesn't know the value of the referrer_host until it looks at the record. It has to scan the whole table!
if most of the values for referrer_host are less than 10 chars then in theory, the optimiser could use the index and then only check rows that have more than 10 chars. But, because this is not a clustered index it would have to make many non-sequential disk reads to find these records. It could end up being even slower, because a table scan will at least be a sequential read. Instead of making assumptions, the optimiser just does a scan.
Try this query:
EXPLAIN SELECT COUNT(referrer_host) AS count_all, referrer_host FROM `sessions` GROUP BY referrer_host;
Now the count will fail for the group by on referrer_host = null, but I'm uncertain if there's another way around this.
You're grouping on referrer_host for all the rows in the table. As your index doesn't include referrer_host (it contains the first 10 chars!), it's going to scan the whole table.
I'll bet that this is faster, though less detailed:
SELECT COUNT(*) AS count_all, substring(referrer_host,1,10) AS referrer_host FROM `sessions` GROUP BY referrer_host;
If you need the full referrer, index it.