Speed of mysql query on tables containing blob depends on filesystem cache - mysql

I have a table with approximately 120k rows, which contains a field with a BLOB (not more than 1MB each entry in size, usually much less). My problem is that whenever I run a query asking any columns on this table (not including the BLOB one), if the filesystem cache is empty, it takes approximately 40'' to complete. All subsequent queries on the same table require less than 1'' (testing from the command line client, on the server itself). The number of rows returned in the queries vary from an empty set to 60k+
I have eliminated the query cache so it has nothing to do with it.
The table is myisam but I also tried to change it to innodb (and setting ROW_FORMAT=COMPACT), but without any luck.
If I remove the BLOB column, the query is always fast.
So I would assume that the server reads the blobs from the disk (or parts of them) and the filesystem caches them. The problem is that on a server with high traffic and limited memory, the filesystem cache is refreshed every once in a while, so this particular query keeps causing me trouble.
So my question is, is there a way to considerably speed things up, without removing the blob column from the table?
here are 2 example queries, ran one after the other, along with explain, indexes and table definition:
mysql> SELECT ct.score FROM completed_tests ct where ct.status != 'deleted' and ct.status != 'failed' and score < 100;
Empty set (48.21 sec)
mysql> SELECT ct.score FROM completed_tests ct where ct.status != 'deleted' and ct.status != 'failed' and score < 99;
Empty set (1.16 sec)
mysql> explain SELECT ct.score FROM completed_tests ct where ct.status != 'deleted' and ct.status != 'failed' and score < 99;
+----+-------------+-------+-------+---------------+--------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+--------+---------+------+-------+-------------+
| 1 | SIMPLE | ct | range | status,score | status | 768 | NULL | 82096 | Using where |
+----+-------------+-------+-------+---------------+--------+---------+------+-------+-------------+
1 row in set (0.00 sec)
mysql> show indexes from completed_tests;
+-----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| completed_tests | 0 | PRIMARY | 1 | id | A | 583938 | NULL | NULL | | BTREE | |
| completed_tests | 1 | users_login | 1 | users_LOGIN | A | 11449 | NULL | NULL | YES | BTREE | |
| completed_tests | 1 | tests_ID | 1 | tests_ID | A | 140 | NULL | NULL | | BTREE | |
| completed_tests | 1 | status | 1 | status | A | 3 | NULL | NULL | YES | BTREE | |
| completed_tests | 1 | timestamp | 1 | timestamp | A | 291969 | NULL | NULL | | BTREE | |
| completed_tests | 1 | archive | 1 | archive | A | 1 | NULL | NULL | | BTREE | |
| completed_tests | 1 | score | 1 | score | A | 783 | NULL | NULL | YES | BTREE | |
| completed_tests | 1 | pending | 1 | pending | A | 1 | NULL | NULL | | BTREE | |
+-----------------+------------+-------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
mysql> show create table completed_tests;
+-----------------+--------------------------------------
| Table | Create Table |
+-----------------+--------------------------------------
| completed_tests | CREATE TABLE `completed_tests` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`users_LOGIN` varchar(100) DEFAULT NULL,
`tests_ID` mediumint(8) unsigned NOT NULL DEFAULT '0',
`test` longblob,
`status` varchar(255) DEFAULT NULL,
`timestamp` int(10) unsigned NOT NULL DEFAULT '0',
`archive` tinyint(1) NOT NULL DEFAULT '0',
`time_start` int(10) unsigned DEFAULT NULL,
`time_end` int(10) unsigned DEFAULT NULL,
`time_spent` int(10) unsigned DEFAULT NULL,
`score` float DEFAULT NULL,
`pending` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `users_login` (`users_LOGIN`),
KEY `tests_ID` (`tests_ID`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`),
KEY `archive` (`archive`),
KEY `score` (`score`),
KEY `pending` (`pending`)
) ENGINE=InnoDB AUTO_INCREMENT=117996 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED
1 row in set (0.00 sec)
I originally posted this on mysql query slow at first fast afterwards but I now have more information so I repost as a different question
I also posted this on the mysql forum, but I haven't heard back
Thanks in advance as always

The design of BLOB (=TEXT) storage in MySQL seems to be totally flawed and counter-intuitive. I ran a couple of times into the same problem and was unable to find any authoritative explanation. The most detailed analysis I've finally found is this post from 2010: http://www.mysqlperformanceblog.com/2010/02/09/blob-storage-in-innodb/
General belief and expectation is that BLOBs/TEXTs are stored outside main row storage (e.g., see this answer). This is NOT TRUE, though. There are several issues here (I'm basing on the article given above):
If the size of a BLOB item is several KB, it is included directly in row data. Consequently, even if you SELECT only non-BLOB columns, the engine still has to load all your BLOBs from disk. Say, you have 1M rows with 100 bytes of non-blob data each and 5000 bytes of blob data. You SELECT all non-blob columns and expect that MySQL would read from disk around 100-120 bytes per row, which is 100-120 MB in total (+20 for BLOB address). However, the reality is that MySQL stores all BLOBs in the same disk blocks as rows, so they all must be read together even if not used, and so the size of data read from disk is around 5100 MB = 5 GB - this is 50 times more than you would expect and means 50 times slower query execution.
Of course, this design has an advantage: when you need all the columns, including the blob one, SELECT query is faster when blobs are stored with the row than when stored externally: you avoid (sometimes) 1 additional page access per row. However, this is not a typical use case for BLOBs and DB engine should not be optimized towards this case. If your data is so small that it fits in a row and you're fine with loading it in every query no matter if needed or not - then you would use VARCHAR type instead of BLOB/TEXT.
Even if for some reason (long row or long blob) the BLOB value is stored externally, its 768-byte prefix is still kept in the row itself. Let's take the previous example: you have 100 bytes of non-blob data in each row, but now the blob column holds items of 1 MB each so they must be kept externally. SELECT of non-blob columns will have to read roughly 800 bytes per row (non-blobs + blob prefix), instead of 100-120 - this is again 7 times larger disk transfer than you'd expect, and 7x slower query execution.
External BLOB storage is ineffective in its usage of disk space: it allocates space in blocks of 16 KB and single block cannot hold multiple items, so if your blobs are small and take, for instance, 8 KB each, the actual space allocated is twice that large.
I hope this design will get fixed one day: MySQL will store ALL blobs - big and small - in external storage, without any prefixes kept in DB, with external storage allocation being efficient for items of all sizes. Before this happens, separating out BLOB/TEXT columns seems the only reasonable solution - separating out to another table or to the filesystem (each BLOB value kept as a file).
[UPDATE 2019-10-15]
InnoDB documentation provides now an ultimate answer to the issue discussed above:
https://dev.mysql.com/doc/refman/8.0/en/innodb-row-format.html
The case of storing 768-byte prefixes of BLOB/TEXT values inline holds indeed for COMPACT row format. According to the docs, "For each non-NULL variable-length field (...) The internal part is 768 bytes".
However, you can use DYNAMIC row format instead. With this format:
"InnoDB can store long variable-length column values (...) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page. (...) TEXT and BLOB columns that are less than or equal to 40 bytes are stored in line."
Here, a BLOB value can occupy up to 40 bytes of inline storage, which is much better than 768 bytes as in the COMPACT mode, and looks like a lot more reasonable approach in the case you want to mix BLOB and non-BLOB types in a table and still be able to scan multiple rows pretty fast. Moreover, the extended (over 20 bytes) inline storage is used ONLY for values sized between 20-40 bytes; for larger values, only the 20-byte pointer is stored (no prefix), unlike in the COMPACT mode. Hence, the extended 40-byte storage is used rarely in practice and one can safely assume the average size of inline storage to be just 20 bytes (or less, if you tend to keep many small values of less than 20B in your BLOB). All in all, it seems DYNAMIC row format, rather than COMPACT, should be the default choice in most cases to achieve good predictable performance of BLOB columns in InnoDB.
An example how to check the actual physical storage in InnoDB can be found here:
https://dba.stackexchange.com/a/210430/177276
As to MyISAM, it apparently does NOT provide off-page storage for BLOBs at all (just inline). Check here for more info:
https://dev.mysql.com/doc/refman/5.7/en/dynamic-format.html
https://forums.mysql.com/read.php?24,105964,267596#msg-267596

I was doing research on this issue for a while. Many people recommend using blob with only one primary key in a separate table and storing the blobs meta data in another table with a foreign key to the blob table. With this the performance will be higher considerably.

Adding a composite index on the two relevant columns should allow these queries to be executed without accessing the table data directly.
CREATE INDEX `IX_score_status` ON `completed_tests` (`score`, `status`);
If you are able to switch to MariaDB then you can make the most of the table elimination optimisations. This would allow you to split the BLOB field out into it's own table and use a view to recreate you existing table structure using a LEFT JOIN. This way it will only access the BLOB data if it is explicitly required for the executing query.

Just add index or indexes to fields used after WHERE query for a table with blobs.
e.g. You have 2 tables with those fields
users : USERID, NAME, ...
userphotos : BLOBID, BLOB, USERNO, ...
select * from userphotos where USERNO=123456;
Normaly this works fine. When you have many large images (e.g. BLOB, MEDIUMBLOB or LONGBLOB more than 5GB in total ) this will take much time (more than minutes) while BLOBID is primary key.
Somehow MySQL is searching whole data including images if there is no index about the field of BLOB table in WHERE clause. When your data goes larger and larger that takes much time. If you create index for the field USERNO, this will speed up your database and it will be independed by the size of whole data.
Solution:
**Add Index to the USERNO at userphotos**
As an answer to your question you should create index for the ct.status

Related

Very simple MySQL index query running very slowly

I have a very simple query that is running extremely slowly despite being indexed.
My table is as follows:
mysql> show create table mytable
CREATE TABLE `mytable` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_time` datetime DEFAULT NULL,
`status` varchar(64) DEFAULT NULL,
`user_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `ix_status_user_id_start_time` (`status`,`user_id`,`start_time`),
### other columns and indices, not relevant
) ENGINE=InnoDB AUTO_INCREMENT=115884841 DEFAULT CHARSET=utf8
Then the following query takes more than 10 seconds to run:
select id from mytable USE INDEX (ix_status_user_id_start_time) where status = 'running';
There are about 7 million rows in the table, and approximately 200 of rows have status running.
I would expect this query to take less than a tenth of a second. It should find the first row in the index with status running. And then scan the next 200 rows until it finds the first non-running row. It should not need to look outside the index.
When I explain the query I get a very strange result:
mysql> explain select id from mytable USE INDEX (ix_status_user_id_start_time) where status =
'running';
+----+-------------+---------+------------+------+------------------------------+------------------------------+---------+-------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------+------------+------+------------------------------+------------------------------+---------+-------+---------+----------+-------------+
| 1 | SIMPLE | mytable | NULL | ref | ix_status_user_id_start_time | ix_status_user_id_start_time | 195 | const | 2118793 | 100.00 | Using index |
+----+-------------+---------+------------+------+------------------------------+------------------------------+---------+-------+---------+----------+-------------+
It is estimating a scan of more than 2 million rows! Also, the cardinality of the status index does not seem correct. There are only about 5 or 6 different statuses, not 344.
Other info
There are somewhat frequent insertions and updates to this table. About 2 rows inserted per second, and 10 statuses updated per second. I don't know how much impact this has, but I would not expect it to be 30 seconds worth.
If I query by both status and user_id, sometimes it is fast (sub 0.1s) and sometimes it is slow (> 1s), depending on the user_id. This does not seem to depend on the size of the result set (some users with 20 rows are quick, others with 4 are slow)
Can anybody explain what is going on here and how it can be fixed?
I am using mysql version 5.7.33
As already mentioned in the comment, you are using many indexes on a big table. So the required memory for this indexes is very high.
You can increase the index buffer size in the my.cnf by changing the innodb_buffer_pool_size to a higher value.
But probably it is more efficient to use less indexes and do not use combined indexes if not absolutely needed.
My guess is, that if you remove all indexes and create only one on status this query will run in under 1s.

Removing duplicate TEXTS from large mysql table

I have mysql table, which has structure
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| content | longtext | NO | | NULL | |
| valid | tinyint(1) | NO | | NULL | |
| created_at | timestamp | YES | | NULL | |
| updated_at | timestamp | YES | | NULL | |
+------------+------------------+------+-----+---------+----------------+
I need to remove duplicate entries by content column, everything would be easy if it wasn't longtext, the main issue is that entries in that column vary in length from 1 char to over 12,000 chars and more, and I have over 4,000,000 entries, simple query like select id from table where content like "%stackoverflow%"; takes 15s to execute, what would be best approach to remove duplicate entries and not wait 2 days on executing query?
md5 is your friend here. Make a separate hashvalues table (to avoid locking/contention with this table in production) with columns for the id and hash. The primary key for this table should actually be the hash column, rather than id.
Once the new empty table is created, use MySql's md5() function to populate the new table from your original data, with the original id and the md5(content) for the field values. If necessary you can even populate the table in batches, if it would take too long or slow things down too much to do it all at once.
When the new table is fully populated with data, you can JOIN it to itself like this:
SELECT h1.*
FROM hashvalues h1
INNER JOIN hashvalues h2 on h1.hash = h2.hash and h1.id <> h2.id
This should be MUCH faster than comparing the content directly, since the database only has to compare pre-computed hash values. I'd expect to run almost instantly. It will tell you which records are potential duplicates. There is still a potential for hash collisions, so you also need to compare this back to the original data to be sure, or include an originalcontent column in the new table you can use with the query above. That done, you will know which records to remove.
This system can be even better if you can add a column to the original table to keep the md5() hash of your content field up to date every time it changes. A Generated Column will work well for this if you have the right storage engine. Otherwise, you can use a trigger. This column will allow you to re-run your duplicates check as needed, without all the extra work with the separate table.
Finally, there are also Sha(), Sha1(), and Sha2() functions that might be more collision-resistant. However, the md5() will be much faster and the additional collision resistance isn't enough to avoid the need for also comparing the original data. This also isn't a security situation where collision potential will matter, and so md5() is the better choice here. These aren't passwords, after all.

Slow query, state = 'Sorting result' mysql

I have generated query
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from MessageDetails messagedet0_, MessageEntry messageent1_, MailSourceFile mailsource2_
where messagedet0_.id=messageent1_.messageDetails_id
and messageent1_.mailSourceFile_id=mailsource2_.id
order by mailsource2_.file, messageent1_.mboxOffset;
Explain says that there is no full scans and indexes are used:
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys |key | key_len | ref | rows | Extra |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | mailsource2_ | index | PRIMARY |file | 384 | NULL | 1445 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | messageent1_ | ref | msf_idx,md_idx,FKBBB258CB60B94D38,FKBBB258CBF7C835B8 |msf_idx | 9 | skryb.mailsource2_.id | 2721 | Using where |
| 1 | SIMPLE | messagedet0_ | eq_ref | PRIMARY |PRIMARY | 8 | skryb.messageent1_.messageDetails_id | 1 | |
+----+-------------+--------------+--------+------------------------------------------------------+---------+---------+--------------------------------------+------+----------------------------------------------+
CREATE TABLE `mailsourcefile` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`file` varchar(127) COLLATE utf8_bin DEFAULT NULL,
`size` bigint(20) DEFAULT NULL,
`archive_id` bigint(20) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `file` (`file`),
KEY `File_idx` (`file`),
KEY `Archive_idx` (`archive_id`),
KEY `FK7C3F816ECDB9F63C` (`archive_id`),
CONSTRAINT `FK7C3F816ECDB9F63C` FOREIGN KEY (`archive_id`) REFERENCES `archive` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1370 DEFAULT CHARSET=utf8 COLLATE=utf8_bin
Also I have indexes for file and mboxOffset. SHOW FULL PROCESSLIST says that mysql is sorting result and it takes more then few minutes. Resultset size is 5M records. How can I optimize this?
Don't think there is much optimization to do in the query itself. Joins would make it more readable, but iirc mysql nowadays is perfectly able to detect these kind of constructs and plan the joins itself.
What would help propably is to increase both the tmp_table_size and max_heap_table_size to allow the resultset of this query to remain in memory, rather than having to write it to disk.
The maximum size for in-memory temporary tables is the minimum of the tmp_table_size and max_heap_table_size values
http://dev.mysql.com/doc/refman/5.5/en/internal-temporary-tables.html
The "using temporary" in the explain denotes that it is using a temporary table (see the link above again) - which will probably be written to disk due to the large amount of data (again, see the link above for more on this).
the file column alone is anywhere between 1 and 384 bytes, so lets take the half for our estimation and ignore the rest of the columns, that leads to 192 bytes per row in the result-set.
1445 * 2721 = 3,931,845 rows
* 192 = 754,914,240 bytes
/ 1024 ~= 737,221 kb
/ 1024 ~= 710 mb
This is certainly more than the max_heap_table_size (16,777,216 bytes) and most likely more than the tmp_table_size.
Not having to write such a result to disk will most certainly increase speed.
Good luck!
Optimization is always tricky. In order to make a dent in your execution time, I think you probably need to do some sort of pre-cooking.
If the file names are similar, (e.g. /path/to/file/1, /path/to/file/2), sorting them will mean a lot of byte comparisons, probably compounded by the unicode encoding. I would calculate a hash of the filename on insertion (e.g. MD5()) and then sort using that.
If the files are already well distributed (e.g. postfix spool names), you probably need to come up with some scheme on insertion whereby either:
simply reading records from some joined table will automatically generate them in correct order; this may not save a lot of time, but it will give you some data quickly so you can start processing, or
find a way to provide a "window" on the data so that not all of it needs to be processed at once.
As #raheel shan said above, you may want to try some JOINs:
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails messagedet0_
inner join
MessageEntry messageent1_
on
messagedet0_.id = messageent1_.messageDetails_id
inner join
MailSourceFile mailsource2_
on
messageent1_.mailSourceFile_id = mailsource2_.id
order by
mailsource2_.file,
messageent1_.mboxOffset
My apologies if the keys are off, but I think I've conveyed the point.
write the query with joins like
select
mailsource2_.file as col_0_0_,
messagedet0_.messageId as col_1_0_,
messageent1_.mboxOffset as col_2_0_,
messageent1_.mboxOffsetEnd as col_3_0_,
messagedet0_.id as col_4_0_
from
MessageDetails m0
inner join
MessageEntry m1
on
m0.id = m1.messageDetails_id
inner join
MailSourceFile m2
on
m1.mailSourceFile_id = m2.id
order by
m2_.file,
m1_.mboxOffset;
on seeing ur explain i found 3 things which in my opinion are not good
1 file sort in extra column
2 index in type column
3 key length which is 384
if you reduce the key length you may get quick retrieval for that consider the character set you use and the partial indexes
here you can do force index for order by and use index for join ( create appropriate multi column indexes and assign them) remember it is alway good to order with column present in the same table
index type represents it is scanning entire index column which is not good

Improving MySQL Performance on a Run-Once Query with a Large Dataset

I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.
I've added the data to the database with a schema that looks like this:
TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip | int(10) unsigned | NO | PRI | NULL | |
| infohash | varchar(40) | NO | PRI | NULL | |
+----------+------------------+------+-----+---------+-------+
The two fields together form the primary key.
This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).
I've written the following query:
SELECT `count`, COUNT(`ip`)
FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
FROM TorrentsPerPeer
GROUP BY `ip`) AS `counts`
GROUP BY `count`;
Here's the EXPLAIN for the sub-select:
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_length | ref | rows | Extra |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1 | SIMPLE | TorrentPerPeer | index | [Null] | PRIMARY | 126 | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
I can't seem to do an EXPLAIN for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.
This query is currently running (and has been for an hour). top is reporting that mysqld is only using ~5% of the available CPU whilst its RSIZE is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.
My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf file to increase the INNODB buffer pool size, should I change any other values?
If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.
EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.
My hunch is that with an unsigned int and a varchar 40 (especially the varchar!) you have now a HUGE primary key and it is making your index file too big to fit in whatever RAM you have for Innodb_buffer_pool. This would make InnoDB have to rely on disk to swap index pages as it searches and that is a LOT of disk seeks and not a lot of CPU work.
One thing I did for a similar issue is use something in between a truly natural key and a surrogate key. We would take the 2 fields that are actually unique (one of which was also a varchar) and in the application layer would make a fixed width MD5 hash and use THAT as the key. Yes, it means more work for the app but it makes for a much smaller index file since you are no longer using an arbitrary length field.
OR, you could just use a server with tons of RAM and see if that makes the index fit in memory but I always like to make 'throw hardware at it' a last resort :)

MySQL EXPLAIN 'type' changes from 'range' to 'ref' when the date in the where statement is changed?

I've been testing out different ideas for optimizing some of the tables we have in our system at work. Today I came across a table that tracks every view on each vehicle in our system. Create table below.
SHOW CREATE TABLE vehicle_view_tracking;
CREATE TABLE `vehicle_view_tracking` (
`vehicle_view_tracking_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`public_key` varchar(45) NOT NULL,
`vehicle_id` int(10) unsigned NOT NULL,
`landing_url` longtext NOT NULL,
`landing_port` int(11) NOT NULL,
`http_referrer` longtext,
`created_on` datetime NOT NULL,
`created_on_date` date NOT NULL,
`server_host` longtext,
`server_uri` longtext,
`referrer_host` longtext,
`referrer_uri` longtext,
PRIMARY KEY (`vehicle_view_tracking_id`),
KEY `vehicleViewTrackingKeyCreatedIndex` (`public_key`,`created_on_date`),
KEY `vehicleViewTrackingKeyIndex` (`public_key`)
) ENGINE=InnoDB AUTO_INCREMENT=363439 DEFAULT CHARSET=latin1;
I was playing around with multi-column and single column indexes. I ran the following query:
EXPLAIN EXTENDED SELECT dealership_vehicles.vehicle_make, dealership_vehicles.vehicle_model, vehicle_view_tracking.referrer_host, count(*) AS count
FROM vehicle_view_tracking
LEFT JOIN dealership_vehicles
ON dealership_vehicles.dealership_vehicle_id = vehicle_view_tracking.vehicle_id
WHERE vehicle_view_tracking.created_on_date >= '2011-09-07' AND vehicle_view_tracking.public_key IN ('ab12c3')
GROUP BY (dealership_vehicles.vehicle_make) ASC , dealership_vehicles.vehicle_model, referrer_host
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | vehicle_view_tracking | range | vehicleViewTrackingKeyCreatedIndex,vehicleViewTrackingKeyIndex | vehicleViewTrackingKeyCreatedIndex | 50 | NULL | 23086 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dealership_vehicles | eq_ref | PRIMARY | PRIMARY | 8 | vehicle_view_tracking.vehicle_id | 1 | 100.00 | |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+------------------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
(Execution time for actual select query was .309 seconds)
then I change the date in the where clause from '2011-09-07' to '2011-07-07' and got the following explain results
EXPLAIN EXTENDED SELECT dealership_vehicles.vehicle_make, dealership_vehicles.vehicle_model, vehicle_view_tracking.referrer_host, count(*) AS count
FROM vehicle_view_tracking
LEFT JOIN dealership_vehicles
ON dealership_vehicles.dealership_vehicle_id = vehicle_view_tracking.vehicle_id
WHERE vehicle_view_tracking.created_on_date >= '2011-07-07' AND vehicle_view_tracking.public_key IN ('ab12c3')
GROUP BY (dealership_vehicles.vehicle_make) ASC , dealership_vehicles.vehicle_model, referrer_host
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
| 1 | SIMPLE | vehicle_view_tracking | ref | vehicleViewTrackingKeyCreatedIndex,vehicleViewTrackingKeyIndex | vehicleViewTrackingKeyIndex | 47 | const | 53676 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dealership_vehicles | eq_ref | PRIMARY | PRIMARY | 8 | vehicle_view_tracking.vehicle_id | 1 | 100.00 | |
+----+-------------+-----------------------+--------+----------------------------------------------------------------+-----------------------------+---------+----------------------------------------------+-------+----------+----------------------------------------------+
(Execution time for actual select query was .670 seconds)
I see 4 main changes:
type changed from range to ref
key changed from vehicleViewTrackingKeyCreatedIndex to vehicleViewTrackingKeyIndex
key_len changed from 50 to 47 (caused by the change in key)
rows changed from 23086 to 53676 (caused by the change in key)
At this point, the execution time is only .6 seconds for the slow query however we only have about 10% of our vehicles in our database.
It's getting late and I may have overlooked something in the mysql docs but I can't seem to find why the key (and in turn the type and rows) are changing when the date is changed in the where clause.
The help is greatly appreciated. I searched for someone having the same/similar issue with a date causing this change and was not able to find anything. If I missed a previous post, please link me :-)
Different search strategies make sense for different data. In particular, index scans (such as range) often have to do a seek to actually read the row. At some point, doing all those seeks is slower than not using the index at all.
Take a trivial example, a table with three columns: id (primary key), name (indexed), birthday. Say it has a lot of data. If you ask MySQL to look for Bob's birthday, it can do that fairly quickly: first, it finds Bob in the name index (this takes a few seeks, log(n) where n is the row count), then one additional seek to read the actual row in the data file and read the birthday from it. That's very quick, and far quicker than scanning the entire table.
Next, consider doing a name like 'Z%'. That is probably a fairly small portion of the table. So its still faster to find where the Zs start in the name index, then for each one seek the data file to read the row. (This is a range scan).
Finally, consider asking for all names starting with M-Z. That's probably around half the data. It could do a range scan, and then a lot of seeks, but seeking randomly over the datafile with the ultimate goal of reading half the rows isn't optimal: it'd be faster to just do a big sequential read over the data file. So, in this case, the index will be ignored.
This is what you're seeing—except in your case, there is another key it can fall back on. (Its also possible that it might actually use the date index if it didn't have the other, it should pick whichever index will be quickest. Beware that MySQL's optimizer often makes errors in this.)
So, in short, this is expected. A query doesn't say how to retrieve the data, rather it says what data to retrieve. The database's optimizer is supposed to find the quickest way to retrieve it.
You may find an index on both columns, in the order (public_key,created_on_date) is preferred in both cases, and speeds up your query. This is because MySQL can only ever use one index per table (per query). Also, the date goes at the end because a range scan can only be done efficiently on the last column in an index.
[InnoDB actually has another layer of indirection, I believe, but it'd just confuse the point. It doesn't make a difference to the explanation.]