I recently encounter a problem involving MySQL DBSM.
The Table is like this:
CREATE TABLE `orders` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(60) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`sex` enum('男','女') DEFAULT NULL,
`amount` float(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_i` (`name`),
KEY `sex` (`sex`)
) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8
As is shown above ,I create a single colume index on col name
I want to perform a range query on name, and the explain statement is
mysql> explain select * from orders where name like '王%';
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| 1 | SIMPLE | orders | NULL | range | name_i | name_i | 183 | NULL | 20630 | 100.00 | Using index condition; Using MRR |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
1 row in set, 1 warning (0.10 sec)
so it should use the index name_i and finish the query in a flash(my classmate spent 0.07 sec)
however , this is how it turned out:
| 4998119 | 王缝 | 27 | 男 | 159.21 |
| 4998232 | 王求葬 | 19 | 男 | 335.65 |
| 4998397 | 王倘予 | 49 | 女 | 103.39 |
| 4998482 | 王厚 | 77 | 男 | 960.69 |
| 4998703 | 王啄淋 | 73 | 女 | 458.85 |
| 4999106 | 王般埋 | 70 | 女 | 700.98 |
| 4999359 | 王胆具 | 31 | 女 | 362.83 |
| 4999510 | 王铁脾 | 31 | 女 | 973.09 |
| 4999880 | 王战万 | 59 | 女 | 127.28 |
| 4999928 | 王忆 | 42 | 女 | 72.47 |
+---------+--------+------+------+--------+
11160 rows in set (3.43 sec)
And it seems to not use the index at all, because the data is sorted by the primary key id rather than col name(besides it is too slow ,comparing to 0.07 sec).
Has anyone encountered the problem too?
What percentage of the table is "Kings" (王) ? If it is more than about 20%, it will choose to do a table scan instead of use the index. (And this may actually be faster.) (Based on Comments, 0.22% of the table is Kings.)
EXPLAIN and the execution of the query are separate things. Although I don't remember proving this, it is possible that the EXPLAIN might say one thing, but the query would work another way.
Do you have 5 million rows in the table? Was the cache 'cold' when you first ran it? And it had to fetch 11,160 rows from disk? Then the second time, all was in cache, so much faster?
Was the table loaded in "alphabetical" (or whatever the Chinese word for that is) order? If so, there is a good chance the ids and the names are in the same order?
Apparently you are using utf8_general_ci COLLATION? Maybe it does not sort Chinese well. (Provide a test case; I'll do some tests.)
I do not understand why it mentioned MRR.
I, too, am baffled by "1 min 32.24sec". The ORDER BY name should have further encouraged the Optimizer to use INDEX(name). Can you turn on "Optimizer trace".
To really see whether it used the index, do this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
If the big number(s) look like the number of rows in the table, then it did a table scan. If they look more like 11160, then they used the index.
In my slow query log I am seeing slow queries like
# Time: 121107 16:34:02
# User#Host: web_node[web_node] # localhost [127.0.0.1]
# Thread_id: 34436186 Schema: test_db Last_errno: 0 Killed: 0
# Query_time: 1.413751 Lock_time: 0.000222 Rows_sent: 203 Rows_examined: 203 Rows_affected: 0 Rows_read: 203
# Bytes_sent: 7553 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# InnoDB_trx_id: 9B04384
SET timestamp=1352334842;
SELECT id, email FROM test_data WHERE id IN (13089576,3002681,3117763,1622233,2941590,12305279,1732672,2446772,3189510,13084725,4943929,5855071,6572137,2266261,3003496,2024860,3336832,13758671,6477694,1796684,13001771,4690025,1071744,1017876,5175795,795988,1619821,2481819,2941090,4770802,13438250,3254708,2323402,526303,13219855,3313573,3190479,1733761,3300577,2941758,6474118,1733379,11523598,4205064,6521805,2492903,1860388,3337093,5205317,1213970,5442738,12194039,1214203,12970536,3076611,3126152,3677156,5305021,2751587,4954875,875480,2105172,5309382,12981920,5204330,13729768,3254503,5030441,2680750,590661,1338572,7272410,1860386,2567550,5434143,1918035,5329411,1683235,3254119,5175784,1855380,3336834,2102567,4749746,37269,3207031,6464336,2227907,2713471,3937600,2940442,2233821,5619141,5204711,5988803,5050821,10109926,5226877,5050275,1874115,13677832,5338699,2423773,6432937,6443660,1990611,6090667,6527411,6568731,3254846,3414049,2011907,5180984,12178711,8558260,3130655,5864745,2059318,3480233,2104948,2387703,1939395,5356002,2681209,1184622,1184456,10390165,510854,7983305,795991,2622393,4490187,9436477,5356051,2423464,5205318,1600499,13623229,3255205,12200483,6477706,3445661,5226284,1176639,13760962,2101681,6022818,12909371,1732457,2377496,7260091,12191702,2492899,2630691,13047691,1684470,9382108,2233737,13117701,1796698,2535914,4941741,4565958,1100410,2321180,13080467,813342,4563877,4689365,2104756,1102802,2714488,3188947,1599770,1558291,5592740,5233428,5204830,1574452,3188956,13693326,2102349,3704111,1748303,790889,9323280,4741494,2387900,5338213,3583795,2283942,3189482,3002296,4490123,3585020,962926,3481423,1600920,1682364,4693123,6487778,2677582,2377195);
When I run the slow query through the profiler using SQL_NO_CACHE it looks says
203 rows in set (0.03 sec)
show profile for query 33;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000187 |
| checking permissions | 0.000012 |
| Opening tables | 0.000034 |
| System lock | 0.000016 |
| init | 0.000087 |
| optimizing | 0.000024 |
| statistics | 0.028694 |
| preparing | 0.000074 |
| executing | 0.000005 |
| Sending data | 0.001596 |
| end | 0.000009 |
| query end | 0.000008 |
| closing tables | 0.000014 |
| freeing items | 0.001600 |
| logging slow query | 0.000007 |
| cleaning up | 0.000011 |
+----------------------+----------+
when I run the query with explain it says
+----+-------------+------------------+-------+------------------+----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+------------------+----------+---------+------+------+--------------------------+
| 1 | SIMPLE | test_data | range | PRIMARY,id_email | id_email | 4 | NULL | 203 | Using where; Using index |
+----+-------------+------------------+-------+------------------+----------+---------+------+------+--------------------------+
the create table looks like
CREATE TABLE `test_data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`email` varchar(254) DEFAULT NULL,
`domain` varchar(254) DEFAULT NULL,
`age` smallint(6) DEFAULT NULL,
`gender` tinyint(1) DEFAULT NULL,
`location_id` int(11) unsigned DEFAULT NULL,
`created` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`unistall_date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`subscription_date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`active` tinyint(1) DEFAULT '1',
PRIMARY KEY (`id`),
UNIQUE KEY `email` (`email`),
KEY `domain` (`domain`),
KEY `id_email` (`id`,`email`),
KEY `email_id` (`email`,`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13848530 DEFAULT CHARSET=utf8
There is another query that gets run regularly selecting the id and email from a list of email addresses hence the email, id key, the email address need to be unique hence why that is a unique key. The table only has ~14M rows
I thought maybe the indexes where getting to big for memory and swapping but the box has 8 gigs of ram.
SELECT table_schema "Data Base Name", SUM( data_length + index_length) / 1024 / 1024 "Data Base Size in MB", SUM( index_length) / 1024 / 1024 "Index Size in MB" FROM information_schema.TABLES GROUP BY table_schema;
+--------------------+----------------------+------------------+
| Data Base Name | Data Base Size in MB | Index Size in MB |
+--------------------+----------------------+------------------+
| metrics | 3192.50000000 | 1594.42187500 |
| data | 8096.48437500 | 5639.51562500 |
| raw_data | 6000.35937500 | 745.07812500 |
| information_schema | 0.00878906 | 0.00878906 |
| mysql | 0.04319191 | 0.04101563 |
| performance_schema | 0.00000000 | 0.00000000 |
+--------------------+----------------------+------------------+
Setting innodb_file_per_table=1 in the my.cnf file appears to have solved the issue.
This improved the execution time, my understanding is that having the single file per table means that the disk needle doesn't need to move such large distances.
Questions
If the query can be evaluated using the indexes why does setting innodb_file_per_table=1 improve the performance
Why isnt the query slow when it is run for the profiler without using the cache?
Should my primary key be (id, email) ?
Update
Originally there was no /etc/my.cnf file then I created one with the following
[mysqld]
server-id=1
max_connections=1500
key_buffer_size=50M
query_cache_limit=16M
query_cache_size=256M
thread_cache=16
table_open_cache=4096
sort_buffer_size=512K
join_buffer_size=8M
read_buffer_size=8M
skip_name_resolve=1
thread_cache_size=256
innodb_buffer_pool_size=6G
innodb_buffer_pool_instances=1
innodb_thread_concurrency=96
innodb_additional_mem_pool_size=32M
innodb_log_buffer_size=8M
innodb_flush_log_at_trx_commit=0
innodb_log_file_size=256M
innodb_flush_method=O_DIRECT
innodb_file_per_table=1
net_read_timeout=15
net_write_timeout=30
log-bin=mysql-bin
sync_binlog=0
datadir=/var/lib/mysql
You have too much data for your innodb_log_buffer.
What are the values of:
innodb_buffer_pool_size
innodb_log_file_size
All of InnoDB must run in memory. When you split up the files it is running more efficiently because it is swapping in and out of memory with less disk reads and writes as one larger file takes longer to scan for the data.
Its not swapping because your innodb_buffer_pool_size is constraining the amount of memory that MySQL loads into memory.
The only way to fix your problem is get more memory and allocate enough innodb_buffer_pool_size for all of your innodb tables and indexes.
This is almost driving me insane
I do the following query:
SELECT * FROM `photo_person` WHERE photo_person.photo_id IN (SELECT photo_id FROM photo_person WHERE `photo_person`.`person_id` ='1')
When I change the id, I get different processing time. Although it's all the same queries and tables.
By changing the person_id I get the following:
-- person_id=1 ( 3 total, Query took 0.4523 sec)
-- person_id=2 ( 99 total, Query took 0.1340 sec)
-- person_id=3 ( 470 total, Query took 0.0194 sec)
-- person_id=4 ( 1,869 total, Query took 0.0024 sec)
I do not understand how with the increase of the number of records/results the query time is lower.
The table structures are very straight forward
UPDATE: I have already disabled mysql query cache, so every time I run the query, I would get the same exact value (of course it varies on the milisecond level but this is can be neglected)
UPDATE: table is MyISAM
CREATE TABLE IF NOT EXISTS `photo_person` (
`entry_id` int(11) NOT NULL AUTO_INCREMENT,
`photo_id` int(11) NOT NULL DEFAULT '0',
`person_id` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`entry_id`),
UNIQUE KEY `PhotoID` (`photo_id`,`person_id`),
KEY `photo_id` (`photo_id`),
KEY `person_id` (`person_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=182072 ;
Here is the results of the profiling
+----------+------------+-----------------------------+
| Query_ID | Duration |Query |
+----------+------------+-----------------------------+
| 1 | 0.45541200 | SELECT ...`person_id` ='1') |
| 2 | 0.44833700 | SELECT ...`person_id` ='2') |
| 3 | 0.45587800 | SELECT ...`person_id` ='3') |
| 4 | 0.45074900 | SELECT ...`person_id` ='4') |
+----------+------------+-----------------------------+
now since the number are the same, it must be the caching :(
So the aparently the caching kicks in a certain number of records or bytes
mysql> SHOW VARIABLES LIKE "%cac%";
+------------------------------+------------+
| Variable_name | Value |
+------------------------------+------------+
| binlog_cache_size | 32768 |
| have_query_cache | YES |
| key_cache_age_threshold | 300 |
| key_cache_block_size | 1024 |
| key_cache_division_limit | 100 |
| max_binlog_cache_size | 4294963200 |
| query_cache_limit | 1024 |
| query_cache_min_res_unit | 4096 |
| query_cache_size | 1024 |
| query_cache_type | ON |
| query_cache_wlock_invalidate | OFF |
| table_definition_cache | 256 |
| table_open_cache | 64 |
| thread_cache_size | 8 |
+------------------------------+------------+
14 rows in set (0.00 sec)
How are you testing the query speeds? I suspect it's not an appropriate way. The more you query the table, the more likely MySQL is to do some agressive pre-fetching on the table, meaning further queries on the table will be faster, despite they require scanning more data. The reason it is so is because MySQL will not have to load the pages from disk, since it's already pre-fetched them in memory.
As other people have stated, query cache could also mess up you test's results, especially if they implied re-running the query several times in a row to get an "average" runtime.
Add SQL_NO_CACHE to your query to see if it is the cache that tricks you.
To see what is taking time try to use PROFILING like this:
mysql> SET profiling = 1;
mysql> Your select goes here;
mysql> SHOW PROFILES;
Also, try to use the simpler query:
SELECT * FROM photo_person WHERE `photo_person`.`person_id` ='1'
I don't know if MySQL is optimising or not your query, but logically, your and this are equivalent - except that your uses a subquery - always avoid subqueries where possible
Given this table on local MySQL instance 5.1 with query caching off:
show create table product_views\G
*************************** 1. row ***************************
Table: product_views
Create Table: CREATE TABLE `product_views` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`dateCreated` datetime NOT NULL,
`dateModified` datetime DEFAULT NULL,
`hibernateVersion` bigint(20) DEFAULT NULL,
`brandName` varchar(255) DEFAULT NULL,
`mfrModel` varchar(255) DEFAULT NULL,
`origin` varchar(255) NOT NULL,
`price` float DEFAULT NULL,
`productType` varchar(255) DEFAULT NULL,
`rebateDetailsViewed` tinyint(1) NOT NULL,
`rebateSearchZipCode` int(11) DEFAULT NULL,
`rebatesFoundAmount` float DEFAULT NULL,
`rebatesFoundCount` int(11) DEFAULT NULL,
`siteSKU` varchar(255) DEFAULT NULL,
`timestamp` datetime NOT NULL,
`uiContext` varchar(255) DEFAULT NULL,
`siteVisitId` bigint(20) NOT NULL,
`efficiencyLevel` varchar(255) DEFAULT NULL,
`siteName` varchar(255) DEFAULT NULL,
`clicks` varchar(1024) DEFAULT NULL,
`rebateFormDownloaded` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `siteVisitId` (`siteVisitId`,`siteSKU`),
KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
KEY `FIND_UNPROCESSED_IDX` (`siteSKU`,`siteVisitId`,`timestamp`),
CONSTRAINT `FK52C29B1E3CAB9CC4` FOREIGN KEY (`siteVisitId`) REFERENCES `site_visits` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=32909504 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
This query takes ~3s:
SELECT pv.id, pv.siteSKU
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
But this one (non-indexed column added to SELECT) takes ~65s:
SELECT pv.id, pv.siteSKU, pv.hibernateVersion
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
Nothing in 'where' or 'from' clauses is different. All the extra time is spent in 'sending data':
mysql> show profile for query 1;
+--------------------+-----------+
| Status | Duration |
+--------------------+-----------+
| starting | 0.000155 |
| Opening tables | 0.000029 |
| System lock | 0.000007 |
| Table lock | 0.000019 |
| init | 0.000072 |
| optimizing | 0.000032 |
| statistics | 0.000316 |
| preparing | 0.000034 |
| executing | 0.000002 |
| Sending data | 63.530402 |
| end | 0.000044 |
| query end | 0.000005 |
| freeing items | 0.000091 |
| logging slow query | 0.000002 |
| logging slow query | 0.000109 |
| cleaning up | 0.000004 |
+--------------------+-----------+
16 rows in set (0.00 sec)
I understand that using a non-indexed column in where clause would slow things down, but why here? What can be done to improve the latter case - given that I will actually want to SELECT(*) from product_views?
EXPLAIN Output
explain extended select pv.id, pv.siteSKU from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU='foo' and sv.sit eId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where; Using index | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ 2 rows in set, 1 warning (0.00 sec)
mysql> explain extended select pv.id, pv.siteSKU, pv.hibernateVersion from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU= 'foo' and sv.siteId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ 2 rows in set, 1 warning (0.00 sec)
UPDATE1: Splitting into 2 queries brings total time down to ~30s range
Not sure why, but splitting the latter query into the following reduces lat. from 65s to ~30s:
1) SELECT pv.id .... //from, where clauses same as above
2) SELECT * FROM product_views where id in (idList); //idList
UPDATE2: TABLE SIZE
table has on the order of 10M rows
query returns about 3k rows
When you select only indexed columns, MySQL does read only the index, and does not need to read the table data. This, as far as I remember, is called index-covered query. However, when there are columns, that are not present in the used index, MySQL needs to open the table and read the data from it. This is the reason index-covered queries to be much faster.
See Using Covering Indexes to Improve Query Performance.
As for the improvement, how many rows are in the table, how much the query returns and what is your buffer pool size, how much RAM is available, etc.?
From what I have read about show profile, 'sending data' is a portion of execution process, and has almost nothing to do with sending actual data to the client. You can take a look on this thread
Also, mysql docs says about "Sending data" :
The thread is reading and processing rows for a SELECT statement, and sending data to the client. Because operations occurring during this state tend to perform large amounts of disk access (reads), it is often the longest-running state over the lifetime of a given query.
In my opinion, mysql would better not mix together "reading and processing rows for a SELECT statement" and "sending data" in one state, especially in state called "sending" data" which causes lots of confusion.
I'm don't know MySQL internals at all, but Darhazer's explanation looks like the winner to me. When the non-indexed field is added, the entire row must be retrieved. And your rows are very wide. I can't quite tell from the names how (if at all) it is denormalized, but I suspect it is. site name and site sku smell like they belong in a site lookup table with an FK. rebates found amount and rebates found count sound like statistics that should be coming from a join to a separate product rebate table. etc.