Need help understanding how mysql indexes work - mysql

I have a table that looks like this:
CREATE TABLE `metric` (
`metricid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`host` varchar(50) NOT NULL,
`userid` int(10) unsigned DEFAULT NULL,
`lastmetricvalue` double DEFAULT NULL,
`receivedat` int(10) unsigned DEFAULT NULL,
`name` varchar(255) NOT NULL,
`sampleid` tinyint(3) unsigned NOT NULL,
`type` tinyint(3) unsigned NOT NULL DEFAULT '0',
`lastrawvalue` double NOT NULL,
`priority` tinyint(3) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`metricid`),
UNIQUE KEY `unique-metric` (`userid`,`host`,`name`,`sampleid`)
) ENGINE=InnoDB AUTO_INCREMENT=1000000221496 DEFAULT CHARSET=utf8
It has 177,892 rows at the moment, and when I run the following query:
select metricid, lastrawvalue, receivedat, name, sampleid
FROM metric m
WHERE m.userid = 8
AND (host, name, sampleid) IN (('localhost','0.4350799184758216cpu-3/cpu-nice',0),
('localhost','0.4350799184758216cpu-3/cpu-system',0),
('localhost','0.4350799184758216cpu-3/cpu-idle',0),
('localhost','0.4350799184758216cpu-3/cpu-wait',0),
('localhost','0.4350799184758216cpu-3/cpu-interrupt',0),
('localhost','0.4350799184758216cpu-3/cpu-softirq',0),
('localhost','0.4350799184758216cpu-3/cpu-steal',0),
('localhost','0.4350799184758216cpu-4/cpu-user',0),
('localhost','0.4350799184758216cpu-4/cpu-nice',0),
('localhost','0.4350799184758216cpu-4/cpu-system',0),
('localhost','0.4350799184758216cpu-4/cpu-idle',0),
('localhost','0.4350799184758216cpu-4/cpu-wait',0),
('localhost','0.4350799184758216cpu-4/cpu-interrupt',0),
('localhost','0.4350799184758216cpu-4/cpu-softirq',0),
('localhost','0.4350799184758216cpu-4/cpu-steal',0),
('localhost','_util/billing-bytes',0),('localhost','_util/billing-metrics',0));
it takes 0.87 seconds to return results, explain is:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: ref
possible_keys: unique-metric
key: unique-metric
key_len: 5
ref: const
rows: 85560
Extra: Using where
1 row in set (0.00 sec)
profile looks like this:
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000160 |
| checking permissions | 0.000010 |
| Opening tables | 0.000021 |
| exit open_tables() | 0.000008 |
| System lock | 0.000008 |
| mysql_lock_tables(): unlocking | 0.000005 |
| exit mysqld_lock_tables() | 0.000007 |
| init | 0.000068 |
| optimizing | 0.000018 |
| statistics | 0.000091 |
| preparing | 0.000042 |
| executing | 0.000005 |
| Sending data | 0.870180 |
| innobase_commit_low():trx_comm | 0.000012 |
| Sending data | 0.000111 |
| end | 0.000009 |
| query end | 0.000009 |
| ha_commit_one_phase(-1) | 0.000015 |
| innobase_commit_low():trx_comm | 0.000004 |
| ha_commit_one_phase(-1) | 0.000005 |
| query end | 0.000005 |
| closing tables | 0.000012 |
| freeing items | 0.000562 |
| logging slow query | 0.000005 |
| cleaning up | 0.000005 |
| sleeping | 0.000006 |
+--------------------------------+----------+
Which seems way too high for me. I've tried to replace the userid = 8 and (host, name, sampleid) IN part of the first query to (userid, host, name, sampleid) IN and this query runs about 0.5s - almost 2 times quicker, for reference, here's the query:
select metricid, lastrawvalue, receivedat, name, sampleid
FROM metric m
WHERE (userid, host, name, sampleid) IN ((8,'localhost','0.4350799184758216cpu-3/cpu-nice',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-system',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-idle',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-wait',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-interrupt',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-softirq',0),
(8,'localhost','0.4350799184758216cpu-3/cpu-steal',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-user',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-nice',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-system',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-idle',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-wait',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-interrupt',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-softirq',0),
(8,'localhost','0.4350799184758216cpu-4/cpu-steal',0),
(8,'localhost','_util/billing-bytes',0),
(8,'localhost','_util/billing-metrics',0));
its explain looks like this:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 171121
Extra: Using where
1 row in set (0.00 sec)
Next I've updated the table to contain a single joined column:
alter table `metric` add `forindex` varchar(120) not null default '';
update metric set forindex = concat(userid,`host`,`name`,sampleid);
alter table metric add index `forindex` (`forindex`);
Updated the query to have only 1 string searched:
select metricid, lastrawvalue, receivedat, name, sampleid
FROM metric m
WHERE (forindex) IN (('8localhost0.4350799184758216cpu-3/cpu-nice0'),
('8localhost0.4350799184758216cpu-3/cpu-system0'),
('8localhost0.4350799184758216cpu-3/cpu-idle0'),
('8localhost0.4350799184758216cpu-3/cpu-wait0'),
('8localhost0.4350799184758216cpu-3/cpu-interrupt0'),
('8localhost0.4350799184758216cpu-3/cpu-softirq0'),
('8localhost0.4350799184758216cpu-3/cpu-steal0'),
('8localhost0.4350799184758216cpu-4/cpu-user0'),
('8localhost0.4350799184758216cpu-4/cpu-nice0'),
('8localhost0.4350799184758216cpu-4/cpu-system0'),
('8localhost0.4350799184758216cpu-4/cpu-idle0'),
('8localhost0.4350799184758216cpu-4/cpu-wait0'),
('8localhost0.4350799184758216cpu-4/cpu-interrupt0'),
('8localhost0.4350799184758216cpu-4/cpu-softirq0'),
('8localhost0.4350799184758216cpu-4/cpu-steal0'),
('8localhost_util/billing-bytes0'),
('8localhost_util/billing-metrics0'));
And now I get the same results in 0.00 sec! Explain is:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: m
type: range
possible_keys: forindex
key: forindex
key_len: 362
ref: NULL
rows: 17
Extra: Using where
1 row in set (0.00 sec)
So to summarize, here are the results:
m.userid = X AND (host, name, sampleid) IN - index used, 85560 rows scanned, runs in 0.9s
(userid, host, name, sampleid) IN - index not used, 171121 rows scanned, runs in 0.5s
additional column with compound index replaced with an index over a concatenated utility column - index used, 17 rows scanned, runs in 0s
Why does second query run faster than the first? And why is the third query so much faster than the rest? Should I keep such a column for the sole purpose of faster searching?
Mysql version is:
mysqld Ver 5.5.34-55 for Linux on x86_64 (Percona XtraDB Cluster (GPL), wsrep_25.9.r3928)

Indexes help your search terms in the WHERE clause by narrowing down the search as much as possible. You can see this happening...
The rows field of EXPLAIN gives an estimate of how many rows the query will have to examine to find the rows that match your query. By comparing the rows reported in each EXPLAIN, you can see how much better your better-optimized query is:
rows: 85560 -- first query
rows: 171121 -- second query examines 2x more rows, but it was probably
-- faster because the data was buffered after the first query
rows: 17 -- third query examines 5,000x fewer rows than first query
You would also notice in the SHOW PROFILE details if you ran that for the third query that "Sending data" is a lot faster for the quicker query. This process state indicates how long it took to copy rows from the storage engine up to the SQL layer of MySQL. Even when doing memory-to-memory copying, this takes a while for so many thousands of rows. This is why indexes are so beneficial.
For more useful explanation, see my presentation How to Design Indexes, Really.

Related

mysql bulk table script execution

I have 120 tables in my project.
Now I have to migrate MSSQL to MySQL.
So I did all Queries to create those tables that are already worked.
Now my problem is when I execute this script in MSSQL it completes within a second.
But MySQL takes around 4 min to complete its execution.
I want to improve my performance in MySQL. But I don't know how to do that if anyone knows please help me.
Thank you
Here is my sample table Script
MySQL
CREATE TABLE `rb_tbl_bak` (
`BakPathId` int NOT NULL AUTO_INCREMENT,
`BakPath` varchar(500) CHARACTER SET utf8 COLLATE utf8_general_ci DEFAULT NULL,
`BakDate` datetime(3) DEFAULT NULL,
PRIMARY KEY (`BakPathId`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
MSSQL
--Create table and its columns
CREATE TABLE [dbo].[RB_Tbl_Bak] (
[BakPathId] [int] NOT NULL IDENTITY (1, 1),
[BakPath] [nvarchar](500) NULL,
[BakDate] [datetime] NULL);
GO
like this way, I have to complete for 120+ tables
Oh well, In this case, MySQL databases take time.
You can turn on profiling to get an idea of what takes so long. An example is given using Mysql's CLI:-
SET profiling = 1;
CREATE TABLE rb_tbl_back (id BIGINT UNSIGNED NOT NULL PRIMARY KEY);
SET profiling = 1;
You should get a response like this:-
mysql> SHOW PROFILES;
| Query_ID | Duration | Query |
+----------+------------+-------------------------------------------------------------+
| 1 | 0.00913800 | CREATE TABLE rb_tbl_back (id BIGINT UNSIGNED NOT NULL PRIMARY KEY) |
+----------+------------+-------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW PROFILE FOR QUERY 1;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000071 |
| checking permissions | 0.000007 |
| Opening tables | 0.001698 |
| System lock | 0.000043 |
| creating table | 0.007260 |
| After create | 0.000004 |
| query end | 0.000004 |
| closing tables | 0.000015 |
| freeing items | 0.000031 |
| logging slow query | 0.000002 |
| cleaning up | 0.000003 |
+----------------------+----------+
11 rows in set (0.00 sec)
If you read the profiling documentation, there are other flags for showing the profile of the query CPU, BLOCK IO, etc that might help you on the 'creating table' stage.
I got this answer from here

Is there a way to hint mysql to use Using index for group-by

I was busying myself with exploring GROUP BY optimizations. On a classical "max salary per departament" query. And suddenly weird results. The dump below goes straight from my console. NO COMMAND were issued between these two EXPLAINS. Only some time had passed.
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 1 | |
| 2 | DERIVED | emploee | index | NULL | dep_id | 8 | NULL | 84 | Using index |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
mysql> explain select name, t1.dep_id, salary
from emploee t1
JOIN ( select dep_id, max(salary) msal
from emploee
group by dep_id
) t2
ON t1.salary=t2.msal and t1.dep_id = t2.dep_id
order by salary desc;
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 4 | Using temporary; Using filesort |
| 1 | PRIMARY | t1 | ref | dep_id | dep_id | 8 | t2.dep_id,t2.msal | 3 | |
| 2 | DERIVED | emploee | range | NULL | dep_id | 4 | NULL | 9 | Using index for group-by |
+----+-------------+------------+-------+---------------+--------+---------+-------------------+------+---------------------------------+
3 rows in set (0.00 sec)
As you may notice, it examined ten times less rows in second run. I assume it's because some inner counters got changed. But I don't want to depend on these counters. So - is there a way to hint mysql to use "Using index for group by" behavior only?
Or - if my speculations are wrong - is there any other explanation on the behavior and how to fix it?
CREATE TABLE `emploee` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) DEFAULT NULL,
`dep_id` int(11) NOT NULL,
`salary` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `dep_id` (`dep_id`,`salary`)
) ENGINE=InnoDB AUTO_INCREMENT=85 DEFAULT CHARSET=latin1 |
+-----------+
| version() |
+-----------+
| 5.5.19 |
+-----------+
Hm, showing the cardinality of indexes may help, but keep in mind: range's are usually slower then indexes there.
Because it think it can match the full index in the first one, it uses the full one. In the second one, it drops the index and goes for a range, but guesses the total number of rows satisfying that larger range wildly lower then the smaller full index, because all cardinality has changed. Compare it to this: why would "AA" match 84 rows, but "A[any character]" match only 9 (note that it uses 8 bytes of the key in the first, 4 bytes in the second)? The second one will in reality not read less rows, EXPLAIN just guesses the number of rows differently after an update on it's metadata of indexes. Not also that EXPLAIN does not tell you what a query will do, but what it probably will do.
Updating the cardinality can or will occur when:
The cardinality (the number of different key values) in every index of a table is calculated when a table is opened, at SHOW TABLE STATUS and ANALYZE TABLE and on other circumstances (like when the table has changed too much). Note that all tables are opened, and the statistics are re-estimated, when the mysql client starts if the auto-rehash setting is set on (the default).
So, assume 'at any point' due to 'changed too much', and yes, connecting with the mysql client can alter the behavior in choosing indexes of a server. Also: reconnecting of the mysql client after it lost its connection after a timeout counts as connecting with auto-rehash AFAIK. If you want to give mysql help to find the proper method, run ANALYZE TABLE once in a while, especially after heavy updating. If you think the cardinality it guesses is often wrong, you can alter the number of pages it reads to guess some statistics, but keep in mind a higher number means a longer running update of that cardinality, and something you don't want to happen that often when 'data has changed to much' on a table with a lot of operations.
TL;DR: it guesses rows differently, but you'd actually prefer the first behavior if the data makes that possible.
Adding:
On this previously linked page, we can probably also find why especially dep_id might have this problem:
small values like 1 or 2 can result in very inaccurate estimates of cardinality
I could imagine the number of different dep_id's is typically quite small, and I've indeed observed a 'bouncing' cardinality on non-unique indexes with quite a small range compared to the number of rows in my own databases. It easily guesses a range of 1-10 in the hundreds and then down again the next time, just based on the specific sample pages it picks & some algorithm that tries to extrapolate that.

MySQL count(*) , Group BY and INNER JOIN

I have a really bad time with a query on MySQL 5.1.
I simplified the 2 tables I make a JOIN on :
CREATE TABLE `jobs` (
`id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`title` VARCHAR( 255 ) NOT NULL
) ENGINE = MYISAM ;
AND
CREATE TABLE `jobsCategories` (
`jobID` int(11) NOT NULL,
`industryID` int(11) NOT NULL,
KEY `jobID` (`jobID`),
KEY `industryID` (`industryID`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
The query is straight forward :
SELECT count(*) as nb,industryID
FROM jobs J
INNER JOIN jobsCategories C ON C.jobID=J.id
GROUP BY industryID
ORDER BY nb DESC;
I got around 150000 records into the jobs table, and 350000 records into the jobsCategories table, and I have 30 industries;
The query takes approximatively 50 seconds to execute !!!
Do you have any idea why it takes so long? How could I optimize the structure of this database? Profilling the query show me that 99% of the execution time is spend on copying on tmp tables.
EXPLAIN <query> gives me :
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: J
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 4
ref: NULL
rows: 178950
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: C
type: ref
possible_keys: jobID
key: jobID
key_len: 8
ref: J.id
rows: 1
Extra: Using where
2 rows in set (0.00 sec)
About the memory :
free -m :
total used free shared buffers cached
Mem: 2011 1516 494 0 8 1075
-/+ buffers/cache: 433 1578
Swap: 5898 126 5772
With the FORCE INDEX suggested below
select count(*) as nb, industryID
from
jobs J
inner join jobsCategories C force index (industryID) on (C.jobID = J.id )
group by industryID
order by nb DESC;
SHOW PROFILE;
gives me :
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000095 |
| Opening tables | 0.000014 |
| System lock | 0.000008 |
| Table lock | 0.000007 |
| init | 0.000032 |
| optimizing | 0.000011 |
| statistics | 0.000032 |
| preparing | 0.000016 |
| Creating tmp table | 0.000031 |
| executing | 0.000003 |
| Copying to tmp table | 3.301305 |
| Sorting result | 0.000028 |
| Sending data | 0.000024 |
| end | 0.000003 |
| removing tmp table | 0.000009 |
| end | 0.000004 |
| query end | 0.000003 |
| freeing items | 0.000029 |
| logging slow query | 0.000003 |
| cleaning up | 0.000003 |
+----------------------+----------+
I guess my RAM (2Gb) is not large enough. How can I be certain this is the case?
Firstly I think that you don't need to join table jobs in order to get the same result (unless you have some garbage data in table jobsCategories):
select count(*) as nb, industryID
from jobsCategories
group by industryID
order by nb DESC;
Otherwise you may try to force index on industryID:
select count(*) as nb, industryID
from
jobs J
inner join jobsCategories C force index (industryID) on (C.jobID = J.id )
group by industryID
order by nb DESC;
change your tables to InnoDB =) InnoDB is good managing big tables and the COUNT(*) to make it faster
http://www.mysqlperformanceblog.com/2009/01/12/should-you-move-from-myisam-to-innodb/
Good Luck
EDIT:
after testing, it seems that MyISAM is faster than InnoDB when using COUNT(*) when there is no WHERE clause:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
anyway, i've tested your exact query simulating the tables that you have (150k Jobs and 300k JobsCategories) using MyISAM tables and it took 1.5 seconds so maybe your problem is elsewhere.. it's all i can tell you =P
Hope I'm not misinterpreting the reading, but from what I see, you don't need ANY join. Since your grouping is how many jobs fall under each respective industry, its all in your job categories table, why join to the actual job table for the title of the job since that is not even being returned
select IndustryID,
count(*) JobsPerIndustry
from JobCategories
group by IndustryID
EDIT PER COMMENT / FEEDBACK...
That definitely makes a difference... adding a criteria associated with a job... Ensure your Jobs table has an index on the element you are expecting to allow limiting based on... Then follow similar query like you originally had. Ensure your Jobs table has an index on CountryID.
SELECT
count(*) as nb,
industryID
FROM jobs J
JOIN jobsCategories C
ON J.ID = C.jobID
WHERE
J.countryID=1234
GROUP BY
industryID
ORDER BY
nb DESC;

SELECTing non-indexed column increases 'sending data' 25x - why and how to improve?

Given this table on local MySQL instance 5.1 with query caching off:
show create table product_views\G
*************************** 1. row ***************************
Table: product_views
Create Table: CREATE TABLE `product_views` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`dateCreated` datetime NOT NULL,
`dateModified` datetime DEFAULT NULL,
`hibernateVersion` bigint(20) DEFAULT NULL,
`brandName` varchar(255) DEFAULT NULL,
`mfrModel` varchar(255) DEFAULT NULL,
`origin` varchar(255) NOT NULL,
`price` float DEFAULT NULL,
`productType` varchar(255) DEFAULT NULL,
`rebateDetailsViewed` tinyint(1) NOT NULL,
`rebateSearchZipCode` int(11) DEFAULT NULL,
`rebatesFoundAmount` float DEFAULT NULL,
`rebatesFoundCount` int(11) DEFAULT NULL,
`siteSKU` varchar(255) DEFAULT NULL,
`timestamp` datetime NOT NULL,
`uiContext` varchar(255) DEFAULT NULL,
`siteVisitId` bigint(20) NOT NULL,
`efficiencyLevel` varchar(255) DEFAULT NULL,
`siteName` varchar(255) DEFAULT NULL,
`clicks` varchar(1024) DEFAULT NULL,
`rebateFormDownloaded` tinyint(1) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `siteVisitId` (`siteVisitId`,`siteSKU`),
KEY `FK52C29B1E3CAB9CC4` (`siteVisitId`),
KEY `rebateSearchZipCode_idx` (`rebateSearchZipCode`),
KEY `FIND_UNPROCESSED_IDX` (`siteSKU`,`siteVisitId`,`timestamp`),
CONSTRAINT `FK52C29B1E3CAB9CC4` FOREIGN KEY (`siteVisitId`) REFERENCES `site_visits` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=32909504 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
This query takes ~3s:
SELECT pv.id, pv.siteSKU
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
But this one (non-indexed column added to SELECT) takes ~65s:
SELECT pv.id, pv.siteSKU, pv.hibernateVersion
FROM product_views pv
CROSS JOIN site_visits sv
WHERE pv.siteVisitId = sv.id
AND pv.siteSKU = 'foo'
AND sv.siteId = 'bar'
AND sv.postProcessed = 1
AND pv.timestamp >= '2011-05-19 00:00:00'
AND pv.timestamp < '2011-06-18 00:00:00';
Nothing in 'where' or 'from' clauses is different. All the extra time is spent in 'sending data':
mysql> show profile for query 1;
+--------------------+-----------+
| Status | Duration |
+--------------------+-----------+
| starting | 0.000155 |
| Opening tables | 0.000029 |
| System lock | 0.000007 |
| Table lock | 0.000019 |
| init | 0.000072 |
| optimizing | 0.000032 |
| statistics | 0.000316 |
| preparing | 0.000034 |
| executing | 0.000002 |
| Sending data | 63.530402 |
| end | 0.000044 |
| query end | 0.000005 |
| freeing items | 0.000091 |
| logging slow query | 0.000002 |
| logging slow query | 0.000109 |
| cleaning up | 0.000004 |
+--------------------+-----------+
16 rows in set (0.00 sec)
I understand that using a non-indexed column in where clause would slow things down, but why here? What can be done to improve the latter case - given that I will actually want to SELECT(*) from product_views?
EXPLAIN Output
explain extended select pv.id, pv.siteSKU from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU='foo' and sv.sit eId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where; Using index | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+--------------------------+ 2 rows in set, 1 warning (0.00 sec)
mysql> explain extended select pv.id, pv.siteSKU, pv.hibernateVersion from product_views pv cross join site_visits sv where pv.siteVisitId=sv.id and pv.siteSKU= 'foo' and sv.siteId='bar' and sv.postProcessed=1 and pv.timestamp>='2011-05-19 00:00:00' and pv.timestamp<'2011-06-18 00:00:00';
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filt ered | Extra |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ | 1 | SIMPLE | pv | ref | siteVisitId,FK52C29B1E3CAB9CC4,FIND_UNPROCESSED_IDX | FIND_UNPROCESSED_IDX | 258 | const | 41810 | 10
0.00 | Using where | | 1 | SIMPLE | sv | eq_ref | PRIMARY,post_processed_idx | PRIMARY | 8 | clabs.pv.siteVisitId | 1 | 10
0.00 | Using where |
+----+-------------+-------+--------+-----------------------------------------------------+----------------------+---------+----------------------+-------+-----
-----+-------------+ 2 rows in set, 1 warning (0.00 sec)
UPDATE1: Splitting into 2 queries brings total time down to ~30s range
Not sure why, but splitting the latter query into the following reduces lat. from 65s to ~30s:
1) SELECT pv.id .... //from, where clauses same as above
2) SELECT * FROM product_views where id in (idList); //idList
UPDATE2: TABLE SIZE
table has on the order of 10M rows
query returns about 3k rows
When you select only indexed columns, MySQL does read only the index, and does not need to read the table data. This, as far as I remember, is called index-covered query. However, when there are columns, that are not present in the used index, MySQL needs to open the table and read the data from it. This is the reason index-covered queries to be much faster.
See Using Covering Indexes to Improve Query Performance.
As for the improvement, how many rows are in the table, how much the query returns and what is your buffer pool size, how much RAM is available, etc.?
From what I have read about show profile, 'sending data' is a portion of execution process, and has almost nothing to do with sending actual data to the client. You can take a look on this thread
Also, mysql docs says about "Sending data" :
The thread is reading and processing rows for a SELECT statement, and sending data to the client. Because operations occurring during this state tend to perform large amounts of disk access (reads), it is often the longest-running state over the lifetime of a given query.
In my opinion, mysql would better not mix together "reading and processing rows for a SELECT statement" and "sending data" in one state, especially in state called "sending" data" which causes lots of confusion.
I'm don't know MySQL internals at all, but Darhazer's explanation looks like the winner to me. When the non-indexed field is added, the entire row must be retrieved. And your rows are very wide. I can't quite tell from the names how (if at all) it is denormalized, but I suspect it is. site name and site sku smell like they belong in a site lookup table with an FK. rebates found amount and rebates found count sound like statistics that should be coming from a join to a separate product rebate table. etc.

Why does MySQL not use an index when executing this query?

mysql> desc users;
+-------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| email | varchar(128) | NO | UNI | | |
| password | varchar(32) | NO | | | |
| screen_name | varchar(64) | YES | UNI | NULL | |
| reputation | int(10) unsigned | NO | | 0 | |
| imtype | varchar(1) | YES | MUL | 0 | |
| last_check | datetime | YES | MUL | NULL | |
| robotno | int(10) unsigned | YES | | NULL | |
+-------------+------------------+------+-----+---------+----------------+
8 rows in set (0.00 sec)
mysql> create index i_users_imtype_robotno on users(imtype,robotno);
Query OK, 24 rows affected (0.25 sec)
Records: 24 Duplicates: 0 Warnings: 0
mysql> explain select * from users where imtype!='0' and robotno is null;
+----+-------------+-------+------+------------------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+------------------------+------+---------+------+------+-------------+
| 1 | SIMPLE | users | ALL | i_users_imtype_robotno | NULL | NULL | NULL | 24 | Using where |
+----+-------------+-------+------+------------------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
But this way,it's used:
mysql> explain select * from users where imtype in ('1','2') and robotno is null;
+----+-------------+-------+-------+------------------------+------------------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------+------------------------+---------+------+------+-------------+
| 1 | SIMPLE | users | range | i_users_imtype_robotno | i_users_imtype_robotno | 11 | NULL | 3 | Using where |
+----+-------------+-------+-------+------------------------+------------------------+---------+------+------+-------------+
1 row in set (0.01 sec)
Besides,this one also did not use index:
mysql> explain select id,email,imtype from users where robotno=1;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | users | ALL | NULL | NULL | NULL | NULL | 24 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.00 sec)
SELECT *
FROM users
WHERE imtype != '0' and robotno is null
This condition is not satisified by a single contiguous range of (imtype, robotno).
If you have records like this:
imtype robotno
$ NULL
$ 1
0 NULL
0 1
1 NULL
1 1
2 NULL
2 1
, ordered by (imtype, robotno), then the records 1, 5 and 7 would be returned, while other records wouldn't.
You'll need create this index to satisfy the condition:
CREATE INDEX ix_users_ri ON users (robotno, imptype)
and rewrite your query a little:
SELECT *
FROM users
WHERE (
robotno IS NULL
AND imtype < '0'
)
OR
(
robotno IS NULL
AND imtype > '0'
)
, which will result in two contiguous blocks:
robotno imtype
--- first block start
NULL $
--- first block end
NULL 0
--- second block start
NULL 1
NULL 2
--- second block end
1 $
1 0
1 1
1 2
This index will also serve this query:
SELECT id, email, imtype
FROM users
WHERE robotno = 1
, which is not served now by any index for the same reason.
Actually, the index for this query:
SELECT *
FROM users
WHERE imtype in ('1', '2')
AND robotno is null
is used only for coarse filtering on imtype (note using where in the extra field), it doesn't range robotno's
You need an index that has robotno as the first column. Your existing index is (imtype,robotno). Since imtype is not in the where clause, it can't use that index.
An index on (robotno,imtype) could be used for queries with just robotno in the where clause, and also for queries with both imtype and robotno in the where clause (but not imtype by itself).
Check out the docs on how MySQL uses indexes, and look for the parts that talk about multi-column indexes and "leftmost prefix".
BTW, if you think you know better than the optimizer, which is often the case, you can force MySQL to use a specific index by appending
FORCE INDEX (index_name) after FROM users.
It's because 'robotno' is potentially a primary key, and it uses that instead of the index.
A database systems query planner determines whether to do an index scan or not by analyzing the selectivity of the query's where clause relative to the index. (Indexes are also used to join tables together, but you only have users here.)
The first query has where imtype != '0'. This would select nearly all of the rows in users, assuming you have a large number of distinct values of imtype. The inequality operator is inherently unselective. So the MySQL query planner is betting here that reading through the index won't help and that it may as well just do a sequential scan through the whole table, since it probably would have to do that anyway.
On the other hand, had you said where imtype ='0', equality is a highly selective operator, and MySQL would bet that by reading just a few index blocks it could avoid reading nearly all of the blocks of the users table itself. So it would pick the index.
In your second example, where imtype in ('1','2'), MySQL knows that the index will be highly selective (though only half as selective as where imtype = '0'), and it will again bet that using the index will lead to a big payoff, as you discovered.
In your third example, where robotno=1, MySQL probably can't effectively use the index on users(imtype,robotno) since it would need to read in all the index blocks to find the robotno=1 record numbers: the index is sorted by imtype first, then robotno. If you had another index on users(robotno), MySQL would eagerly use it though.
As a footnote, if you had two indexes, one on users(imtype), and the other on users(imtype,robotno), and your query was on where imtype = '0', either index would make your query fast, but MySQL would probably select users(imtype) simply because it's more compact and fewer blocks would need to be read from it.
I'm being very simplistic here. Early database systems would just look at imtype's datatype and make a very rough guess at the selectivity of your query, but people very quickly realized that giving the query planner interesting facts like the total size of the table, the number of ditinct values in each column, etc. would enable it to make much smarter decisions. For instance if you had a users table where imtype was only every '0' or '1', the query planner might choose the index, since in that case the where imtype != '0' is more selective.
Take a look at the MySQL UPDATE STATISTICS statement and you'll see that its query planner must be sophisticated. For that reason I'd hesitate a great deal before using the FORCE statement to dictate a query plan to it. Instead, use UPDATE STATISTICS to give the query planner improved information to base its decisions on.
Your index is over users(imtype,robotno). In order to use this index, either imtype or imtype and robotno must be used to qualify the rows. You are just using robotno in your query, thus it can't use this index.