I have a sales data table in which average 1,329,415 rows are inserted in daily. I have to generate report from the table daily in different formats. But the query from the table is too much slow. Here is my SHOW CREATE TABLE command output.
CREATE TABLE `query_manager_table` (
`mtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`region_id` int(2) NOT NULL,
`rtslug` varchar(10) DEFAULT NULL,
`dsid` int(3) NOT NULL,
`dpid` int(3) NOT NULL,
`route_number` int(4) NOT NULL,
`route_id` int(11) NOT NULL,
`rtlid` int(11) NOT NULL,
`retailer_code` varchar(16) DEFAULT NULL,
`platform_code` varchar(16) DEFAULT NULL,
`prid` int(4) NOT NULL,
`skid` int(4) NOT NULL,
`group` int(4) NOT NULL,
`family` int(4) NOT NULL,
`volume` float DEFAULT NULL,
`value` float(7,2) DEFAULT NULL,
`date` date NOT NULL DEFAULT '0000-00-00',
`outlets` int(4) NOT NULL,
`visited` int(4) NOT NULL,
`channel` int(3) DEFAULT NULL,
`subchannel` int(3) DEFAULT NULL,
`tpg` int(4) DEFAULT NULL,
`ioq` int(10) DEFAULT NULL,
`sales_time` int(11) DEFAULT NULL,
PRIMARY KEY (`dpid`,`route_id`,`rtlid`,`prid`,`skid`,`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY LIST (YEAR(date) * 100 + QUARTER(date))
(PARTITION y2017q1 VALUES IN (201701) ENGINE = InnoDB,
PARTITION y2017q2 VALUES IN (201702) ENGINE = InnoDB,
PARTITION y2017q3 VALUES IN (201703) ENGINE = InnoDB,
PARTITION y2017q4 VALUES IN (201704) ENGINE = InnoDB,
PARTITION y2018q1 VALUES IN (201801) ENGINE = InnoDB,
PARTITION y2018q2 VALUES IN (201802) ENGINE = InnoDB,
PARTITION y2018q3 VALUES IN (201803) ENGINE = InnoDB,
PARTITION y2018q4 VALUES IN (201804) ENGINE = InnoDB,
PARTITION y2019q1 VALUES IN (201901) ENGINE = InnoDB,
PARTITION y2019q2 VALUES IN (201902) ENGINE = InnoDB,
PARTITION y2019q3 VALUES IN (201903) ENGINE = InnoDB,
PARTITION y2019q4 VALUES IN (201904) ENGINE = InnoDB) */
Now I just want to know the by retailer sales from 1st September to 9th September by following query -
SELECT
query_manager_table.dpid,
query_manager_table.route_id,
query_manager_table.rtlid,
query_manager_table.prid,
SUM(query_manager_table.`volume`) AS sales,
1 AS memos
FROM
query_manager_table
WHERE
query_manager_table.date BETWEEN '2018-09-01'
AND '2018-09-08'
GROUP BY
query_manager_table.dpid,
query_manager_table.rtlid,
query_manager_table.date
But it takes about 500-700 sec . I have added dpid IN (1,2,.....) AND prid IN (1,2,....) as both fileds are added as primary key. Then output comes after 300sec. What I am doing wrong?
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | ALL | PRIMARY | NULL | NULL | NULL | 129065467 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
When I add all dpid and prid in where condition then EXPAIN look like
+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | range | PRIMARY | PRIMARY | 4 | NULL | 128002 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
Is there any way to optimize table or query?
If I run EXPLAIN PARTITIONS SELECT... for the first one then get -
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | y2017q1,y2017q2,y2017q3,y2017q4,y2018q1,y2018q2,y2018q3,y2018q4,y2019q1,y2019q2,y2019q3,y2019q4 | ALL | PRIMARY | NULL | NULL | NULL | 127129410 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
For the 2nd one I get -
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | y2017q1,y2017q2,y2017q3,y2017q4,y2018q1,y2018q2,y2018q3,y2018q4,y2019q1,y2019q2,y2019q3,y2019q4 | range | PRIMARY | PRIMARY | 4 | NULL | 153424 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
INDEXes are used for efficiency in SELECTs.
The one PRIMARY KEY (in MySQL) is, by definition a unique INDEX. It should have a minimal set of columns that uniquely identify a row.
Any unique index (including the PK) is also a "uniqueness constraint" -- this prevents inserting multiple rows with the same set if values.
Indexes are used "from the left". That is, with INDEX(a,b), if a is not useful, it won't get to the b.
PARTITION BY LIST is virtually useless. It rarely, if ever, improves performance. You have shown us a couple of queries; let's see more of the typical queries so we can help you with indexes and partitioning.
WHERE
query_manager_table.date BETWEEN '2018-09-01'
AND '2018-09-08'
begs for INDEX(date). In a composite index, the columns after a 'range' won't be reached. That is, in INDEX(date, x, y), testing date for a range (such as the 8 days in the WHERE), won't let it make use of x or y. On the other hand, WHERE date = '2018-09-01' AND x=1 will make use of more of the index.
float(7,2) -- don't use the (m,n) option on FLOAT or DOUBLE. Instead, switch to DECIMAL.
INT is always 4 bytes. See TINYINT (1 byte), SMALLINT (2 bytes), etc. This, alone, may cut the table size in half.
To explain this:
PRIMARY KEY (`dpid`,`route_id`, ...
WHERE ... AND dpid IN (...) AND ...
manages to use the first (remember: 'leftmost') for the pseudo-range IN, but can't use anything else in the PK since route_id is next.
This explains why the second EXPLAIN has a smaller "Rows". Also, note the "4" in "key_len" -- that's the number of bytes in dpid.
After you have made some of those changes, come back so we can discuss using Summary Tables to speed things up. However, "modify" may lead to complexity in this optimization.
How much RAM do you have? What is the value of innodb_buffer_pool_size?
Don't use GUIDs unless you must; they slow actions on large tables down due to the randomness.
I would not combine actual data fields to make a primary key. I would have a single field, and use an auto-incrementing integer or perhaps a GUID for the value. Having to go through six fields to identify a unique record takes more time than going through one, and as you say you run the risk of duplicate fields if a user is entering key data.
If you have business reasons to make those six fields unique when taken together, you should also work up a routine to identify whether or not an inserted record duplicates an existing one with respect to these fields. If you are batch inserting, you'll want to do this after inserting the records rather than checking each one as you insert it. You'll also want to index these six fields, to speed up your query for duplicates.
As for your SELECT query, you'll probably want to index the fields in your WHERE clause. In any case, you'll want to read up on execution plans and experiment with different indexes and key structures (probably easier to do on a subset of your data). Google "mysql execution plan" for lots of information.
Related
I have a MySQL (v8.0.30) table that stores trades, the schema is the following:
CREATE TABLE `log_fill` (
`id` bigint NOT NULL AUTO_INCREMENT,
`orderId` varchar(63) NOT NULL,
`clientOrderId` varchar(36) NOT NULL,
`symbol` varchar(31) NOT NULL,
`executionId` varchar(255) DEFAULT NULL,
`executionSide` tinyint NOT NULL COMMENT '0 = long, 1 = short',
`executionSize` decimal(15,2) NOT NULL,
`executionPrice` decimal(21,8) unsigned NOT NULL,
`executionTime` bigint unsigned NOT NULL,
`executionValue` decimal(21,8) NOT NULL,
`executionFee` decimal(13,8) NOT NULL,
`feeAsset` varchar(63) DEFAULT NULL,
`positionSizeBeforeFill` decimal(21,8) DEFAULT NULL,
`apiKey` int NOT NULL,
`side` varchar(20) DEFAULT NULL,
`reconciled` tinyint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `executionId` (`executionId`,`executionSide`),
KEY `apiKey` (`apiKey`)
) ENGINE=InnoDB AUTO_INCREMENT=6522695 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
As you can see, there's a BTREE index on the column apiKey which stands for a user, this way I can quickly retrieve all trades for a specific user.
My goal is a query that returns positionSizeBeforeFill + executionSize for the last record, given an apiKey and a symbol. So I wrote the following:
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
ORDER BY id DESC
However the execution is extremely slow (around 100ms). I've noticed that running either WHERE or ORDER BY (and not both together) drastically reduces execution time. For example
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
only takes 220 microseconds to execute. The number of records after filtering by apiKey and symbol is 388.
Similarly,
SELECT positionSizeBeforeFill + executionSize
FROM log_fill
ORDER BY id DESC
takes 26 microseconds (on a 3 million records table).
All in all, separately running WHERE and ORDER BY takes microseconds of execution, when I combine them we scale up to milliseconds (around 1000x more).
Running EXPLAIN on the slow query it turns out it has to examine 116032 rows.
I tried to create a temporary table hoping for MySQL to perform sorting only on the filtered records, but the outcome is the same. Was wondering whether the problem might be the index (whose cardinality is 203), but how can it be the case when WHERE alone takes very little time? I could not find similar cases on other questions or forums. I think I just fail at understanding how InnoDB selects data, I thought it would first filter by WHERE and then perform ORDER BY on the filtered rows. How can I improve this? Thanks!
Edit: The EXPLAIN statement on the slow query returns
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+----------------------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where; Backward index scan |
The query with WHERE only
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+------+---------------+--------+---------+-------+--------+----------+-------------+
| 1 | SIMPLE | log_fill_tmp | NULL | ref | apiKey | apiKey | 4 | const | 116032 | 10.00 | Using where |
The query with ORDER BY only on the full table
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------------+------------+-------+---------------+---------+---------+------+---------+----------+---------------------+
| 1 | SIMPLE | log_fill_tmp | NULL | index | NULL | PRIMARY | 8 | NULL | 2503238 | 100.00 | Backward index scan |
26 microseconds for any query against a 3-million-row table implies that you have the Query cache enabled. Please rerunning your timings with SELECT SQL_NO_CACHE .... (Even milliseconds would be suspicious.) Were all 3M rows returned? Shoveling that many rows probably takes more than a second under any circumstances.
Meanwhile, to speed up the first two queries, add
INDEX(symbol, apikey, id)
EXPLAIN gives only approximate (eg, 116032) counts. A "cardinality" of 203 is also just an estimate, but it is used by the Optimizer in some situations. Please get the exact count just to check that there really are any rows:
SELECT COUNT(*)
FROM log_fill
WHERE apiKey = 90 AND symbol = 'ABCD'
With the ORDER BY id DESC, it will scan the entire B+Tree what holds the data. As it says, it will do a 'Backward index scan'. However, since the "index" is the PRIMARY KEY and the PK is clustered with the data, it is really referring to the data's BTree.
The EXPLAIN for the first query decided that the indexes were not useful enough for WHERE; instead it avoided the sort (ORDER BY) by doing the Backward full table scan, same as the 3rd query. (And ignored any rows that did not match the WHERE.
I added a composite index on (apiKey, symbol) and now the query runs in as little as 0.2ms. In order to reduce the creation time for the index I reduce the number of records from 3M to 500K and the gain in time is about 97%, I believe it's going to be more on the full table. I thought using just the apiKey index it would first filter out by user, then by symbol, but I'm probably wrong.
Problem with MySQL version 5.7.18. Earlier versions of MySQL behaves as supposed to.
Here are two tables. Table 1:
CREATE TABLE `test_events` (
`id` int(11) NOT NULL,
`event` int(11) DEFAULT '0',
`manager` int(11) DEFAULT '0',
`base_id` int(11) DEFAULT '0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`client` int(11) DEFAULT '0',
`event_time` datetime DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test_events`
ADD PRIMARY KEY (`id`),
ADD KEY `client` (`client`),
ADD KEY `event_time` (`event_time`),
ADD KEY `manager` (`manager`),
ADD KEY `base_id` (`base_id`),
ADD KEY `create_time` (`create_time`);
And the second table:
CREATE TABLE `test_event_types` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`base` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test_event_types`
ADD PRIMARY KEY (`id`);
Let's try to select last event from base "314":
EXPLAIN SELECT `test_events`.`create_time`
FROM `test_events`
LEFT JOIN `test_event_types`
ON ( `test_events`.`event` = `test_event_types`.`id` )
WHERE base = 314
ORDER BY `test_events`.`create_time` DESC
LIMIT 1;
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | test_events | NULL | ALL | NULL | NULL | NULL | NULL | 434928 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | test_event_types | NULL | ALL | PRIMARY | NULL | NULL | NULL | 44 | 2.27 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
MySQL is not using index and reads the whole table.
Without WHERE statement:
EXPLAIN SELECT `test_events`.`create_time`
FROM `test_events`
LEFT JOIN `test_event_types`
ON ( `test_events`.`event` = `test_event_types`.`id` )
ORDER BY `test_events`.`create_time` DESC
LIMIT 1;
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
| 1 | SIMPLE | test_events | NULL | index | NULL | create_time | 4 | NULL | 1 | 100.00 | NULL |
| 1 | SIMPLE | test_event_types | NULL | eq_ref | PRIMARY | PRIMARY | 4 | m16.test_events.event | 1 | 100.00 | Using index |
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
Now it uses index.
MySQL 5.5.55 uses index in both cases. Why is it so and what to do with it?
I don't know the difference you are seeing in your previous and current installations but the servers behaviour makes sense.
SELECT test_events.create_time FROM test_events LEFT JOIN test_event_types ON ( test_events.event = test_event_types.id ) ORDER BY test_events.create_time DESC LIMIT 1;
In this query you do not have a where clause but you are fetching one row only. And that's after sorting by create_time which happens to have an index. And that index can be used for sorting. But let's see the second query.
SELECT test_events.create_time FROM test_events LEFT JOIN test_event_types ON ( test_events.event = test_event_types.id ) WHERE base = 314 ORDER BY test_events.create_time DESC LIMIT 1
You don't have an index on the base column. So no index can be used on that. To find the relevent records mysql has to do a table scan. Having identified the relevent rows, they need to be sorted. But in this case the query planner has decided that it's just not worth it to use the index on create_time
I see several problems with your setup, the first being not having and index on base as already mentioned. But why is base varchar? You appear to be storing integers in it.
ALTER TABLE test_events
ADD PRIMARY KEY (id),
ADD KEY client (client),
ADD KEY event_time (event_time),
ADD KEY manager (manager),
ADD KEY base_id (base_id),
ADD KEY create_time (create_time);
And making multiple indexes like this doesn't make much sense in mysql. That's because mysql can use only one index per table for queries. You would be far better off with one or two indexes. Possibly multi column indexes.
I think your ideal index would contain both create_time and event fields
base = 314 with base VARCHAR... is a performance problem. Either put quotes around 314 or make base some integer type.
You appear not to need LEFT. If not, then do a plain JOIN so that the optimizer has the freedom to start with an INDEX(base), which is then missing and needed.
As for the differences between 5.5 and 5.6 and 5.7, there have been a number of Optimization changes; you may have encountered a regression. But I don't want to chase that until you have improved the query and indexes.
I stumbled upon same scenario where MySQL was using table scan, instead of INDEX search.
This could be because of one of the reasons, mentioned in MySQL docs:
The table is so small that it is faster to perform a table scan than to bother with a key lookup. This is common for tables with fewer than 10 rows and a short row length.
mysql docs link
And when I checked EXPLAIN of MySQL query in production server with large number of rows, it used INDEX search as expected.
Its one of the MySQL optimizations, under the hood :)
Greeting.
Let me show my table scheme first:
CREATE TABLE `log_table` (
`rid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`dataId` int(10) unsigned NOT NULL DEFAULT '0',
`memberId` int(10) unsigned NOT NULL DEFAULT '0',
`clientId` int(10) unsigned NOT NULL DEFAULT '0',
`qty` int(11) NOT NULL DEFAULT '0',
`timestamp` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`typeA` tinyint(2) DEFAULT NULL,
`typeB` int(11) DEFAULT '0',
PRIMARY KEY (`rid`,`timestamp`),
KEY `idx_report1` (`timestamp`,`memberId`,`dataId`),
KEY `idx_report2` (`memberId`,`timestamp`),
KEY `idx_report3` (`dataId`,`timestamp`,`rid`),
KEY `idx_report4` (`timestamp`,`typeB`,`typeA`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY RANGE (year(`timestamp`))
(PARTITION p2014 VALUES LESS THAN (2015),
PARTITION p2015 VALUES LESS THAN (2016)
);
I'm using MariaDB 5.5 and this table contains 25 million records, so I decided to make partitions in the table for preventing performance issue may occur in the near future.
You may see it's time serial, log data, and having 4 views. For example, one of the views uses following query:
select typeB, typeA, count(*) as number from log_table where timestamp between '2015-1-1' and '2015-2-1' group by typeB, typeA;
AFAIK, this query loads the data from p2015 only by partition pruning. But I saw there is not much difference between original table and partition-version in query execution time. (avg 1.94 sec vs 1.95 sec)
Hm, I thought it's might influenced by number of rows in each partition. then how about smaller size of partition? to_days()?
PARTITION BY RANGE (to_days(`timestamp`))
(
...
PARTITION p_2015_01 VALUES LESS THAN (to_days('2015-2-1')),
PARTITION p_2015_02 VALUES LESS THAN (to_days('2015-3-1'))
...
)
Well, there's no effect. Could you let me know what's my missing piece?
EDIT: sorry for my error in the query.. btw, EXPLAIN PARTITION doesn't help me.
and result of explain both tables are :
// original
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | org_table | range | idx_report1,idx_report4 | idx_report4 | 8 | NULL | 8828000 | Using where; Using index; Using temporary; Using filesort |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
//partition
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | log_table | range | idx_report1,idx_report4 | idx_report4 | 8 | NULL | 7902646 | Using where; Using index; Using temporary; Using filesort |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
PARTITIONing does not help performance nearly as often as users think it will.
KEY `idx_report4` (`timestamp`,`typeB`,`typeA`)
without partitioning is optimal for the SELECT you provided. PARTITIONing will not speed it up any.
Since BETWEEN is "inclusive" where timestamp between '2015-1-1' and '2015-2-1' actually hits two partitions. Use EXPLAIN PARTITIONS SELECT ... to see that.
BY RANGE (TO_DAYS(...)) is probably better than BY RANGE (YEAR(...)), but still not useful for the given query.
Here is my discussion of the only 4 use cases where PARTITIONing helps performance: http://mysql.rjweb.org/doc.php/partitionmaint
If this type of query is important, consider "Summary Tables" as a way of greatly speeding up the application: http://mysql.rjweb.org/doc.php/datawarehouse and http://mysql.rjweb.org/doc.php/summarytables
I have two tables that are all the same, except one has a timestamp value column and the other has a datetime value column. Indexes are the same. Values are the same.
But when I run SELECT station, MAX(timestamp) AS max_timestamp FROM stations GROUP BY station; if stations is the one with timestamps, it executes really fast, and if I try it with the datetime one, well I haven't seen one query executes. In both cases the timestampcolumn is indexed, only the type changes.
Where should I start looking for? Or is datetime just not suitable for search and indexing ?
Here is what EXPLAIN gives :
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | stations | range | NULL | stamp | 33 | NULL | 1511 | Using index for group-by |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
+----+-------------+--------+-------+---------------+---------+---------+------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+-------+---------------+---------+---------+------+---------+-------+
| 1 | SIMPLE |stations2 | index | NULL | station | 2 | NULL | 3025467 | |
+----+-------------+--------+-------+---------------+---------+---------+------+---------+-------+
And the SHOW:
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| stations | CREATE TABLE `stations` (
`station` varchar(10) COLLATE utf8_bin DEFAULT NULL,
`available` smallint(6) DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
UNIQUE KEY `stamp` (`station`,`timestamp`),
KEY `time` (`timestamp`),
KEY `timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| stations2 | CREATE TABLE `stations2` (
`station` smallint(5) unsigned NOT NULL,
`available` smallint(5) unsigned DEFAULT NULL,
`timestamp` datetime DEFAULT NULL,
KEY `station` (`station`),
KEY `timestamp` (`timestamp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
You can see from the EXPLAIN that there is no key being used for selection (NULL for possible_keys). You don't have a WHERE clause, so this makes sense.
MySQL can utilize an index to determine MAX, and it can utilize an index to optimize GROUP BY. However, to be able to optimize both combined, you would need both the column in your MAX() function and the column in your GROUP BY clause to be in a compound index. In the first table, you have this compound index as a unique key called 'stamp'. The EXPLAIN result shows that MySQL is using that index.
On the second table, you don't have this compound index, so MySQL is having to perform a lot more work. It has to manually group the results and keep the MAX value for each station by manually scanning each row. If you add the same compound index on the second table, you will see similar performance between the two.
However, TIMESTAMP will still slightly outperform DATETIME because TIMESTAMP is treated as a single 4 byte integer value, which is processed faster than an 8 byte special DATETIME value. The larger the data set, the larger difference you will see.
I am trying to optimize a bigger query and ran into this wall when I realized this part of the query was doing a full table scan, which in my mind does not make sense considering the field in question is a primary key. I would assume that the MySQL Optimizer would use the index.
Here is the table:
CREATE TABLE userapplication (
application_id int(11) NOT NULL auto_increment,
userid int(11) NOT NULL default '0',
accountid int(11) NOT NULL default '0',
resume_id int(11) NOT NULL default '0',
coverletter_id int(11) NOT NULL default '0',
user_email varchar(100) NOT NULL default '',
account_name varchar(200) NOT NULL default '',
resume_name varchar(255) NOT NULL default '',
resume_modified datetime NOT NULL default '0000-00-00 00:00:00',
cover_name varchar(255) NOT NULL default '',
cover_modified datetime NOT NULL default '0000-00-00 00:00:00',
application_status tinyint(4) NOT NULL default '0',
application_created datetime NOT NULL default '0000-00-00 00:00:00',
application_modified timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
publishid int(11) NOT NULL default '0',
application_visible int(11) default '1',
PRIMARY KEY (application_id),
KEY publishid (publishid),
KEY application_status (application_status),
KEY userid (userid),
KEY accountid (accountid),
KEY application_created (application_created),
KEY resume_id (resume_id),
KEY coverletter_id (coverletter_id),
) ENGINE=MyISAM ;
This simple query seems to do a full table scan:
SELECT * FROM userapplication WHERE application_id > 1025;
This is the output of the EXPLAIN:
+----+-------------+-------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | userapplication | ALL | PRIMARY | NULL | NULL | NULL | 784422 | Using where |
+----+-------------+-------------------+------+---------------+------+---------+------+--------+-------------+`
Any ideas how to prevent this simple query from doing a full table scan? Or am I out of luck?
You'd probably be better off letting MySql decide on the query plan. There is a good
chance that doing an index scan would be less efficient than a full table scan.
There are two data structures on disk for this table
The table itself; and
The primary key B-Tree index.
When you run a query the optimizer has two options about how to access the data:
SELECT * FROM userapplication WHERE application_id > 1025;
Using The Index
Scan the B-Tree index to find the address of all the rows where application_id > 1025
Read the appropriate pages of the table to get the data for these rows.
Not using the Index
Scan the entire table, and pick the appropriate records.
Choosing the best stratergy
The job of the query optimizer is to choose the most efficient strategy for getting the data you want. If there are a lot of rows with an application_id > 1025 then it can actually be less efficient to use the index. For example if 90% of the records have an application_id > 1025 then the query optimizer would have to scan around 90% of the leaf nodes of the b-tree index and then read at least 90% of the table as well to get the actual data; this would involve reading more data from disk than just scanning the table.
MyISAM tables are not clustered, a PRIMARY KEY index is a secondary index and requires an additional table lookup to get the other values.
It is several times more expensive to traverse the index and do the lookups. If you condition is not very selective (yields a large share of total records), MySQL will consider table scan cheaper.
To prevent it from doing a table scan, you could add a hint:
SELECT *
FROM userapplication FORCE INDEX (PRIMARY)
WHERE application_id > 1025
, though it would not necessarily be more efficient.
Mysql definitely considers a full table scan cheaper than using the index; you can however force to use your primary key as preferred index with:
mysql> EXPLAIN SELECT * FROM userapplication FORCE INDEX (PRIMARY) WHERE application_id > 10;
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | userapplication | range | PRIMARY | PRIMARY | 4 | NULL | 24 | Using where |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------------+
Note that using "USE INDEX" instead of "FORCE INDEX" to only hint mysql on the index to use, mysql still prefers a full table scan:
mysql> EXPLAIN SELECT * FROM userapplication USE INDEX (PRIMARY) WHERE application_id > 10;
+----+-------------+-----------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | userapplication | ALL | PRIMARY | NULL | NULL | NULL | 34 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+------+-------------+
If your WHERE is a "greater than" comparison, it probably returns quite a few entries (and can realistically return all of them), therefore full table scans are usually preferred.
It should be the case of just typing:
SELECT * FROM userapplication WHERE application_id > 1025;
As detailed at this link. According to that guide, it should work where the application_id is a numeric value, for non-numeric values, you should type:
SELECT * FROM userapplication WHERE application_id > '1025';
I don't think there's anything wrong with your SELECT, maybe it's a table configuration problem?