Greeting.
Let me show my table scheme first:
CREATE TABLE `log_table` (
`rid` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`dataId` int(10) unsigned NOT NULL DEFAULT '0',
`memberId` int(10) unsigned NOT NULL DEFAULT '0',
`clientId` int(10) unsigned NOT NULL DEFAULT '0',
`qty` int(11) NOT NULL DEFAULT '0',
`timestamp` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`typeA` tinyint(2) DEFAULT NULL,
`typeB` int(11) DEFAULT '0',
PRIMARY KEY (`rid`,`timestamp`),
KEY `idx_report1` (`timestamp`,`memberId`,`dataId`),
KEY `idx_report2` (`memberId`,`timestamp`),
KEY `idx_report3` (`dataId`,`timestamp`,`rid`),
KEY `idx_report4` (`timestamp`,`typeB`,`typeA`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
PARTITION BY RANGE (year(`timestamp`))
(PARTITION p2014 VALUES LESS THAN (2015),
PARTITION p2015 VALUES LESS THAN (2016)
);
I'm using MariaDB 5.5 and this table contains 25 million records, so I decided to make partitions in the table for preventing performance issue may occur in the near future.
You may see it's time serial, log data, and having 4 views. For example, one of the views uses following query:
select typeB, typeA, count(*) as number from log_table where timestamp between '2015-1-1' and '2015-2-1' group by typeB, typeA;
AFAIK, this query loads the data from p2015 only by partition pruning. But I saw there is not much difference between original table and partition-version in query execution time. (avg 1.94 sec vs 1.95 sec)
Hm, I thought it's might influenced by number of rows in each partition. then how about smaller size of partition? to_days()?
PARTITION BY RANGE (to_days(`timestamp`))
(
...
PARTITION p_2015_01 VALUES LESS THAN (to_days('2015-2-1')),
PARTITION p_2015_02 VALUES LESS THAN (to_days('2015-3-1'))
...
)
Well, there's no effect. Could you let me know what's my missing piece?
EDIT: sorry for my error in the query.. btw, EXPLAIN PARTITION doesn't help me.
and result of explain both tables are :
// original
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | org_table | range | idx_report1,idx_report4 | idx_report4 | 8 | NULL | 8828000 | Using where; Using index; Using temporary; Using filesort |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
//partition
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | log_table | range | idx_report1,idx_report4 | idx_report4 | 8 | NULL | 7902646 | Using where; Using index; Using temporary; Using filesort |
+------+-------------+-----------+-------+-------------------------+-------------+---------+------+---------+-----------------------------------------------------------+
PARTITIONing does not help performance nearly as often as users think it will.
KEY `idx_report4` (`timestamp`,`typeB`,`typeA`)
without partitioning is optimal for the SELECT you provided. PARTITIONing will not speed it up any.
Since BETWEEN is "inclusive" where timestamp between '2015-1-1' and '2015-2-1' actually hits two partitions. Use EXPLAIN PARTITIONS SELECT ... to see that.
BY RANGE (TO_DAYS(...)) is probably better than BY RANGE (YEAR(...)), but still not useful for the given query.
Here is my discussion of the only 4 use cases where PARTITIONing helps performance: http://mysql.rjweb.org/doc.php/partitionmaint
If this type of query is important, consider "Summary Tables" as a way of greatly speeding up the application: http://mysql.rjweb.org/doc.php/datawarehouse and http://mysql.rjweb.org/doc.php/summarytables
Related
I have a sales data table in which average 1,329,415 rows are inserted in daily. I have to generate report from the table daily in different formats. But the query from the table is too much slow. Here is my SHOW CREATE TABLE command output.
CREATE TABLE `query_manager_table` (
`mtime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`region_id` int(2) NOT NULL,
`rtslug` varchar(10) DEFAULT NULL,
`dsid` int(3) NOT NULL,
`dpid` int(3) NOT NULL,
`route_number` int(4) NOT NULL,
`route_id` int(11) NOT NULL,
`rtlid` int(11) NOT NULL,
`retailer_code` varchar(16) DEFAULT NULL,
`platform_code` varchar(16) DEFAULT NULL,
`prid` int(4) NOT NULL,
`skid` int(4) NOT NULL,
`group` int(4) NOT NULL,
`family` int(4) NOT NULL,
`volume` float DEFAULT NULL,
`value` float(7,2) DEFAULT NULL,
`date` date NOT NULL DEFAULT '0000-00-00',
`outlets` int(4) NOT NULL,
`visited` int(4) NOT NULL,
`channel` int(3) DEFAULT NULL,
`subchannel` int(3) DEFAULT NULL,
`tpg` int(4) DEFAULT NULL,
`ioq` int(10) DEFAULT NULL,
`sales_time` int(11) DEFAULT NULL,
PRIMARY KEY (`dpid`,`route_id`,`rtlid`,`prid`,`skid`,`date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY LIST (YEAR(date) * 100 + QUARTER(date))
(PARTITION y2017q1 VALUES IN (201701) ENGINE = InnoDB,
PARTITION y2017q2 VALUES IN (201702) ENGINE = InnoDB,
PARTITION y2017q3 VALUES IN (201703) ENGINE = InnoDB,
PARTITION y2017q4 VALUES IN (201704) ENGINE = InnoDB,
PARTITION y2018q1 VALUES IN (201801) ENGINE = InnoDB,
PARTITION y2018q2 VALUES IN (201802) ENGINE = InnoDB,
PARTITION y2018q3 VALUES IN (201803) ENGINE = InnoDB,
PARTITION y2018q4 VALUES IN (201804) ENGINE = InnoDB,
PARTITION y2019q1 VALUES IN (201901) ENGINE = InnoDB,
PARTITION y2019q2 VALUES IN (201902) ENGINE = InnoDB,
PARTITION y2019q3 VALUES IN (201903) ENGINE = InnoDB,
PARTITION y2019q4 VALUES IN (201904) ENGINE = InnoDB) */
Now I just want to know the by retailer sales from 1st September to 9th September by following query -
SELECT
query_manager_table.dpid,
query_manager_table.route_id,
query_manager_table.rtlid,
query_manager_table.prid,
SUM(query_manager_table.`volume`) AS sales,
1 AS memos
FROM
query_manager_table
WHERE
query_manager_table.date BETWEEN '2018-09-01'
AND '2018-09-08'
GROUP BY
query_manager_table.dpid,
query_manager_table.rtlid,
query_manager_table.date
But it takes about 500-700 sec . I have added dpid IN (1,2,.....) AND prid IN (1,2,....) as both fileds are added as primary key. Then output comes after 300sec. What I am doing wrong?
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | ALL | PRIMARY | NULL | NULL | NULL | 129065467 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
When I add all dpid and prid in where condition then EXPAIN look like
+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | range | PRIMARY | PRIMARY | 4 | NULL | 128002 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
Is there any way to optimize table or query?
If I run EXPLAIN PARTITIONS SELECT... for the first one then get -
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | y2017q1,y2017q2,y2017q3,y2017q4,y2018q1,y2018q2,y2018q3,y2018q4,y2019q1,y2019q2,y2019q3,y2019q4 | ALL | PRIMARY | NULL | NULL | NULL | 127129410 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+------+---------------+------+---------+------+-----------+----------------------------------------------+
For the 2nd one I get -
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
| 1 | SIMPLE | query_manager_table | y2017q1,y2017q2,y2017q3,y2017q4,y2018q1,y2018q2,y2018q3,y2018q4,y2019q1,y2019q2,y2019q3,y2019q4 | range | PRIMARY | PRIMARY | 4 | NULL | 153424 | Using where; Using temporary; Using filesort |
+----+-------------+---------------------+-------------------------------------------------------------------------------------------------+-------+---------------+---------+---------+------+--------+----------------------------------------------+
INDEXes are used for efficiency in SELECTs.
The one PRIMARY KEY (in MySQL) is, by definition a unique INDEX. It should have a minimal set of columns that uniquely identify a row.
Any unique index (including the PK) is also a "uniqueness constraint" -- this prevents inserting multiple rows with the same set if values.
Indexes are used "from the left". That is, with INDEX(a,b), if a is not useful, it won't get to the b.
PARTITION BY LIST is virtually useless. It rarely, if ever, improves performance. You have shown us a couple of queries; let's see more of the typical queries so we can help you with indexes and partitioning.
WHERE
query_manager_table.date BETWEEN '2018-09-01'
AND '2018-09-08'
begs for INDEX(date). In a composite index, the columns after a 'range' won't be reached. That is, in INDEX(date, x, y), testing date for a range (such as the 8 days in the WHERE), won't let it make use of x or y. On the other hand, WHERE date = '2018-09-01' AND x=1 will make use of more of the index.
float(7,2) -- don't use the (m,n) option on FLOAT or DOUBLE. Instead, switch to DECIMAL.
INT is always 4 bytes. See TINYINT (1 byte), SMALLINT (2 bytes), etc. This, alone, may cut the table size in half.
To explain this:
PRIMARY KEY (`dpid`,`route_id`, ...
WHERE ... AND dpid IN (...) AND ...
manages to use the first (remember: 'leftmost') for the pseudo-range IN, but can't use anything else in the PK since route_id is next.
This explains why the second EXPLAIN has a smaller "Rows". Also, note the "4" in "key_len" -- that's the number of bytes in dpid.
After you have made some of those changes, come back so we can discuss using Summary Tables to speed things up. However, "modify" may lead to complexity in this optimization.
How much RAM do you have? What is the value of innodb_buffer_pool_size?
Don't use GUIDs unless you must; they slow actions on large tables down due to the randomness.
I would not combine actual data fields to make a primary key. I would have a single field, and use an auto-incrementing integer or perhaps a GUID for the value. Having to go through six fields to identify a unique record takes more time than going through one, and as you say you run the risk of duplicate fields if a user is entering key data.
If you have business reasons to make those six fields unique when taken together, you should also work up a routine to identify whether or not an inserted record duplicates an existing one with respect to these fields. If you are batch inserting, you'll want to do this after inserting the records rather than checking each one as you insert it. You'll also want to index these six fields, to speed up your query for duplicates.
As for your SELECT query, you'll probably want to index the fields in your WHERE clause. In any case, you'll want to read up on execution plans and experiment with different indexes and key structures (probably easier to do on a subset of your data). Google "mysql execution plan" for lots of information.
Problem with MySQL version 5.7.18. Earlier versions of MySQL behaves as supposed to.
Here are two tables. Table 1:
CREATE TABLE `test_events` (
`id` int(11) NOT NULL,
`event` int(11) DEFAULT '0',
`manager` int(11) DEFAULT '0',
`base_id` int(11) DEFAULT '0',
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`client` int(11) DEFAULT '0',
`event_time` datetime DEFAULT '0000-00-00 00:00:00'
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test_events`
ADD PRIMARY KEY (`id`),
ADD KEY `client` (`client`),
ADD KEY `event_time` (`event_time`),
ADD KEY `manager` (`manager`),
ADD KEY `base_id` (`base_id`),
ADD KEY `create_time` (`create_time`);
And the second table:
CREATE TABLE `test_event_types` (
`id` int(11) NOT NULL,
`name` varchar(255) DEFAULT NULL,
`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`base` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
ALTER TABLE `test_event_types`
ADD PRIMARY KEY (`id`);
Let's try to select last event from base "314":
EXPLAIN SELECT `test_events`.`create_time`
FROM `test_events`
LEFT JOIN `test_event_types`
ON ( `test_events`.`event` = `test_event_types`.`id` )
WHERE base = 314
ORDER BY `test_events`.`create_time` DESC
LIMIT 1;
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | test_events | NULL | ALL | NULL | NULL | NULL | NULL | 434928 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | test_event_types | NULL | ALL | PRIMARY | NULL | NULL | NULL | 44 | 2.27 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+------------------+------------+------+---------------+------+---------+------+--------+----------+----------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
MySQL is not using index and reads the whole table.
Without WHERE statement:
EXPLAIN SELECT `test_events`.`create_time`
FROM `test_events`
LEFT JOIN `test_event_types`
ON ( `test_events`.`event` = `test_event_types`.`id` )
ORDER BY `test_events`.`create_time` DESC
LIMIT 1;
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
| 1 | SIMPLE | test_events | NULL | index | NULL | create_time | 4 | NULL | 1 | 100.00 | NULL |
| 1 | SIMPLE | test_event_types | NULL | eq_ref | PRIMARY | PRIMARY | 4 | m16.test_events.event | 1 | 100.00 | Using index |
+----+-------------+------------------+------------+--------+---------------+-------------+---------+-----------------------+------+----------+-------------+
2 rows in set, 1 warning (0.00 sec)
Now it uses index.
MySQL 5.5.55 uses index in both cases. Why is it so and what to do with it?
I don't know the difference you are seeing in your previous and current installations but the servers behaviour makes sense.
SELECT test_events.create_time FROM test_events LEFT JOIN test_event_types ON ( test_events.event = test_event_types.id ) ORDER BY test_events.create_time DESC LIMIT 1;
In this query you do not have a where clause but you are fetching one row only. And that's after sorting by create_time which happens to have an index. And that index can be used for sorting. But let's see the second query.
SELECT test_events.create_time FROM test_events LEFT JOIN test_event_types ON ( test_events.event = test_event_types.id ) WHERE base = 314 ORDER BY test_events.create_time DESC LIMIT 1
You don't have an index on the base column. So no index can be used on that. To find the relevent records mysql has to do a table scan. Having identified the relevent rows, they need to be sorted. But in this case the query planner has decided that it's just not worth it to use the index on create_time
I see several problems with your setup, the first being not having and index on base as already mentioned. But why is base varchar? You appear to be storing integers in it.
ALTER TABLE test_events
ADD PRIMARY KEY (id),
ADD KEY client (client),
ADD KEY event_time (event_time),
ADD KEY manager (manager),
ADD KEY base_id (base_id),
ADD KEY create_time (create_time);
And making multiple indexes like this doesn't make much sense in mysql. That's because mysql can use only one index per table for queries. You would be far better off with one or two indexes. Possibly multi column indexes.
I think your ideal index would contain both create_time and event fields
base = 314 with base VARCHAR... is a performance problem. Either put quotes around 314 or make base some integer type.
You appear not to need LEFT. If not, then do a plain JOIN so that the optimizer has the freedom to start with an INDEX(base), which is then missing and needed.
As for the differences between 5.5 and 5.6 and 5.7, there have been a number of Optimization changes; you may have encountered a regression. But I don't want to chase that until you have improved the query and indexes.
I stumbled upon same scenario where MySQL was using table scan, instead of INDEX search.
This could be because of one of the reasons, mentioned in MySQL docs:
The table is so small that it is faster to perform a table scan than to bother with a key lookup. This is common for tables with fewer than 10 rows and a short row length.
mysql docs link
And when I checked EXPLAIN of MySQL query in production server with large number of rows, it used INDEX search as expected.
Its one of the MySQL optimizations, under the hood :)
I have two tables like this
CREATE TABLE `vendors` (
vid int(10) unsigned NOT NULL AUTO_INCREMENT,
updated timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (vid),
key(updated)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `products` (
vid int(10) unsigned NOT NULL default 0,
pid int unsigned default 0,
flag int(11) unsigned DEFAULT '0',
PRIMARY KEY (vid),
KEY (pid)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This is a simple query
> explain select vendors.vid, pid from products, vendors where pid=1 and vendors.vid=products.vid order by updated;
+------+-------------+----------+--------+---------------+---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------+--------+---------------+---------+---------+---------------------+------+----------------------------------------------+
| 1 | SIMPLE | products | ref | PRIMARY,pid | pid | 5 | const | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | vendors | eq_ref | PRIMARY | PRIMARY | 4 | social.products.vid | 1 | |
+------+-------------+----------+--------+---------------+---------+---------+---------------------+------+----------------------------------------------+
I am wondering why mysql need to use temporary table and filesort for such a simple query. As you can see that ORDER BY field has index.
mysql fiddle here : http://sqlfiddle.com/#!9/3d9be/30
That will be the optimum query in that case, doesn't always have to go to the index for the fastest result. The optimiser may choose to use the index when the record count goes up. You can try inserting 10,000 dummy records and seeing if this is the case.
If I flip the conditions here, you will find it will use the index, since I have supplied the table where the where condition is joined on later in the query. We need to look at records in table products after the join is made, so in essence I've made it harder work, so the index is used. It'll still run in the same time. You can try pitting the 2 queries against each other to see what happens. Here it is:
EXPLAIN
SELECT vendors.vid, products.pid
FROM vendors
INNER JOIN products ON vendors.vid = products.vid
WHERE pid = 1
ORDER BY vendors.updated DESC
You can find a detailed explanation here: Fix Using where; Using temporary; Using filesort
I've been trying to wrap my head around this for a good while, but had no luck. I have a simple queue system implemented on my small site and a cron job to check if there are any items in the queue. It's supposed to fetch several items ordered by priority and process them, but for some reason the priority index gets ignored. My create table syntax is
CREATE TABLE `site_queue` (
`row_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`task` tinyint(3) unsigned NOT NULL COMMENT '0 - email',
`priority` int(10) unsigned DEFAULT NULL,
`commands` text NOT NULL,
`added` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`row_id`),
KEY `task` (`task`),
KEY `priority` (`priority`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
The query to fetch queued items is
SELECT `row_id`, `task`, `commands` FROM `site_queue` ORDER BY `priority` DESC LIMIT 5;
The EXPLAIN query returns the following:
+----+-------------+------------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | site_queue | ALL | NULL | NULL | NULL | NULL | 1269 | Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+------+----------------+
Can anyone offer some insight on what might be causing this?
Because when it's only few rows (originally 4, then increased to 1k) there is no reason to use index, since it will be slower (mysql will have to read both index and data pages too many times).
So the rule of thumb of mysql query optimizations: use reasonably big amount of data when you do so. It would be good if size was comparable to real production data size.
I am trying to optimize a bigger query and ran into this wall when I realized this part of the query was doing a full table scan, which in my mind does not make sense considering the field in question is a primary key. I would assume that the MySQL Optimizer would use the index.
Here is the table:
CREATE TABLE userapplication (
application_id int(11) NOT NULL auto_increment,
userid int(11) NOT NULL default '0',
accountid int(11) NOT NULL default '0',
resume_id int(11) NOT NULL default '0',
coverletter_id int(11) NOT NULL default '0',
user_email varchar(100) NOT NULL default '',
account_name varchar(200) NOT NULL default '',
resume_name varchar(255) NOT NULL default '',
resume_modified datetime NOT NULL default '0000-00-00 00:00:00',
cover_name varchar(255) NOT NULL default '',
cover_modified datetime NOT NULL default '0000-00-00 00:00:00',
application_status tinyint(4) NOT NULL default '0',
application_created datetime NOT NULL default '0000-00-00 00:00:00',
application_modified timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
publishid int(11) NOT NULL default '0',
application_visible int(11) default '1',
PRIMARY KEY (application_id),
KEY publishid (publishid),
KEY application_status (application_status),
KEY userid (userid),
KEY accountid (accountid),
KEY application_created (application_created),
KEY resume_id (resume_id),
KEY coverletter_id (coverletter_id),
) ENGINE=MyISAM ;
This simple query seems to do a full table scan:
SELECT * FROM userapplication WHERE application_id > 1025;
This is the output of the EXPLAIN:
+----+-------------+-------------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------------+------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | userapplication | ALL | PRIMARY | NULL | NULL | NULL | 784422 | Using where |
+----+-------------+-------------------+------+---------------+------+---------+------+--------+-------------+`
Any ideas how to prevent this simple query from doing a full table scan? Or am I out of luck?
You'd probably be better off letting MySql decide on the query plan. There is a good
chance that doing an index scan would be less efficient than a full table scan.
There are two data structures on disk for this table
The table itself; and
The primary key B-Tree index.
When you run a query the optimizer has two options about how to access the data:
SELECT * FROM userapplication WHERE application_id > 1025;
Using The Index
Scan the B-Tree index to find the address of all the rows where application_id > 1025
Read the appropriate pages of the table to get the data for these rows.
Not using the Index
Scan the entire table, and pick the appropriate records.
Choosing the best stratergy
The job of the query optimizer is to choose the most efficient strategy for getting the data you want. If there are a lot of rows with an application_id > 1025 then it can actually be less efficient to use the index. For example if 90% of the records have an application_id > 1025 then the query optimizer would have to scan around 90% of the leaf nodes of the b-tree index and then read at least 90% of the table as well to get the actual data; this would involve reading more data from disk than just scanning the table.
MyISAM tables are not clustered, a PRIMARY KEY index is a secondary index and requires an additional table lookup to get the other values.
It is several times more expensive to traverse the index and do the lookups. If you condition is not very selective (yields a large share of total records), MySQL will consider table scan cheaper.
To prevent it from doing a table scan, you could add a hint:
SELECT *
FROM userapplication FORCE INDEX (PRIMARY)
WHERE application_id > 1025
, though it would not necessarily be more efficient.
Mysql definitely considers a full table scan cheaper than using the index; you can however force to use your primary key as preferred index with:
mysql> EXPLAIN SELECT * FROM userapplication FORCE INDEX (PRIMARY) WHERE application_id > 10;
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | userapplication | range | PRIMARY | PRIMARY | 4 | NULL | 24 | Using where |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------------+
Note that using "USE INDEX" instead of "FORCE INDEX" to only hint mysql on the index to use, mysql still prefers a full table scan:
mysql> EXPLAIN SELECT * FROM userapplication USE INDEX (PRIMARY) WHERE application_id > 10;
+----+-------------+-----------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | userapplication | ALL | PRIMARY | NULL | NULL | NULL | 34 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+------+-------------+
If your WHERE is a "greater than" comparison, it probably returns quite a few entries (and can realistically return all of them), therefore full table scans are usually preferred.
It should be the case of just typing:
SELECT * FROM userapplication WHERE application_id > 1025;
As detailed at this link. According to that guide, it should work where the application_id is a numeric value, for non-numeric values, you should type:
SELECT * FROM userapplication WHERE application_id > '1025';
I don't think there's anything wrong with your SELECT, maybe it's a table configuration problem?