mysql table struct
it make me confuse, if query range influence use index in mysql
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
That is what happens. And it is actually an optimization.
When using a secondary key (such as INDEX(teacher_id)), the processing goes like this:
Reach into the index, which is a B+Tree. In such a structure, it is quite efficient to find a particular value (such as 1) and then scan forward (until 5000 or 10000).
For each entry, reach over into the data to fetch the row (SELECT *). This uses the PRIMARY KEY, a copy of which is in the secondary key. The PK and the data are clustered together; each lookup by one PK value is efficient (again, a BTree), but you need to do 5000 or 10000 of them. So the cost (time taken) adds up.
A "table scan" (ie, not using any INDEX) goes like this:
Start at the beginning of the table, walk through the B+Tree for the table (in PK order) until the end.
For each row, check the WHERE clause (a range on teacher_id).
If more than something like 20% of the table needs to be looked at, a table scan is actually faster than bouncing back and forth between the secondary index and the data.
So, "large" is somewhere around 20%. The actual value depends on table statistics, etc.
Bottom line: Just let the Optimizer do its thing; most of the time it knows best.
in brief, i use mysql database
execute
EXPLAIN
SELECT * FROM t_teacher_course_info
WHERE teacher_id >1 and teacher_id < 5000
will use index INDEX `idx_teacher_id_last_update_time` (`teacher_id`, `last_update_time`)
but if change range
EXPLAIN
SELECT * FROM t_teacher_course_info
WHERE teacher_id >1 and teacher_id < 10000
id select_type table type possible_keys key key_len ref rows Extra
1 1 SIMPLE t_teacher_course_info ALL idx_teacher_update_time 671082 Using where
scan all table, not use index , any mysql config
maybe scan row count judge if use index. !!!!!!!
Related
I have the following query:
select * from `tracked_employments`
where `tracked_employments`.`file_id` = 10006000
and `tracked_employments`.`user_id` = 1003230
and `tracked_employments`.`can_be_sent` = 1
and `tracked_employments`.`type` = ‘jobchange’
and `tracked_employments`.`file_type` = ‘file’
order by `tracked_employments`.`id` asc
limit 1000
offset 2000;
and this index
explain tells me that it does not use the index, but when I replace * with id it does use it. Why does it make a difference what columns I select?
Both you and Akina have misconceptions about how InnoDB indexing works.
Let me explain the two ways that that query may be executed.
Case 1. Index is used.
This assumes the datatypes, etc, all match the 5-column composite index that seems to exist on the table. Note: because all the tests are for =, the order of the columns in the WHERE clause and the INDEX does not matter.
In InnoDB, id (or whatever column(s) are in the PRIMARY KEY are implicitly added onto the index.
The lookup will go directly (in the Index's BTree) to the first row that matches all 5 tests. From there, it will scan forward. Each 'row' in the index has the PK, so it can reach over into the data's BTree to find any other columns needed for * (cf SELECT *).
But, it must skip over 2000 rows before delivering the 1000 that are desired. This is done by actually stepping over each one, one at a time. That is, OFFSET is not necessarily fast.
Case 2. Don't bother with the index.
This happens based on some nebulous analysis of the 3000 rows that need to be touched and the size of the table.
The rationale behind possibly scanning the table without using the index is that the bouncing between the index BTree and the data BTree may be more costly than simply scanning the data BTree. Note that the data BTree is already in the desired order -- namely by id. (Assuming that is the PK.) That avoids a sort of up to 1000 rows.
Also, certain datatype issues may prevent the use of the index.
I do need to ask what the client will do with 1000 rows all at once. If it is a web page, that seems awfully big.
Case 3 -- Just SELECT id
In this case, all the info is available in the index, so there is no need to reach into the data's BTree.
I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.
I have one table: student_homework, and one of its composite index is uk_sid_lsnid_version(student_id, lesson_id, curriculum_version, type):
student_homework 0 uk_sid_lsnid_version 1 student_id A 100 BTREE
student_homework 0 uk_sid_lsnid_version 2 lesson_id A 100 BTREE
student_homework 0 uk_sid_lsnid_version 3 curriculum_version A 100 BTREE
student_homework 0 uk_sid_lsnid_version 4 type A 100 BTREE
Now i have a Sql:
select * from student_homework where student_id=100 and type=1 and explain result is like:
1 SIMPLE student_homework ref uk_sid_lsnid_version,idx_student_id_update_time uk_sid_lsnid_version 4 const 20 10.0 Using index condition
The execution plan is uk_sid_lsnid_version.
The question for me is how the query condition type works here? Does DB engine scans all (narrowed) records for it? In my understanding, the tree hierarchy architecture is:
student_id
/ \
lesson_id lesson_id
/ \
curriculum_version curriculum_version
/ \
type type
For the query condition (student_id, type), student_id matches the root of the tree index. Yet, the type does not match index lesson_id, the DB engine would apply type on all records, which have been filted by student_id.
Is my understanding is correct? if the subset records with a student_id is large, the query cost is still expensive.
There is no difference between query condition student_id = 100 and type =0 and type=0 and student_id = 100
To make full usage of composite index, would it be better if I add a new composite index (student_id, type)?
Yes, your understanding is correct, mysql will use uk_sid_lsnid_version index to match on student_id only, while filtering on type will be done a on the reduced set of rows that match on student_id.
The hint is in the extra column of the explain result: Using index condition
Using index condition (JSON property: using_index_condition)
Tables are read by accessing index tuples and testing them first to determine whether to read full table rows. In this way, index information is used to defer (“push down”) reading full table rows unless it is necessary. See Section 8.2.1.6, “Index Condition Pushdown Optimization”.
Section 8.2.1.6, “Index Condition Pushdown Optimization describes the steps of this technique as:
Get the next row's index tuple (but not the full table row).
Test the part of the WHERE condition that applies to this table and can be checked using only index columns. If the condition is not
satisfied, proceed to the index tuple for the next row.
If the condition is satisfied, use the index tuple to locate and read the full table row.
Test the remaining part of the WHERE condition that applies to this table. Accept or reject the row based on the test result.
Whether it would be better to add another composite index on student_id, type is a question that cannot be objectively answered by us, you need to test it.
If the speed of the query with the current index is fine, then you probably do not need a new index. You also need to weigh in how many other queries would use that index - there is not much point to create an index just for one query. You also need to weigh in how selective the type field is. Type fields with a limited list of values are often not selective enough. Mysql may decide to use index condition pushdown since student_id, type index is a not a covering index and mysql would have to get the full row anyway.
I have a large table (about 3 million records) that includes primarily these fields: rowID (int), a deviceID (varchar(20)), a UnixTimestamp in a format like 1536169459 (int(10)), powerLevel which has integers that range between 30 and 90 (smallint(6)).
I'm looking to pull out records within a certain time range (using UnixTimestamp) for a particular deviceID and with a powerLevel above a certain number. With over 3 million records, it takes a while. Is there a way to create an index that will optimize for this?
Create an index over:
DeviceId,
PowerLevel,
UnixTimestamp
When selecting, you will first narrow in to the set of records for your given Device, then it will narrow in to only those records that are in the correct PowerLevel range. And lastly, it will narrow in, for each PowerLevel, to the correct records by UnixTimestamp.
If I understand you correctly, you hope to speed up this sort of query.
SELECT something
FROM tbl
WHERE deviceID = constant
AND start <= UnixTimestamp
AND UnixTimestamp < end
AND Power >= constant
You have one constant criterion (deviceID) and two range critera (UnixTimestamp and Power). MySQL's indexes are BTREE (think sorted in order), and MySQL can only do one index range scan per SELECT.
So, you should probably choose an index on (deviceID, UnixTimestamp, Power). To satisfy the query, MySQL will random-access the index to the entries for deviceID, then further random access to the first row meeting the UnixTimestamp start criterion.
It will then scan the index sequentially, and use the Power information from each index entry to decide whether it should choose each row.
You could also use (deviceID, Power, UnixTimestamp) . But in this case MySQL will find the first entry matching the device and power criteria, then scan the index to look at entries will all timestamps to see which rows it should choose.
Your performance objective is to get MySQL to scan the fewest possible index entries, so it seems very likely the (deviceID, UnixTimestamp, Power) choice is superior. The index column on UnixTimestamp is probably more selective than the one on Power. (That's my guess.)
ALTER TABLE tbl CREATE INDEX tbl_dev_ts_pwr (deviceID, UnixTimestamp, Power);
Look at Bill Karwin's tutorials. Also look at Markus Winand's https://use-the-index-luke.com
The suggested 3-column indexes are only partially useful. The Optimizer will use the first 2 columns, but ignore the third.
Better:
INDEX(DeviceId, PowerLevel),
INDEX(DeviceId, UnixTimestamp)
Why?
The optimizer will pick between those two based on which seems to be more selective. If the time range is 'narrow', then the second index will be used; if there are not many rows with the desired PowerLevel, then the first index will be used.
Even better...
The PRIMARY KEY... You probably have Id as the PK? Perhaps (DeviceId, UnixTimestamp) is unique? (Or can you have two readings for a single device in a single second??) If the pair is unique, get rid of Id completely and have
PRIMARY KEY(DeviceId, UnixTimestamp),
INDEX(DeviceId, PowerLevel)
Notes:
Getting rid of Id saves space, thereby providing a little bit of speed.
When using a secondary index, the executing spends time bouncing between the index's BTree and the data BTree (ordered by the PK). By having PRIMARY KEY(Id), you are guaranteed to do the bouncing. By changing the PK to this, the bouncing is avoided. This may double the speed of the query.
(I am not sure the secondary index will every be used.)
Another (minor) suggestion: Normalize the DeviceId so that it is (perhaps) a 2-byte SMALLINT UNSIGNED (range 0..64K) instead of VARCHAR(20). Even if this entails a JOIN, the query will run a little faster. And a bunch of space is saved.
I know there are similar questions on this but I've got a specific query / question around why this query
EXPLAIN SELECT DISTINCT RSubdomain FROM R_Subdomains WHERE EmploymentState IN (0,1) AND RPhone='7853932120'
gives me this output explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains index NULL RSubdomain 767 NULL 3278 Using where
with and index on RSubdomains
but if I add in a composite index on EmploymentState/RPhone
I get this output from explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains range EmploymentState EmploymentState 67 NULL 2 Using where; Using temporary
if I take away the distinct on RSubdomains it drops the Using temp from the explain output... but what I don't get is why, when I add in the composite key (and keeping the key on RSubdomain) does the distinct end up using a temp table and which index schema is better here? I see that the amount of rows scanned on the combined key is far less, but the query is of type range and it's also slower.
Q: why ... does the distinct end up using a temp table?
MySQL is doing a range scan on the index (i.e. reading index blocks) to locate the rows that satisfy the predicates (WHERE clause). Then MySQL has to lookup the value of the RSubdomain column from the underlying table (it's not available in the index.) To eliminate duplicates, MySQL needs to scan the values of RSubdomain that were retrieved. The "Using temp" indicates the MySQL is materializing a resultset, which is processed in a subsequent step. (Likely, that's the set of RSubdomain values that was retrieved; given the DISTINCT, it's likely that MySQL is actually creating a temporary table with RSubdomain as a primary or unique key, and only inserting non-duplicate values.
In the first case, it looks like the rows are being retreived in order by RSubdomain (likely, that's the first column in the cluster key). That means that MySQL needn't compare the values of all the RSubdomain values; it only needs to check if the last retrieved value matches the currently retrieved value to determine whether the value can be "skipped."
Q: which index schema is better here?
The optimum index for your query is likely a covering index:
... ON R_Subdomains (RPhone, EmploymentState, RSubdomain)
But with only 3278 rows, you aren't likely to see any performance difference.
FOLLOWUP
Unfortunately, MySQL does not provide the type of instrumentation provided in other RDBMS (like the Oracle event 10046 sql trace, which gives actual timings for resources and waits.)
Since MySQL is choosing to use the index when it is available, that is probably the most efficient plan. For the best efficiency, I'd perform an OPTIMIZE TABLE operation (for InnoDB tables and MyISAM tables with dynamic format, if there have been a significant number of DML changes, especially DELETEs and UPDATEs that modify the length of the row...) At the very least, it would ensure that the index statistics are up to date.
You might want to compare the plan of an equivalent statement that does a GROUP BY instead of a DISTINCT, i.e.
SELECT r.RSubdomain
FROM R_Subdomains r
WHERE r.EmploymentState IN (0,1)
AND r.RPhone='7853932120'
GROUP
BY r.Subdomain
For optimum performance, I'd go with a covering index with RPhone as the leading column; that's based on an assumption about the cardinality of the RPhone column (close to unique values), opposed to only a few different values in the EmploymentState column. That covering index will give the best performance... i.e. the quickest elimination of rows that need to be examined.
But again, with only a couple thousand rows, it's going to be hard to see any performance difference. If the query was examining millions of rows, that's when you'd likely see a difference, and the key to good performance will be limiting the number of rows that need to be inspected.