Using datetime index in where clause MySQL - mysql

I have a table with 200 million rows where index is created in "created_at" column which is datetime datatype.
show create table [tablename] outputs:
create table `table`
(`created_at` datetime NOT NULL)
PRIMARY KEY (`id`)
KEY `created_at_index` (`created_at`)
ENGINE=InnoDB AUTO_INCREMENT=208512112 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci'
created_at ranges from 2020-04-01 ~ 2020-05-28.
I want to get only rows that are beyond 2020-05-15 23:00:00.
when I run :
EXPLAIN SELECT created_at
FROM table
where created_at >= '2020-05-15 23:00:00';
it says it outputs:
rows Extra
200mil Using Where
My understanding is that in RDMS if no index rows are not ordered but when you create an index on a column it is in sorted order therefore right after finding '2020-05-15 23:00:00' it will simply return all rows after that.
Also since its cardinalty is 7mil I'd thought using an index would be better than full table scan.
Is it because I've inputted date as a string? but when I try
where created_at >= date('2020-05-15 23:00:00');
still the same.
and
where created_at >= datetime('2020-05-15 23:00:00');
outputs syntax error.
Did mysql just decide it would be more efficient to do a full table scan?
EDIT:
using equals
EXPLAIN SELECT created_at
FROM table
where created_at = '2020-05-15';
outputs:
key_len ref rows Extra
5 const 51
In where clause if I change string to date('2020-05-15') it outputs:
key_len ref rows Extra
5 const 51 Using index condition
does this mean that first equal query one didn't use an index?

All of your queries would take advantage of an index on column created_at. MySQL always uses an index when it matches the predicate(s) of the where clause.
The output of your explains do indicate that you do not have this index in place, which is confirmed by the output of your create table.
Just create the index and your database will use it.
Here is a demo:
-- sample table, without the index
create table mytable(id int, created_at datetime);
-- the query does a full scan, as no index is available
explain select created_at from mytable where created_at >= '2020-05-15 23:00:00';
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
-: | :---------- | :------ | :--------- | :--- | :------------ | :--- | :------ | :--- | ---: | -------: | :----------
1 | SIMPLE | mytable | null | ALL | null | null | null | null | 1 | 100.00 | Using where
-- now add the index
create index idx_mytable_created_at on mytable(created_at);
-- the query uses the index
explain select created_at from mytable where created_at >= '2020-05-15 23:00:00';
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra
-: | :---------- | :------ | :--------- | :---- | :--------------------- | :--------------------- | :------ | :--- | ---: | -------: | :-----------------------
1 | SIMPLE | mytable | null | index | idx_mytable_created_at | idx_mytable_created_at | 6 | null | 1 | 100.00 | Using where; Using index

If the values are evenly distributed, about 25% of the rows are >= '2020-05-15 23:00:00' Yes, Mysql will prefer a full table scan to using the index when you have such a large percentage of the table needed.
See Why does MySQL not always use index for select query?
In a DATE context, date('2020-05-15 23:00:00') is the same as '2020-05-15'.
In a DATETIME context, datetime('2020-05-15 23:00:00') is the same as '2020-05-15 23:00:00'.
Using index means that the INDEX is "covering", which means that the entire query can be performed entirely in the index's BTree -- without reaching over to the data's BTree.
Using index condition means something quite different -- it has to do with a minor optimization relating to the two layers ("handler" and "engine") in MySQL's design. (More details in "ICP" aka "Index Condition Pushdown".)

Related

Why MySQL indexing is taking too much time for < operator?

This is my MYSQL table demo having more than 7 million rows;
+-------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id | varchar(42) | YES | MUL | NULL | |
| date | datetime | YES | MUL | NULL | |
| text | varchar(100) | YES | | NULL | |
+-------+--------------+------+-----+---------+-------+
I read that indexes work sequentially.
Case 1:
select * from demo where id="43984a7e-edcf-11ea-92c7-509a4cb89342" order by date limit 30;
I created (id, date) index and it is working fine and query is executing too fast.
But Hold on to see the below cases.
Case 2:
Below is my SQL query.
select * from demo where id>"43984a7e-edcf-11ea-92c7-509a4cb89342" order by date desc limit 30;
to execute the above query faster I created an index on (id, date). But it is taking more than 10 sec.
then I made another index on (date). This took less than 1 sec. Why the composite index(id, date) is too much slower than (date) index in this case ??
Case 3:
select * from demo where id<"43984a7e-edcf-11ea-92c7-509a4cb89342" order by date desc limit 30;
for this query, even the (date) index is taking more than 1.8 sec. Why < operator is not optimized with any index either it is (date) or(id, date).
and even this query is just going through around 300 rows and still taking more than 1.8 sec why?
mysql> explain select * from demo where id<"43984a7e-edcf-11ea-92c7-509a4cb89342" order by date desc limit 30;
+----+-------------+-------+------------+-------+-----------------------+------------+---------+------+------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+-----------------------+------------+---------+------+------+----------+-------------+
| 1 | SIMPLE | demo | NULL | index | demoindex1,demoindex2 | demoindex3 | 6 | NULL | 323 | 36.30 | Using where |
+----+-------------+-------+------------+-------+-----------------------+------------+---------+------+------+----------+-------------+
Any suggestions for how to create an index in Case 3 to optimize it?
In your first query, the index can be used for both the where clause and the ordering. So it will be very fast.
For the second query, the index can only be used for the where clause. Because of the inequality, information about the date is no longer in order. So the engine needs to explicitly order.
In addition, I imagine that the second query returns much more data than the first -- a fair amount of data if it take 10 seconds to sort it.

Why does an indexed mysql query filtered on less char values result in more rows examined?

When I run the following query, I see the expected rows examined as 40
EXPLAIN SELECT s.* FROM subscription s
WHERE s.current_period_end_date <= NOW()
AND s.status in ('A', 'T')
AND s.period_end_action in ('R','C')
ORDER BY s._id ASC limit 20;
+----+-------------+-------+-------+--------------------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+--------------------------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | s | index | status,current_period_end_date | PRIMARY | 4 | NULL | 40 | Using where |
+----+-------------+-------+-------+--------------------------------+---------+---------+------+------+-------------+
But when I run this query that simply changes AND s.period_end_action in ('R','C') to AND s.period_end_action = 'C', I see the expected rows examined as 611
EXPLAIN SELECT s.* FROM subscription s
WHERE s.current_period_end_date <= NOW()
AND s.status in ('A', 'T')
AND s.period_end_action = 'C'
ORDER BY s._id ASC limit 20;
+----+-------------+-------+-------+--------------------------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+--------------------------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | s | index | status,current_period_end_date | PRIMARY | 4 | NULL | 611 | Using where |
+----+-------------+-------+-------+--------------------------------+---------+---------+------+------+-------------+
I have the following indexes on the subscription table:
_id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
INDEX(status, period_end_action),
INDEX(current_period_end_date),
Any ideas? I don't understand why removing one of the period_end_action values would cause such a large increase in rows examined?
(I agree with others that EXPLAIN often has terrible row estimates.)
Actually the numbers might be reasonable (though I doubt it). The optimizer decided to do a table scan in both cases. And the query with fewer options for period_end_action probably has to scan farther to get the 20 rows. This is because it punted on using either of your secondary indexes.
These indexes are more likely to help your second query:
INDEX(period_end_action, _id)
INDEX(period_end_action, status)
INDEX(period_end_action, current_period_end_date)
The optimal index is usually starts with any columns tested by =.
Since there is no such thing for your first query, the Optimizer probably decided to scan in _id order so that it could avoid the "sort" mandated by ORDER BY.

MySQL- Improvement on count(*) aggregation with composite index keys

I have a table with the following structure with almost 120000 rows,
desc user_group_report
+------------------+----------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+----------+------+-----+-------------------+-------+
| user_id | int | YES | MUL | NULL | |
| group_id | int(11) | YES | MUL | NULL | |
| type_id | int(11) | YES | | NULL | |
| group_desc | varchar(128)| NO| | NULL |
| status | enum('open','close')|NO| | NULL | |
| last_updated | datetime | NO | | CURRENT_TIMESTAMP | |
+------------------+----------+------+-----+-------------------+-------+
I have indexes on the following keys :
user_group_type(user_id,group_id,group_type)
group_type(group_id,type_id)
user_type(user_id,type_id)
user_group(user_id,group_id)
My issue is I am running a count(*) aggregation on above table group by group_id and with a clause on type_id
Here is the query :
select count(*) user_count, group_id
from user_group_report
where type_id = 1
group by group_id;
and here is the explain plan (query taking 0.3 secs on average):
+----+-------------+------------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
| 1 | SIMPLE | user_group_report | index | user_group_type,group_type,user_group | group_type | 10 | NULL | 119811 | Using where; Using index |
+----+-------------+------------------+-------+---------------------------------+---------+---------+------+--------+--------------------------+
Here as I understand the query almost does a full table scan because of complex indices and When I am trying to add an index on group_id, the rows in explain plan shows a less number (almost half the rows) but the time taking for query execution is increased to 0.4-0.5 secs.
I have tried different ways to add/remove indices but none of them is reducing the time taken.
Assuming the table structure cannot be changed and querying is independent of other tables, Can someone suggest me a better way to optimize the above query or If i am missing anything here.
PS:
I have already tried to modify the query to the following but couldn't find any improvement.
select count(user_id) user_count, group_id
from user_group_report
where type_id = 1
group by group_id;
Any little help is appreciated.
Edit:
As per the suggestions, I added a new index
type_group on (type_id,group_id)
This is the new explain plan. The number of rows in explain,reduced but the query execution time is still the same
+----+-------------+------------------+------+---------------------------------+---------+---------+-------+-------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------------------------+---------+---------+-------+-------+--------------------------+
| 1 | SIMPLE | user_group_report | ref | user_group_type,type_group,user_group | type_group | 5 | const | 59846 | Using where; Using index |
+----+-------------+------------------+------+---------------------------------+---------+---------+-------+-------+--------------------------+
EDIT 2:
Adding details as suggested in answers/comments
select count(*)
from user_group_report
where type_id = 1
This query itself is taking 0.25 secs to execute.
and here is the explain plan:
+----+-------------+------------------+------+---------------+---------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------------+------+---------------+---------+---------+-------+-------+-------------+
| 1 | SIMPLE | user_group_report | ref | type_group | type_group | 5 | const | 59866 | Using index |
+----+-------------+------------------+------+---------------+---------+---------+-------+-------+-------------+
I believe that your group_type is wrong. Try to switch the attributes.
create index ix_type_group on user_group_report(type_id,group_id)
This index is better for your query because you specify the type_id = 1 in the where clause. Therefore, the query processor finds the first record with type_id = 1 in your index and then it scans the records in the index with this type_id and performs the aggregation. With such index, only relevant records in the index are accessed which is not possible with the group_type index.
If type_id is selective (i.e. it reduces the search space significantly), creating an index on type_id, group_id should help significantly.
This is because it reduces the number of records that need to be grouped first (remove everything where type_id != 1), and only then does the grouping/summing.
EDIT:
Following on from the comments, it seems we need to figure out more about where the bottleneck is - finding the records, or grouping/summing.
The first step would be to measure the performance of:
select count(*)
from user_group_report
where type_id = 1
If that is significantly faster, the challenge is likely in the grouping than in finding the records. If that's just as slow, it's in finding the records in the first place.
Do most of the columns really need to be NULLable? Change to NOT NULL where applicable.
What percentage of the table has type_id = 1? If it is most of the table, then that would explain why you don't see much improvement. Meanwhile, the EXPLAIN seems to be thinking there are only two distinct values for type_id, hence it says only half the table will be scanned -- this number cannot be trusted.
To get more insight into what is going on, please do these:
EXPLAIN FORMAT=JSON SELECT...;
And
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';
We can help interpret the data you get there. (Here is a brief discussion of such.)

Index on DATE(column)

if i use the following to compare a date with a datetime
SELECT * FROM `calendar` WHERE DATE(startTime) = '2010-04-29'
can i still use a index on startTime ?
after i create a index :
create index mi_date on calendar(startTime);
explain result :
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | calendar | index | NULL | mi_date | 6 | NULL | 25 | Using where; Using index |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
and with query
SELECT * FROM `calendar` WHERE startTime like '2010-04-29 %'
explain cmd :
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | calendar | index | mi_date | mi_date | 6 | NULL | 25 | Using where; Using index |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
the diference between the first explain is that column possible_keys is not null.
And there is also this query :
SELECT * FROM `calendar` WHERE startTime BETWEEN '2010-04-29 00:00:00' AND '2010-04-29 23:59:59'
explain :
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
| 1 | SIMPLE | calendar | range | mi_date | mi_date | 6 | NULL | 16 | Using where; Using index |
+----+-------------+--------------+-------+---------------+---------+---------+------+------+--------------------------+
the column type is range and the rows scanned are less just 16 which is the result of my query.
p.s : i have 25 rows in my table.
With EXPLAIN
mysql> EXPLAIN SELECT * FROM `calendar` WHERE DATE(startTime) = '2010-04-29'
you can check index using from mysql. But normaly MySQl dont' use a index using DATE().
Try
SELECT * FROM `calendar` WHERE startTime BETWEEN '2010-04-29 00:00:00' AND '2010-04-29 23:59:59'
You can know it adding "Explain" before your query, this will report with keys are being used. Usually the use of functions disables indexed searchs, but you could perfectly use indexes modifying your query:
SELECT * FROM `calendar` WHERE startTime like '2010-04-29 %'
Well, as you can see in the results of the Explain, in this case (yours, using date(..) ) mysql is already using the index. Check the ref field:
ref – Shows the columns or constants that are compared to the index
named in the key column. MySQL will either pick a constant value to be
compared or a column itself based on the query execution plan. You can
see this in the example given below.
Which is better? At this point they must be likely the same. You can try other queries and take a look at the rows field.
rows – lists the number of records that were examined to produce the
output. This Is another important column worth focusing on optimizing
queries, especially for queries that use JOIN and subqueries.
The on with less rows examined is the best.

Is index dependent on selected columns?

I am executing most of the queries based on the time. So i created index for the created time. But , The index only works , If I select the indexed columns only. Is mysql index is dependant the selected columns?.
My Assumption On Index
I thought index is like a telephone dictionary index page. Ex: If i want to find "Mark" . Index page shows which page character "M" starts in the directory. I think as same as the mysql works.
Table
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Name | varchar(100) | YES | | NULL | |
| OPERATION | varchar(100) | YES | | NULL | |
| PID | int(11) | YES | | NULL | |
| CREATED_TIME | bigint(20) | YES | | NULL | |
+--------------+--------------+------+-----+---------+----------------+
Indexes On the table.
+-----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| IndexTest | 0 | PRIMARY | 1 | ID | A | 10261 | NULL | NULL | | BTREE | | |
| IndexTest | 1 | t_dx | 1 | CREATED_TIME | A | 410 | NULL | NULL | YES | BTREE | | |
+-----------+------------+----------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
Queries Using Indexes:
explain select * from IndexTest where ID < 5;
+----+-------------+-----------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | IndexTest | range | PRIMARY | PRIMARY | 4 | NULL | 4 | Using where |
+----+-------------+-----------+-------+---------------+---------+---------+------+------+-------------+
explain select CREATED_TIME from IndexTest where CREATED_TIME > UNIX_TIMESTAMP(CURRENT_DATE())*1000;
+----+-------------+-----------+-------+---------------+------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+------+--------------------------+
| 1 | SIMPLE | IndexTest | range | t_dx | t_dx | 9 | NULL | 5248 | Using where; Using index |
+----+-------------+-----------+-------+---------------+------+---------+------+------+--------------------------+
Queries Not using Indexes
explain select count(distinct(PID)) from IndexTest where CREATED_TIME > UNIX_TIMESTAMP(CURRENT_DATE())*1000;
+----+-------------+-----------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | IndexTest | ALL | t_dx | NULL | NULL | NULL | 10261 | Using where |
+----+-------------+-----------+------+---------------+------+---------+------+-------+-------------+
explain select PID from IndexTest where CREATED_TIME > UNIX_TIMESTAMP(CURRENT_DATE())*1000;
+----+-------------+-----------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | IndexTest | ALL | t_dx | NULL | NULL | NULL | 10261 | Using where |
+----+-------------+-----------+------+---------------+------+---------+------+-------+-------------+
Short answer: No.
Whether indexes are used depends on the expresion in your WHERE clause, JOINs etc, but not on the columns you select.
But no rule without an exception (or actually a long list of those):
Long answer: Usually not
There are a number of factors used by the MySQL Optimizer in order to determine whether it should use an index.
The optimizer may decide to ignore an index if...
another (otherwise non-optimal) saves it from accessing the table data at all
it fails to understand that an expression is a constant
its estimates suggest it will return the full table anyway
if its use will cause the creation of a temporary file
... and tons of other reasons, some of which seem not to be documented anywhere
Sometimes the choices made by said optimizer are... erm... lets call them sub-optimal. Now what do you do in those cases?
You can help the optimizer by doing an OPTIMIZE TABLE and/or ANALYZE TABLE. That is easy to do, and sometimes helps.
You can make it use a certain index with the USE INDEX(indexname) or FORCE INDEX(indexname) syntax
You can make it ignore a certain index with the IGNORE INDEX(indexname) syntax
More details on Index Hints, Optimize Table and Analyze Table on the MySQL documentation website.
Actually, it makes no difference wether you select the column or not. Indexes are used for lookups, meaning for reducing really fast the number of records you need retrieved. That makes it usually useful in situations where: you have joins, you have where conditions. Also indexes help alot in ordering.
Updating and deleting can be sped up quite alot using indexes on the where conditions as well.
As an example:
table: id int pk ai, col1 ... indexed, col2 ...
select * from table -> does not use a index
select id from table where col1 = something -> uses the col1 index although it is not selected.
Looking at the second query, mysql does a lookup in the index, locates the records, then in this case stops and delivers (both id and col1 have index and id happens to be pk, so no need for a secondary lookup).
Situation changes a little in this case:
select col2 from table where col1 = something
This will make internally 2 lookups: 1 for the condition, and 1 on the pk for delivering the col2 data. Please notice that again, you don't need to select the col1 column to use the index.
Getting back to your query, the problem lies with: UNIX_TIMESTAMP(CURRENT_DATE())*1000;
If you remove that, your index will be used for lookups.
Is mysql index is dependant the selected columns?.
Yes, absolutely.
For example:
MySQL cannot use the index to perform lookups if the columns do not form a leftmost
prefix of the index. Suppose that you have the SELECT statements shown here:
SELECT * FROM tbl_name WHERE col1=val1;
SELECT * FROM tbl_name WHERE col1=val1 AND col2=val2;
SELECT * FROM tbl_name WHERE col2=val2;
SELECT * FROM tbl_name WHERE col2=val2 AND col3=val3;
If an index exists on (col1, col2, col3), only the first two queries use the index.
The third and fourth queries do involve indexed columns, but (col2) and (col2, col3)
are not leftmost prefixes of (col1, col2, col3).
Have a read through the extensive documentation.
for mysql query , the answer is yes, but not all
the query:
explain select * from IndexTest where ID < 5;
use the table cluster index if you use innodb, its table's primary key, so it use primary for query
the second query:
select CREATED_TIME from IndexTest where CREATED_TIME >
UNIX_TIMESTAMP(CURRENT_DATE())*1000;
this one is just fetch the index column that mysql does not need to fetch data from table but just index, so your explain result got "Using Index"
the query:
select count(distinct(PID)) from IndexTest where CREATED_TIME >
UNIX_TIMESTAMP(CURRENT_DATE())*1000;
it look like this
select PID from IndexTest where
CREATE_TIME>UNIX_TIMESTAMP(CURRENT_DATE())*1000 group by PID
mysql can use index to fetch data from database also, but mysql thinks this query it no need to use index to fetch data, because of the where condition filter, mysql thinks that use index fetch data is more expensive than scan all table, you can use force index also
the same reason for your last query
hopp this answer can help you
indexing helps speed the search for that particular column and associated data rather than the table data. So you have to include the indexed column to speed up select.