Some day I answered a question on SO (accepted as correct), but the answer left me with a great doubt.
Shortly, user had a table with this fields:
id INT PRIMARY KEY
dt DATETIME (with an INDEX)
lt DOUBLE
The query SELECT DATE(dt),AVG(lt) FROM table GROUP BY DATE(dt) was really slow.
We told him that (part of) the problem was using DATE(dt) as field and grouping, but db was on a production server and wasn't possible to split that field.
So (with a trigger) was inserted another field da DATE (with an INDEX) filled automatically with DATE(dt). Query SELECT da,AVG(lt) FROM table GROUP BY da was a bit faster, but with about 8mln records it took about 60s!!!
I tried on my pc and finally I discovered that, removing the index on field da query took only 7s, while using DATE(dt) after removing index it took 13s.
I've always thought an index on column used for grouping could really speed the query up, not the contrary (8 times slower!!!).
Why? Which is the reason?
Thanks a lot.
Because you still need to read all the data from both index + data file. Since you're not using any where condition - you always will have the query plan, that access all the data, row by row and you can do nothing with this.
If performance is important for this query and it is performed often - I'd suggest to cache the results into some temporary table and update it hourly (daily, etc).
Why it becomes slower: because in index data is already sorted and when mysql calculates cost of the query execution it thinks that it will be better to use already sorted data, then group it, then calculate agregates. But it is not in this case.
I think this is because of this or similiar MySQL bug: Index degrades sort performance and optimizer does not honor IGNORE INDEX
I remember the question as I was going to answer it but got distracted with something else. The problem was that his table design wasnt taking advantage of a clustered primary key index.
I would have re-designed the table creating a composite clustered primary key with the date as the leading part of the index. The sm_id field is still just a sequential unsigned int to guarantee uniqueness.
drop table if exists speed_monitor;
create table speed_monitor
(
created_date date not null,
sm_id int unsigned not null,
load_time_secs double(10,4) not null default 0,
primary key (created_date, sm_id)
)
engine=innodb;
+------+----------+
| year | count(*) |
+------+----------+
| 2009 | 22723200 | 22 million
| 2010 | 31536000 | 31 million
| 2011 | 5740800 | 5 million
+------+----------+
select
created_date,
count(*) as counter,
avg(load_time_secs) as avg_load_time_secs
from
speed_monitor
where
created_date between '2010-01-01' and '2010-12-31'
group by
created_date
order by
created_date
limit 7;
-- cold runtime
+--------------+---------+--------------------+
| created_date | counter | avg_load_time_secs |
+--------------+---------+--------------------+
| 2010-01-01 | 86400 | 1.66546802 |
| 2010-01-02 | 86400 | 1.66662466 |
| 2010-01-03 | 86400 | 1.66081309 |
| 2010-01-04 | 86400 | 1.66582251 |
| 2010-01-05 | 86400 | 1.66522316 |
| 2010-01-06 | 86400 | 1.66859480 |
| 2010-01-07 | 86400 | 1.67320440 |
+--------------+---------+--------------------+
7 rows in set (0.23 sec)
Related
I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows.
The query is:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
The indexes from this table are:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
If I remove AVG():
select count(*) from mytable where `date` = "2017-11-01";
It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.
I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.
Any suggestion is appreciated!
To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).
To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)
A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.
In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.
To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use
select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.
If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.
Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.
As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).
For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
For example:
SELECT COUNT(*) FROM student;
https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count
If you add AVG() or something else, you lose this optimization
I have a mysql table where the details of check-in activity performed by all the users is captured. Below is the structure with sample data.
Table Name: check_in
+-----------+--------------------+---------------------+
| id | user_id | time |
+-----------+--------------------+---------------------+
| 1 | 10001 | 2016-04-02 12:04:02 |
| 2 | 10001 | 2016-04-02 11:04:02 |
| 3 | 10002 | 2016-10-27 23:56:17 |
| 4 | 10001 | 2016-04-02 10:04:02 |
| 5 | 10002 | 2016-10-27 22:56:17 |
| 6 | 10002 | 2016-10-27 21:56:17 |
+-----------+--------------------+---------------------+
On the dashboard, I have to display each user and at what time was their latest check-in activity performed (Sample dashboard view shown below).
User 1's last check-in was at 2016-04-02 12:04:02
User 2's last check-in was at 2016-10-27 23:56:17
What is the best and efficient way to write the query to pull this data?
I have written below query, but it is taking 5-8 seconds to complete the execution. Note: This table has hundreds of thousands of rows.
select user_id, max(time) as last_check_in_at from check_in group by user_id
Your SQL query look optimized to me.
The reason for it being slow is probably that you do not have indexes on user_id and time columns.
Try adding the following indexes to your table:
ALTER TABLE `check_in` ADD INDEX `user_id` (`user_id`)
ALTER TABLE `check_in` ADD INDEX `time` (`time`)
and then execute your SQL query again to see if it makes a difference.
The indexes should allow the SQL engine to quickly group the relevant records by user_id and also quickly determine the maximum time.
Indexes will also help to quickly sort data by time (as suggested by Rajesh)
Simply use order by
Note:- please don't use time as a column name because may be it should be reserved keywords in any language
select chk.user_id user_id, chk.time as last_check_in_at from
check_in chk group by chk.user_id ORDER BY chk.time
You need to use order by in query.
Please try this query
select `user_id`,`time` as last_check_in_at from check_in group by user_id order by `time` DESC
It works for me.
I have an INNODB table levels:
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(9) | NO | PRI | NULL | |
| level_name | varchar(20) | NO | | NULL | |
| user_id | int(10) | NO | | NULL | |
| user_name | varchar(45) | NO | | NULL | |
| rating | decimal(5,4) | NO | | 0.0000 | |
| votes | int(5) | NO | | 0 | |
| plays | int(5) | NO | | 0 | |
| date_published | date | NO | MUL | NULL | |
| user_comment | varchar(255) | NO | | NULL | |
| playable_character | int(2) | NO | | 1 | |
| is_featured | tinyint(1) | NO | MUL | 0 | |
+--------------------+--------------+------+-----+---------+-------+
There are ~4 million rows. Because of the front-end functionality, I need to query this table with a variety of filters and sorts. They are on playable_character, rating, plays, and date_published. The date_published can be filtered to show by the last day, week, month, or anytime(last 3 years). There's also paging. So, depending on the user choices, the queries can look, for example, like one of these:
SELECT * FROM levels
WHERE playable_character = 0 AND
date_published BETWEEN date_sub(now(), INTERVAL 3 YEAR) AND now()
ORDER BY date_published DESC
LIMIT 0, 1000;
SELECT * FROM levels
WHERE playable_character = 4 AND
date_published BETWEEN date_sub(now(), INTERVAL 1 WEEK) AND now()
ORDER BY rating DESC
LIMIT 4000, 1000;
SELECT * FROM levels
WHERE playable_character = 5 AND
date_published BETWEEN date_sub(now(), INTERVAL 1 MONTH) AND now()
ORDER BY plays DESC
LIMIT 1000, 1000;
I started out with an index idx_date_char(date_published, playable_character) that works great on the first example query here -- basically anything that's ordering by date_published. Using EXPLAIN, I get 'using index condition', which is good. I think I understand why the index works, since the same two indexed columns exist in the WHERE and ORDER BY clauses.
My problem is with queries that ORDER by plays or rating. I understand I'm introducing a third column, but for the life of me I can't get an index that works well, despite trying just about every variation I could think of: composite indexes of all three or four in every order, and so on. Maybe the query could be written differently?
I should add that rating and plays are always queried as DESC. Only date_published may be either DESC or ASC.
Any suggestions greatly appreciated. TIA.
It seems you would make good use of data sorted in this way for each of the queries:
playable_character, date_published
playable_character, date_published, rating
playable_character, date_published, plays
Bear in mind that the data you need sorted in the first query happens to be a subset of the data the second and third query needs, so we can get rid of it.
Also note that adding DESC or ASC to an index is syntactically correct but doesn't actually change anything as that feature is not currently supported (it is expected to be supported in the future so that is why it is there). All indexes are stored in ascending order. More information here.
So these are the indexes that you should create:
ALTER TABLE levels ADD INDEX (playable_character, date_published, rating)
ALTER TABLE levels ADD INDEX (playable_character, date_published, plays)
That should make the 3 queries up there run faster than Forrest Gump.
The columns used in your where clause AND order by should be part of the index. I would have an index on
( playable_character, date_published DESC, rating DESC, plays DESC )
The reason I would put the playable character FIRST is you want that ID primary, then all those dates within question. The rating and plays are just along for the ride for assisting the ORDER BY clause).
Think of the index like this. If you have it ordered by Date_Published, then Playable_Character, think of a room of boxes. Each box has a date.. Within that box for a given date, you have them in order of character. So, you have 3 years worth of data to go through, you have to open all boxes for the last 3 years and find the character you are looking for.
Now, think of it like this. Each box is by character, and within that, all their dates are pre-sorted. So, you go to one box, open it... Move to the date in question and grab the records from X-Y range you want. Now, you can apply a simple order by of those records.
When your query includes a range predicate like BETWEEN, the order of columns in your index is important.
First, include one or more columns referenced by equality predicates.
Next, include one column referenced by a range predicate.
Any further columns in the index after the column referenced by a range predicate cannot be used for other range predicates or for sorting.
If you have no range predicate, you can add a column for sort order.
So your first query can benefit from an index on (playable_character, date_published). The sorting should be a no-op because the optimizer will just fetch rows in the index order.
The second and third queries are bound to do a filesort, because you have a range predicate and then you're sorting by a different column. If you had had only equality predicates, you would be able to use the third column to avoid the filesort, but that doesn't work when you have a range predicate.
The best you can hope for is that the conditions reduce the size of the result set so that it can sort in memory without doing too many sort merge passes. You can help this by increasing sort_buffer_size, but be careful not to increase it too much, because it's allocated per thread.
The ASC/DESC keywords in index definitions makes no difference in MySQL.
See http://dev.mysql.com/doc/refman/5.6/en/create-index.html:
These keywords are permitted for future extensions for specifying ascending or descending index value storage. Currently, they are parsed but ignored; index values are always stored in ascending order.
I am using MySQL 5.6 on FreeBSD and have just recently switched from using MyISAM tables to InnoDB to gain advances of foreign key constraints and transactions.
After the switch, I discovered that a query on a table with 100,000 rows that was previously taking .003 seconds, was now taking 3.6 seconds. The query looked like this:
SELECT *
-> FROM USERS u
-> JOIN MIGHT_FLOCK mf ON (u.USER_ID = mf.USER_ID)
-> WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
I noticed that if I removed the ORDER BY clause, the execution time dropped back down to .003 seconds, so the problem is obviously in the sorting.
I then discovered that if I added back the ORDER BY but removed indexes on the columns referred to in the query (STATUS and ACCESS_ID), the query execution time would take the normal .003 seconds.
Then I discovered that if I added back the indexes on the STATUS and ACCESS_ID columns, but used IGNORE INDEX (STATUS,ACCESS_ID), the query would still execute in the normal .003 seconds.
Is there something about InnoDB and sorting results when referencing an indexed column in a WHERE clause that I don't understand?
Or am I doing something wrong?
EXPLAIN for the slow query returns the following results:
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
| 1 | SIMPLE | u | ref | PRIMARY,STATUS,ACCESS_ID | STATUS | 2 | const | 53902 | Using index condition; Using where; Using temporary; Using filesort |
| 1 | SIMPLE | mf | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.u.USER_ID | 1 | NULL |
+----+-------------+-------+--------+--------------------------+---------+---------+---------------------+-------+---------------------------------------------------------------------+
EXPLAIN for the fast query returns the following results:
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
| 1 | SIMPLE | mf | index | PRIMARY | STREAK | 2 | NULL | 100 | NULL |
| 1 | SIMPLE | u | eq_ref | PRIMARY | PRIMARY | 4 | PRO_MIGHT.mf.USER_ID | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+----------------------+------+-------------+
Any help would be greatly appreciated.
In the slow case, MySQL is making an assumption that the index on STATUS will greatly limit the number of users it has to sort through. MySQL is wrong. Presumably most of your users are ACTIVE. MySQL is picking up 50k user rows, checking their ACCESS_ID, joining to MIGHT_FLOCK, sorting the results and taking the first 100 (out of 50k).
In the fast case, you have told MySQL it can't use either index on USERS. MySQL is using its next-best index, it is taking the first 100 rows from MIGHT_FLOCK using the STREAK index (which is already sorted), then joining to USERS and picking up the user rows, then checking that your users are ACTIVE and have an ACCESS_ID at or above 8. This is much faster because only 100 rows are read from disk (x2 for the two tables).
I would recommend:
drop the index on STATUS unless you frequently need to retrieve INACTIVE users (not ACTIVE users). This index is not helping you.
Read this question to understand why your sorts are so slow. You can probably tune InnoDB for better sort performance to prevent these kind of problems.
If you have very few users with ACCESS_ID at or above 8 you should see a dramatic improvement already. If not you might have to use STRAIGHT_JOIN in your select clause.
Example below:
SELECT *
FROM MIGHT_FLOCK mf
STRAIGHT_JOIN USERS u ON (u.USER_ID = mf.USER_ID)
WHERE u.STATUS = 'ACTIVE' AND u.ACCESS_ID >= 8 ORDER BY mf.STREAK DESC LIMIT 0,100
STRAIGHT_JOIN forces MySQL to access the MIGHT_FLOCK table before the USERS table based on the order in which you specify those two tables in the query.
To answer the question "Why did the behaviour change" you should start by understanding the statistics that MySQL keeps on each index: http://dev.mysql.com/doc/refman/5.6/en/myisam-index-statistics.html. If statistics are not up to date or if InnoDB is not providing sufficient information to MySQL, the query optimiser can (and does) make stupid decisions about how to join tables.
Table structure:
CREATE TABLE `mytable` (
`id` varchar(8) NOT NULL,
`event` varchar(32) NOT NULL,
`event_date` date NOT NULL,
`event_time` time NOT NULL,
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The data in this table looks like this:
id | event | event_date | event_time
---------+------------+-------------+-------------
ref1 | someevent1 | 2010-01-01 | 01:23:45
ref1 | someevent2 | 2010-01-01 | 02:34:54
ref1 | someevent3 | 2010-01-18 | 01:23:45
ref2 | someevent4 | 2012-10-05 | 22:23:21
ref2 | someevent5 | 2012-11-21 | 11:22:33
The table contains about 500.000.000 records similar to this.
The query I'd like to ask about here looks like this:
SELECT *
FROM `mytable`
WHERE `id` = 'ref1'
ORDER BY event_date DESC,
event_time DESC
LIMIT 0, 500
The EXPLAIN output looks like:
select_type: SIMPLE
table: E
type: ref
possible_keys: id
key: id
key_len: 27
ref: const
rows: 17024 (a common example)
Extra: Using where; Using filesort
Purpose:
This query is generated by a website, the LIMIT-values are for page navigation element, so if the user wants to see older entries, they'll get adjusted to 500, 500, then 1000, 500 and so on.
Since some items in the field id can be set in quite a lot of rows, more and more rows will of course lead to a slower query. Profiling those slow queries showed me the reason is the sorting, most of the time during the query the mysql server is busy sorting the data. Indexing the fields event_date and event_time didn't change that very much.
Example SHOW PROFILE Result, sorted by duration:
state | duration/sec | percentage
---------------|--------------|-----------
Sorting result | 12.00145 | 99.80640
Sending data | 0.01978 | 0.16449
statistics | 0.00289 | 0.02403
freeing items | 0.00028 | 0.00233
...
Total | 12.02473 | 100.00000
Now the question:
Before delving way deeper into the mysql variables like sort_buffer_size and other server configuration option, can you think of any way to change the query or the sorting behaviour so sorting ain't that big performance eater anymore and the purpose of this query is still in place?
I don't mind a bit of out-of-the-box-thinking.
Thank you in advance!
As I wrote in comment multi-column index (id, evet_date desc, event_time desc) may help.
If this table will grow fast you should consider to adding option in application for user to select data for particular date range.
Example: First step always return 500 records but to select next records user should set date range for data and then set pagination.
Indexing is most likely the solution; you just have to do it right. See the mysql reference page for this.
The most effective way to do it is to create a three-part index on (id, event_date, event_time). You can specify event_date desc, event_time desc in the index, but I don't think it's necessary.
I would start by doing what sufleR suggests - the multi-column index on (id, event_date desc, event_time desc).
However, according to http://dev.mysql.com/doc/refman/5.0/en/create-index.html, the DESC keyword is supported, but doesn't actually do anything. That's a bit of a pain - so try it, and see if it improves the performance, but it probably won't.
If that's the case, you may have to cheat by creating a "sort_column", with an automatically decrementing value (pretty sure you'd have to do this in the application layer, I don't think you can decrement in MySQL), and add that column to the index.
You'd end up with:
id | event | event_date | event_time | sort_value
---------+------------+-------------+-------------------------
ref1 | someevent1 | 2010-01-01 | 01:23:45 | 0
ref1 | someevent2 | 2010-01-01 | 02:34:54 | -1
ref1 | someevent3 | 2010-01-18 | 01:23:45 | -2
ref2 | someevent4 | 2012-10-05 | 22:23:21 | -3
ref2 | someevent5 | 2012-11-21 | 11:22:33 | -4
and and index on ID and sort_value.
Dirty, but the only other suggestion is to reduce the number of records matching the where clause in other ways - for instance, by changing the interface not to return 500 records, but records for a given date.