Optimizing query in MySQL5.6 - mysql

I have an INNODB table levels:
+--------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+-------+
| id | int(9) | NO | PRI | NULL | |
| level_name | varchar(20) | NO | | NULL | |
| user_id | int(10) | NO | | NULL | |
| user_name | varchar(45) | NO | | NULL | |
| rating | decimal(5,4) | NO | | 0.0000 | |
| votes | int(5) | NO | | 0 | |
| plays | int(5) | NO | | 0 | |
| date_published | date | NO | MUL | NULL | |
| user_comment | varchar(255) | NO | | NULL | |
| playable_character | int(2) | NO | | 1 | |
| is_featured | tinyint(1) | NO | MUL | 0 | |
+--------------------+--------------+------+-----+---------+-------+
There are ~4 million rows. Because of the front-end functionality, I need to query this table with a variety of filters and sorts. They are on playable_character, rating, plays, and date_published. The date_published can be filtered to show by the last day, week, month, or anytime(last 3 years). There's also paging. So, depending on the user choices, the queries can look, for example, like one of these:
SELECT * FROM levels
WHERE playable_character = 0 AND
date_published BETWEEN date_sub(now(), INTERVAL 3 YEAR) AND now()
ORDER BY date_published DESC
LIMIT 0, 1000;
SELECT * FROM levels
WHERE playable_character = 4 AND
date_published BETWEEN date_sub(now(), INTERVAL 1 WEEK) AND now()
ORDER BY rating DESC
LIMIT 4000, 1000;
SELECT * FROM levels
WHERE playable_character = 5 AND
date_published BETWEEN date_sub(now(), INTERVAL 1 MONTH) AND now()
ORDER BY plays DESC
LIMIT 1000, 1000;
I started out with an index idx_date_char(date_published, playable_character) that works great on the first example query here -- basically anything that's ordering by date_published. Using EXPLAIN, I get 'using index condition', which is good. I think I understand why the index works, since the same two indexed columns exist in the WHERE and ORDER BY clauses.
My problem is with queries that ORDER by plays or rating. I understand I'm introducing a third column, but for the life of me I can't get an index that works well, despite trying just about every variation I could think of: composite indexes of all three or four in every order, and so on. Maybe the query could be written differently?
I should add that rating and plays are always queried as DESC. Only date_published may be either DESC or ASC.
Any suggestions greatly appreciated. TIA.

It seems you would make good use of data sorted in this way for each of the queries:
playable_character, date_published
playable_character, date_published, rating
playable_character, date_published, plays
Bear in mind that the data you need sorted in the first query happens to be a subset of the data the second and third query needs, so we can get rid of it.
Also note that adding DESC or ASC to an index is syntactically correct but doesn't actually change anything as that feature is not currently supported (it is expected to be supported in the future so that is why it is there). All indexes are stored in ascending order. More information here.
So these are the indexes that you should create:
ALTER TABLE levels ADD INDEX (playable_character, date_published, rating)
ALTER TABLE levels ADD INDEX (playable_character, date_published, plays)
That should make the 3 queries up there run faster than Forrest Gump.

The columns used in your where clause AND order by should be part of the index. I would have an index on
( playable_character, date_published DESC, rating DESC, plays DESC )
The reason I would put the playable character FIRST is you want that ID primary, then all those dates within question. The rating and plays are just along for the ride for assisting the ORDER BY clause).
Think of the index like this. If you have it ordered by Date_Published, then Playable_Character, think of a room of boxes. Each box has a date.. Within that box for a given date, you have them in order of character. So, you have 3 years worth of data to go through, you have to open all boxes for the last 3 years and find the character you are looking for.
Now, think of it like this. Each box is by character, and within that, all their dates are pre-sorted. So, you go to one box, open it... Move to the date in question and grab the records from X-Y range you want. Now, you can apply a simple order by of those records.

When your query includes a range predicate like BETWEEN, the order of columns in your index is important.
First, include one or more columns referenced by equality predicates.
Next, include one column referenced by a range predicate.
Any further columns in the index after the column referenced by a range predicate cannot be used for other range predicates or for sorting.
If you have no range predicate, you can add a column for sort order.
So your first query can benefit from an index on (playable_character, date_published). The sorting should be a no-op because the optimizer will just fetch rows in the index order.
The second and third queries are bound to do a filesort, because you have a range predicate and then you're sorting by a different column. If you had had only equality predicates, you would be able to use the third column to avoid the filesort, but that doesn't work when you have a range predicate.
The best you can hope for is that the conditions reduce the size of the result set so that it can sort in memory without doing too many sort merge passes. You can help this by increasing sort_buffer_size, but be careful not to increase it too much, because it's allocated per thread.
The ASC/DESC keywords in index definitions makes no difference in MySQL.
See http://dev.mysql.com/doc/refman/5.6/en/create-index.html:
These keywords are permitted for future extensions for specifying ascending or descending index value storage. Currently, they are parsed but ignored; index values are always stored in ascending order.

Related

MySQL Left Join / explain why original order of first table not been kept? [duplicate]

i have a mysql db with a table 'difficulties' with a few records. If i do "select * from difficulties" i get them back in the order they were added, ordered by primary key id:
mysql> select * from difficulties;
+----+-------+-----------+--------+----------+-----------+
| id | value | name | letter | low_band | high_band |
+----+-------+-----------+--------+----------+-----------+
| 1 | 1 | very_easy | VE | 1 | 1 |
| 2 | 2 | easy | E | 2 | 5 |
| 3 | 3 | medium | M | 6 | 10 |
| 4 | 4 | hard | H | 11 | 12 |
| 5 | 0 | na | NA | 0 | 0 |
+----+-------+-----------+--------+----------+-----------+
However, if i do "select name from difficulties" i get them back in a different order:
mysql> select name from difficulties;
+-----------+
| name |
+-----------+
| easy |
| hard |
| medium |
| na |
| very_easy |
+-----------+
My question is: what determines this order? Is there any logic to it? Is it something like "the order the files representing the records happen to be in within the filesystem" or something else that is to all intents and purposes random?
thanks, max
This is correct and by design: if you don't ask for sorting, the server doesn't bother with sorting (sorting can be an expensive operation), and it will return the rows in whatever order it sees fit. Without a requested order, the way the records are ordered can even differ from one query to the next (although that's not too likely).
The order is definitely not random - it's just whatever way the rows come out of the query, and as you see, even minor modifications can change this un-order significantly. This "undefined" ordering is implementation dependent, unpredictable and should not be relied upon.
If you want the elements to be ordered, use the ORDER BY clause (that's its purpose) - e.g.
SELECT name FROM difficulties ORDER BY name ASC;
That will always return the result sorted by name, in ascending order. Or, if you want them ordered by the primary key, last on top, use:
SELECT name FROM difficulties ORDER BY id DESC;
You can even sort by function - if you actually want random order, do this (caveat: horrible performance with largish tables):
SELECT name FROM difficulties ORDER BY RAND();
For more details see this tutorial and the documentation.
As Piskvor said, MySQL will order the query however it finds most convenient. To address the "why" part of your question, the different result orders are probably a side effect of different execution plans. If you have an index on difficulties, the second query would make use of it but the first would not.
Without the ORDER BY clause, the results are returned in random order. However, it seems logical to me that the easiest (and the fastest) way for db engine to return data as it's stored. So it's why the fist resultset is ordered by PK (no fragmentation, logical order is the same as physical). In the second case I would assume that you have an index on field name, and for the query select name from difficulties this index is covering, so db engine scans this index, and it's why you see results ordered by name. Anyway, you shouldn't rely on such "default" ordering.
select name from difficulties should return the values in alphabetical order as it is a text field.
And select * from difficulties will return in numeric order i believe. dont hold me to that lol
best thing to do is use ORDER BY if you care about what order things are

Very simple AVG() aggregation query on MySQL server takes ridiculously long time

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows.
The query is:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
The indexes from this table are:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
If I remove AVG():
select count(*) from mytable where `date` = "2017-11-01";
It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.
I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.
Any suggestion is appreciated!
To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).
To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)
A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.
In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.
To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use
select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.
If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.
Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.
As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).
For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
For example:
SELECT COUNT(*) FROM student;
https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count
If you add AVG() or something else, you lose this optimization

MySQL query results returned are semi-random / inconsistently ordered

I'm working with an ndb cluster setup that uses proxysql. There are 4 mysql servers, 4 data nodes, and 2 management nodes. The following happens when I access one of the mysql servers directly, so I think that I can safely rule out proxysql as the root cause, but beyond that I'm just lost.
Here's a table I set up to help illustrate my problem:
mysql> describe delain;
+----------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+----------------+
| album_id | tinyint(2) | NO | PRI | NULL | auto_increment |
| album | varchar(30) | YES | | NULL | |
+----------+-------------+------+-----+---------+----------------+
2 rows in set (0.00 sec)
It contains the following data; note that I specified an order by clause:
mysql> select * from delain order by album_id;
+----------+-------------------------+
| album_id | album |
+----------+-------------------------+
| 1 | Lucidity |
| 2 | April Rain |
| 3 | We Are the Others |
| 4 | The Human Contradiction |
| 5 | Moonbathers |
+----------+-------------------------+
5 rows in set (0.00 sec)
If I don't specify an order clause, the results returned are seemingly random, such as this:
mysql> select * from delain;
+----------+-------------------------+
| album_id | album |
+----------+-------------------------+
| 3 | We Are the Others |
| 5 | Moonbathers |
| 1 | Lucidity |
| 2 | April Rain |
| 4 | The Human Contradiction |
+----------+-------------------------+
5 rows in set (0.00 sec)
When I repeat the query (sans order clause) I get a different ordering pretty much every time. It doesn't seem to be truly random, but there sure as heck isn't any sort of discernible pattern to me.
Why is this happening? My experience with mysql has always been that the default ordering is essentially according to the primary key, but this is also the first time I've used an ndb cluster in particular; I don't know if there's a difference there, or if there's a setting inside a config file that got missed or what. Any help is greatly appreciated!
This is standard SQL behavior.
https://mariadb.com/kb/en/library/sql-99/order-by-clause/ says in part:
An ORDER BY clause may optionally appear after a query expression: it specifies the order rows should have when returned from that query (if you omit the clause, your DBMS will return the rows in some random order).
(emphasis mine)
It'd be more accurate to say it will return the rows in some arbitrary order, instead of random order. Random implies that the order will change from one execution to the next.
In the case of InnoDB, the order tends to be the index order in which the rows were accessed. The index it reads is not necessarily the primary key. So the order is unchanging and somewhat predictable if you know something about the internals. But it's not random.
In the case of MyISAM, the order tends to be the order the rows are stored in the table, which can vary depending on the order the rows were inserted, and also depending on where there was space in the file at the time of insertion, after row deletions.
In the case of NDB, I don't know as much about its internals, so I can't describe its rule for "default" order, but it's still true that without an explicit ORDER BY, the storage engine is allowed to return rows in whatever order it wants to.
For NDB the order depends on timing in the case of a
SELECT * from table;
SELECT * from table is implemented as a parallelised
full table scan within the data nodes and their database
threads with one
MySQL thread receiving results.
So with a filtered query like
SELECT * from table where filter_column = 2;
the filter gets evaluated in many threads in parallel.
Each of those threads return rows to the MySQL thread in any
order that depends on OS scheduler, networking and many
other things. So there is no default ordering unless you
use ORDER BY.
So for NDB order is truly random and not just arbitrary.
You'll see this in all NDB test suites using MTR that
queries mostly use SELECT * from table ORDER BY some_field;

mysql performance improvements for sorted query in a large table

Table structure:
CREATE TABLE `mytable` (
`id` varchar(8) NOT NULL,
`event` varchar(32) NOT NULL,
`event_date` date NOT NULL,
`event_time` time NOT NULL,
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The data in this table looks like this:
id | event | event_date | event_time
---------+------------+-------------+-------------
ref1 | someevent1 | 2010-01-01 | 01:23:45
ref1 | someevent2 | 2010-01-01 | 02:34:54
ref1 | someevent3 | 2010-01-18 | 01:23:45
ref2 | someevent4 | 2012-10-05 | 22:23:21
ref2 | someevent5 | 2012-11-21 | 11:22:33
The table contains about 500.000.000 records similar to this.
The query I'd like to ask about here looks like this:
SELECT *
FROM `mytable`
WHERE `id` = 'ref1'
ORDER BY event_date DESC,
event_time DESC
LIMIT 0, 500
The EXPLAIN output looks like:
select_type: SIMPLE
table: E
type: ref
possible_keys: id
key: id
key_len: 27
ref: const
rows: 17024 (a common example)
Extra: Using where; Using filesort
Purpose:
This query is generated by a website, the LIMIT-values are for page navigation element, so if the user wants to see older entries, they'll get adjusted to 500, 500, then 1000, 500 and so on.
Since some items in the field id can be set in quite a lot of rows, more and more rows will of course lead to a slower query. Profiling those slow queries showed me the reason is the sorting, most of the time during the query the mysql server is busy sorting the data. Indexing the fields event_date and event_time didn't change that very much.
Example SHOW PROFILE Result, sorted by duration:
state | duration/sec | percentage
---------------|--------------|-----------
Sorting result | 12.00145 | 99.80640
Sending data | 0.01978 | 0.16449
statistics | 0.00289 | 0.02403
freeing items | 0.00028 | 0.00233
...
Total | 12.02473 | 100.00000
Now the question:
Before delving way deeper into the mysql variables like sort_buffer_size and other server configuration option, can you think of any way to change the query or the sorting behaviour so sorting ain't that big performance eater anymore and the purpose of this query is still in place?
I don't mind a bit of out-of-the-box-thinking.
Thank you in advance!
As I wrote in comment multi-column index (id, evet_date desc, event_time desc) may help.
If this table will grow fast you should consider to adding option in application for user to select data for particular date range.
Example: First step always return 500 records but to select next records user should set date range for data and then set pagination.
Indexing is most likely the solution; you just have to do it right. See the mysql reference page for this.
The most effective way to do it is to create a three-part index on (id, event_date, event_time). You can specify event_date desc, event_time desc in the index, but I don't think it's necessary.
I would start by doing what sufleR suggests - the multi-column index on (id, event_date desc, event_time desc).
However, according to http://dev.mysql.com/doc/refman/5.0/en/create-index.html, the DESC keyword is supported, but doesn't actually do anything. That's a bit of a pain - so try it, and see if it improves the performance, but it probably won't.
If that's the case, you may have to cheat by creating a "sort_column", with an automatically decrementing value (pretty sure you'd have to do this in the application layer, I don't think you can decrement in MySQL), and add that column to the index.
You'd end up with:
id | event | event_date | event_time | sort_value
---------+------------+-------------+-------------------------
ref1 | someevent1 | 2010-01-01 | 01:23:45 | 0
ref1 | someevent2 | 2010-01-01 | 02:34:54 | -1
ref1 | someevent3 | 2010-01-18 | 01:23:45 | -2
ref2 | someevent4 | 2012-10-05 | 22:23:21 | -3
ref2 | someevent5 | 2012-11-21 | 11:22:33 | -4
and and index on ID and sort_value.
Dirty, but the only other suggestion is to reduce the number of records matching the where clause in other ways - for instance, by changing the interface not to return 500 records, but records for a given date.

Why an index can make a query really slow?

Some day I answered a question on SO (accepted as correct), but the answer left me with a great doubt.
Shortly, user had a table with this fields:
id INT PRIMARY KEY
dt DATETIME (with an INDEX)
lt DOUBLE
The query SELECT DATE(dt),AVG(lt) FROM table GROUP BY DATE(dt) was really slow.
We told him that (part of) the problem was using DATE(dt) as field and grouping, but db was on a production server and wasn't possible to split that field.
So (with a trigger) was inserted another field da DATE (with an INDEX) filled automatically with DATE(dt). Query SELECT da,AVG(lt) FROM table GROUP BY da was a bit faster, but with about 8mln records it took about 60s!!!
I tried on my pc and finally I discovered that, removing the index on field da query took only 7s, while using DATE(dt) after removing index it took 13s.
I've always thought an index on column used for grouping could really speed the query up, not the contrary (8 times slower!!!).
Why? Which is the reason?
Thanks a lot.
Because you still need to read all the data from both index + data file. Since you're not using any where condition - you always will have the query plan, that access all the data, row by row and you can do nothing with this.
If performance is important for this query and it is performed often - I'd suggest to cache the results into some temporary table and update it hourly (daily, etc).
Why it becomes slower: because in index data is already sorted and when mysql calculates cost of the query execution it thinks that it will be better to use already sorted data, then group it, then calculate agregates. But it is not in this case.
I think this is because of this or similiar MySQL bug: Index degrades sort performance and optimizer does not honor IGNORE INDEX
I remember the question as I was going to answer it but got distracted with something else. The problem was that his table design wasnt taking advantage of a clustered primary key index.
I would have re-designed the table creating a composite clustered primary key with the date as the leading part of the index. The sm_id field is still just a sequential unsigned int to guarantee uniqueness.
drop table if exists speed_monitor;
create table speed_monitor
(
created_date date not null,
sm_id int unsigned not null,
load_time_secs double(10,4) not null default 0,
primary key (created_date, sm_id)
)
engine=innodb;
+------+----------+
| year | count(*) |
+------+----------+
| 2009 | 22723200 | 22 million
| 2010 | 31536000 | 31 million
| 2011 | 5740800 | 5 million
+------+----------+
select
created_date,
count(*) as counter,
avg(load_time_secs) as avg_load_time_secs
from
speed_monitor
where
created_date between '2010-01-01' and '2010-12-31'
group by
created_date
order by
created_date
limit 7;
-- cold runtime
+--------------+---------+--------------------+
| created_date | counter | avg_load_time_secs |
+--------------+---------+--------------------+
| 2010-01-01 | 86400 | 1.66546802 |
| 2010-01-02 | 86400 | 1.66662466 |
| 2010-01-03 | 86400 | 1.66081309 |
| 2010-01-04 | 86400 | 1.66582251 |
| 2010-01-05 | 86400 | 1.66522316 |
| 2010-01-06 | 86400 | 1.66859480 |
| 2010-01-07 | 86400 | 1.67320440 |
+--------------+---------+--------------------+
7 rows in set (0.23 sec)