A simple query like the one below, properly indexed on a table populated with roughly 2M rows is taking 95 rows in set (2.06 sec) a lot longer to complete than I was hoping for.
As this is my first experience with tables this size, am I looking into normal behavior?
Query:
SELECT t.id, t.symbol, t.feed, t.time,
FLOOR(UNIX_TIMESTAMP(t.time)/(60*15)) as diff
FROM data as t
WHERE t.symbol = 'XYZ'
AND DATE(t.time) = '2011-06-02'
AND t.feed = '1M'
GROUP BY diff
ORDER BY t.time ASC;
...and Explain:
+----+-------------+-------+------+--------------------+--------+---------+-------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+--------------------+--------+---------+-------+--------+----------------------------------------------+
| 1 | SIMPLE | t | ref | unique,feed,symbol | symbol | 1 | const | 346392 | Using where; Using temporary; Using filesort |
+----+-------------+-------+------+--------------------+--------+---------+-------+--------+----------------------------------------------+
Try this:
...
AND t.time >= '2011-06-02' AND t.time < '2011-06-03'
...
Otherwise, your index(es) are wrong for this query. I'd expect one on (symbol, feed, time, id) or (feed, symbol, time, id) to cover it.
Edit, after comment:
If you put a function or processing on a column, any index is liable to be ignored. The index is on x not f(x) basically.
This change allows the index to be used because we now do a <= x < y to ignore the time part, not takeofftime(x)
Related
I have a query running on MySQL (v5.5 -- I know it's old but it's what I have to work with for now). The table A below has ~16 million rows and B has ~700,000. The query looks something like this:
SELECT A.id, A.x, A.y, A.z, B.foo FROM A STRAIGHT_JOIN B ON A.id = B.id
where A.x = 53 ORDER BY A.y desc LIMIT 0, 30;
There's an index setup on A.id as well as on B.id.
There's also an index setup on (A.x, A.y) (this key/index is called DocsByType).
This query has worked great so far, it's performance has always been sub-second or thereabouts. Recently though, I have a need to occasionally check against an additional possible value for A.x in the where clause. The following query is now performing very poorly, on average taking ~15 secs to complete:
SELECT A.id, A.x, A.y, A.z, B.foo FROM A STRAIGHT_JOIN B ON A.id = B.id
where (A.x = 18 or A.x = 53) ORDER BY A.y desc LIMIT 0, 30;
The explain for the fast query with only one comparison looks like this:
+----+-------------+-------+------+-----------------------------------------------------+------------+---------+----------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+-----------------------------------------------------+------------+---------+----------------+---------+-------------+
| 1 | SIMPLE | A | ref | Documents1,Documents2,Documents3,DocsByType,KEY_AID | DocsByType | 4 | const | 1870603 | Using where |
| 1 | SIMPLE | B | ref | KEY_BID | KEY_BID | 4 | mydb.B.id | 1 | |
+----+-------------+-------+------+-----------------------------------------------------+------------+---------+----------------+---------+-------------+
The explain for the multi-comparison query looks like this:
+----+-------------+-------+-------+-----------------------------------------------------+------------+---------+----------------+---------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+-----------------------------------------------------+------------+---------+----------------+---------+-----------------------------+
| 1 | SIMPLE | A | range | Documents1,Documents2,Documents3,DocsByType,KEY_AID | DocsByType | 4 | NULL | 1878693 | Using where; Using filesort |
| 1 | SIMPLE | B | ref | KEY_BID | KEY_BID | 4 | mydb.B.id | 1 | |
+----+-------------+-------+-------+-----------------------------------------------------+------------+---------+----------------+---------+-----------------------------+
I can see that there's a filesort operation that's not in the first query. Also the type is "range" instead of "ref", and the ref is "NULL" instead of "const". Removing the order by clause fixes it completely, so that it completes in less than a second, but it's important that the results are sorted.
Query optimization is not my strong suit. I would have thought that it would have worked exactly the same given that the column is already indexed. Can anyone explain why this behaves the way it does and suggest a way to optimize the query? Please also note that the new query might need to use 3, 4 or even 5 possible values for the where clause (but always against the same column).
I've also tried running the queries using MySQL 5.8 but the result is the same. My table is using the MyISAM engine.
Suppose you have a big list of people's names. And the goal is to find the first 30 Smiths (ordered by first name). The first query is fast because it is essentially doing the WHERE, ORDER BY and LIMIT all at once:
The second is messier because it is effectively done thus:
Find the first names of all the 'Smiths',
Find the first names of all the 'Joneses'
Sort the first names and show the first 30
There are two things to speed up your slow query:
( SELECT A.id, A.x, A.y, A.z, B.foo FROM A JOIN B ON A.id = B.id
where (A.x = 18)
ORDER BY A.y desc LIMIT 30 )
UNION ALL
( SELECT A.id, A.x, A.y, A.z, B.foo FROM A JOIN B ON A.id = B.id
where (A.x = 53) -- Note
ORDER BY A.y desc LIMIT 30 )
ORDER BY A.y desc LIMIT 0, 30 -- Yes, repeated
Comments:
STRAIGHT_JOIN is unnecessary, JOIN will happen to do the same thing
Each subquery will use INDEX(x,y) and make use of LIMIT.
ALL is faster than the default, and is appropriate in this case
If you need to "paginate", the limits need to be handled as described here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or
Any number of UNIONs can be tacked together. However, at some point, the cost of all the unions will outweigh the benefit. (It is not practical to try to predict where the cutoff is.)
It would be faster to do the LIMIT 30 before JOINing to B. That way, you would do only 30 lookups in B; my way needs 60; your original query needed lots more.
I have a problematic query that I know how to write faster, but technically the SQL is invalid and it has no guarantee of working correctly into the future.
The original, slow query looks like this:
SELECT sql_no_cache DISTINCT r.field_1 value
FROM table_middle m
JOIN table_right r on r.id = m.id
WHERE ((r.field_1) IS NOT NULL)
AND (m.kind IN ('partial'))
ORDER BY r.field_1
LIMIT 26
This takes about 37 seconds. Explain output:
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | rows | Extra |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | r | range | PRIMARY,index_field_1 | index_field_1 | 9 | 1544595 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | m | eq_ref | PRIMARY,index_kind | PRIMARY | 4 | 1 | Using where; Distinct |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
The faster version looks like this; the order by clause is pushed down into a subquery, which is joined on and is in turn limited with distinct:
SELECT sql_no_cache DISTINCT value
FROM (
SELECT r.field_1 value
FROM table_middle m
JOIN table_right r ON r.id = m.id
WHERE ((r.field_1) IS NOT NULL)
AND (m.kind IN ('partial'))
ORDER BY r.field_1
) t
LIMIT 26
This takes about 2.7 seconds. Explain output:
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | rows | Extra |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | 1346348 | Using temporary |
| 2 | DERIVED | m | ref | PRIMARY,index_kind | index_kind | 99 | 1539558 | Using where; Using index; Using temporary; Using filesort |
| 2 | DERIVED | r | eq_ref | PRIMARY,index_field_1 | PRIMARY | 4 | 1 | Using where |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
There are three million rows in table_right and table_middle, and all mentioned columns are individually indexed. The query should be read as having an arbitrary where clause - it's dynamically generated. The query can't be rewritten in any way that prevents the where clause being easily replaced, and similarly the indexes can't be changed - MySQL doesn't support enough indexes for the number of potential filter field combinations.
Has anyone seen this problem before - specifically, select / distinct / order by / limit performing very poorly - and is there another way to write the same query with good performance that doesn't rely on unspecified implementation behaviour?
(AFAIK MariaDB, for example, ignores order by in a subquery because it should not logically affect the set-theoretic semantics of the query.)
For the more incredulous
Here's how you can create a version of database for yourself! This should output a SQL script you can run with mysql command-line client:
#!/usr/bin/env ruby
puts "create database testy;"
puts "use testy;"
puts "create table table_right(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "create table table_middle(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "begin;"
3_000_000.times do |x|
puts "insert into table_middle values (#{x},#{x*10},#{x*100},#{x*1000});"
puts "insert into table_right values (#{x},#{x*10},#{x*100},#{x*1000});"
end
puts "commit;"
Indexes aren't important for reproducing the effect. The script above is untested; it's an approximation of a pry session I had when reproducing the problem manually.
Replace the m.kind in ('partial') with m.field_1 > 0 or something similar that's trivially true. Observe the large difference in performance between the two different techniques, and how the sorting semantics are preserved (tested using MySQL 5.5). The unreliability of the semantics are, of course, precisely the reason I'm asking the question.
Please provide SHOW CREATE TABLE. In the absence of that, I will guess that these are missing and may be useful:
m: (kind, id)
r: (field_1, id)
You can turn off MariaDB's ignoring of the subquery's ORDER BY.
SELECT archive.id, archive.file, archive.create_timestamp, archive.spent
FROM archive LEFT JOIN submissions ON archive.id = submissions.id
WHERE submissions.id is NULL
AND archive.file is not NULL
AND archive.create_timestamp < DATE_SUB(NOW(), INTERVAL 6 month)
AND spent = 0
ORDER BY archive.create_timestamp ASC LIMIT 10000
EXPLAIN result:
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
| 1 | SIMPLE | archive | range | create_timestamp,file_in | create_timestamp | 4 | NULL | 111288502 | Using where |
| 1 | SIMPLE | submissions | eq_ref | PRIMARY | PRIMARY | 4 | production.archive.id | 1 | Using where; Using index; Not exists |
+----+-------------+--------------------+--------+--------------------------------+------------------+---------+--------------------------------------------+-----------+--------------------------------------+
I've tried hinting use of indexes for archive table with:
USE INDEX (create_timestamp,file_in)
Archive table is huge, ~150mil records.
Any help with speeding up this query would be greatly appreciated.
You want to use a composite index. For this query:
create index archive_file_spent_createts on archive(file, spent, createtimestamp);
In such an index, you want the where conditions with = to come first, followed by up to one column with an inequality. In this case, I'm not sure if MySQL will use the index for the order by.
This assumes that the where conditions do, indeed, significantly reduce the size of the data.
I recently started moving my application from one host to another. From my home computer, to a virtual machine in the cloud. When testing the performance on the new node I noticed severe degradation. Comparing the results of the same query, with the same data, with the same version of mysql.
On my home computer:
mysql> SELECT id FROM events WHERE id in (SELECT distinct event AS id FROM results WHERE status='Inactive') AND (DATEDIFF(NOW(), startdate) < 30) AND (DATEDIFF(NOW(), startdate) > -1) AND status <> 10 AND (form = 'IndSingleDay' OR form = 'IndMultiDay');
+------+
| id |
+------+
| 8238 |
| 8369 |
+------+
2 rows in set (0,57 sec)
and on the new machine:
mysql> SELECT id FROM events WHERE id in (SELECT distinct event AS id FROM results WHERE status='Inactive') AND (DATEDIFF(NOW(), startdate) < 30) AND (DATEDIFF(NOW(), startdate) > -1) AND status <> 10 AND (form = 'IndSingleDay' OR form = 'IndMultiDay');
+------+
| id |
+------+
| 8369 |
+------+
1 row in set (26.70 sec)
Which means 46 times slower. That is not okay. I tried to get an explanation to why it was so slow. For my home computer:
mysql> explain SELECT id FROM events WHERE id in (SELECT distinct event AS id FROM results WHERE status='Inactive') AND (DATEDIFF(NOW(), startdate) < 30) AND (DATEDIFF(NOW(), startdate) > -1) AND status <> 10 AND (form = 'IndSingleDay' OR form = 'IndMultiDay');
+----+--------------+-------------+--------+---------------+------------+---------+-------------------+---------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------------+--------+---------------+------------+---------+-------------------+---------+-------------+
| 1 | SIMPLE | events | ALL | PRIMARY | NULL | NULL | NULL | 5370 | Using where |
| 1 | SIMPLE | <subquery2> | eq_ref | <auto_key> | <auto_key> | 5 | eventor.events.id | 1 | NULL |
| 2 | MATERIALIZED | results | ALL | idx_event | NULL | NULL | NULL | 1319428 | Using where |
+----+--------------+-------------+--------+---------------+------------+---------+-------------------+---------+-------------+
3 rows in set (0,00 sec)
And for my virtual node:
mysql> explain SELECT id FROM events WHERE id in (SELECT distinct event AS id FROM results WHERE status='Inactive') AND (DATEDIFF(NOW(), startdate) < 30) AND (DATEDIFF(NOW(), startdate) > -1) AND status <> 10 AND (form = 'IndSingleDay' OR form = 'IndMultiDay');
+----+--------------------+---------+----------------+---------------+-----------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+---------+----------------+---------------+-----------+---------+------+------+-------------+
| 1 | PRIMARY | events | ALL | NULL | NULL | NULL | NULL | 7297 | Using where |
| 2 | DEPENDENT SUBQUERY | results | index_subquery | idx_event | idx_event | 5 | func | 199 | Using where |
+----+--------------------+---------+----------------+---------------+-----------+---------+------+------+-------------+
2 rows in set (0.00 sec)
As you can see the results differ. I have not been able to figure out what the difference is. From all other point of views, the two system setups look similar.
In this case, the most likely problem is the processing of the subquery. This changed between some recent versions of MySQL (older versions do a poor job of optimizing the subqueries, the newest version does a better job).
One simple solution is to replace the in with exists and a correlated subquery:
SELECT id
FROM events
WHERE exists (SELECT 1
FROM results
WHERE status='Inactive' and results.event = events.id
) AND
(DATEDIFF(NOW(), startdate) < 30) AND (DATEDIFF(NOW(), startdate) > -1) AND status <> 10 AND (form = 'IndSingleDay' OR form = 'IndMultiDay');
This should work well in both versions, especially if you have an index on results(status, event).
The difference between 5.5 and 5.6 because of the new optimizations for handling subqueries explains (as discussed in comments) the difference in performance, but this conclusion also masks the fact that the original query is not written optimally to begin with. There does not seem to be a need for a subquery here at all.
The "events" table needs an index on (status,form,startdate) and the "results" table needs an index on (status) and another index on (event).
SELECT DISTINCT e.id
FROM events e
JOIN results r ON r.event = e.id AND r.status = 'Inactive'
WHERE (e.form = 'IndSingleDay' OR e.form = 'IndMultiDay')
AND e.status != 10
AND start_date > DATE_SUB(DATE(NOW()), INTERVAL 30 DAY)
AND start_date < DATE_SUB(DATE(NOW()), INTERVAL 2 DAY);
You might have to tweak the values "30" and "2" to get precisely the same logic, but the important principle here is that you never want to use a column as an argument to a function in the WHERE clause if it can be avoided by rewriting the expression another way, because the optimizer can't look "backwards" through the function to discover the actual range of raw values that you are wanting it to find. Instead, it has to evaluate the function against all of the possible data that it can't otherwise eliminate.
Using functions to derive constant values for comparison to the column, as shown above, allows the optimizer to realize that it's actually looking for a range of start_date values, and narrow down the possible rows accordingly, assuming an index exists on the values in question.
If I've decoded your query correctly, this version should be faster than any subquery if the indexes are in place.
I have this query
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
INNER JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
INNER JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
GROUP BY l.licitatii_id
ORDER BY data_publicarii DESC
Explain outputs:
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | ALL | PRIMARY,key_status_tip_domeniu | NULL | NULL | NULL | 120 | 85.83 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
As you see type=ALL for d table
now if I add LIMIT 100 to the query
plan changes to range:
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | range | PRIMARY,key_status_tip_domeniu | key_status_tip_domeniu | 9 | NULL | 103 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
Why does this happen?
Can this query be optimized more, both queries take 13 seconds.
Table schema is visible on gist github
MySQL chooses domenii as the leading table for the join.
This table is filtered on (status, tip_domeniu) = (1, 1).
It does not seem to be a very selective condition, so normally a full table scan with filtering would be preferred over the index scan.
We can see that MySQL expects 120 records to be returned from domanii for which this condition would hold.
When you add a LIMIT, the number of records expected to be processed is decreased, and MySQL considers the index scan more efficient for this.
Note that this condition:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita) AS DATE))) < '1300683793'
is not sargable, so you deprive the optimizer to use an index on data_limita.
Create the following indexes:
licitatii_ue (status, data_limita)
licitatii_ue (status, data_publicarii)
and rewrite the query like this:
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND l.data_limita < FROM_UNIXTIME(((1300683793 - 86400) div 86400) * 86400)
GROUP BY
l.licitatii_id
ORDER BY
data_publicarii DESC
Ah, the mysteries of the query optimizer are many and unknowable...
At a quick glance, the most obvious thing to optimize might be the
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
clause.
depending on the number of records in the licitatii_ue table, this looks like an expensive operation, and it will bypass any indices available.
ALL is table scan, range is range scan (due to LIMIT). Nothing bad with that, actually it also causes a key to be used (key_status_tip_domeniu).
The reason you are slow is, most likely, that you are using ORDER BY data_publicarii DESC (this is easy to test, just drop the ORDER BY and benchmark the query; would expect few orders of magnitude).
Mysql admits (under Extra column of explain) that it is using filesort (needed for order by because it can't or does not know how to use an index). Adding yet another index to the mix might help, especially if you confirm that ORDER BY is making it slow.
EDIT
Actually, you do have a cardinal sin in your query:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
Avoid applying any functions to your field values if you can apply them to a constant. So switch it around and rewrite it as
l.data_limita < some_function('1300683793')
However complext the some_function would be, it will be calculated only once. Mysql planner will know it is a constant. The way you wrote it would force mysql to apply unix_timestamp, timestampadd, cast and from_unixtime to value of data_limita from each row. Now in I/O bound systems this will usually just burn some extra CPU cycles while waiting for the disks to spin around (however, it might get significant, your system might get CPU bound and it is just a bad thing). Biggest difference is that you loose possibility to use an index on data_limita.
Finally, all your indexes are singe field indexes and mysql does some index merging, but is not stellar in it. You might want to try creating indexes that cover all your conditions and sorting order (in order of selectivity for target query).