Optimizing MySQL select distinct order by limit safely - mysql

I have a problematic query that I know how to write faster, but technically the SQL is invalid and it has no guarantee of working correctly into the future.
The original, slow query looks like this:
SELECT sql_no_cache DISTINCT r.field_1 value
FROM table_middle m
JOIN table_right r on r.id = m.id
WHERE ((r.field_1) IS NOT NULL)
AND (m.kind IN ('partial'))
ORDER BY r.field_1
LIMIT 26
This takes about 37 seconds. Explain output:
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | rows | Extra |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
| 1 | SIMPLE | r | range | PRIMARY,index_field_1 | index_field_1 | 9 | 1544595 | Using where; Using index; Using temporary; Using filesort |
| 1 | SIMPLE | m | eq_ref | PRIMARY,index_kind | PRIMARY | 4 | 1 | Using where; Distinct |
+----+-------------+-------+--------+-----------------------+---------------+---------+---------+-----------------------------------------------------------+
The faster version looks like this; the order by clause is pushed down into a subquery, which is joined on and is in turn limited with distinct:
SELECT sql_no_cache DISTINCT value
FROM (
SELECT r.field_1 value
FROM table_middle m
JOIN table_right r ON r.id = m.id
WHERE ((r.field_1) IS NOT NULL)
AND (m.kind IN ('partial'))
ORDER BY r.field_1
) t
LIMIT 26
This takes about 2.7 seconds. Explain output:
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | rows | Extra |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | 1346348 | Using temporary |
| 2 | DERIVED | m | ref | PRIMARY,index_kind | index_kind | 99 | 1539558 | Using where; Using index; Using temporary; Using filesort |
| 2 | DERIVED | r | eq_ref | PRIMARY,index_field_1 | PRIMARY | 4 | 1 | Using where |
+----+-------------+------------+--------+-----------------------+------------+---------+---------+-----------------------------------------------------------+
There are three million rows in table_right and table_middle, and all mentioned columns are individually indexed. The query should be read as having an arbitrary where clause - it's dynamically generated. The query can't be rewritten in any way that prevents the where clause being easily replaced, and similarly the indexes can't be changed - MySQL doesn't support enough indexes for the number of potential filter field combinations.
Has anyone seen this problem before - specifically, select / distinct / order by / limit performing very poorly - and is there another way to write the same query with good performance that doesn't rely on unspecified implementation behaviour?
(AFAIK MariaDB, for example, ignores order by in a subquery because it should not logically affect the set-theoretic semantics of the query.)
For the more incredulous
Here's how you can create a version of database for yourself! This should output a SQL script you can run with mysql command-line client:
#!/usr/bin/env ruby
puts "create database testy;"
puts "use testy;"
puts "create table table_right(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "create table table_middle(id int(11) not null primary key, field_0 int(11), field_1 int(11), field_2 int(11));"
puts "begin;"
3_000_000.times do |x|
puts "insert into table_middle values (#{x},#{x*10},#{x*100},#{x*1000});"
puts "insert into table_right values (#{x},#{x*10},#{x*100},#{x*1000});"
end
puts "commit;"
Indexes aren't important for reproducing the effect. The script above is untested; it's an approximation of a pry session I had when reproducing the problem manually.
Replace the m.kind in ('partial') with m.field_1 > 0 or something similar that's trivially true. Observe the large difference in performance between the two different techniques, and how the sorting semantics are preserved (tested using MySQL 5.5). The unreliability of the semantics are, of course, precisely the reason I'm asking the question.

Please provide SHOW CREATE TABLE. In the absence of that, I will guess that these are missing and may be useful:
m: (kind, id)
r: (field_1, id)
You can turn off MariaDB's ignoring of the subquery's ORDER BY.

Related

Speed up MYSQL query that has order by

I got 300,000 rows in 'tblmessages', and I'm trying to run this query.
If U don't use "order by msgId desc" It's run very fast, but when I add the order It's very very slow.
What am I missing...?
my index
SELECT msgId
FROM tblmessages
left join wp_users as a on a.ID = (CASE WHEN (msgFromUserId=1) then tblmessages.msgToUserId else tblmessages.msgFromUserId END)
left join tblforum_users u1 on u1.user_ID =(CASE WHEN (msgFromUserId=1) then tblmessages.msgToUserId else tblmessages.msgFromUserId END)
where (msgFromUserId=1 or msgToUserId=1)
order by msgId desc limit 0,20
after Explain:
+----+-------------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+----------+-----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+----------+-----------------------------------------------------+
| 1 | SIMPLE | tblmessages | NULL | index_merge | IXtbl_messages_from,IX_tblmessages_to | IX_tblmessages_from,IX_tblmessages_to | 5,5 | NULL | 726454 | 100.00 | Using union(IX_tblmessages_from,IX_tblmessages_to); |
Using where; Using filesort |
| 1 | SIMPLE | a | NULL | eq_ref | PRIMARY | PRIMARY | 8 | func | 1 | 100.00 | Using where; Using index |
| 1 | SIMPLE | u1 | NULL | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where; Using index |
+----+-------------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+----------+-----------------------------------------------------+
and more information about tblmessages:
Data 12.1 GiB
Index 190.8 MiB
Total 12.3 GiB
There are 2 reasons for this:
If U don't use "order by msgId desc" It's run very fast, but when I
add the order It's very very slow.
1st is that your query has a "LIMIT 20" rows at the end of it. Ordering your results for this query (via "order by") will require all rows which satisfy the where clause to be pulled/generated,sorted,etc. in order to get the correct 20 rows.
2nd is sort-of subjective, but I'll say your table's don't have the necessary indexing to support optimization of that query. Or, another way of looking at it, could be a query like that is generally gonna run slowly because of the multiple joins on IF-THEN clauses. My advice on fixing this is to either simplify the query into multiple queries or update your question with "SHOW CREATE TABLE" statements for the relevant tables along with one of those INFORMATION_SCHEMA queries that summarizes indices, partitions, etc. From there we could potentially update design (e.g. adding multi-column indices/indexes/however its spelled...)

Why is this MySQL query poor performance (DEPENDENT_SUBQUERY)

explain select id, nome from bea_clientes where id in (
select group_concat(distinct(bea_clientes_id)) as list
from bea_agenda
where bea_clientes_id>0
and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
)
When I try to do the above (without the explain), MySQL simply goes busy, using DEPENDENT SUBQUERY, which makes this slow as hell. The thing is why the optimizer calculates the subquery for each ids in client. I even put the IN argument in a group_concat believing that would be the same to put that result as a plain "string" to avoid scanning.
I thought this wouldn't be a problem for MySQL server which is 5.5+?
Testing in MariaDb also does the same.
Is this a known bug? I know I can rewrite this as a join, but still this is terrible.
Generated by: phpMyAdmin 4.4.14 / MySQL 5.6.26
Comando SQL: explain select id, nome from bea_clientes where id in ( select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in(300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022) );
Lines: 2
Current selection does not contain a unique column. Grid edit, checkbox, Edit, Copy and Delete features are not available.
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
|----|--------------------|--------------|-------|-------------------------------|---------------|---------|------|-------|------------------------------------|
| 1 | PRIMARY | bea_clientes | ALL | NULL | NULL | NULL | NULL | 30432 | Using where |
| 2 | DEPENDENT SUBQUERY | bea_agenda | range | bea_clientes_id,bea_agente_id | bea_agente_id | 5 | NULL | 2352 | Using index condition; Using where |
Obviously hard to test without the data but something like below.
Subqueries are just not good in mysql (though its my prefered engine).
I could also recommend indexing the relevant columns which will improve performance for both queries.
For clarity can I also advise expanding queries.
select t1.id,t1.nome from (
(select group_concat(distinct(bea_clientes_id)) as list from bea_agenda where bea_clientes_id>0 and bea_agente_id in (300006,300007,300008,300009,300010,300011,300012,300013,300014,300018,300019,300020,300021,300022)
) as t1
join
(select id, nome from bea_clientes) as t2
on t1.list=t2.id
)

mysql slow complex query with order by

The below query even without the order by is very slow and I can't figure out why. I'm guessing it's the where date_affidavit_file but how can I make it fast with that order byas well? Perhaps a sublect on the job_id's that match the where and then pass that into the rest of the code but I still need to order by server the servername like this. Any suggestions?
explain select sql_no_cache court_county, job.id as jid, job_status,
DATE_FORMAT(job.datetime_served, '%m/%d/%Y') as dserved ,
CONCAT(server.namefirst, ' ', server.namelast) as servername, client_name,
DATE_FORMAT(job.datetime_received, '%m/%d/%Y') as dtrec ,
DATE_FORMAT(job.datetime_give2server, '%m/%d/%Y') as dtg2s,
DATE_FORMAT(date_kase_filed, '%m/%d/%Y') as dkf,
DATE_FORMAT(job.date_sent_to_court, '%m/%d/%Y') as dtstc ,
TO_DAYS(datetime_served )-TO_DAYS(date_kase_filed) as totaldays from job
left join kase on kase.id=job.kase_id
left join server on job.server_id=server.id
left join client on kase.client_id=client.id
left join LUcourt on LUcourt.id=kase.court_id
where date_affidavit_filed is not null and date_affidavit_filed !='' order by servername;
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | kase | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.kase_id | 1 | |
| 1 | SIMPLE | server | eq_ref | PRIMARY | PRIMARY | 4 | pserve.job.server_id | 1 | |
| 1 | SIMPLE | client | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.client_id | 1 | |
| 1 | SIMPLE | LUcourt | eq_ref | PRIMARY | PRIMARY | 4 | pserve.kase.court_id | 1 | |
+----+-------------+---------+--------+----------------------+---------+---------+-----------------------+--------+----------------------------------------------+
Check that you have indexes on the following columns. job.kase_id or job.server_id
Also you are ordering by a calculated field which is not optimal. Perhaps order by a field with index.
If you need to preserve that exact sort, you might want to add a field in the DB for that value. And populate it with appropriate values or set up a trigger on the DB to populate it for you automatically.
This can speed up the order by:
CREATE INDEX namefull ON server (namefirst,namelast);
if you do ORDER BY (server.namefirst, server.namelast) instead of ORDER BY servername, which should produce the same output.
You can also create indexes on each table on any field you are left joining, that can improve the performance of your query too.
When you write,
where date_affidavit_filed is not null and date_affidavit_filed !=''
you practically are selecting most of the rows. Or at least so many that it is not worthwhile to run through the indexing. The query planner sees that there is an index involving date_affidavit_filed, but decides not to use it and go with the WHERE clause, which only involves date_affidavit_filed; so we know it's not a key issue, it must be a cardinality issue.
| 1 | SIMPLE | job | ALL | date_affidavit_filed | NULL | NULL | NULL | 365212 | Using where; Using temporary; Using filesort |
You can try optimizing this by creating an index on
date_affidavit_filed, kase_id, server_id
in that order. How many rows are returned by the query?
You are selecting everything that isn't empty really.
That really means everything.
I don't know how many rows of data you have have but it's a lot to go through.
Try narrowing your query to a date range or specific client.
If you really need everything, don't output it one row after a time, but build up a big string in the software you use to output with all formatting, and then when you're finished looping through the results and you have constructed the data you wish to output you can output them in one big go.
You could also use paging.
Just add limit 0,30 on page 1, limit 30,30 on page two, etc.. and let the end user walk through the pages.

SQL Query Optimization

I am trying to speed up this django app (note: I didn't design this... just stuck maintaining it) and the biggest bottle neck seems to be these queries that are being generated by the admin. We have a content class that 4-5 other sub-classes inherit from and anytime the master list is pulled up in the admin a query like this is generated:
SELECT `content_content`.`id`,
`content_content`.`issue_id`,
`content_content`.`slug`,
`content_content`.`section_id`,
`content_content`.`priority`,
`content_content`.`group_id`,
`content_content`.`rotatable`,
`content_content`.`pub_status`,
`content_content`.`created_on`,
`content_content`.`modified_on`,
`content_content`.`old_pk`,
`content_content`.`content_type_id`,
`content_image`.`content_ptr_id`,
`content_image`.`caption`,
`content_image`.`kicker`,
`content_image`.`pic`,
`content_image`.`crop_x`,
`content_image`.`crop_y`,
`content_image`.`crop_side`,
`content_issue`.`id`,
`content_issue`.`special_issue_name`,
`content_issue`.`web_publish_date`,
`content_issue`.`issue_date`,
`content_issue`.`fm_name`,
`content_issue`.`arts_name`,
`content_issue`.`comments`,
`content_section`.`id`,
`content_section`.`name`,
`content_section`.`audiodizer_id`
FROM `content_image`
INNER
JOIN `content_content`
ON `content_image`.`content_ptr_id` = `content_content`.`id`
INNER
JOIN `content_issue`
ON `content_content`.`issue_id` = `content_issue`.`id`
INNER
JOIN `content_section`
ON `content_content`.`section_id` = `content_section`.`id`
WHERE NOT ( `content_content`.`pub_status` = -1 )
ORDER BY `content_issue`.`issue_date` DESC LIMIT 30
I ran an EXPLAIN on this and got the following:
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
| 1 | SIMPLE | content_image | ALL | PRIMARY | NULL | NULL | NULL | 40499 | Using temporary; Using filesort |
| 1 | SIMPLE | content_content | eq_ref | PRIMARY,issue_id,content_content_issue_id,content_content_section_id,content_content_pub_status | PRIMARY | 4 | content_image.content_ptr_id | 1 | Using where |
| 1 | SIMPLE | content_section | eq_ref | PRIMARY | PRIMARY | 4 | content_content.section_id | 1 | |
| 1 | SIMPLE | content_issue | eq_ref | PRIMARY | PRIMARY | 4 | content_content.issue_id | 1 | |
+----+-------------+-----------------+--------+-------------------------------------------------------------------------------------------------+---------+---------+--------------------------------------+-------+---------------------------------+
Now, from what I've read, I need to somehow figure out how to make the query to content_image not be terrible; however, I'm drawing a blank on where to start.
Currently, judging by the execution plan, MySQL is starting with content_image, retrieving all rows, and only thereafter using primary keys on the other tables: content_image has a foreign key to content_content, and content_content has foreign keys to content_issue and content_section. Also, only after all the joins are complete can it make much use of the ORDER BY content_issue.issue_date DESC LIMIT 30, since it can't tell which of these joins might fail, and therefore, how many records from content_issue will really be needed before it can get the first thirty rows of output.
So, I would try the following:
Change JOIN content_issue to JOIN (SELECT * FROM content_issue ORDER BY issue_date DESC LIMIT 30) content_issue. This will allow MySQL, if it starts with content_issue and works its way to the other tables, to grab a very small subset of content_issue.
Note: properly speaking, this changes the semantics of the query: it means that only records from at most the last 30 content_issues will be retrieved, and therefore that if some of those issues don't have published contents with images, then fewer than 30 records will be retrieved. I don't have enough information about your data to gauge whether this change of semantics would actually change the results you get.
Also note: I'm not suggesting to remove the ORDER BY content_issue.issue_date DESC LIMIT 30 from the end of the query. I think you want it in both places.
Add an index on content_issue.issue_date, to optimize the above subquery.
Add an index on content_image.content_ptr_id, so MySQL can work its way from content_content to content_image without doing a full table scan.

Why scan type is changed from ALL to RANGE when using LIMIT on SQL queries + Optimize query

I have this query
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
INNER JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
INNER JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
GROUP BY l.licitatii_id
ORDER BY data_publicarii DESC
Explain outputs:
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | ALL | PRIMARY,key_status_tip_domeniu | NULL | NULL | NULL | 120 | 85.83 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+----------+----------+---------------------------+-------+-----------+----------------------------------------------+
As you see type=ALL for d table
now if I add LIMIT 100 to the query
plan changes to range:
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
| 1 | SIMPLE | d | range | PRIMARY,key_status_tip_domeniu | key_status_tip_domeniu | 9 | NULL | 103 | 100.00 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | dl | ref | PRIMARY,tip_licitatie,licitatii_id | PRIMARY | 4 | web61db1.d.domenii_id | 6180 | 100.00 | Using where; Using index |
| 1 | SIMPLE | l | eq_ref | PRIMARY | PRIMARY | 4 | web61db1.dl.licitatii_id | 1 | 100.00 | Using where |
+-----+--------------+--------+---------+-------------------------------------+-------------------------+----------+---------------------------+-------+-----------+----------------------------------------------+
Why does this happen?
Can this query be optimized more, both queries take 13 seconds.
Table schema is visible on gist github
MySQL chooses domenii as the leading table for the join.
This table is filtered on (status, tip_domeniu) = (1, 1).
It does not seem to be a very selective condition, so normally a full table scan with filtering would be preferred over the index scan.
We can see that MySQL expects 120 records to be returned from domanii for which this condition would hold.
When you add a LIMIT, the number of records expected to be processed is decreased, and MySQL considers the index scan more efficient for this.
Note that this condition:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita) AS DATE))) < '1300683793'
is not sargable, so you deprive the optimizer to use an index on data_limita.
Create the following indexes:
licitatii_ue (status, data_limita)
licitatii_ue (status, data_publicarii)
and rewrite the query like this:
SELECT l.licitatii_id,
l.nume,
l.data_publicarii,
l.data_limita
FROM licitatii_ue l
JOIN domenii_licitatii dl
ON l.licitatii_id = dl.licitatii_id
AND dl.tip_licitatie = '2'
JOIN domenii d
ON dl.domenii_id = d.domenii_id
AND d.status = 1
AND d.tip_domeniu = '1'
WHERE l.status = 1
AND l.data_limita < FROM_UNIXTIME(((1300683793 - 86400) div 86400) * 86400)
GROUP BY
l.licitatii_id
ORDER BY
data_publicarii DESC
Ah, the mysteries of the query optimizer are many and unknowable...
At a quick glance, the most obvious thing to optimize might be the
AND Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
clause.
depending on the number of records in the licitatii_ue table, this looks like an expensive operation, and it will bypass any indices available.
ALL is table scan, range is range scan (due to LIMIT). Nothing bad with that, actually it also causes a key to be used (key_status_tip_domeniu).
The reason you are slow is, most likely, that you are using ORDER BY data_publicarii DESC (this is easy to test, just drop the ORDER BY and benchmark the query; would expect few orders of magnitude).
Mysql admits (under Extra column of explain) that it is using filesort (needed for order by because it can't or does not know how to use an index). Adding yet another index to the mix might help, especially if you confirm that ORDER BY is making it slow.
EDIT
Actually, you do have a cardinal sin in your query:
Unix_timestamp(TIMESTAMPADD(DAY, 1, CAST(From_unixtime(l.data_limita)
AS DATE)))
< '1300683793'
Avoid applying any functions to your field values if you can apply them to a constant. So switch it around and rewrite it as
l.data_limita < some_function('1300683793')
However complext the some_function would be, it will be calculated only once. Mysql planner will know it is a constant. The way you wrote it would force mysql to apply unix_timestamp, timestampadd, cast and from_unixtime to value of data_limita from each row. Now in I/O bound systems this will usually just burn some extra CPU cycles while waiting for the disks to spin around (however, it might get significant, your system might get CPU bound and it is just a bad thing). Biggest difference is that you loose possibility to use an index on data_limita.
Finally, all your indexes are singe field indexes and mysql does some index merging, but is not stellar in it. You might want to try creating indexes that cover all your conditions and sorting order (in order of selectivity for target query).