Problem
I have two queries where one needs the result of the other one. My first guess was to use an independent subquery:
SELECT P2.*
FROM ExampleTable P2
WHERE P2.delivery_start >= (
SELECT MIN(P1.delivery_start)
FROM ExampleTable P1
WHERE 1641288602 < P1.delivery_end
);
The entire query takes 5-6 seconds which is way to long for my application. Running these queries after another takes only around 800ms for both:
SELECT MIN(P1.delivery_start)
FROM ExampleTable P1
WHERE 1641288602 < P1.delivery_end;
SELECT P2.*
FROM ExampleTable P2
WHERE P2.delivery_start >= 1641286800;
I am using Mariadb 10.2 and have indices on both delivery_start and delivery_end.
What I have tried
I have used a CTE instead of a subquery which resulted in the same performance. Using a Variable with SET yields similar results as to running both queries separately, so thats what I will use for the time being.
I ran EXPLAIN on all 3 Queries:
1. Query with subquery
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
PRIMARY
P2
ALL
delivery_start
NULL
NULL
NULL
6388282
Using where
2
SUBQUERY
P1
range
delivery_end
delivery_end
4
NULL
36378
Using index condition
2. Separate Queries
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
P1
range
delivery_end
delivery_end
4
NULL
36432
Using index condition
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
P2
range
delivery_start
delivery_start
4
NULL
35944
Using index condition
Question
I think the issue is shown in the first EXPLAIN table as it has type ALL which means that the database performs a full table scan. My question is simply: why? Is the optimizer not able to figure out that the subquery produces a number with which we only need a range type query? And why does it not use any index?
The problem is described in the MariaDB docs:
In all remaining cases when NULL cannot be substituted with FALSE, it
is not possible to use index lookups. This is not a limitation in the
server, but a consequence of the NULL semantics in the ANSI SQL
standard.
There is a full examination here:
https://mariadb.com/kb/en/non-semi-join-subquery-optimizations/
The result of your subquery can potentially return a NULL in the case no rows were found. Hence, MariaDB cannot use the index for the parent query.
You must rewrite your subquery in a way that it will always return a row with a non-NULL scalar or stick with two separate queries. However, what should happen if your first query returns NULL? With a compound statement you could put an if around the second query and don't even execute it if the first returns NULL.
Replace these
INDEX(delivery_start)
INDEX(delivery_end)
with these:
INDEX(delivery_start, delivery_end)
INDEX(delivery_end, delivery_start)
The second one will help significantly with the subquery. Then the first may help with the outer query.
(If those don't help, please add SHOW CREATE TABLE, EXPLAIN SELECT ... and table sizes.)
Related
I have a pretty long insert query that inserts data from a select query in a table. The problem is that the select query takes too long to execute. The table is MyISAM and the select locks the table which affects other users who also use the table. I have found that problem of the query is a join.
When I remove this part of the query, it takes less then a second to execute but when I leave this part the query takes more than 15 minutes:
LEFT JOIN enq_217 Pex_217
ON e.survey_panelId = Pex_217.survey_panelId
AND e.survey_respondentId = Pex_217.survey_respondentId
AND Pex_217.survey_respondentId != 0
db.table_1 contains 5,90,145 rows and e contains 4,703 rows.
Explain Output:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY e ALL survey_endTime,survey_type NULL NULL NULL 4703 Using where
1 PRIMARY Pex_217 ref survey_respondentId,idx_table_1 idx_table_1 8 e.survey_panelId,e.survey_respondentId 2 Using index
2 DEPENDENT SUBQUERY enq_11525_timing eq_ref code code 80 e.code 1
How can I edit this part of the query to be faster?
I suggest creating an index on the table db.table_1 for the fields panelId and respondentId
You want an index on the table. The best index for this logic is:
create index idx_table_1 on table_1(panelId, respondentId)
The order of these two columns in the index should not matter.
You might want to include other columns in the index, depending on what the rest of the query is doing.
Note: a single index with both columns is different from two indexes with each column.
Why is it a LEFT join?
How many rows in Pex_217?
Run ANALYZE TABLE on each table used. (This sometimes helps MyISAM; rarely is needed for InnoDB.)
Since the 'real problem' seems to be that the query "holds up other users", switch to InnoDB.
Tips on conversion
The JOIN is not that bad (with the new index -- note Using index): 4703 rows scanned, then reach into the other table's index about 2 times each.
Perhaps the "Dependent subquery" is the costly part. Let's see that.
Recently, we switched from MyISAM to InnoDB. I tested the whole application and there are generally no problems except for one thing - using COUNT(*) in combination with 2 or more WHERE conditions.
So, here's the problem. The query below takes half a second which is not acceptable. After all InnoDB shouldn't be slower than MyISAM when using COUNT() + WHERE, but that's exactly what is happening here.
Both project_id and status_id are indexed columns. The table has 350K records.
SELECT COUNT(*) FROM respondents WHERE project_id='366' AND status_id='42'
And here is what EXPLAIN says:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE respondents index_merge project_id,status_id project_id,status_id 4,1 NULL 8343 Using intersect(project_id,status_id); Using where...
When I use only one condition after WHERE (either project_id='366' or status_id='42'), it works fine.
I'm thinking, this whole intersecting thing could be the root of the problem. But then what can I do about it? What do you think?
The index merge can be fixed by a compound index
ALTER TABLE respondents ADD KEY(project_id,status_id)
Assuming the data distribution is not very skewed,so this index will be useful.(the project_id='366' AND status_id='42' will not return more than 50% of rows)
Also make sure that your column types match the search.Are project_id and status_id really VARCHAR? If not remove the quotes.
I am running a query (taking 3 minutes to execute)-
SELECT c.eveDate,c.hour,SUM(c.dataVolumeDownLink)+SUM(c.dataVolumeUpLink)
FROM cdr c
WHERE c.evedate>='2013-10-19'
GROUP BY c.hour;
with explain plan -
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE c ALL evedate_index,eve_hour_index 31200000 Using where; Using temporary; Using filesort
I am using table(Myisam)-
Primary key(id,evedate),
with weekly 8 partitions with key evedate,
index key- on evedate,
composite index on evedate,hour.
I have changed the mysql tuning parameters from my.ini as (4GB RAM)-
tmp_table_size=200M
key_buffer_size=400M
read_rnd_buffer_size=2M
But still its using temporary table and file sort. Please let me know what should I do to exclude this.
After adding new composite index(evedate,msisdn)
I have found some changes in few queries they were not using any temporary case, even in above query if I omit group by clause, its not using temporary table.
You cannot do anything. MySql is not able to optimize this query and avoid temporaty table.
According to this link: http://dev.mysql.com/doc/refman/5.7/en/group-by-optimization.html
there are two methods MySql is using to optimize GROUP BY.
The first method - Loose Index Scan - cannot be used for your query because this condition is not meet:
The only aggregate functions used in the select list (if any) are MIN() and MAX() ....
Your query contains SUM, therefore MySql cannot use the above optimalization method.
The second method - Tight Index Scan - cannot be used for your query, because this condition is not meet:
For this method to work, it is sufficient that there is a constant equality condition for all columns in a query referring to parts of the key coming before or in between parts of the GROUP BY key.
Your query is using only a range operator : WHERE c.evedate>='2013-10-19', there is no any equality condition in the WHERE clause, therefore this method cannot be used to optimize the query.
I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.
Hi I am not understanding , why the subquery of given query is converting into dependent subquery.
Although the subquery is not dependent(not using primary query table) on main query.
I know that this query can be optimized using joins,but here i just want to know the reason of this
MYSQL Version 5.5
EXPLAIN SELECT id FROM `cab_request_histories`
WHERE cab_request_histories.id = any(SELECT id
FROM cab_requests
WHERE cab_requests.request_type = 'pickup')
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY cab_request_histories index NULL PRIMARY 4 NULL 20
2 DEPENDENT SUBQUERY cab_requests unique_subquery PRIMARY PRIMARY 4 func 1
I suspect that the ANY keyword will require MySQL to pass the values from outside the subquery to inside it to evaluate whether the result is true.
Mysql optimizer uses EXIST strategy for this query, effectively changing it to something like:
SELECT id FROM cab_request_histories
WHERE EXISTS
( SELECT 'this one is dependent' FROM cab_requests
WHERE cab_requests.request_type = 'pickup'
AND cab_requests.id = cab_request_histories.id )
You can see what optimizer does with your query using EXPLAIN EXTENDED your_query followed by SHOW WARNINGS.
This type of optimization is described in http://dev.mysql.com/doc/refman/5.5/en/subquery-optimization-with-exists.html.