Huge performance difference between two similar SQL queries - mysql

I have two SQL queries that provides the same output.
My first intuition was to use this:
SELECT * FROM performance_dev.report_golden_results
where id IN (SELECT max(id) as 'id' from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id)
Now, this took something like 70 secs to complete!
Searching for another solution I tried something similar:
SELECT * FROM performance_dev.report_golden_results e
join (SELECT max(id) as 'id'
from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id) s
ON s.id = e.id;
Surprisingly, this took 0.05 secs to complete!!!
how come these two are so different?
thanks!

First thing which Might Cause the Time Lag is that MySQL uses 'semi-join' strategy for Subqueries.The Semi Join includes Following Steps :
If a subquery meets the preceding criteria, MySQL converts it to a
semi-join and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query
as an inner join between subquery tables and outer tables. Table
pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semi-join as if it was a join and remove
duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and
there are multiple instances of a given value group, choose one rather
than returning them all. This "shortcuts" scanning and eliminates
production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single
value to be chosen from each subquery's value group.
Materialize the subquery into a temporary table with an index and use
the temporary table to perform a join. The index is used to remove
duplicates. The index might also be used later for lookups when
joining the temporary table with the outer tables; if not, the table
is scanned.
But giving an explicit join reduces these efforts which might be the Reason.
I hope it helped!

MySQL does not consider the first query as subject for semi-join optimization (MySQL converts semi joins to classic joins with some kind of optimization: first match, duplicate weedout ...)
Thus a full scan will be made on the first table and the subquery will be evaluated for each row generated by the outer select: hence the bad performances.
The second one is a classic join, what will happen in this case that MySQL will compute the result of derived query and then matches only values from this query with values from first query satisfying the condition, hence no full scan is needed on the first table (I assumed here that id is an indexed column).
The question right now is why MySQL does not consider the first query as subject to semi-join optimization: the answer is documented in MySQL https://dev.mysql.com/doc/refman/5.6/en/semijoins.html
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an IN (or =ANY) subquery that appears at the top level of the WHERE or ON clause, possibly as a term in an AND expression. For example:
SELECT ...
FROM ot1, ...
WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_i and it_i represent tables in the outer and inner parts of the query, and oe_i and ie_i represent expressions that refer to columns in the outer and inner tables.
It must be a single SELECT without UNION constructs.
It must not contain a GROUP BY or HAVING clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY with LIMIT.
The STRAIGHT_JOIN modifier must not be present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
Your subquery use GROUP BY hence semi-join optimization was not applied.

Related

Is it possible to further optimize this MySQL query?

I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?
Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.
Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.

Need some clarification on indexes (WHERE, JOIN)

We are facing some performance issues in some reports that work on millions of rows. I tried optimizing sql queries, but it only reduces the time of execution to half.
The next step is to analyse and modify or add some indexes, therefore i have some questions:
1- the sql queries contain a lot of joins: do i have to create an index for each foreignkey?
2- Imagine the request SELECT * FROM A LEFT JOIN B on a.b_id = b.id where a.attribute2 = 'someValue', and we have an index on the table A based on b_id and attribute2: does my request use this index for the where part ( i know if the two conditions were on the where clause the index will be used).
3- If an index is based on columns C1, C2 and C3, and I decided to add an index based on C2, do i need to remove the C2 from the first index?
Thanks for your time
You can use EXPLAIN query to see what MySQL will do when executing it. This helps a LOT when trying to figure out why its slow.
JOIN-ing happens one table at a time, and the order is determined by MySQL analyzing the query and trying to find the fastest order. You will see it in the EXPLAIN result.
Only one index can be used per JOIN and it has to be on the table being joined. In your example the index used will be the id (primary key) on table B. Creating an index on every FK will give MySQL more options for the query plan, which may help in some cases.
There is only a difference between WHERE and JOIN conditions when there are NULL (missing rows) for the joined table (there is no difference at all for INNER JOIN). For your example the index on b_id does nothing. If you change it to an INNER JOIN (e.g. by adding b.something = 42 in the where clause), then it might be used if MySQL determines that it should do the query in reverse (first b, then a).
No.. It is 100% OK to have a column in multiple indexes. If you have an index on (A,B,C) and you add another one on (A) that will be redundant and pointless (because it is a prefix of another index). An index on B is perfectly fine.

LEFT OUTER JOIN GIVING UNEXPECTED RESULT

I'm making a simple query :
SELECT * FROM Bench LEFT JOIN ASSIGNED_DOM
ON (Bench.B_id=ASSIGNED_DOM.B_id) WHERE Bench.B_type=0 ;
As expected all the lines of Bench table are returned BUT If I try to get the B_id field I discovered that was put to NULL.
Then I have tried with this other query that should be totally equivalent:
SELECT * FROM Bench LEFT JOIN ASSIGNED_DOM USING (B_id) WHERE Bench.B_type=0 ;
But in that case the B_id field is returned correctly.
What's wrong with the first query? What the difference between the two ?
The two queries are not equivalent. According to the documentation,
Natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard.
Redundant columns of a NATURAL join do not appear.
That specifically comes down to the following difference:
A USING clause can be rewritten as an ON clause that compares corresponding columns. However, although USING and ON are similar, they are not quite the same.
With respect to determining which rows satisfy the join condition, both joins are semantically identical.
With respect to determining which columns to display for SELECT * expansion, the two joins are not semantically identical. The USING join selects the coalesced value of corresponding columns, whereas the ON join selects all columns from all tables.
So your first query has two columns of the same name, bench.B_id, ASSIGNED_DOM.B_id, while the second one just has one, coalesce(bench.B_id, ASSIGNED_DOM.B_id) as B_id.
It will depend on your application/framework how exactly the first case will be handled. E.g. the MySQL client or phpmyadmin will just display all columns. Some frameworks may alter the names in some way to make them unique.
php in particular (and I assume you are using this) will not though: if you use $row['B_id'], it will return the last occurance (although that behaviour is not specified), so in your case you will get ASSIGNED_DOM.B_id. You can however still access both columns with their index (e.g. $row[0], $row[1]), but just one of those with their identical column name.
To prevent such problems, you can/should use aliases, e.g. select bench.B_id as bench_B_id, ASSIGNED_DOM.B_id as ASSIGNED_DOM_B_id, ....
Values in second table overwrites values in the first one if column name is the same. Try to use an alias in your query
SELECT Bench.B_id AS bid1, ASSIGNED_DOM.B_id AS bid2
FROM Bench
LEFT JOIN ASSIGNED_DOM ON (Bench.B_id=ASSIGNED_DOM.B_id)
WHERE Bench.B_type=0;

Is using correlated subquery better than join? (indexing perspective)

I read somewhere something like this:
Indexes will be used per queries.
So as you know, this is two queries:
SELECT m1.*, (SELECT 1 FROM mytable2 m2 WHERE col2 = ?) AS sth
FROM mytable1 m1 WHERE col1 = ?
Well query above can use two indexes: mytable1(col1), mytable2(col2). Because of being two separated queries.
Now take a look at this one: (the same as previous query, just uses join instead of subquery)
SELECT m1.*, m2.1 AS sth
FROM mytable1 m1
JOIN mytable2 m2 ON m2.col2 = ?
WHERE m1.col1 = ?
But this ^ query, is just one query. So it can use just one index. Is my understanding right? So using subquery is better for indexing, right?
But this ^ query, is just one query. So it can use just one index. Is my understanding right? So using subquery is better for indexing, right?
You misunderstand. MySQL can use one index per table reference.
So in this case, it can use both indexes: mytable1(col1), mytable2(col2).
You can even use two different indexes from the same table, if you do a self-join or a UNION or a subquery. Each time you reference the table counts as a separate table reference.
SELECT m1.*, m2.1 AS sth
FROM mytable1 m1
JOIN mytable2 m2 ON m2.col2 = ?
WHERE m1.col1 = ?
Regardless of indexing, this is a strange query. You have no condition that relates mytable1 to mytable2. So you're doing a Cartesian product between the two tables. One or both table may be selecting a single row, depending on your conditions for col1 and col2. But it's still a Cartesian product, so if the conditions on both tables return multiple rows, you'll get result set with a lot of repetition.
This is too long for a comment.
The two queries are different, in multiple respects:
The first returns all rows in mytable1 that match the where condition, regardless of whether there is a match in the second table. The second only returns rows that match.
The first fails with an error if the subquery returns more than one row. The second returns multiple rows that match.
As a consequence, the first could return NULL for sth, the second cannot.
My advice is to first learn to write the query that meets your functional needs. Then worry about performance.
As for your question, both correlated subqueries and joins can make use of an index. The idea that correlated subqueries are always worse than joins is an old-wives' tale (no offense to old wives) that should be forgotten.
Generally speaking, it all depends. At the end of the day, SQL Server will create execution plans, and depending on how it interprets your query, one might be better than the other. Having that said, generally, join is better.

MySQL Is adding subqueries to IN a good practice as to the performance?

Does this query
SELECT * FROM `people` WHERE `country` IN (
SELECT `id` FROM `country` WHERE `available`
)
search for available coutries for each person or it does it once and is as fast as possible?
Maybe it's a good idea to use INNER JOIN to ignore those people whose countries are not available?
Or maybe it's a good idea to split this into two queries? (first select all available countries)?
What's the best method?
If you are using a newer version of MySQL and be careful not to add much complexity on the subquery, the MySQL query optimizer will automatically replace the subquery by a join, see this:
http://dev.mysql.com/doc/refman/5.6/en/subquery-optimization.html
In MySQL, a subquery must satisfy these criteria to be handled as a
semi-join:
It must be an IN (or =ANY) subquery that appears at the top level of
the WHERE or ON clause, possibly as a term in an AND expression. For
example:
SELECT ... FROM ot1, ... WHERE (oe1, ...) IN (SELECT ie1, ... FROM
it1, ... WHERE ...); Here, ot_i and it_i represent tables in the outer
and inner parts of the query, and oe_i and ie_i represent expressions
that refer to columns in the outer and inner tables.
It must be a single SELECT without UNION constructs.
It must not contain a GROUP BY or HAVING clause or aggregate
functions.
It must not have ORDER BY with LIMIT.
The number of outer and inner tables together must be less than the
maximum number of tables permitted in a join.
The subquery may be correlated or uncorrelated. DISTINCT is permitted,
as is LIMIT unless ORDER BY is also used.