Is using correlated subquery better than join? (indexing perspective)

Is using correlated subquery better than join? (indexing perspective) - mysql

I read somewhere something like this:
Indexes will be used per queries.
So as you know, this is two queries:
SELECT m1.*, (SELECT 1 FROM mytable2 m2 WHERE col2 = ?) AS sth
FROM mytable1 m1 WHERE col1 = ?
Well query above can use two indexes: mytable1(col1), mytable2(col2). Because of being two separated queries.
Now take a look at this one: (the same as previous query, just uses join instead of subquery)
SELECT m1.*, m2.1 AS sth
FROM mytable1 m1
JOIN mytable2 m2 ON m2.col2 = ?
WHERE m1.col1 = ?
But this ^ query, is just one query. So it can use just one index. Is my understanding right? So using subquery is better for indexing, right?

But this ^ query, is just one query. So it can use just one index. Is my understanding right? So using subquery is better for indexing, right?
You misunderstand. MySQL can use one index per table reference.
So in this case, it can use both indexes: mytable1(col1), mytable2(col2).
You can even use two different indexes from the same table, if you do a self-join or a UNION or a subquery. Each time you reference the table counts as a separate table reference.
SELECT m1.*, m2.1 AS sth
FROM mytable1 m1
JOIN mytable2 m2 ON m2.col2 = ?
WHERE m1.col1 = ?
Regardless of indexing, this is a strange query. You have no condition that relates mytable1 to mytable2. So you're doing a Cartesian product between the two tables. One or both table may be selecting a single row, depending on your conditions for col1 and col2. But it's still a Cartesian product, so if the conditions on both tables return multiple rows, you'll get result set with a lot of repetition.

This is too long for a comment.
The two queries are different, in multiple respects:
The first returns all rows in mytable1 that match the where condition, regardless of whether there is a match in the second table. The second only returns rows that match.
The first fails with an error if the subquery returns more than one row. The second returns multiple rows that match.
As a consequence, the first could return NULL for sth, the second cannot.
My advice is to first learn to write the query that meets your functional needs. Then worry about performance.
As for your question, both correlated subqueries and joins can make use of an index. The idea that correlated subqueries are always worse than joins is an old-wives' tale (no offense to old wives) that should be forgotten.

Generally speaking, it all depends. At the end of the day, SQL Server will create execution plans, and depending on how it interprets your query, one might be better than the other. Having that said, generally, join is better.

Related

Should I create 2 indexes for the same column to speed up a join?

I am new to database index and I've just read about what an index is, differences between clustered and non clustered and what composite index is.
So for a inner join query like this:
SELECT columnA
FROM table1
INNER JOIN table2
ON table1.columnA= table2.columnA;
In order to speed up the join, should I create 2 indexes, one for table1.columnA and the other for table2.columnA , or just creating 1 index for table1 or table2?
One is good enough? I don't get it, for example, if I select some data from table2 first and based on the result to join on columnA, then I am looping through results one by one from table2, then an index from table2.columnA is totally useless here, because I don't need to find anything in table2 now. So I am needing a index for table1.columnA.
And vice versa, I need a table2.columnA if I select some results from table1 first and want to join on columnA.
Well, I don't know how in reality "select xxxx first then join based on ..." looks like, but that scenario just came into my mind. It would be much appreciated if someone could also give a simple example.

One index is sufficient, but the question is which one?
It depends on how the MySQL optimizer decides to order the tables in the join.
For an inner join, the results are the same for table1 INNER JOIN table2 versus table2 INNER JOIN table1, so the optimizer may choose to change the order. It is not constrained to join the tables in the order you specified them in your query.
The difference from an indexing perspective is whether it will first loop over rows of table1, and do lookups to find matching rows in table2, or vice-versa: loop over rows of table2 and do lookups to find rows in table1.
MySQL does joins as "nested loops". It's as if you had written code in your favorite language like this:
foreach row in table1 {
look up rows in table2 matching table1.column_name
}
This lookup will make use of the index in table2. An index in table1 is not relevant to this example, since your query is scanning every row of table1 anyway.
How can you tell which table order is used? You can use EXPLAIN. It will show you a row for each table reference in the query, and it will present them in the join order.
Keep in mind the presence of an index in either table may influence the optimizer's choice of how to order the tables. It will try to pick the table order that results in the least expensive query.
So maybe it doesn't matter which table you add the index to, because whichever one you put the index on will become the second table in the join order, because it makes it more efficient to do the lookup that way. Use EXPLAIN to find out.

90% of the time in a properly designed relational database, one of the two columns you join together is a primary key, and so should have a clustered index built for it.
So as long as you're in that case, you don't need to do anything at all. The only reason to add additional non-clustered indices is if you're also further filtering the join with a where clause at the end of your statement, you need to make sure both the join columns and the filtered columns are in a correct index together (ie correct sort order, etc).

Is it possible to further optimize this MySQL query?

I was running a query of this kind of query:
SELECT
-- fields
FROM
table1 JOIN table2 ON (table1.c1 = table.c1 OR table1.c2 = table2.c2)
WHERE
-- conditions
But the OR made it very slow so i split it into 2 queries:
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table.c1
WHERE
-- conditions
UNION
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table.c2
WHERE
-- conditions
Which works much better but now i am going though the tables twice so i was wondering if there was any further optimizations for instance getting set of entries that satisfies the condition (table1.c1 = table.c1 OR table1.c2 = table2.c2) and then query on it. That would bring me back to the first thing i was doing but maybe there is another solution i don't have in mind. So is there anything more to do with it or is it already optimal?

Splitting the query into two separate ones is usually better in MySQL since it rarely uses "Index OR" operation (Index Merge in MySQL lingo).
There are few items I would concentrate for further optimization, all related to indexing:
1. Filter the rows faster
The predicate in the WHERE clause should be optimized to retrieve the fewer number of rows. And, they should be analized in terms of selectivity to create indexes that can produce the data with the fewest filtering as possible (less reads).
2. Join access
Retrieving related rows should be optimized as well. According to selectivity you need to decide which table is more selective and use it as a driving table, and consider the other one as the nested loop table. Now, for the latter, you should create an index that will retrieve rows in an optimal way.
3. Covering Indexes
Last but not least, if your query is still slow, there's one more thing you can do: use covering indexes. That is, expand your indexes to include all the rows from the driving and/or secondary tables in them. This way the InnoDB engine won't need to read two indexes per table, but a single one.

Test
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c1 = table2.c1
WHERE
-- conditions
UNION ALL
SELECT
-- fields
FROM
table1 JOIN table2 ON table1.c2 = table2.c2
WHERE
-- conditions
/* add one more condition which eliminates the rows selected by 1st subquery */
AND table1.c1 != table2.c1
Copied from the comments:
Nico Haase > What do you mean by "test"?
OP shows query patterns only. So I cannot predict does the technique is effective or not, and I suggest OP to test my variant on his structure and data array.
Nico Haase > what you've changed
I have added one more condition to 2nd subquery - see added comment in the code.
Nico Haase > and why?
This replaces UNION DISTINCT with UNION ALL and eliminates combined rowset sorting for duplicates remove.

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.

Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

Huge performance difference between two similar SQL queries

I have two SQL queries that provides the same output.
My first intuition was to use this:
SELECT * FROM performance_dev.report_golden_results
where id IN (SELECT max(id) as 'id' from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id)
Now, this took something like 70 secs to complete!
Searching for another solution I tried something similar:
SELECT * FROM performance_dev.report_golden_results e
join (SELECT max(id) as 'id'
from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id) s
ON s.id = e.id;
Surprisingly, this took 0.05 secs to complete!!!
how come these two are so different?
thanks!

First thing which Might Cause the Time Lag is that MySQL uses 'semi-join' strategy for Subqueries.The Semi Join includes Following Steps :
If a subquery meets the preceding criteria, MySQL converts it to a
semi-join and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query
as an inner join between subquery tables and outer tables. Table
pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semi-join as if it was a join and remove
duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and
there are multiple instances of a given value group, choose one rather
than returning them all. This "shortcuts" scanning and eliminates
production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single
value to be chosen from each subquery's value group.
Materialize the subquery into a temporary table with an index and use
the temporary table to perform a join. The index is used to remove
duplicates. The index might also be used later for lookups when
joining the temporary table with the outer tables; if not, the table
is scanned.
But giving an explicit join reduces these efforts which might be the Reason.
I hope it helped!

MySQL does not consider the first query as subject for semi-join optimization (MySQL converts semi joins to classic joins with some kind of optimization: first match, duplicate weedout ...)
Thus a full scan will be made on the first table and the subquery will be evaluated for each row generated by the outer select: hence the bad performances.
The second one is a classic join, what will happen in this case that MySQL will compute the result of derived query and then matches only values from this query with values from first query satisfying the condition, hence no full scan is needed on the first table (I assumed here that id is an indexed column).
The question right now is why MySQL does not consider the first query as subject to semi-join optimization: the answer is documented in MySQL https://dev.mysql.com/doc/refman/5.6/en/semijoins.html
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an IN (or =ANY) subquery that appears at the top level of the WHERE or ON clause, possibly as a term in an AND expression. For example:
SELECT ...
FROM ot1, ...
WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_i and it_i represent tables in the outer and inner parts of the query, and oe_i and ie_i represent expressions that refer to columns in the outer and inner tables.
It must be a single SELECT without UNION constructs.
It must not contain a GROUP BY or HAVING clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY with LIMIT.
The STRAIGHT_JOIN modifier must not be present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
Your subquery use GROUP BY hence semi-join optimization was not applied.

MySQL nested queries execution

I am writing a nested MySQL query where a subquery returns more than one row and hence the query can not be executed.
Can anyone suggest me a solution for this problem?
Thanks in advance.

An error about a subquery returning more than one value says to me that you're attempting a straight value comparison, like this:
WHERE col = (SELECT col2 FROM TABLE_2)
The solution depends on the data coming from the subquery - do you want the query to use all the values being returned? If yes, then change the equals sign for an IN:
WHERE col IN (SELECT col2 FROM TABLE_2)
Otherwise, you need to correct the subquery so it only ever returns one value. The MAX or MIN aggregate functions are a possibliity - they'll return the highest or lowest value. It could just be a matter of correlating the subquery:
FROM TABLE_1 t1
WHERE t1.col = (SELECT MAX(t2.col2)
FROM TABLE_2 t2
WHERE t2.fk_col = t1.id) -- correlated example
As Tabhaza points out, a subquery generally doesn't return more than one column (though some databases support tuple matching), in which case you need to define a derived table/inline view and join to it.
Would've been nice to have more information on the issue you're having...

Try joining to a derived table rather than doing a subquery; it will allow you to return multiple fields:
SELECT a.Field1, a.Field2, b.Field3, b.Field4
FROM table1 a INNER JOIN
(SELECT c.Field3, c.Field4, c.Key FROM table2 as c) as b ON a.Key = b.Key
WHERE ...

this sounds like a logic problem, not a syntax problem.
why is the subquery returning more than one row?
why do you have that in a place that requires only one row?
you need to restructure something to fit these two things together. without any indication of your system, your query, or your intent, it is very hard to help further.

If the database says you are returning more than one row, you should listen to what it says and change your query so that it only returns one row.
This is a problem in your logic.
Change the query so that it only returns one row.
Think about why the query is returning more than one row, and determine how to get the query to return just the single row you need from that result.

Use a LIMIT clause on the subquery so it always returns a maximum of 1 row

You could add a LIMIT 1 to the subquery so the top query only considers the first result. You can also sort the results from the subquery before doing the LIMIT, to return the result with the highest/lowest X. But make sure that that's actually what you want to happen, as the multi-row subquery is often a symptom of an underlying problem.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008