MySQL 5.7 vs. 8 difference in LEFT JOIN processing - mysql

We are checking if we can upgrade our project database from MySQL 5.7 to v.8. The system is 7 years old and has tons of code... Today we got a slightly strange bug which did not appear on 5.7 (I wonder why). The buggy request is the following:
SELECT TableA.Amount, SUM(TableB.Amount) AS Amount2
FROM
TableA LEFT JOIN TableB ON TableA.ReservID = TableB.ReservID
WHERE
TableB.InvoiceID IS NULL
AND TableB.InvoiceStatusID = 2
AND TableB.PersonID = 389
AND TableB.PersonTypeID = 1
AND TableA.ReservID = 4657;
There is one record in TableA and no records in TableB for the given conditions.
I know that WHERE conditions are applied after joining the tables. So it is not a suprise for me that the query return NULL, NULL on MySQL8. But our developer (who's still sure that this query is Ok) just showed me that it returns 67667.65, NULL on MySQL 5.7!
So I got 2 questions at ones. 1. Why it works on 5.7 when all data must be filtered out by the WHERE conditions on non-existent (all null in joint table) Table2 fields? 2. Is there a way to make MySQL8 work in the same 'tolerant' way as I am sure there are many such 'genius' queries all over our old code?

The problem in your query is not the (left) join. While it makes it less clear to the reader that your left join is treated as a join, having the comparisons in the where clause is completely valid sql. Every database will treat your left join correctly as a join, and I don't think that MySQL (5.7 or 8.0) would give you a different result if you replace left join with a join, as the internal representation would not change.
Your query has a problem with the aggregation. select colA, sum(colB) without using group by colA will leave the value of colA unclear, see MySQL Handling of GROUP BY:
SELECT name, MAX(age) FROM t;
Without GROUP BY, there is a single group and it is nondeterministic which name value to choose for the group.
MySQL is about the only database system that will even allow you to run this query, a very special behaviour that generates a lot of questions on stackoverflow. Most other databases will complain about that column listed in the select - exactly for the reason you face right now: they don't really know what to return.
So the value you get for tablea.amount is basically random. While it usually depends on how MySQL executes the query internally (so it could depend on some optimization setting), it looks unlikely that you can convince MySQL 8 to return the number value there - and especially to make sure it is consistent in all similar queries you may have.
And I want to emphasize: the value null in MySQL 8 is also not deterministic - it could be something different.
To make a proper query, use an aggregate function for tablea.amount too, and depending on your requirements, fix your left join, e.g.
SELECT MAX(TableA.Amount) AS Amount,
SUM(TableB.Amount) AS Amount2
FROM TableA LEFT JOIN TableB
ON TableA.ReservID = TableB.ReservID
AND TableB.InvoiceID IS NULL
AND TableB.InvoiceStatusID = 2
AND TableB.PersonID = 389
AND TableB.PersonTypeID = 1
WHERE TableA.ReservID = 4657
This should give you the behaviour from MySQL 5.7, e.g. <value for amount>, null. If you use join, you should get null, null. Both cases will be deterministic.
Although using just SELECT TableA.Amount, SUM(...) ... LEFT JOIN ... (with a proper left join) will return the amount instead of null for MySQL 8 too, it is still not valid sql! MySQL will only allow it because ReservID = 4657 limits it to a single row in TableA using the primary key. So if you have to check all queries anyway, fix it properly.

i dont know the reason, why its working on 5.7. but to get your expected result you can do:
SELECT
TableA.Amount,
SUM(TableB.Amount) AS Amount2
FROM TableA
LEFT JOIN TableB
ON TableA.ReservID = TableB.ReservID
AND TableB.InvoiceID IS NULL
AND TableB.InvoiceStatusID = 2
AND TableB.PersonID = 389
AND TableB.PersonTypeID = 1
WHERE TableA.ReservID = 4657;

Related

Join Performances When Searching For NULL Value

I need to find a value that exists in LoyaltyTransactionBasketItemStores table but not in DimProductConsolidate table. I need the item code and its corresponding company. This is my query
SELECT
A.ProductReference, A.CompanyCode
FROM
(SELECT ProductReference, CompanyCode FROM dwhdb.LoyaltyTransactionsBasketItemsStores GROUP BY ProductReference) A
LEFT JOIN
(SELECT LoyaltyVariantArticleCode FROM dwhdb.DimProductConsolidate) B ON B.LoyaltyVariantArticleCode = A.ProductReference
WHERE
B.LoyaltyVariantArticleCode IS NULL
It is a pretty straight forward query. But when I run it, it's taking 1 hour and still not finish. Then I use EXPLAIN and this is the result
But when I remove the CompanyCode from my query, its performance is increasing a lot. This is the EXPLAIN result
I want to know why is this happening and is there any way to get ProductReference and its company with a lot more better performance?
Your current query is rife with syntax and structural errors. I would use exists logic here:
SELECT a.ProductReference, a.CompanyCode
FROM dwhdb.LoyaltyTransactionsBasketItemsStores a
WHERE NOT EXISTS (SELECT 1 FROM dwhdb.DimProductConsolidate b
WHERE b.LoyaltyVariantArticleCode = a.ProductReference);
Your current query is doing a GROUP BY in the first subquery, but you never select aggregates, but rather other non aggregate columns. On most other databases, and even on MySQL in strict mode, this syntax is not allowed. Also, there is no need to have 2 subqueries here. Rather, just select from the basket table and then assert that matching records do not exist in the other table.

Optimizing INNER JOIN across multiple tables

I have trawled many of the similar responses on this site and have improved my code at several stages along the way. Unfortunately, this 3-row query still won't run.
I have one table with 100k+ rows and about 30 columns of which I can filter down to 3-rows (in this example) and then perform INNER JOINs across 21 small lookup tables.
In my first attempt, I was lazy and used implicit joins.
SELECT `master_table`.*, `lookup_table`.`data_point` x 21
FROM `lookup_table` x 21
WHERE `master_table`.`indexed_col` = "value"
AND `lookup_table`.`id` = `lookup_col` x 21
The query looked to be timing out:
#2013 - Lost connection to MySQL server during query
Following this, I tried being explicit about the joins.
SELECT `master_table`.*, `lookup_table`.`data_point` x 21
FROM `master_table`
INNER JOIN `lookup_table` ON `lookup_table`.`id` = `master_table`.`lookup_col` x 21
WHERE `master_table`.`indexed_col` = "value"
Still got the same result. I then realised that the query was probably trying to perform the joins first, then filter down via the WHERE clause. So after a bit more research, I learned how I could apply a subquery to perform the filter first and then perform the joins on the newly created table. This is where I got to, and it still returns the same error. Is there any way I can improve this query further?
SELECT `temp_table`.*, `lookup_table`.`data_point` x 21
FROM (SELECT * FROM `master_table` WHERE `indexed_col` = "value") as `temp_table`
INNER JOIN `lookup_table` ON `lookup_table`.`id` = `temp_table`.`lookup_col` x 21
Is this the best way to write up this kind of query? I tested the subquery to ensure it only returns a small table and can confirm that it returns only three rows.
First, at its most simple aspect you are looking for
select
mt.*
from
Master_Table mt
where
mt.indexed_col = 'value'
That is probably instantaneous provided you have an index on your master table on the given indexed_col in the first position (in case you had a compound index of many fields)…
Now, if I am understanding you correctly on your different lookup columns (21 in total), you have just simplified them for redundancy in this post, but actually doing something in the effect of
select
mt.*,
lt1.lookupDescription1,
lt2.lookupDescription2,
...
lt21.lookupDescription21
from
Master_Table mt
JOIN Lookup_Table1 lt1
on mt.lookup_col1 = lt1.pk_col1
JOIN Lookup_Table2 lt2
on mt.lookup_col2 = lt2.pk_col2
...
JOIN Lookup_Table21 lt21
on mt.lookup_col21 = lt21.pk_col21
where
mt.indexed_col = 'value'
I had a project well over a decade ago dealing with a similar situation... the Master table had about 21+ million records and had to join to about 30+ lookup tables. The system crawled and queried died after running a query after more than 24 hrs.
This too was on a MySQL server and the fix was a single MySQL keyword...
Select STRAIGHT_JOIN mt.*, ...
By having your master table in the primary position, where clause and its criteria directly on the master table, you are good. You know the relationships of the tables. Do the query in the exact order I presented it to you. Don't try to think for me on this and try to optimize based on a subsidiary table that may have smaller record count and somehow think that will help the query faster... it won't.
Try the STRAIGHT_JOIN keyword. It took the query I was working on and finished it in about 1.5 hrs... it was returning all 21 million rows with all corresponding lookup key descriptions for final output, hence still needed a longer duration than just 3 records.
First, don't use a subquery. Write the query as:
SELECT mt.*, lt.`data_point`
FROM `master_table` mt INNER JOIN
`lookup_table` l
ON l.`id` = mt.`lookup_col`
WHERE mt.`indexed_col` = value;
The indexes that you want are master_table(value, lookup_col) and lookup_table(id, data_point).
If you are still having performance problems, then there are multiple possibilities. High among them is that the result set is simply too big to return in a reasonable amount of time. To see if that is the case, you can use select count(*) to count the number of returned rows.

Using a join in Mysql update statement instead of sub-query

I'm currently using the following to update a table of mine:
UPDATE Check_Dictionary
SET InDict = "No" WHERE (Leagues, Matches, Line) IN (SELECT * FROM (
SELECT Leagues, Matches, Line FROM Check_Dictionary
WHERE InDict = "No")as X)
However, when I have large data sets (40k+ rows) this seems to be fairly inefficient/slow. All of the searching I'm doing suggests that joins are far more efficient for this sort of thing than a sub-query. However, being a mysql newbie I'm not sure of the best way to do it.
My table may have multiple rows where the League/Matches/Line fields are the same. Generally the InDict field on these rows will be "Yes". However, if one of them is "No" I need to update all of the other rows with the same League/Matches/Line columns to "No" as well (so they all have a value of "No").
Would using a join in Mysql update statement instead of sub-query be more efficient?
How can I do it using a join?
I would think a join should be faster, but it depends on indexing and other things, you should try it for yourself to see which performs better (and maybe use explain to analyze the queries).
As for syntax, any of these should work:
UPDATE Check_Dictionary c1
JOIN (
SELECT Leagues, Matches, Line
FROM Check_Dictionary
WHERE InDict = "No"
) AS X USING (Leagues, Matches, Line)
SET InDict = "No"
UPDATE Check_Dictionary AS c1
JOIN Check_Dictionary AS c2 USING (Leagues, Matches, Line)
SET c1.InDict = "No"
WHERE c2.InDict = "No"
The update join query given by "jpw" was correct you can use it, I don't want to repeat. Having said, i just want to post join is faster than subquery obviously especially if you want to update 40K+ rows. Below is the data from MySQL documentation says about the same.
A LEFT [OUTER] JOIN can be faster than an equivalent subquery because the server might be able to optimize it better—a fact that is not specific to MySQL Server alone. Prior to SQL-92, outer joins did not exist, so subqueries were the only way to do certain things. Today, MySQL Server and many other modern database systems offer a wide range of outer join types.
MySQL Server supports multiple-table DELETE statements that can be used to efficiently delete rows based on information from one table or even from many tables at the same time. Multiple-table UPDATE statements are also supported. See Section 13.2.2, “DELETE Syntax”, and Section 13.2.10, “UPDATE Syntax”.
Source : http://dev.mysql.com/doc/refman/5.1/en/rewriting-subqueries.html

SQL Join condition optimization

I had an issue with a SQL query that was slow and lead me to a 500 Internal Server Error.
After playing a bit with the query I had found out that moving join conditions outside makes the query work and the error vanish.
Now I wonder:
is the second query is equivalent to the first one ?
what was my bug mistake, why it was so slow ?
Here the queries.
Original (slow) SQL:
SELECT *
FROM `TABLE_1` AS `main_table`
INNER JOIN `TABLE_2` AS `w`
ON main_table.entity_id = w.product_id
LEFT JOIN `TABLE_3` AS `ur`
ON main_table.entity_id = ur.product_id
AND ur.category_id IS NULL
AND ur.store_id = 1
AND ur.is_system = 1
WHERE (w.website_id = '1')
Faster SQL:
SELECT *
FROM `TABLE_1` AS `main_table`
INNER JOIN `TABLE_2` AS `w`
ON main_table.entity_id = w.product_id
LEFT JOIN `TABLE_3` AS `ur`
ON main_table.entity_id = ur.product_id
WHERE (w.website_id = '1')
AND ur.category_id IS NULL
AND ur.store_id = 1
AND ur.is_system = 1
Is the 2nd query equivalent to the 1st?
Depending on the data in TABLE_3, the 2nd query is NOT equivalent to the first, since you're using a LEFT JOIN.
Consider what happens when you have a record in TABLE_1, with an entity_id that does not match any rows in TABLE_3: Your first query would still return this record from TABLE_1, with NULL values for all the columns of TABLE_3. The 2nd query applies filters to the columns from TABLE_3, but since these are all NULLs, the record now gets filtered out.
In general, query 2 would thus return fewer records than query 1 (unless, of course, all your entity_id's match a product_id from TABLE_3).
What was my bug mistake, why it was so slow?
Both queries are valid SQL, so none of them should give you a 500 Internal Server Error. This does not even look like a MySQL error, so maybe the bug arose elsewhere? Perhaps you have a web application that is unable to handle the NULL's returned from the 1st query? It's impossible to answer the question without more details about the error.
As for the speed of the queries, this depends a lot on what indexes are defined. Generally, I would not expect the 2nd query to be faster than the 1st, but that depends. As I mentioned above, the 1st query might return more records than the 2nd. If you're not querying the database directly, but instead through some application, could it be the application slowing down your query?

How to optimize a MySQL update which contains an "in" subquery?

How do I optimize the following update because the sub-query is being executed for each row in table a?
update
a
set
col = 1
where
col_foreign_id not in (select col_foreign_id in b)
You could potentially use an outer join where there are no matching records instead of your not in:
update table1 a
left join table2 b on a.col_foreign_id = b.col_foreign_id
set a.col = 1
where b.col_foreign_id is null
This should use a simple select type rather than a dependent subquery.
Your current query (or the one that actually works since the example in the OP doesn't look like it would) is potentially dangerous in that a NULL in b.col_foreign_id would cause nothing to match, and you'd update no rows.
not exists would also be something to look at if you want to replace not in.
I can't tell you that this will make your query any faster, but there is some good info here. You'll have to test in your environment.
Here's a SQL Fiddle illuminating the differences between in, exists, and outer join (check the rows returned, null handling, and execution plans).