Join Performances When Searching For NULL Value

Join Performances When Searching For NULL Value - mysql

I need to find a value that exists in LoyaltyTransactionBasketItemStores table but not in DimProductConsolidate table. I need the item code and its corresponding company. This is my query
SELECT
A.ProductReference, A.CompanyCode
FROM
(SELECT ProductReference, CompanyCode FROM dwhdb.LoyaltyTransactionsBasketItemsStores GROUP BY ProductReference) A
LEFT JOIN
(SELECT LoyaltyVariantArticleCode FROM dwhdb.DimProductConsolidate) B ON B.LoyaltyVariantArticleCode = A.ProductReference
WHERE
B.LoyaltyVariantArticleCode IS NULL
It is a pretty straight forward query. But when I run it, it's taking 1 hour and still not finish. Then I use EXPLAIN and this is the result
But when I remove the CompanyCode from my query, its performance is increasing a lot. This is the EXPLAIN result
I want to know why is this happening and is there any way to get ProductReference and its company with a lot more better performance?

Your current query is rife with syntax and structural errors. I would use exists logic here:
SELECT a.ProductReference, a.CompanyCode
FROM dwhdb.LoyaltyTransactionsBasketItemsStores a
WHERE NOT EXISTS (SELECT 1 FROM dwhdb.DimProductConsolidate b
WHERE b.LoyaltyVariantArticleCode = a.ProductReference);
Your current query is doing a GROUP BY in the first subquery, but you never select aggregates, but rather other non aggregate columns. On most other databases, and even on MySQL in strict mode, this syntax is not allowed. Also, there is no need to have 2 subqueries here. Rather, just select from the basket table and then assert that matching records do not exist in the other table.

Related

SQL query takes too much time (3 joins)

I'm facing an issue with an SQL Query. I'm developing a php website, and to avoid making too much queries, I prefer to make a big one looking like :
select m.*, cj.*, cjb.*, me.pseudo as pseudo_acheteur
from mercato m
JOIN cartes_joueur cj
ON m.ID_carte = cj.ID_carte_joueur
JOIN cartes_joueur_base cjb
ON cj.ID_carte_joueur_base = cjb.ID_carte_joueur_base
JOIN membres me
ON me.ID_membre = cj.ID_membre
where not exists (select * from mercato_encheres me where me.ID_mercato = m.ID_mercato)
and cj.ID_membre = 2
and m.status <> 'cancelled'
ORDER BY total_carac desc, cj.level desc, cjb.nom_carte asc
This should return all cards sold by the member without any bet on it. In the result, I need all the information to display them.
Here is the approximate rows in each table :
mercato : 1200
cartes_joueur : 800 000
carte_joueur_base : 62
membres : 2000
mercato_enchere : 15 000
I tried to reduce them (in dev environment) by deleting old data; but the query still needs 10~15 seconds to execute (which is way too long on a website )
Thanks for your help.

Let's take a look.
The use of * in SELECT clauses is harmful to query performance. Why? It's wasteful. It needlessly adds to the volume of data the server must process, and in the case of JOINs, can force the processing of columns with duplicate values. If you possibly can do so, try to enumerate the columns you need.
You may not have useful indexes on your tables for accelerating this. We can't tell. Please notice that MySQL can't exploit multiple indexes in a single query, so to make a query fast you often need a well-chosen compound index. I suggest you try defining the index (ID_membre, ID_carte_jouer, ID_carte_joueur_base) on your cartes_joueur table. Why? Your query matches for equality on the first of those columns, and then uses the second and third column in ON conditions.
I have often found that writing a query with the largest table (most rows) first helps me think clearly about optimizing. In your case your largest table is cartes_jouer and you are choosing just one ID_membre value from that table. Your clearest path to optimization is the knowledge that you only need to examine approximately 400 rows from that table, not 800 000. An appropriate compound index will make that possible, and it's easiest to imagine that index's columns if the table comes first in your query.
You have a correlated subquery -- this one.
where not exists (select *
from mercato_encheres me
where me.ID_mercato = m.ID_mercato)
MySQL's query planner can be stupidly literal-minded when it sees this, running it thousands of times. In your case it's even worse: it's got SELECT * in it: see point 1 above.
It should be refactored to use the LEFT JOIN ... IS NULL pattern. Here's how that goes.
select whatever
from mercato m
JOIN ...
JOIN ...
LEFT JOIN mercato_encheres mench ON mench.ID_mercato = m.ID_mercato
WHERE mench.ID_mercato IS NULL
and ...
ORDER BY ...
Explanation: The use of LEFT JOIN rather than ordinary inner JOIN allows rows from the mercato table to be preserved in the output even when the ON condition does not match them to tables in the mercato_encheres table. The mismatching rows get NULL values for the second table. The mench.ID_mercato IS NULL condition in the WHERE clause then selects only the mismatching rows.

Query takes too long to run

I am running the below query to retrive the unique latest result based on a date field within a same table. But this query takes too much time when the table is growing. Any suggestion to improve this is welcome.
select
t2.*
from
(
select
(
select
id
from
ctc_pre_assets ti
where
ti.ctcassettag = t1.ctcassettag
order by
ti.createddate desc limit 1
) lid
from
(
select
distinct ctcassettag
from
ctc_pre_assets
) t1
) ro,
ctc_pre_assets t2
where
t2.id = ro.lid
order by
id
Our able may contain same row multiple times, but each row with different time stamp. My object is based on a single column for example assettag I want to retrieve single row for each assettag with latest timestamp.

It's simpler, and probably faster, to find the newest date for each ctcassettag and then join back to find the whole row that matches.
This does assume that no ctcassettag has multiple rows with the same createddate, in which case you can get back more than one row per ctcassettag.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
INNER JOIN
(
SELECT
ctcassettag,
MAX(createddate) AS createddate
FROM
ctc_pre_assets
GROUP BY
ctcassettag
)
newest
ON newest.ctcassettag = ctc_pre_assets.ctcassettag
AND newest.createddate = ctc_pre_assets.createddate
ORDER BY
ctc_pre_assets.id
EDIT: To deal with multiple rows with the same date.
You haven't actually said how to pick which row you want in the event that multiple rows are for the same ctcassettag on the same createddate. So, this solution just chooses the row with the lowest id from amongst those duplicates.
SELECT
ctc_pre_assets.*
FROM
ctc_pre_assets
WHERE
ctc_pre_assets.id
=
(
SELECT
lookup.id
FROM
ctc_pre_assets lookup
WHERE
lookup.ctcassettag = ctc_pre_assets.ctcassettag
ORDER BY
lookup.createddate DESC,
lookup.id ASC
LIMIT
1
)
This does still use a correlated sub-query, which is slower than a simple nested-sub-query (such as my first answer), but it does deal with the "duplicates".
You can change the rules on which row to pick by changing the ORDER BY in the correlated sub-query.
It's also very similar to your own query, but with one less join.

Nested queries are always known to take longer time than a conventional query since. Can you append 'explain' at the start of the query and put your results here? That will help us analyse the exact query/table which is taking longer to response.
Check if the table has indexes. Unindented tables are not advisable(until unless obviously required to be unindented) and are alarmingly slow in executing queries.
On the contrary, I think the best case is to avoid writing nested queries altogether. Bette, run each of the queries separately and then use the results(in array or list format) in the second query.

First some questions that you should at least ask yourself, but maybe also give us an answer to improve the accuracy of our responses:
Is your data normalized? If yes, maybe you should make an exception to avoid this brutal subquery problem
Are you using indexes? If yes, which ones, and are you using them to the fullest?
Some suggestions to improve the readability and maybe performance of the query:
- Use joins
- Use group by
- Use aggregators
Example (untested, so might not work, but should give an impression):
SELECT t2.*
FROM (
SELECT id
FROM ctc_pre_assets
GROUP BY ctcassettag
HAVING createddate = max(createddate)
ORDER BY ctcassettag DESC
) ro
INNER JOIN ctc_pre_assets t2 ON t2.id = ro.lid
ORDER BY id
Using normalization is great, but there are a few caveats where normalization causes more harm than good. This seems like a situation like this, but without your tables infront of me, I can't tell for sure.
Using distinct the way you are doing, I can't help but get the feeling you might not get all relevant results - maybe someone else can confirm or deny this?
It's not that subqueries are all bad, but they tend to create massive scaleability issues if written incorrectly. Make sure you use them the right way (google it?)
Indexes can potentially save you for a bunch of time - if you actually use them. It's not enough to set them up, you have to create queries that actually uses your indexes. Google this as well.

MySQL(version 5.5): Why `JOIN` is faster than `IN` clause?

[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?

This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.

EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.

You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)

only select the row if the field value is unique

I sort the rows on date. If I want to select every row that has a unique value in the last column, can I do this with sql?
So I would like to select the first row, second one, third one not, fourth one I do want to select, and so on.

What you want are not unique rows, but rather one per group. This can be done by taking the MIN(pk_artikel_Id) and GROUP BY fk_artikel_bron. This method uses an IN subquery to get the first pk_artikel_id and its associated fk_artikel_bron for each unique fk_artikel_bron and then uses that to get the remaining columns in the outer query.
SELECT * FROM tbl
WHERE pk_artikel_id IN
(SELECT MIN(pk_artikel_id) AS id FROM tbl GROUP BY fk_artikel_bron)
Although MySQL would permit you to add the rest of the columns in the SELECT list initially, avoiding the IN subquery, that isn't really portable to other RDBMS systems. This method is a little more generic.
It can also be done with a JOIN against the subquery, which may or may not be faster. Hard to say without benchmarking it.
SELECT *
FROM tbl
JOIN (
SELECT
fk_artikel_bron,
MIN(pk_artikel_id) AS id
FROM tbl
GROUP BY fk_artikel_bron) mins ON tbl.pk_artikel_id = mins.id

This is similar to Michael's answer, but does it with a self-join instead of a subquery. Try it out to see how it performs:
SELECT * from tbl t1
LEFT JOIN tbl t2
ON t2.fk_artikel_bron = t1.fk_artikel_bron
AND t2.pk_artikel_id < t1.pk_artikel_id
WHERE t2.pk_artikel_id IS NULL
If you have the right indexes, this type of join often out performs subqueries (since derived tables don't use indexes).

This non-standard, mysql-only trick will select the first row encountered for each value of pk_artikel_bron.
select *
...
group by pk_artikel_bron
Like it or not, this query produces the output asked for.
Edited
I seem to be getting hammered here, so here's the disclaimer:
This only works for mysql 5+
Although the mysql specification says the row returned using this technique is not predictable (ie you could get any row as the "first" encountered), in fact in all cases I've ever seen, you'll get the first row as per the order selected, so to get a predictable row that works in practice (but may not work in future releases but probably will), select from an ordered result:
select * from (
select *
...
order by pk_artikel_id) x
group by pk_artikel_bron

MySQL: Include COUNT of SELECT Query Results as a Column (Without Grouping)

I have a simple report sending framework that basically does the following things:
It performs a SELECT query, it makes some text-formatted tables based on the results, it sends an e-mail, and it performs an UPDATE query.
This system is a generalization of an older one, in which all of the operations were hard coded. However, in pushing all of the logic of what I'd like to do into the SELECT query, I've run across a problem.
Before, I could get most of the information for my text tables by saying:
SELECT Name, Address FROM Databas.Tabl WHERE Status='URGENT';
Then, when I needed an extra number for the e-mail, also do:
SELECT COUNT(*) FROM Databas.Tabl WHERE Status='URGENT' AND TimeLogged='Noon';
Now, I no longer have the luxury of multiple SELECT queries. What I'd like to do is something like:
SELECT Tabl.Name, Tabl.Address, COUNT(Results.UID) AS Totals
FROM Databas.Tabl
LEFT JOIN Databas.Tabl Results
ON Tabl.UID = Results.UID
AND Results.TimeLogged='Noon'
WHERE Status='URGENT';
This, at least in my head, says to get a total count of all the rows that were SELECTed and also have some conditional.
In reality, though, this gives me the "1140 - Mixing of GROUP columns with no GROUP columns illegal if no GROUP BY" error. The problem is, I don't want to GROUP BY. I want this COUNT to redundantly repeat the number of results that SELECT found whose TimeLogged='Noon'. Or I want to remove the AND clause and include, as a column in the result of the SELECT statement, the number of results that that SELECT statement found.
GROUP BY is not the answer, because that causes it to get the COUNT of only the rows who have the same value in some column. And COUNT might not even be the way to go about this, although it's what comes to mind. FOUND_ROWS() won't do the trick, since it needs to be part of a secondary query, and I only get one (plus there's no LIMIT involved), and ROW_COUNT() doesn't seem to work since it's a SELECT statement.
I may be approaching it from the wrong angle entirely. But what I want to do is get COUNT-type information about the results of a SELECT query, as well as all the other information that the SELECT query returned, in one single query.
=== Here's what I've got so far ===
SELECT Tabl.Name, Tabl.Address, Results.Totals
FROM Databas.Tabl
LEFT JOIN (SELECT COUNT(*) AS Totals, 0 AS Bonus
FROM Databas.Tabl
WHERE TimeLogged='Noon'
GROUP BY NULL) Results
ON 0 = Results.Bonus
WHERE Status='URGENT';
This does use sub-SELECTs, which I was initially hoping to avoid, but now realize that hope may have been foolish. Plus it seems like the COUNTing SELECT sub-queries will be less costly than the main query since the COUNT conditionals are all on one table, but the real SELECT I'm working with has to join on multiple different tables for derived information.
The key realizations are that I can GROUP BY NULL, which will return a single result so that COUNT(*) will actually catch everything, and that I can force a correlation to this column by just faking a Bonus column with 0 on both tables.
It looks like this is the solution I will be using, but I can't actually accept it as an answer until tomorrow. Thanks for all the help.

SELECT Tabl.Name, Tabl.Address, Results.Totals
FROM Databas.Tabl
LEFT JOIN (SELECT COUNT(*) AS Totals, 0 AS Bonus
FROM Databas.Tabl
WHERE TimeLogged='Noon'
GROUP BY NULL) Results
ON 0 = Results.Bonus
WHERE Status='URGENT';
I figured this out thanks to ideas generated by multiple answers, although it's not actually the direct result of any one. Why this does what I need has been explained in the edit of the original post, but I wanted to be able to resolve the question with the proper answer in case anyone else wants to perform this silly kind of operation. Thanks to all who helped.

You could probably do a union instead. You'd have to add a column to the original query and select 0 in it, then UNION that with your second query, which returns a single column. To do that, the second query must also select empty fields to match the first.
SELECT Cnt = 0, Name, Address FROM Databas.Tabl WHERE Status='URGENT'
UNION ALL
SELECT COUNT(*) as Cnt, Name='', Address='' FROM Databas.Tabl WHERE Status='URGENT' AND TimeLogged='Noon';
It's a bit of a hack, but what you're trying to do isn't ideal...

Does this do what you need?
SELECT Tabl.Name ,
Tabl.Address ,
COUNT(Results.UID) AS GrandTotal,
COUNT(CASE WHEN Results.TimeLogged='Noon' THEN 1 END) AS NoonTotal
FROM Databas.Tabl
LEFT JOIN Databas.Tabl Results
ON Tabl.UID = Results.UID
WHERE Status ='URGENT'
GROUP BY Tabl.Name,
Tabl.Address
WITH ROLLUP;

The API you're using to access the database should be able to report to you how many rows were returned - say, if you're running perl, you could do something like this:
my $sth = $dbh->prepare("SELECT Name, Address FROM Databas.Tabl WHERE Status='URGENT'");
my $rv = $sth->execute();
my $rows = $sth->rows;

Grouping by Tabl.id i dont believe would mess up the results. Give it a try and see if thats what you want.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008