Left Outer Join with Missing Records

Left Outer Join with Missing Records - mysql

I'm selecting data from several tables, but the main idea is that a product may or may not have a discount record associated with it, as either a percent off or dollar off amount. I'm using a left outer join (which may be incorrect) and am getting back the same values for dollar and percent values, regardless as to whether or not the records exist.
The query looks something like:
SELECT Items.ItemID, Items.name, Items.price,
ItemDiscounts.percentOff, ItemDiscounts.dollarOff,
ItemAttributes.ColorName, ItemStuff.StuffID
FROM Items, ItemAttributes, ItemStuff
LEFT OUTER JOIN ItemDiscounts
ON ItemDiscounts.ItemID = ItemID
AND (
ItemDiscounts.percentOff > 0
OR ItemDiscounts.dollarOff > 0
)
WHERE Items.ItemID = ItemAttributes.ItemID
AND ItemStuff.ItemID = Items.ItemID
GROUP BY ItemStuff.StuffID
The weird part is that in all results, percentOff returns "1", and dollarOff returns "0", regardless if each item has it's own associated discount record. For spits, I changed ItemDiscounts.percentOff > 0 to ItemDiscounts.percentOff > 1, then dollarAmount changed to all 2's and percentOff was all 0's.
I'm somewhat baffled on this, so any help would be appreciated.

You have an unqualified reference to ItemID in your ON clause... not clear why that's not throwing an "ambiguous column" exception. (Apparently, it's not ambiguous to MySQL, and MySQL is determining which ItemId is being referenced, the odds are good that its not the one you intended.
Also, your query includes references to the ItemStuff rowsource, but there is no such rowsource shown in your query.
I also suspect that the behavior of the GROUP BY that is giving you a result set that doesn't meet your expectation. (More than likely, right now, it's masking the real problem in your query, which could be a CROSS JOIN operation that you didn't intend.
I suggest you try your query without the GROUP BY clause, and confirm the resultset is what you would expect absent the GROUP BY clause.
NOTE: Most other relational database engines will throw an exception with a GROUP BY like you show in your query. They (basically) require that every non-aggregate in the SELECT list be included in the GROUP BY. You can get MySQL to behave the same way (with some particular settings of sql_mode.) MySQL is more liberal, but the result set you get back may not conform to your expectation.
NOTE: I don't see how this query is passing a semantics check, and is returning any resultset at all, given the references to a non-existent ItemStuff rowsource.
For improved readability, I recommend that you not use the comma as the join operator, and that you instead use the JOIN keyword. I also recommend that you move the join predicates from the WHERE clause to an ON clause. I also prefer to give an alias to each rowsource, and use that alias to qualify the columns from it.
Given what you show in your query, I'd write (the parts I can make sense of) like this:
SELECT i.ItemID
, i.name
, i.price
, d.percentOff
, d.dollarOff
, a.ColorName
FROM Items i
JOIN ItemAttributes a
ON a.ItemID = i.ItemID
LEFT
JOIN ItemDiscounts d
ON d.ItemID = i.ItemID
AND ( d.percentOff > 0 OR d.dollarOff > 0 )
I've omitted ItemStuff.StuffID from the SELECT list, because I don't see any ItemStuff rowsource.
I also exclude the WHERE clause, because I don't see any ItemStuff rowsource in your query.
-- WHERE ItemStuff.ItemID = i.ItemID
I omit the GROUP BY because, again, I don't see any ItemStuff rowsource in your query, and because the behavior of the GROUP BY is likely not what I expect, but is rather masking a problem in my query.
-- GROUP BY ItemStuff.StuffID
UPDATE:
#Kyle, the fact your query "timed out" leads me to believe you are generating WAY MORE rows than you expect, like you have a Cartesian product (every row from a table is being "matched" to every row in some other table... 10,000 rows in one table, 10,000 rows in the other table, that will generate 100,000,000 rows.
I think the GROUP BY clause is masking the real problem.
I recommend that for development, you include the PRIMARY KEY of each table as the leading columns in your result set. I would add some reasonable predicates to the driving table (e.g. i.ItemID IN (2,3,5,7) to limit the size of the result set, and ORDER BY the primary key... that should help you identify an unintended Cartesian product.

Do you get what you want when you remove these lines from your query?
AND (
ItemDiscounts.percentOff > 0
OR ItemDiscounts.dollarOff > 0
)

Once you specify an absolute value for the possibly-null side of an outer join, your WHERE clause has to account for it.
Try it with the following clause:
AND (
ItemDiscounts.percentOff > 0
OR ItemDiscounts.percentOff is null
OR ItemDiscounts.dollarOff > 0
OR ItemDiscounts.dollarOff is null
)
Additionaly, you're specifying a GROUP BY without an aggregate. This makes no sense in most cases. You probably want ORDER BY to sort.

Related

MySQL performance - cross join vs left join

I am wondering how MySQL (or its underlying engine) processes the queries.
There are two set queries below (one uses left join and the other one uses cross join), which eventually will give the same result.
My question is, how come the processing time of the two sets of queries are similar?
What I expected is that the first set query will run quicker because the computer is dealing with left join so the size of the "table" won't be expanding, while the second set of queries makes the size of the "table" (what I assume is that the computer needs to get the result of the cross-join from multiple tables before it can go ahead and do the where clause) relatively larger.
select s.*, a.score as score_01, b.score as score_02
from student s
left join (select \* from sc where cid = '01') a using (sid)
left join (select \* from sc where cid = '02') b using (sid)
where a.score > b.score;
select s.*, a.score as score_01, b.score as score_02
from student s
,(select * from sc where cid = '01') a
,(select * from sc where cid = '02') b
where a.score > b.score and a.sid = b.sid and s.sid = a.sid;
I tried both sets of queries and expected the processing time for the first set query will be shorter, but it is not the case.

Add this to sc:
INDEX(sid, cid, score)
Better yet, if you have a useless id on side replace it with
PRIMARY KEY(sid, cid)`
(Assuming that pair is Unique.)
With either of those fixes, I expect both of your queries run at similar speed, and faster than currently.
For further discussion, please provide SHOW CREATE TABLE.
Addressing some of the Comments
MySQL ignores the keywords INNER, OUTER, and CROSS. So, it up to the WHERE to figure whether it is "inner" or "outer".
MySQL throws the ON and WHERE conditions together (except when it matters for LEFT), then decides what is used for filtering (WHERE) so it may be able to do that first. Then other conditions (which belonged in ON) help it get to the 'next' table.
So... Please use ON to say how the tables are related; use WHERE for filtering. (And don't use the old comma-join.)
That is, MySQL will [usually] look at one table at a time, doing a "Nested Loop Join" (NLJ) to get to the next.
There are many possible ways to evaluate a JOIN; MySQL ponders which one might be best, then uses that.
The order of non-LEFT JOINs does not matter, nor does the order of expressions AND'd together in WHERE.
In some situations, a HAVING expression can (and is) moved to the WHERE clause.
Although FROM comes before WHERE, the two get somewhat tangled up together. But, in general, the clauses are required to be in a certain order, and that order is logically the order that things have to happen in.
It is up to the Optimizer to combine steps. For example
WHERE a = 1
ORDER BY b
and the table has INDEX(a,b) -- The index will be used to do both, essentially at the same time. Ditto for
SELECT a, MAX(b)
...
GROUP BY a
ORDER BY a
can hop through the BTree index on (a,b) and deliver the results without an extra sort pass for either the GROUP BY or ORDER BY.
SELECT x is executed after WHERE y = 'abc' -- Well, in some sense it is. But if you have INDEX(y,x), the Optimizer is smart enough to grab the x values while it is performing the WHERE.
When a WHERE references more than one table of a JOIN, the Optimizer has a quandary. Which table should it start its NLJ with? It has some statistics to help make the decision, but it does not always get it right. It will usually
filter on one of the tables
NLJ to get to the next table, meanwhile throwing in any WHERE clauses for that table in with the ON clause.
Repeat for other tables.
When there is both a WHERE and an ORDER BY, the Optimizer will usually filter filter, then sort. But sometimes (not always correctly) it will decide to use an index for the ORDER BY (thereby eliminating the sort) and filter as it reads the table. LIMIT, which is logically done last further muddies the decision.
MySQL does not have FULL OUTER JOIN. It can be simulated with two JOIN and a UNION. (It is only very rarely needed.)

Mysql - Sum by other column value

Here's the problem. I have a long but not very complex query:
SUM(x.value)
FROM valuetable AS x
LEFT JOIN jointable_1 AS y
LEFT JOIN jointable_2 AS z
etc
...
GROUP BY y.id, z.id
There are n amount of left joins, and I need to keep it this way, for a new left join must be available any time. I obviously get n value dublicates into SUM, since jointables can have multiple results, and I can not break any of them into subquery for flexible WHERE reasons. I need only one x.value per x.id into SUM, thats also obvious.
-I cannot add x.id to GROUP BY, since I so need one row to have sum per y.id.
-I cannot use the calculation:
SUM(x.value)*COUNT(DISTINCT x.id)/COUNT(*)
since there can be any number of x.values in sum, as different x.id-s have different amount of joins.
-I cannot go for DISTINCT x.value, since any x.id can have any x.value and they can contain same value.
-I don't know how to create a subquery for sum, since I cannot use the aggregated value (for example GROUP_CONCAT(DISTINCT x.id)) in subquery, or can I?
Anyways, thats it. I know I can rearrange the query(subqueries instead of joins, different from), but I want to leave it as the last resort. Is there a way to achieve what I want?

Sorry to say, there's no general way to do what you want without subqueries (or maybe views).
A bit of jargon: "Cardinality". For our purpose it's the number of rows in a table or a result set. (For our purpose a result set is a kind of virtual table.)
For aggregate functions like SUM(col) and COUNT(*) to give good results, we must attend to the cardinality of the table being summarized. This kind of thing
SELECT DATE(sale_time) sale_date,
store_id,
SUM(sale_amount) total_sales
FROM sale
GROUP BY DATE(sale_time), store_id
summarizes the same cardinality of result table as the underlying table, so it generates useful results.
But, if we do this
SELECT DATE(sale.sale_time) sale_date,
sale.store_id,
SUM(sale.sale_amount) total_sales,
COUNT(promo.promo_id) promos
FROM sale
LEFT JOIN promo ON sale.store_id = promo.store_id
AND DATE(sale.sale_time) = promo.promo_date
GROUP BY DATE(sale.sale_time), sale.store_id
we wreck the cardinality of the summarized result set. This will never work unless we know for sure that each store had either zero or one promo records for each given day. Why not? The LEFT JOIN operation affects the cardinality of the virtual table being summarized. That means some sale_amount values my show up in the SUM more than once, and therefore the SUM won't be correct, or trustworthy.
How can you prevent LEFT JOIN operations from messing up your cardinality? Make sure your LEFT JOIN's ON clause matches each row on the right to exactly zero rows, or exactly one row, on the left. That is, make sure you (virtual) tables on either side of the JOIN have appropriate cardinality.
(In entity-relationship jargon, your SUM fails because you join two entities with a one-to-many relationship before you do the sum.)
The theoretically cleanest way to do it is to perform both aggregate operations before the join. This joins two virtual tables in a way that the LEFT JOIN is either one-to-none or one-to-one
SELECT sales.sale_date,
sales.store_id,
sales.total_sales,
promos.promo_count
FROM (
SELECT DATE(sale_time) sale_date,
store_id,
SUM(sale_amount) total_sales
FROM sale
GROUP BY DATE(sale_time), sale_store
) sales
LEFT JOIN (
SELECT store_id,
promo_date
COUNT(*) promo_count
FROM promo
GROUP BY store_id, promo_date
) promos ON sales.store_id = promos.store_id
AND sales.sale_date = promo.promo_date
Although this SQL is complex, most servers handle this kind of pattern efficiently.
Troubleshooting tip: If you see SUM() ... FROM ... JOIN ... GROUP BY all at the same level of a query, you may have cardinality problems.

Where clause doesn't limit records when aggregate used in select

I have this select in my MySQL DB:
select r.ID, r.ReservationDate, SUM(p.Amount) AS Amount
from Reservations r
join Payments p
on r.ID = p.ReservationID
where r.ConfirmationNumber = '123456'
and p.CCLast4 = '3506'
and r.ID = 54321
It gives me exactly 1 record -- the correct record -- as expected. But if I change the CCLast4 (3506) to any old number/string I want, I still get the record back, but Amount is null. I would expect no record at all because the where clause no longer matches. If I change the the ConfirmationNumber or the ID, as expected I get back no results. But CCLast4 is being completely ignored.
If I remove the aggregate: SUM(p.Amount) AS Amount - all is good, and the CCLast4 demands the correct number before returning the string.
I don't understand why the aggregate causes the where clause related to the Payments table (CCLast4 column) to be ignored.
How can I change the query so that I can use the aggregate in the select AND all the where clauses are honored?

This is actually the expected behaviour. From the manual:
Without GROUP BY, there is a single group and it is nondeterministic which [non-aggregated column] value to choose for the group.
Although not very clearly stated, this means that you always get one group (so one row), even for an empty table.
It is also worth emphasizing that the values that MySQL chooses for the non-aggregated columns in your select, r.ID and r.ReservationDate, are in fact nondeterministic and will specifically vary across MySQL versions (e.g. they will usually be null for MySQL 8.0 while they will usually contain existing values for earlier versions).
The solution is similarly subtle - add a group by (so the quoted sentence does not apply anymore):
...
where r.ConfirmationNumber = '123456'
and p.CCLast4 = 'xxx'
and r.ID = 54321
group by r.ID, r.ReservationDate
should give you 0 rows.

mySQL bringing back result it should not

I have a table filled with tasting notes written by users, and another table that holds ratings that other users give to each tasting note.
The query that brings up all notes that are written by other users that you have not yet rated looks like this:
SELECT tastingNotes.userID, tastingNotes.beerID, tastingNotes.noteID, tastingNotes.note, COALESCE(sum(tasteNoteRate.Score), 0) as count,
CASE
WHEN tasteNoteRate.userVoting = 1162 THEN 1
ELSE 0
END AS userScored
FROM tastingNotes
left join tasteNoteRate on tastingNotes.noteID = tasteNoteRate.noteID
WHERE tastingNotes.userID != 1162
Group BY tastingNotes.noteID
HAVING userScored < 1
ORDER BY count, userScored
User 1162 has written a note for note 113. In the tasteNoteRate table it shows up as:
noteID | userVoting | score
113 1162 0
but it is still returned each time the above query is run....

MySQL allows you to use group by in a rather special way without complaining, see the documentation:
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard SQL use of GROUP BY permits the select list, HAVING condition, or ORDER BY list to refer to nonaggregated columns even if the columns are not functionally dependent on GROUP BY columns. [...] In this case, the server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate, which is probably not what you want.
This behaviour was the default behaviour prior to MySQL 5.7.
In your case that means, if there is more than one row in tasteNoteRate for a specific noteID, so if anyone else has already voted for that note, userScored, which is using tasteNoteRate.userVoting without an aggregate function, will be based on a random row - likely the wrong one.
You can fix that by using an aggregate:
select ...,
max(CASE
WHEN tasteNoteRate.userVoting = 1162 THEN 1
ELSE 0
END) AS userScored
from ...
or, because the result of a comparison (to something other than null) is either 1 or 0, you can also use a shorter version:
select ...,
coalesce(max(tasteNoteRate.userVoting = 1162),0) AS userScored
from ...
To be prepared for an upgrade to MySQL 5.7 (and enabled ONLY_FULL_GROUP_BY), you should also already group by all non-aggregate columns in your select-list: group by tastingNotes.userID, tastingNotes.beerID, tastingNotes.noteID, tastingNotes.note.
A different way of writing your query (amongst others) would be to do the grouping of tastingNoteRates in a subquery, so you don't have to group by all the columns of tastingNotes:
select tastingNotes.*,
coalesce(rates.count, 0) as count,
coalesce(rates.userScored,0) as userScored
from tastingNotes
left join (
select tasteNoteRate.noteID,
sum(tasteNoteRate.Score) as count,
max(tasteNoteRate.userVoting = 1162) as userScored
from tasteNoteRate
group by tasteNoteRate.noteID
) rates
on tastingNotes.noteID = rates.noteID and rates.userScored = 0
where tastingNotes.userID != 1162
order by count;
This also allows you to get the notes the user voted on by changing rates.userScored = 0 in the on-clause to = 1 (or remove it to get both).

Change to an inner join.
The tasteNoteRate table is being left joined to the tastingNotes, which means that the full tastingNotes table (matching the where) is returned, and then expanded by the matching fields in the tasteNoteRate table. If tasteNoteRate is not satisfied, it doesn't prevent tastingNotes from returning the matched fields. The inner join will take the intersection.
See here for more explanation of the types of joins:
What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?
Make sure to create an index on noteID in both tables or this query and use case will quickly explode.
Note: Based on what you've written as the use case, I'm still not 100% certain that you want to join on noteID. As it is, it will try to give you a joined table on all the notes joined with all the ratings for all users ever. I think the CASE...END is just going to interfere with the query optimizer and turn it into a full scan + join. Why not just add another clause to the where..."and tasteNoteRate.userVoting = 1162"?
If these tables are not 1-1, as it looks like (given the sum() and "group by"), then you will be faced with an exploding problem with the current query. If every note can have 10 different ratings, and there are 10 notes, then there are 100 candidate result rows. If it grows to 1000 and 1000, you will run out of memory fast. Eliminating a few rows that the userID hasn't voted on will remove like what 10 rows from eventually 1,000,000+, and then sum and group them?
The other way you can do it is to reverse the left join:
select ...,sum()... from tasteNoteRate ... left join tastingNotes using (noteID) where userID != xxx group by noteID, that way you only get tastingNotes information for other users' notes.
Maybe that helps, maybe not, but yeah, SCHEMA and specific use cases/example data would be helpful.
With this kind of "ratings of ratings", sometimes its better to maintain a summary table of the vote totals and just track which the user has already voted on. e.g. Don't sum them all up in the select query. Instead, sum it up in the insert...on duplicate key update (total = total + 1); At least thats how I handle the problem in some user ranking tables. They just grow so big so fast.

SQL JOIN Query to return rows where we did NOT find a match in joined table

More of a theory/logic question but what I have is two tables: links and options. Links is a table where I add rows that represent a link between a product ID (in a separate products table) and an option. The options table holds all available options.
What I'm trying to do (but struggling to create the logic for) is to join the two tables, returning only the rows where there is no option link in the links table, therefore representing which options are still available to add to the product.
Is there a feature of SQL that might help me here? I'm not tremendously experienced with SQL yet.

Your table design sounds fine.
If this query returns the id values of the "options" linked to a particular "product"...
SELECT k.option_id
FROM links k
WHERE k.product_id = 'foo'
Then this query would get the details of all the options related to the "product"
SELECT o.id
, o.name
FROM options o
JOIN links k
ON k.option_id = o.id
WHERE k.product_id = 'foo'
Note that we can actually move the "product_id='foo'" predicate from the WHERE clause to the ON clause of the JOIN, for an equivalent result, e.g.
SELECT o.id
, o.name
FROM options o
JOIN links k
ON k.option_id = o.id
AND k.product_id = 'foo'
(Not that it makes any difference here, but it would make a difference if we were using an OUTER JOIN (in the WHERE clause, it would negate the "outer-ness" of the join, and make it equivalent to an INNER JOIN.)
But, none of that answers your question, it only sets the stage for answering your question:
How do we get the rows from "options" that are NOT linked to particular product?
The most efficient approach is (usually) an anti-join pattern.
What that is, we will get all the rows from "options", along with any matching rows from "links" (for a particular product_id, in your case). That result set will include the rows from "options" that don't have a matching row in "links".
The "trick" is to filter out all the rows that had matching row(s) found in "links". That will leave us with only the rows that didn't have a match.
And way we filter those rows, we use a predicate in the WHERE clause that checks whether a match was found. We do that by checking a column that we know for certain will be NOT NULL if a matching row was found. And we know* for certain that column will be NULL if a matching row was NOT found.
Something like this:
SELECT o.id
, o.name
FROM options o
LEFT
JOIN links k
ON k.option_id = o.id
AND k.product_id = 'foo'
WHERE k.option_id IS NULL
The "LEFT" keyword specifies an "outer" join operation, we get all the rows from "options" (the table on the "left" side of the JOIN) even if a matching row is not found. (A normal inner join would filter out rows that didn't have a match.)
The "trick" is in the WHERE clause... if we found a matching row from links, we know that the "option_id" column returned from "links" would not be NULL. It can't be NULL if it "equals" something, and we know it had to "equals" something because of the predicate in the ON clause.
So, we know that the rows from options that didn't have a match will have a NULL value for that column.
It takes a bit to get your brain wrapped around it, but the anti-join quickly becomes a familiar pattern.
The "anti-join" pattern isn't the only way to get the result set. There are a couple of other approaches.
One option is to use a query with a "NOT EXISTS" predicate with a correlated subquery. This is somewhat easier to understand, but doesn't usually perform as well:
SELECT o.id
, o.name
FROM options o
WHERE NOT EXISTS ( SELECT 1
FROM links k
WHERE k.option_id = o.id
AND k.product_id = 'foo'
)
That says get me all rows from the options table. But for each row, run a query, and see if a matching row "exists" in the links table. (It doesn't matter what is returned in the select list, we're only testing whether it returns at least one row... I use a "1" in the select list to remind me I'm looking for "1 row".
This usually doesn't perform as well as the anti-join, but sometimes it does run faster, especially if other predicates in the WHERE clause of the outer query filter out nearly every row, and the subquery only has to run for a couple of rows. (That is, when we only have to check a few needles in a haystack. When we need to process the whole stack of hay, the anti-join pattern is usually faster.)
And the beginner query you're most likely to see is a NOT IN (subquery). I'm not even going to give an example of that. If you've got a list of literals, then by all means, use a NOT IN. But with a subquery, it's rarely the best performer, though it does seem to be the easiest to understand.
Oh, what the hay, I'll give a demo of that as well (not that I'm encouraging you to do it this way):
SELECT o.id
, o.name
FROM options o
WHERE o.id NOT IN ( SELECT k.option_id
FROM links k
WHERE k.product_id = 'foo'
AND k.option_id IS NOT NULL
GROUP BY k.option_id
)
That subquery (inside the parens) gets a list of all the option_id values associated with a product.
Now, for each row in options (in the outer query), we can check the id value to see if it's in that list returned by the subquery.
If we have a guarantee that option_id will never be NULL, we can omit the predicate that tests for "option_id IS NOT NULL". (In the more general case, when a NULL creeps into the resultset, then the outer query can't tell if o.id is in the list or not, and the query doesn't return any rows; so I usually include that, even when it's not required. The GROUP BY isn't strictly necessary either; especially if there's a unique constraint (guaranteed uniqueness) on the (product_id,option_id) tuple.
But, again, don't use that NOT IN (subquery), except for testing, unless there's some compelling reason to (for example, it manages to perform better than the anti-join.)
You're unlikely to notice any performance differences with small sets, the overhead of transmitting the statement, parsing it, generating an access plan, and returning results dwarfs the actual "execution" time of the plan. It's with larger sets that the differences in "execution" time become apparent.
EXPLAIN SELECT ... is a really good way to get a handle on the execution plans, to see what MySQL is really doing with your statement.
Appropriate indexes, especially covering indexes, can noticeably improve performance of some statements.

Yes, you can do a LEFT JOIN (if MySQL; there are variations in other dialects) which will include rows in links which do NOT have a match in options. Then test if options.someColumn IS NULL and you will have exactly the rows in links which had no "matching" row in options.

Try something along the lines of this
To count
SELECT Links.linkId, Count(*)
FROM Link
LEFT JOIN Options ON Links.optionId = Options.optionId
Where Options.optionId IS NULL
Group by Links.linkId
To see the lines
SELECT Links.linkId
FROM Link
LEFT JOIN Options ON Links.optionId = Options.optionId
Where Options.optionId IS NULL

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008