Does MySQL automatically use the coalesce function during a join between tables? - mysql

During a table join, when does MySQL use this function?
The single result column that replaces two common columns is defined
using the coalesce operation. That is, for two t1.a and t2.a the
resulting single join column a is defined as a = COALESCE(t1.a, t2.a),
where:
COALESCE(x, y) = (CASE WHEN x IS NOT NULL THEN x ELSE y END)
https://dev.mysql.com/doc/refman/8.0/en/join.html
I know what the function does, but I want to know when it is used during the join operation. This just makes no sense to me! Can someone show me an example?

That is in reference to redundant column elimination during natural join and join with using. Describing how the columns are excluded from display.
The order of operation is described above the section you referenced.
First, coalesced common columns of the two joined tables, in the order in which they occur in the first table
Second, columns unique to the first table, in order in which they occur in that table
Third, columns unique to the second table, in order in which they occur in that table
Example
t1
| a | b | c |
| 1 | 1 | 1 |
t2
| a | b | d |
| 1 | 1 | 1 |
The join with using
SELECT * FROM t1 JOIN t2 USING (b);
Would result in, t1.b being coalesced (due to USING), followed by the columns unique to the first table, followed by those in the second table.
| b | a | c | a | d |
| 1 | 1 | 1 | 1 | 1 |
Whereas a natural join
SELECT * FROM t1 NATURAL JOIN t2;
Would result in, the t1 columns (or rather common columns from both tables) being coalesced, followed by the unique columns of the first table, followed by those in the second table.
| a | b | c | d |
| 1 | 1 | 1 | 1 |

Related

Fetch distinct count of all columns in a single MySQL query

I am trying to fetch distinct count of all columns in a single query. Consider the below table.
COL1 | COL2 | COL3
A | 5 | C
B | 5 | C
C | 5 | C
C | 5 | C
D | 7 | C
Expected result
DC_COL1 | DC_COL2 | DC_COL3 #DC - Distinct count
4 | 2 | 1
Though the above result can not be achieved (AFAIK) in a single query (single full table scan) using valid group by functions, what are the optimisations that could be done here?
Firing individual queries for each column might result in full table scan for each column. Though the entire table might have come to the buffer pool during the distinct count query for the first column but it will still be a performance issue on large tables.
It can be done in a single table scan:
SELECT
COUNT(DISTINCT COL1) DC_COL1,
COUNT(DISTINCT COL2) DC_COL2,
COUNT(DISTINCT COL3) DC_COL3
FROM tablename

SQL: Update table by mapping two columns to each other

I have the following two tables:
Table A
+-------------------+
|___User___|__Value_|
| 3 | a |
| 4 | b |
| 5 | c |
|____6_____|__d_____|
Table B
+-------------------+
|___User___|__Value_|
| 1 | |
| 4 | |
| 5 | |
|____9_____|________|
My job is to take user from Table A (and their correspondings value) and then map it to Table B and insert those values in there. So from the above example Table B should look like this after running the script:
Table B
+-------------------+
|___User___|__Value_|
| 1 | |
| 4 | b |
| 5 | c |
|____9_____|________|
My question is how can I construct an SQL query that will do this for me in an efficient way, if Table A contains 300,000 + entries and Table B contains 70,000 entries?
NOTES: In Table A the User field is not unique and neither is the Value field. However in Table B, both the User and Value fields are unique and should not appear more than once. Neither are primary keys for either tables.
Could be this
update table_b as b
inner join table_a as a on a.User = b.User
set b.value = a.value
In real-world situations, it would be more likely that you want a predictable value, such as the greatest value for any given user. In that case you would want
update table_b as b
inner join (
select user, max(value) from table_a
group by user ) as a_max on a.user = b.user
set b.value = a_max.value
Your question is unclear about what to do about any values that are already in b. If you use a left join, then these will explicitly be set to NULL:
update table_b b left join
table_a a
on a.User = b.User
set b.value = a.value;
If you want to keep the existing values for non-matches, then use inner join.
Note that this might be inefficient, but should be ok if an index exists on a(user).
If you had very few users in a and lots and lots of duplicates, then you might want to aggregate a before doing the join.

SQL query to find missing entries in

I have a database in which I need to find some missing entries and fill them in.
I have a table called "menu", each restaurant has multiple dishes and each dish has 4 different language entries (actually 8 in the main database but for simplicity lets go with 4), I need to find out which dishes for a particular restaurant are missing any language entries.
select * from menu where restaurantid = 1
i get stuck there, something along the lines of where language 1 or 2 or 3 or 4 doesn't exist which is the complicated bit because I need to see the languages that exist in order to see the language that's missing because I can't display something that isn't there. I hope that makes sense?
In the example table below restaurant 2 dishid 2 is missing language 3, that's what i need to find.
+--------------+--------+----------+-----------+
| RestaurantID | DishID | DishName | Language |
+--------------+--------+----------+-----------+
| 1 | 1 | Soup | 1 |
| 1 | 1 | Soúp | 2 |
| 1 | 1 | Soupe | 3 |
| 1 | 1 | Soupa | 4 |
| 1 | 2 | Bread | 1 |
| 1 | 2 | Bréad | 2 |
| 1 | 2 | Breade | 3 |
| 1 | 1 | Breada | 4 |
| 2 | 1 | Dish1 | 1 |
| 2 | 1 | Dísh1 | 2 |
| 2 | 1 | Disha1 | 3 |
| 2 | 1 | Dishe1 | 4 |
| 2 | 2 | Dish2 | 1 |
| 2 | 2 | Dísh2 | 2 |
| 2 | 2 | Dishe2 | 4 |
+--------------+--------+----------+-----------+
An anti-join pattern is usually the most efficient, in terms of performance.
Your particular case is a little more tricky, in that you need to "generate" rows that are missing. If every (ResturantID,DishID) should have 4 rows, with Language values of 1,2,3 and 4, we can generate that set of all rows with a CROSS JOIN operation.
The next step is to apply an anti-join... a LEFT OUTER JOIN to the rows that exist in the menu table, so we get all the rows from the CROSS JOIN set, along with matching rows.
The "trick" is to use a predicate in the WHERE clause that filters out rows where we found a match, so we are left rows that didn't have a match.
(It seems a bit strange at first, but once you get your brain wrapped around the anti-join pattern, it becomes familiar.)
So a query of this form should return the specified result set.
SELECT d.RestaurantID
, d.DishID
, lang.id AS missing_language
FROM (SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
) lang
CROSS
JOIN (SELECT e.RestaurantID, e.DishID
FROM menu e
GROUP BY e.RestaurantID, e.DishID
) d
LEFT
JOIN menu m
ON m.RestaurantID = d.RestaurantID
AND m.DishID = d.DishID
AND m.Language = lang.id
WHERE m.RestaurantID IS NULL
ORDER BY 1,2,3
Let's unpack that bit.
First we get a set containing the numbers 1 thru 4.
Next we get a set containing the (RestaurantID, DishID) distinct tuples. (For each distinct Restaurant, a distinct list of DishID, as long as there is at least one row for any Language for that combination.)
We do a CROSS JOIN, matching every row from set one (lang) with every row from set (d), to generate a "complete" set of every (RestaurantID, DishID, Language) we want to have.
The next part is the anti-join... the left outer join to menu to find which of the rows from the "complete" set has a matching row in menu, and filtering out all the rows that had a match.
That may be a little confusing. If we think of that CROSS JOIN operation producing a temporary table that looks like the menu table, but containing all possible rows... we can think of it in terms of pseudocode:
create temporary table all_menu_rows (RestaurantID, MenuID, Language) ;
insert into all_menu_rows ... all possible rows, combinations ;
Then the anti-join pattern is a little easier to see:
SELECT r.RestaurantID
, r.DishID
, r.Language
FROM all_menu_rows r
LEFT
JOIN menu m
ON m.RestaurantID = r.RestaurantID
AND m.DishID = r.DishID
AND m.Language = r.Language
WHERE m.RestaurantID IS NULL
ORDER BY 1,2,3
(But we don't have to incur the extra overhead of creating and populating the temporary table, we can do that right in the query.)
Of course, this isn't the only approach. We could use a NOT EXISTS predicate instead of an anti-join, though this is not usually as efficient. The first part of the query is the same, to generate the "complete" set of rows we expect to have; what differs is how we identify whether or not there is a matching row in the menu table:
SELECT d.RestaurantID
, d.DishID
, lang.id AS missing_language
FROM (SELECT 1 AS id UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
) lang
CROSS
JOIN (SELECT e.RestaurantID, e.DishID
FROM menu e
GROUP BY e.RestaurantID, e.DishID
) d
WHERE NOT EXISTS ( SELECT 1
FROM menu m
WHERE m.RestaurantID = d.RestaurantID
AND m.DishID = d.DishID
AND m.Language = lang.id
)
ORDER BY 1,2,3
For each row in the "complete" set (generated by the CROSS JOIN operation), we're going to run a correlated subquery that checks whether a matching row is found. The NOT EXISTS predicate returns TRUE if no matching row is found. (This is a little easier to understand, but it usually doesn't perform as well as the anti-join pattern.)
You can use the following statement if each menu item should have a record on each language (8 in real life 4 in example). You can change the number 4 to 8 if you want to see all menu items per restaurant that doesn't have all 8 entries.
SELECT RestaurantID,DishID, COUNT( * )
FROM Menu
GROUP BY RestaurantID,DishID
HAVING COUNT( * ) <4

How to `SELECT` and manufacture missing rows from previous values?

I have the following (simplified) result from SELECT * FROM table ORDER BY tick,refid:
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 3 3333
3 3 333333
Note the "missing" rows for refid 1 (tick 3) and refid 2 (ticks 2 and 3)
If possible, how can I make a query to add these missing rows using the most recent prior value for that refid? "Most recent" means the value for the row with the same refid as the missing row and largest tick such that the tick is less than the tick for the missing row. e.g.
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 2 22
2 3 3333
3 1 1111
3 2 22
3 3 333333
Additional conditions:
All refids will have values at tick=1.
There may be many 'missing' ticks for a refid in sequence, (as above for refid 2).
There are many refids and it's not known which will have sparse data where.
There will be many ticks beyond 3, but all sequential. In the correct result, each refid will have a result for each tick.
Missing rows are not known in advance - this will be run on multiple databases, all with the same structure, and different "missing" rows.
I'm using MySQL and cannot change db just now. Feel free to post answer in another dialect, to help discussion, but I'll select an answer in MySQL dialect over others.
Yes, I know this can be done in the code, which I've implemented. I'm just curious if it can be done with SQL.
What value should be returned when a given tick-refid combination does not exist? In this solution, I simply returned the lowest value for that given refid.
Revision
I've updated the logic to determine what value to use in the case of a null. It should be noted that I'm assuming that ticks+refid is unique in the table.
Select Ticks.tick
, Refs.refid
, Case
When Table.value Is Null
Then (
Select T2.value
From Table As T2
Where T2.refid = Refs.refId
And T2.tick = (
Select Max(T1.tick)
From Table As T1
Where T1.tick < Ticks.tick
And T1.refid = T2.refid
)
)
Else Table.value
End As value
From (
Select Distinct refid
From Table
) As Refs
Cross Join (
Select Distinct tick
From Table
) As Ticks
Left Join Table
On Table.tick = Ticks.tick
And Table.refid = Refs.refid
If you know in advance what your 'tick' and 'refid' values are,
Make a helper table that contains all possible tick and refid values.
Then left join from the helper table on tick and refid to your data table.
If you don't know exactly what your 'tick' and 'refid' values are, you maybe could still use this method, but instead of a static helper table, it would have to be dynamically generated.
The following has too many sub-selects for my taste, but it generates the desired result in MySQL, as long as every tick and every refid occurs separately at least once in the table.
Start with a query that generates every pair of tick and refid. The following uses the table to generate the pairs, so if any tick never appears in the underlying table, it will also be missing from the generated pairs. The same holds true for refids, though the restriction that "All refids will have values at tick=1" should ensure the latter never happens.
SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
Using this, generate every missing tick, refid pair, along with the largest tick that exists in the table by equijoining on refid and θ≥-joining on tick. Group by the generated tick, refid since only one row for each pair is desired. The key to filtering out existing tick, refid pairs is the HAVING clause. Strictly speaking, you can leave out the HAVING; the resulting query will return existing rows with their existing values.
SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
HAVING tr.tick > MAX(c.tick)
One final select from the above as a sub-select, joined to the original table to get the value for the given ctick, returns the new rows for the table.
INSERT INTO chadwick
SELECT missing.tick, missing.refid, c.value
FROM (SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
) AS missing
JOIN chadwick AS c ON missing.ctick = c.tick AND missing.refid=c.refid
;
Performance on the sample table, along with (tick, refid) and (refid, tick) indices:
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | c | ALL | tick_ref,ref_tick | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 9 | Using temporary; Using filesort |
| 2 | DERIVED | c | ref | tick_ref,ref_tick | ref_tick | 5 | tr.refid | 1 | Using where; Using index |
| 3 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 3 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | Using join buffer |
| 5 | DERIVED | chadwick | index | NULL | tick_ref | 10 | NULL | 6 | Using index |
| 4 | DERIVED | chadwick | ref | tick_ref | tick_ref | 5 | | 2 | Using where; Using index |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
As I said, too many sub-selects. A temporary table may help matters.
To check for missing ticks:
SELECT clo.tick+1 AS missing_tick
FROM chadwick AS chi
RIGHT JOIN chadwick AS clo ON chi.tick = clo.tick+1
WHERE chi.tick IS NULL;
This will return at least one row with tick equal to 1 + the largest tick in the table. Thus, the largest value in this result can be ignored.
In order to have the list of pairs (tick, refid) to insert get a whole list:
SELECT a.tick, b.refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
Now substract from that query the existing ones:
SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
Now you can join with t to obtain the final query (note that I use inner join + left join to obtain previous result but you could adapt):
INSERT INTO t(tick, refid, value)
SELECT c.tick, c.refid, t1.value
FROM ( SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
) c
INNER JOIN t t1 ON t1.refid = c.refid and t1.tick < c.tick
LEFT JOIN t t2 ON t2.refid = c.refid AND t1.tick < t2.tick AND t2.tick < c.tick
WHERE t2.tick IS NULL

Can I specify the order in which the joins occur?

Say I have three tables, A, B and C. Conceptually A (optionally) has one B, and B (always) has one C.
Table A:
a_id
... other stuff
Table B:
a_fk_id (foreign key to A.a_id, unique, primary, not null)
c_fk_id (foreign key to C.c_id, not null)
... other stuff
Table C:
c_id
... other stuff
I want to select All records from A as well as their associated records from B and C if present. However, the B and C data must only occur in the result if both B and C are present.
I feel like I want to do:
SELECT *
FROM
A
LEFT JOIN B on A.a_id=B.a_fk_id
INNER JOIN C on B.c_fk_id=C.c_id
But Joins seem to be left associative (the first join happens before the second join), so this will not give records from A that don't have an entry in C.
AFAICT I must use sub queries, something along the lines of:
SELECT *
FROM
A
LEFT JOIN (
SELECT * FROM B INNER JOIN C ON B.c_fk_id=C.c_id
) as tmp ON A.id = tmp.a_fk_id
but once I have a couple of such relationships in a query (in reality I may have two or three nested), I'm worried both about code complexity and about the query optimizer.
Is there a way for me to specify the join order, other than this subquery method?
Thanks.
In SQL Server you can do
SELECT *
FROM a
LEFT JOIN b
INNER JOIN c
ON b.c_fk_id = c.c_id
ON a.id = b.a_fk_id
The position of the ON clause means that the LEFT JOIN on b logically happens last. As far as I know this is standard (claimed to be ANSI prescribed here) but I'm sure the downvotes will notify me if it doesn't work in MySQL!
Edit: And that's what I get for talking faster than I think. My previous solution doesn't work because 'c' hasn't been joined yet. Let's try this again.
We can use a WHERE clause to limit the results to only those that match the criteria you're looking for, where C has a valid (IS NOT NULL) or B does not have a value (IS NULL). Like this:
SELECT *
FROM a
LEFT JOIN b ON (b.a = a.a)
LEFT JOIN c ON (c.b = b.b)
WHERE (c.c IS NOT NULL OR b.b IS NULL);
Without WHERE Results:
mysql> SELECT * FROM a LEFT JOIN b ON (b.a = a.a) LEFT JOIN c ON (c.b = b.b);
+------+------+------+------+------+
| a | a | b | c | b |
+------+------+------+------+------+
| 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 2 | NULL | NULL |
| 2 | 2 | 3 | 2 | 3 |
| 3 | NULL | NULL | NULL | NULL |
| 4 | NULL | NULL | NULL | NULL |
+------+------+------+------+------+
With WHERE Results:
mysql> SELECT * FROM a LEFT JOIN b ON (b.a = a.a) LEFT JOIN c ON (c.b = b.b) WHERE (c.c IS NOT NULL OR b.b IS NULL);
+------+------+------+------+------+
| a | a | b | c | b |
+------+------+------+------+------+
| 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 3 | 2 | 3 |
| 3 | NULL | NULL | NULL | NULL |
| 4 | NULL | NULL | NULL | NULL |
+------+------+------+------+------+
Yes, you use the STRAIGHT_JOIN for this.
When using this keyword the join will occur in the exact order that you specify.
See: http://dev.mysql.com/doc/refman/5.5/en/join.html
Well, I thought up another solution as well, and I'm posting it for completeness (Though I'm actually using Martin's answer).
Use a RIGHT JOIN:
SELECT
*
FROM
b
INNER JOIN c ON b.c_fk_id = c.c_id
RIGHT JOIN a ON a.id = b.a_fk_id
I'm pretty sure every piece I've read about JOINS said that RIGHT JOINs were pointless, but there you are.