Am I understanding what Left Join is supposed to do?
I have a query. Call it Query A. It returns 19 records.
I have another query, Query B. It returns 1,400 records.
I insert Query B into Query A as a left join, so Query A becomes:
SELECT *
FROM tableA
LEFT JOIN (<<entire SQL of Query B>>) ON tableA.id = tableB.id
Now, a Left Join means everything from Table A, and only records from Table B where they match. So no matter what, this mixed query should not return more than the 19 records that the original Query A returns. What I actually get is 1,000 records.
Am I fundamentally misunderstanding how LEFT JOIN works?
You're not exactly misunderstanding LEFT JOIN, so much as the results implied by it. If you have only one row in A, and 1000 in B that reference to the id of that single row in A; your result will be 1000 rows. You're overlooking that the relation may be 1-to-many. The size of the "left" table/subquery (subject to WHERE conditions) is the lower bound for the number of results.
Yes, you are misunderstanding slightly.
Now, a Left Join means everything from Table A, and only records from Table B where they match.
So far, so good: the data from Table B will be included only if it matches against Table A according to the rules you specify in the ON clause.
So no matter what, this mixed query should not return more than the 19 records that the original Query A returns.
This seems like it makes sense, until you realise that more than one row in Table B can match the same row in Table A.
Let's say you have 2 rows in Table A, one with A_ID=1 and one with A_ID=3; and 10 rows in Table B; 5 of the rows in Table B have A_ID=1, and 5 have A_ID=2. All the rows in Table B have different values for B_ID.
If you use a Left Join with the condition that A_ID must match, which rows will you get?
The row from Table A with A_ID=3 will show up once, with a NULL value for B_ID, because there is no row in Table B to match it.
The 5 rows from Table B with A_ID=2 won't show up at all, because they don't match any rows from Table A.
The 5 rows from Table B with A_ID=1 will all show up, each partnered with the 1 row from Table A with A_ID=1.
So you get 6 results, even though there were only 2 rows in Table A.
Related
Leetcode problem link: https://leetcode.com/problems/rising-temperature/
The solution that I don't understand:
SELECT
weather.id AS 'Id'
FROM
weather
JOIN
weather w ON DATEDIFF(weather.recordDate, w.recordDate) = 1
AND weather.Temperature > w.Temperature
;
Here weather and w (Alias of weather) are the same tables and DATEDIFF is comparing dates but I don't understand this, If weather and w are the same tables then doesn't that mean that DATEDIFF is comparing the same rows. The solution is correct which means that both rows are not same how?
The table is the same and the columns being compared are the same but not the same rows satisfy the conditions on both tables.
Simplistic put a table join is a internal product of the rows of the tables involved. Which means that a self join of a table with 3 rows will return 9 rows.
When you set conditions, the result set is filtered and only the combined rows that satisfy them will be returned. In the exercise, the conditions are a relation between the row of the first table instance and the second instance.
It is better if you alias both copies of the table:
SELECT w1.id
FROM weather w1 JOIN weather w2
ON DATEDIFF(w1.recordDate, w2.recordDate) = 1 AND w1.Temperature > w2.Temperature;
What this query does is join every row of the table with every other row of the same table which has as recordDate the previous day and less Temperature and returns the id of the 1st copy (if the conditions are satisfied).
In fact all rows of the table are compared against all rows, but when it comes to the same rows they are rejected because for them the conditions fail.
Also, note that your query may return the same id more than once, because it could happen that for a row of the table there may exist more than 1 other rows where the date is 1 day less and the temperature is less.
So maybe you want:
SELECT DISTINCT w1.id
.....................
I am using MySQL through R. I am working with two tables within the same database and I noticed something strange that I can't explain. To be more specific, when I try to make a connection between the tables using a foreign key the result is not what it should be.
One table is called Genotype_microsatellites, the second table is called Records_morpho. They are connected through the foreign key sample_id.
If I only select records with certain characteristics from the Genotype_microsatellites table using the following command...
Gen_msat <- dbGetQuery(mydb, 'SELECT *
FROM Genotype_microsatellites
WHERE CIDK113a >= 0')
...the query returns 546 observations for 52 variables, exactly what I would expect. Now, I want to do a query that adds a little more info to my results, specifically by including data from the Records_morpho table. I, therefore, use the following code:
Gen_msat <- dbGetQuery(mydb, 'SELECT Genotype_microsatellites.*,
Records_morpho.net_mass_g,
Records_morpho.svl_mm
FROM Genotype_microsatellites
INNER JOIN Records_morpho ON Genotype_microsatellites.sample_id = Records_morpho.sample_id
WHERE CIDK113a >= 0')
The problem is that now the output has 890 observation and 54 variables!! Some sample_id values (i.e., the rows or individuals in the data frame ) are showing up multiple times, which shouldn't be the case. I have tried to fix this using SLECT DISTINCT, but the problem wouldn't go away.
Any help would be much appreciated.
Sounds like it is working as intended, that is how joins work. With A JOIN B ON A.x = B.y you get every row from A combined with every row from B that has a y matching the A row's x. If there are 3 rows in B that match one row in A, you will get three result rows for those. The A row's data will be repeated for each B row match.
To go a little further, if x is not unique and y is not unique. And you have two x with the same value, and three y with that value, they will produce six result rows.
As you mentioned DISTINCT does not make this problem go away because DISTINCT operates across the result row. It will only merge result rows if the values in all selected fields are the same on those result rows. Similarly, if you have a query on a single table that has duplicate rows, DISTINCT will merge those rows despite them being separate rows, as they do not have distinct sets of values.
I have this query:
SELECT s.*
FROM #mcmodlist_servers s
LEFT OUTER JOIN #mcmodlist_tag_server ts
ON ts.server_id = s.id
(don't mind the #mcmodlist_ bits, it's converted by PHP into the actual table names).
When executed as written as above it gives a result of 5 records, as it should, but when I add LIMIT 10 it suddenly returns 4.
But wait, it gets even better: If I change it to LIMIT 12 there's suddenly 5 records again (LIMIT 11 still returns 4).
Left outer should join only if it has a matching record and otherwise return null, right?
Why is LIMIT behaving like this? it works just fine without the JOIN clause
I think if you run the query in a MySQL client, with limit 10, you will find that it is in fact returning 10 rows in the resultset.
I suspect that there are multiple rows in #mcmodlist_tag_server with a server_id that matches a row from #mcmodlist_servers. When there are multiple matching rows, you are going to get "duplicate" rows from #mcmodlist_servers.
Given that there are no columns returned from the #mcmodlist_tag_server table, and that this is an OUTER join, it's not at all clear why this table would be included in the query at all.
And no, LEFT JOIN does not mean what you said it means.
Q: Left outer should join only if it has a matching record and otherwise return null, right?
A: No. That's not what LEFT JOIN means. A LEFT JOIN will return all rows from the table on the left side, along with matching rows from the right side, just like an INNER JOIN. But with the LEFT JOIN, if there's a row from the left side that doesn't have a matching row from the right side, the row from the left is returned. Yes, when that happens, the columns from the rightside table will consist of NULL placeholders.
The LIMIT clause applies to the total number of rows returned in the resultset. It does not mean the number of distinct rows from a given table.
I have 2 tables with a common field. On one table the common field has an index
while on the other not. Running a query as the following :
SELECT *
FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
the query is way less performing than running the opposite :
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Anybody could explain me why and the logic behind the use of indexes in this case?
You can prepend your queries with EXPLAIN to find out how MySQL will use the indexes and in which order it will join the tables.
Take a look at the documentation of the EXPLAIN output format to see how to interpret the result.
Because of the LEFT JOINs, the order of the tables cannot be changed. MySQL needs to include in the final result set all the rows from the left table, whether or not they have matches in the right table.
On INNER JOINs, MySQL usually swaps the tables and puts the table having less rows first because this way it has a smaller number of rows to analyze.
Let's take this query (it's your query with shorter names for the tables):
SELECT *
FROM a
LEFT JOIN b ON a.col = b.col
WHERE 1
How MySQL runs this query:
It gets the first row from table a that matches the query conditions. If there are conditions in the WHERE or join clauses that use only fields of table a and constant values then an index that contain some or all of these fields is used to filter only the rows that matches the conditions.
After a row from table a was selected it goes to the next table from the execution plan (this is table b in our query). It has to select all the rows that match the WHERE condition(s) AND the JOIN condition(s). More specifically, the row(s) selected from table b must match the condition b.col = X where X is the value of column col for the row currently selected from table a on step 1. It finds the first matching row then goes to the next table. Since there is no "next table" in this query, it will put the pair of rows (from a and b) into the result set then discard the row from b and search for the next one, repeating this step until it finds all the rows from b that match the row currently selected from a (on step 1).
If on step 2 cannot find any row from b that match the row currently selected from a, the LEFT JOIN forces MySQL to make up a row (having the columns of b) full of NULLs and together with the current row from a it creates a row puts it into the result set.
After all the matching rows from b were processed, MySQL discards the current row from a, selects the next row from a that matches the WHERE and join conditions and starts over with the selection of matching rows from b (step 2).
This process loops until all the rows from a are processed.
Remarks:
The meaning of "first row" on step 1 depends on a lot of factors. For example, if there is an index on table a that contains all the columns (of table a) specified in the query then MySQL will not read the table data but will use the index instead. In this case, the order of the rows is given by the index. In other cases the rows are read from the table data and the order is provided by the order they are stored on the storage medium.
This simple query doesn't have any WHERE condition (WHERE 1 is always TRUE) and also there is no condition in the JOIN clause that contains only columns from a. All the rows from table a are included in the result set and that leads to a full table scan or an index scan, if possible.
On step 2, if table b has an index on column col then MySQL uses the index to find the rows from b that have value X on column col. This is a fast operation. If table b does not have an index on column col then MySQL needs to perform a full table scan of table b. That means it has to read all the rows of table b in order to find those having values X on column col. This is a very slow and resource consuming operation.
Because there is no condition on rows of table a, MySQL cannot use an index of table a to filter the rows it selects. On the other hand, when it needs to select the rows from table b (on step 2), it has a condition to match (b.col = X) and it could use an index to speed up the selection, given such an index exists on table b.
This explains the big difference of performance between your two queries. More, because of the LEFT JOIN, your two queries are not equivalent, they produce different results.
Disclaimer: Please note that the above list of steps is an overly simplified explanation of how the execution of a query works. It attempts to put it in simple words and skip the many technical aspects of what happens behind the scene.
Hints about how to make your query run faster can be found on MySQL documentation, section 8. Optimization
To check what's going on with MySQL Query optimizer please show EXPLAIN plan of these two queries. Goes like this:
EXPLAIN
SELECT * FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
and
EXPLAIN
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Hows is it possible for these two queries to be different. I mean the first query didn't include all the rows from my left table so I put the conditions within the join part.
Query 1
SELECT COUNT(*) as opens, hours.hour as point
FROM hours
LEFT OUTER JOIN tracking ON hours.hour = HOUR(FROM_UNIXTIME(tracking.open_date))
WHERE tracking.campaign_id = 83
AND tracking.open_date < 1299538799
AND tracking.open_date > 1299452401
GROUP BY hours.hour
Query 2
SELECT COUNT(*) as opens, hours.hour as point
FROM hours
LEFT JOIN tracking ON hours.hour = HOUR(FROM_UNIXTIME(tracking.open_date))
AND tracking.campaign_id = 83
AND tracking.open_date < 1299538799
AND tracking.open_date > 1299452401
GROUP BY hours.hour
The difference is that the first query gives me 18 rows where there are no rows between point 17 to 22. But when I run the second query, it shows the fully 24 rows but for rows between 17 and 22 it has a value of 1! I would of expected it to be 0 or NULL? If it really is 1 should it not have appeared in the first query?
How has this happened?
the first JOIN is really an INNER JOIN, the outer joined table should not appear in the WHERE clause like you have in the top query, instead of COUNT(*), pick a column from the outer joined table
You're using COUNT(*), which will count every row in your result set (as it's written), since even without data in tracking, you do have data in hours.
Try changing COUNT(*) to COUNT(tracking.open_date) (or any non-nullable column within tracking; it doesn't matter which one).
COUNT(*) counts the number of rows resulted in the query.
You can use count(tracking.open_date), basically any column from tracking table (right table)
The problem is that the first query will do an outer join, with some rows containing NULL in all tables from the tracking table. Then it will apply a filter on those tracking columns, and since they are null, the corresponding row from the result set will be filtered out.
The second query will do a proper outer join on all columns.