What actually happens during table JOINs? - mysql

I'm trying to see if my understanding of JOINs is correct.
For the following query:
SELECT * FROM tableA
join tableB on tableA.someId = tableB.someId
join tableC on tableA.someId = tableC.someId;
Does the RDMS basically execute similar pseudocode as follows:
List tempResults
for each A_record in tableA
for each B_record in tableB
if (A_record.someId = B_record.someId)
tempResults.add(A_record)
List results
for each Temp_Record in tempResults
for each C_record in tableC
if (Temp_record.someId = C_record.someId)
results.add(C_record)
return results;
So basically the more records with the same someId tableA has with tableB and tableC, the more records the RDMS have the scan? If all 3 tables have records with same someId, then essentially a full table scan is done on all 3 tables?
Is my understanding correct?

Each vendor's query processor is of course written (coded) slightly differently, but they probably share many common techniques. Implementing a join can be done in a variety of ways, and which one is chosen, in any vendor's implementation, will be dependent on the specific situation, but factors that will be considered include whether the data is already sorted by the join attribute, the relative number of records in each table (a join between 20 records in one set of data with a million records in the other will be done differently than one where each set of records is of comparable size). I do not know the internals for MySQL, but for SQL server, there are three different join techniques, a Merge Join, a Loop Join, and a Hash Join. Take a look at this.

Related

SQL most efficient way to check if rows from one table are also present in another

I have two DB tables each containing email addresses
One is mssql with 1.500.000.000 entries
One is mysql with 70.000.000 entries
I now want to check how many identical email addresses are present in both tables.
i.e. the same address is present in both tables.
Which approach would be the fastest:
1. Download both datasets as csv, load it into memory and compare in program code
2. Use the DB queries to get the overlapping resultset.
if 2 is better: What would be a suggested SQL query?
I would go with a DBQuery. Set up a linked server connection between the two DBs (probably on the MSSQL side), and use a simple inner join query to produce the list of e-mails that occur in both tables:
select a.emailAddress
from MSDBServ.DB.dbo.Table1 a
join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
Finding the set difference, that's going to take more processor power (and it's going to produce at least 1.4b results in the best-case scenario of every MySql row matching an MSSQL row), but the query isn't actually that much different. You still want a join, but now you want that join to return all records from both tables whether they could be joined or not, and then you specifically want the results that aren't joined (in which case one side's field will be null):
select a.EmailAddress, b.EmailAddress
from MSDBServ.DB.dbo.Table1 a
full join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
where a.EmailAddress IS NULL OR b.EmailAddress IS NULL
You could do a sql query to check how many identical email addresses are present in two databases: first number is how many duplicates, second value is the email address.
SELECT COUNT(emailAddr),emailAddr FROM table1 A
INNER JOIN
table2 B
ON A.emailAddr = B.emailAddr
Table1 has the 70,000,000 email addresses, table2 has the 1,500,000,000. I use Oracle so the Upper function may or may not have an equivalent in MySQL.
Select EmailAddress from table1 where Upper(emailaddress) in (select Upper(emailaddress) from table2)
Quicker than comparing spreadsheets and this assumes both tables are in the same database.

Conditionals in WHEREs or JOINs?

Lets say I have the following query:
SELECT occurs.*, events.*
FROM occurs
INNER JOIN events ON (events.event_id = occurs.event_id)
WHERE event.event_state = 'visible'
Another way to do the same query and get the same results would be:
SELECT occurs.*, events.*
FROM occurs
INNER JOIN events ON (events.event_id = occurs.event_id
AND event.event_state = 'visible')
My question. Is there a real difference? Is one way faster than the other? Why would I choose one way over the other?
For an INNER JOIN, there's no conceptual difference between putting a condition in ON and in WHERE. It's a common practice to use ON for conditions that connect a key in one table to a foreign key in another table, such as your event_id, so that other people maintaining your code can see how the tables relate.
If you suspect that your database engine is mis-optimizing a query plan, you can try it both ways. Make sure to time the query several times to isolate the effect of caching, and make sure to run ANALYZE TABLE occurs and ANALYZE TABLE events to provide more info to the optimizer about the distribution of keys. If you do find a difference, have the database engine EXPLAIN the query plans it generates. If there's a gross mis-optimization, you can create an Oracle account and file a feature request against MySQL to optimize a particular query better.
But for a LEFT JOIN, there's a big difference. A LEFT JOIN is often used to add details from a separate table if the details exist or return the rows without details if they do not. This query will return result rows with NULL values for b.* if no row of b matches both conditions:
SELECT a.*, b.*
FROM a
LEFT JOIN b
ON (condition_one
AND condition_two)
WHERE condition_three
Whereas this one will completely omit results that do not match condition_two:
SELECT a.*, b.*
FROM a
LEFT JOIN b ON some_condition
WHERE condition_two
AND condition_three
Code in this answer is dual licensed: CC BY-SA 3.0 or the MIT License as published by OSI.

Join TableA, TableB and TableC to get data from TableA and TableC

Why I am getting so many records for this
SELECT e.OneColumn, fb.OtherColumn
FROM dbo.TABLEA FB
INNER JOIN dbo.TABLEB eo ON Fb.Primary = eo.foregin
INNER JOIN dbo.TABLEC e ON eo.Primary =e.Foreign
WHERE FB.SomeOtherColumn = 0
When I am running this I am getting Millions of records which is not the correct case, all tables has less number of records.
I need to get the columns from TableA and TableC and because they are not joined logically so I have to use TableB to act as bridge
EDIT
Below is the count:
TABLEA = 273551
TABLEB = 384412
TABKEC = 13046
Above Query = After 2 minutes I have forcefully canceled the query.. till that time the count was 11437613
Any suggestion?
To figure out what is going on in such a query where the results are not as expected, I tend to do this. First I change to a SELECT * (Note this is only for figuring out the problem, do not use SELECT * on production, ever!) Then I add an order by for the ID frield from tableA if there is not one in the query.
So now I run the query up to the first table including any where conditions that are from the first table. I comment out the rest. I note the number of records returned.
Now I add in the second table and any where conditions from it. If I am expecting a one to relationship, and if this query doesn't return the smae number of records, then I look at the data that is being returned to see if I can figure out why. Since the contents are ordered by the table1 ID, you can ususally see examples of some records that are duplicated fairly easily and then scroll over until you find the field that causes the differnce. Often this means that you need some sort of addtional where clause or aggregation on the fields in the next table to limit to only one record. JUSt note down the problem at this point though as you may be able tomake the change more effectively in the next join.
So add inteh the third table and again, not the number of records and then look closely at the data where the id from A is repeated. LOok at the columns you intend to return, are they always teh same for an id? If they are differnt then you do not havea one-one relationship and you need to understand that either theri is a data integrity problem or you are mistaken in thinking there is a one-to-one. If tehy are the same, then a derived table may be in order. You only need the ids from tableb so the join could look something like this:
JOIN (SELECT MIn(Primary), foreign FROM TABLEB GROUP BY foreign) EO ON Fb.Primary = eo.foreign
Hope this helps.

MySQL join with two conditions

With a single query, I need to get a list of ALL objects A and an additional column returning "1" when there is an association to an Object C in Table B, or returning "0" when there is no associaton to an Object C in Table B.
Table A holds all objects A
Table B holds all objects A associated to another object C.
I know the ID of object C.
Currently I am using a Query with LEFT JOIN and two conditions with AND in the JOIN.
For the return value column I am using "(TableB.id IS NOT NULL) as associated".
Table A will likely hold only between dozens up to a hundred records.
Table B will likely hold between thousands to hundreds of thousands records.
Table C will likely hold between thousands to hundreds of thousands records.
TableA.id is index
TableB.tablea_id is index
TableB.id is index
TableB.tablec_id is index
TableC.id is index
My query currently looks like this:
SELECT TableA.name, TableA.code, (TableB.id IS NOT NULL) AS associated
FROM TableA
LEFT JOIN TableB ON TableA.id=TableB.tablea_id AND TableB.tablec_id = $input
Is there any performance concern about the method I use for the SQL query or a better way of achieving the desired result ?
Your query looks fine as far as I can see.
However, you should keep in mind that conditions on join should be for matching foreign keys. Following this practice at least makes your queries more readable and maintainable.

Conditional Select Statement in MS Access

I want to create a query in MS Access that will display information from two tables based on the values in one table. Both of these tables have the same exact columns. One has set records and the other one has records a visitor can insert/edit/delete. For the purpose of this question I will call the tables TableA and TableB. TableA has the predetermined records and can not be changed. Multiple users will be using these records. Visitors would add records to TableB. I need a query that will display the records from TableA unless a visitor adds a record to TableB and then it displays that record. The field I need to join on is CategoryID. So what I need is basically like this;
If TableB.CategoryID Is Not Null Then
Select * From TableB
Else
Select * From TableA
End If
Thanks for any assistance anyone can provide.
JW
You get part of the way there by unioning the individual table queries; that works if there's nothing in B, but shows the A records if there is.
So suppose we created a table just like A, say A2, but with an added column: the number of records in B. And then we select all of the records in A2 where this new column 0, and only the columns originally in A; call this A3.
Now consider the union of A3 & B. If B is empty, we get A. If B is not empty, then none of the records from A2 will be chosen for A3, and we'e left with just B.
That is easier than it seems at first. You'll have to join both tables on CategoryID and then conditionally select the right item like this:
SELECT tA.CategoryID, IIF(tB.CategoryID IS NULL, tA.txtEntry, tB.txtEntry) AS EntryText,
tB.CategoryID IS NULL AS bOriginalEntry
FROM TableA AS tA LEFT JOIN TableB AS tB ON tA.CategoryID=tB.CategoryID
However there is one caveat: If TableB is empty then the join is producing an empty set! Just populate TableB with at least one record (preferably one with an invalid CategoryID so it won't join with a valid record in TableA.
The bOriginalEntry is just a boolean expression to show whether the EntryText stems from TableA or TableB.
I found this thread searching for a similar problem. Note to self and others.
You can use the Join Types to cope with potential different values in a conditional select,
MS Access doesn't have the full range of JOIN that MS SQL has, but you can "fudge" it.
eg
Full outer joins: all the data, combined where feasible
In some systems, an outer join can include all rows from both tables, with rows combined when they correspond. This is called a full outer join, and Access doesn’t explicitly support them. However, you can use a cross join and criteria to achieve the same effect.
https://support.office.com/en-us/article/join-tables-and-queries-3f5838bd-24a0-4832-9bc1-07061a1478f6#typesofjoins