Record-doubling problem on a simple left join - mysql

I'm running this query:
CREATE TABLE
SELECT people.*, Sheet1.department
FROM people LEFT JOIN Sheet1 ON people.depno = Sheet1.depno
On a set of tables detailing employee records.
The goal is to create a new table that has all the "people" data, plus a human-readable department name. Simple, right?
The problem is that each record in the resulting table appears to be duplicated exactly (with literally every field being the same), turning a roughly 23,000-record table into a roughly 46,000-record table. I say "roughly" because it's not an exact doubling -- there's a difference of about a hundred records.
Some details: The "people" table contains 15 fields, including the "depno" field, which is an integer indicating department.
The "Sheet1" table is, as one would guess, a table generated from an imported xls file containing two fields: the shared "depno" and a new "department" (the latter being a verbose department name corresponding to the depno in question). There are 44 records in the "Sheet1" table.
Thanks in advance for any pointers on this. Let me know what other information you can use from me.
Update: Here's the code I ended up using, from my response to Johan (thanks again to everyone who worked on this):
CREATE TABLE morebetter
SELECT people.*, Sheet1.department FROM people
LEFT JOIN Sheet1 ON people.depno = Sheet1.depno
GROUP BY id

Sounds like the Sheet1.depno field isn't unique?

The people.depno is not unique, that's why you're getting the doubling.
Change the SELECT part to
SELECT DISTINCT people.*, Sheet1.department
FROM people LEFT JOIN Sheet1 ON people.depno = Sheet1.depno
This will eliminate duplicate rows.
In MySQL you can also write
SELECT people.*, Sheet1.department
FROM people LEFT JOIN Sheet1 ON people.depno = Sheet1.depno
GROUP BY people.depno
Which works slightly different.
The first query eliminates rows with duplicate output, the second query eliminates records with duplicate people.depno, even if people.depno does not appear in the output.
I like the second form, because it makes explicit which duplicate you're trying to eliminate and you don't need to tweak the output.
It's also slightly faster in executing time.
***Warning***
The group by version will eliminate any double people.depno it finds, but if the other fields in the select are not identical it will just choose one at random!
In other words. If the outcome of the select distinct is different from the group by version that means that MySQL is silently dropping non-duplicate rows.
This may or may not be what you want!
In order to be safe, do a group by on all fields that you care about!
If the group by is on a unique key than it's pointless to include further fields from the same table as that unique key.

Related

Subquery or join to get one string from the same two tables multiple times

Related to Join vs. sub-query but a different type of situation, and just trying to understand how this works.
I had to make a view where I get a bunch of employee codes from one table, and I have to get their names from a different table - the same two tables every time. I arranged my query like this:
SELECT
(SELECT name from emptable where empcod = code1) as emp1, code1,
(SELECT name from emptable where empcod = code2) as emp2, code2,
[repeat 6 times]
FROM codetable
It is more complicated than this, and more tables are joined, but this is the element I want to ask about. My boss says joining like so is better:
SELECT e1.name, c.code1, e2.name, c.code2, e3.name, code3 [etc]
FROM codetable c
INNER JOIN emptable e1 ON e1.empcod = c.code1
INNER JOIN emptable e2 ON e2.empcod = c.code2
INNER JOIN emptable e3 ON e3.empcod = c.code3
My reasoning, aside from not having to go search in the joins which table gets whose name and why, is the way I understand the join goes like this:
Take whole table A
Take whole table B
Combine all the data from both tables according to the 'ON' section of the join
select one single string from this complete combination of two whole tables from which I need no other data
I think it's obvious that this seems like it would take up a lot of resources. I understand the subquery as
Get one datum from table A (the employee code)
Match this one datum to every record from table B until you find a match
As soon as you get a match, bring back this one single datum from this other table (the employee's name)
Understanding that in the table of employees, the employee code is a primary key and cannot be duplicated, so every subquery can only ever give me one single string back.
It seems to me that comparing ONE number from one table to ONE number from another table and retrieving ONE string related to that number would be less resource-intensive than matching ALL of the data in two whole tables together in order to get this one string. But I figure I don't know what these databases are doing behind the scenes, and a lot of people seem to prefer joins. Can anyone explain this to me, if I'm understanding it wrong? The other posts I find here of similar situations tend to want more information from more tables, I'm not immediately finding anything about matching the same two tables six or seven times to retrieve one single string for every configuration.
Thanks in advance.
So as ScaisEdge explained, a join only gets executed once - and thus only spends time and resources once - no matter how many rows you have, whereas each of the six subselects get executed once for every row. If you have 100 rows, you're executing six joins once or you're executing 6 subselects 100 times.
It makes sense that this would be more resource-intensive, and I did not explain clearly enough that my case involves only one row at a time - in which case I guess the difference would be negligible anyway.

SQL most efficient way to check if rows from one table are also present in another

I have two DB tables each containing email addresses
One is mssql with 1.500.000.000 entries
One is mysql with 70.000.000 entries
I now want to check how many identical email addresses are present in both tables.
i.e. the same address is present in both tables.
Which approach would be the fastest:
1. Download both datasets as csv, load it into memory and compare in program code
2. Use the DB queries to get the overlapping resultset.
if 2 is better: What would be a suggested SQL query?
I would go with a DBQuery. Set up a linked server connection between the two DBs (probably on the MSSQL side), and use a simple inner join query to produce the list of e-mails that occur in both tables:
select a.emailAddress
from MSDBServ.DB.dbo.Table1 a
join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
Finding the set difference, that's going to take more processor power (and it's going to produce at least 1.4b results in the best-case scenario of every MySql row matching an MSSQL row), but the query isn't actually that much different. You still want a join, but now you want that join to return all records from both tables whether they could be joined or not, and then you specifically want the results that aren't joined (in which case one side's field will be null):
select a.EmailAddress, b.EmailAddress
from MSDBServ.DB.dbo.Table1 a
full join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
where a.EmailAddress IS NULL OR b.EmailAddress IS NULL
You could do a sql query to check how many identical email addresses are present in two databases: first number is how many duplicates, second value is the email address.
SELECT COUNT(emailAddr),emailAddr FROM table1 A
INNER JOIN
table2 B
ON A.emailAddr = B.emailAddr
Table1 has the 70,000,000 email addresses, table2 has the 1,500,000,000. I use Oracle so the Upper function may or may not have an equivalent in MySQL.
Select EmailAddress from table1 where Upper(emailaddress) in (select Upper(emailaddress) from table2)
Quicker than comparing spreadsheets and this assumes both tables are in the same database.

include null and zero in count() from related table

I would like to list in table (staging) the number of related records from table (studies).
So far this statement works well but returns only the rows where there are >0 related records:
SELECT staging.*,
COUNT(studies.PMID) AS refcount
FROM studies
LEFT JOIN staging
ON studies.rs_number = staging.rs
GROUP BY staging.idstaging;
How can I adjust this statement to list ALL rows in table (staging) including where there are zero or null related records from table (studies)?
Thank you
You have the tables in the wrong order in the LEFT JOIN:
SELECT staging.*, COUNT(studies.PMID) AS refcount
FROM staging LEFT JOIN
studies
ON studies.rs_number = staging.rs
GROUP BY staging.idstaging;
LEFT JOIN keeps everything in the first ("left") table and all matching rows in the second. If you want to keep everything in the staging table, then put it first.
And, in case anyone wants to complain about the use of staging.* with GROUP BY. This particular usage is (presumably) ANSI compliant because staging.idstaging is (presumably) a unique id in that table.

How to combine 5 tables together with same ID in a query?

I have 5 different tables T_DONOR, T_RECIPIENT_1, T_RECIPIENT_2, T_RECIPIENT_3, and T_RECIPIENT_4. All 5 tables have the same CONTACT_ID.
This is the T_DONOR table:
T_RECIPIENT_1:
T_RECIPIENT_2:
This is what I want the final table to look like with more recipients and their information to the right.
T_RECIPIENT_3 and T_RECIPIENT_4 are the same as T_RECIPIENT_1 and T_RECIPIENT_2 except that they have different RECIPIENT ID and different names. I want to combine all 5 of these tables so on one line I can have the DONOR_CONTACT_ID which his information, and then all of the Recipient's information.
The problem is that when I try to run a query, it does not work because not all of the Donors have all of the recipient fields filled, so the query will run and give a blank table. Some instances I have a Donor with 4 Recipients and other times I have a Donor with only 1 Recipient so this causes a problem. I've tried running queries where I connect them with the DONOR_CONTACT_ID but this will only work if all of the RECIPIENT fields are filled. Any suggestions on what to do? Is there a way I could manipulate this in VBA? I only know some VBA, I'm not an expert.
First I think you want all rows from T_DONOR. And then you want to pull in information from the recipient tables when they include DONOR_CONTACT_ID matches. If that is correct, LEFT JOIN T_DONOR to the other tables.
Start with a simpler set of fields; you can add in the "name" fields after you get the joins set to correctly return the rest of the data you need.
SELECT
d.DONOR_CONTACT_ID,
r1.RECIPIENT_1,
r2.RECIPIENT_1
FROM
(T_DONOR AS d
LEFT JOIN T_RECIPIENT_1 AS r1
ON d.ORDER_NUMBER = r1.ORDER_NUMBER)
LEFT JOIN T_RECIPIENT_2 AS r2
ON d.ORDER_NUMBER = r2.ORDER_NUMBER;
Notice the parentheses in the FROM clause. The db engine requires them for any query which includes more than one join. If possible, set up your joins in Design View of the query designer. The query designer knows how to add parentheses to keep the db engine happy.
Here is a version without aliased table names in case it's easier to understand and set up in the query designer ...
SELECT
T_DONOR.DONOR_CONTACT_ID,
T_RECIPIENT_1.RECIPIENT_1,
T_RECIPIENT_2.RECIPIENT_1
FROM
(T_DONOR
LEFT JOIN T_RECIPIENT_1
ON T_DONOR.ORDER_NUMBER = T_RECIPIENT_1.ORDER_NUMBER)
LEFT JOIN T_RECIPIENT_2
ON T_DONOR.ORDER_NUMBER = T_RECIPIENT_2.ORDER_NUMBER;
SELECT T_DONOR.ORDER_NUMBER, T_DONOR.DONOR_CONTACT_ID, T_DONOR.FIRST_NAME, T_DONOR.LAST_NAME, T_RECIPIENT_1.RECIPIENT_1, T_RECIPIENT_1.FIRST_NAME, T_RECIPIENT_1.LASTNAME
FROM T_DONOR
JOIN T_RECIPIENT_1
ON T_DONOR.DONOR_CONTACT_ID = T_RECIPIENT_1.DONOR_CONTACT_ID
This shows you how to JOIN the first recipient table, you should be able to follow the same structure for the other three...

Join TableA, TableB and TableC to get data from TableA and TableC

Why I am getting so many records for this
SELECT e.OneColumn, fb.OtherColumn
FROM dbo.TABLEA FB
INNER JOIN dbo.TABLEB eo ON Fb.Primary = eo.foregin
INNER JOIN dbo.TABLEC e ON eo.Primary =e.Foreign
WHERE FB.SomeOtherColumn = 0
When I am running this I am getting Millions of records which is not the correct case, all tables has less number of records.
I need to get the columns from TableA and TableC and because they are not joined logically so I have to use TableB to act as bridge
EDIT
Below is the count:
TABLEA = 273551
TABLEB = 384412
TABKEC = 13046
Above Query = After 2 minutes I have forcefully canceled the query.. till that time the count was 11437613
Any suggestion?
To figure out what is going on in such a query where the results are not as expected, I tend to do this. First I change to a SELECT * (Note this is only for figuring out the problem, do not use SELECT * on production, ever!) Then I add an order by for the ID frield from tableA if there is not one in the query.
So now I run the query up to the first table including any where conditions that are from the first table. I comment out the rest. I note the number of records returned.
Now I add in the second table and any where conditions from it. If I am expecting a one to relationship, and if this query doesn't return the smae number of records, then I look at the data that is being returned to see if I can figure out why. Since the contents are ordered by the table1 ID, you can ususally see examples of some records that are duplicated fairly easily and then scroll over until you find the field that causes the differnce. Often this means that you need some sort of addtional where clause or aggregation on the fields in the next table to limit to only one record. JUSt note down the problem at this point though as you may be able tomake the change more effectively in the next join.
So add inteh the third table and again, not the number of records and then look closely at the data where the id from A is repeated. LOok at the columns you intend to return, are they always teh same for an id? If they are differnt then you do not havea one-one relationship and you need to understand that either theri is a data integrity problem or you are mistaken in thinking there is a one-to-one. If tehy are the same, then a derived table may be in order. You only need the ids from tableb so the join could look something like this:
JOIN (SELECT MIn(Primary), foreign FROM TABLEB GROUP BY foreign) EO ON Fb.Primary = eo.foreign
Hope this helps.