INNER JOIN in MySQL returns multiple entries of the same row - mysql

I am using MySQL through R. I am working with two tables within the same database and I noticed something strange that I can't explain. To be more specific, when I try to make a connection between the tables using a foreign key the result is not what it should be.
One table is called Genotype_microsatellites, the second table is called Records_morpho. They are connected through the foreign key sample_id.
If I only select records with certain characteristics from the Genotype_microsatellites table using the following command...
Gen_msat <- dbGetQuery(mydb, 'SELECT *
FROM Genotype_microsatellites
WHERE CIDK113a >= 0')
...the query returns 546 observations for 52 variables, exactly what I would expect. Now, I want to do a query that adds a little more info to my results, specifically by including data from the Records_morpho table. I, therefore, use the following code:
Gen_msat <- dbGetQuery(mydb, 'SELECT Genotype_microsatellites.*,
Records_morpho.net_mass_g,
Records_morpho.svl_mm
FROM Genotype_microsatellites
INNER JOIN Records_morpho ON Genotype_microsatellites.sample_id = Records_morpho.sample_id
WHERE CIDK113a >= 0')
The problem is that now the output has 890 observation and 54 variables!! Some sample_id values (i.e., the rows or individuals in the data frame ) are showing up multiple times, which shouldn't be the case. I have tried to fix this using SLECT DISTINCT, but the problem wouldn't go away.
Any help would be much appreciated.

Sounds like it is working as intended, that is how joins work. With A JOIN B ON A.x = B.y you get every row from A combined with every row from B that has a y matching the A row's x. If there are 3 rows in B that match one row in A, you will get three result rows for those. The A row's data will be repeated for each B row match.
To go a little further, if x is not unique and y is not unique. And you have two x with the same value, and three y with that value, they will produce six result rows.
As you mentioned DISTINCT does not make this problem go away because DISTINCT operates across the result row. It will only merge result rows if the values in all selected fields are the same on those result rows. Similarly, if you have a query on a single table that has duplicate rows, DISTINCT will merge those rows despite them being separate rows, as they do not have distinct sets of values.

Related

How to merge two duplicate columns from different tables AND keep duplicate rows - MYSQL

I'm new to MySQL (I've looked all over the other questions about merging columns and I think my issue might be slightly different). I am trying to Join two tables, and only data from 3 of the columns ("QuantityOnHand", "WarehouseID", and "WarehouseCity") from these two tables, but, one of the columns in each of the two tables are identical ("WarehouseID"), with identical information. When I run my query, I get 4 columns (because the column "WarehouseID" is duplicated). However, I need to only have one column that shows "WarehouseID". I figured out how to get past the ambiguous problem of Joining two tables that have the same name of "WarehouseID", but I don't know how to merge the duplicate column (I'm not even sure if merging is correct, I just need to drop the duplicate column) My query looks like this:
SELECT i.QuantityOnHand, i.WarehouseID, w.WarehouseID, w.WarehouseCity
FROM inventory i
JOIN warehouse w ON i.WarehouseID = w.WarehouseID
WHERE i.QuantityOnHand >= 200
ORDER BY i.QuantityOnHand desc;
To drop the duplicate ID, you simply need to omit it from the SELECT portion of your statement. You don't have to select a column simply because it appears in your join clause. I believe you could achieve what you're looking for with the statement:
SELECT i.QuantityOnHand, w.WarehouseID, w.WarehouseCity
FROM inventory i
JOIN warehouse w ON i.WarehouseID = w.WarehouseID
WHERE i.QuantityOnHand >= 200
ORDER BY i.QuantityOnHand desc;
Note that I just omitted one of the WarehouseID columns from your statement. Is this what you were looking for?

sql workbench adding rows after running a join

Hello everyone i have a quick question, i am running mysql workbench and after joining two tables i get as results 10000 rows. Considering that the first dataset got 6000 rows and the second 450, it'clearly wrong. i'm clearly doing something wrong but i can't figure what is that and why it is happening
I am selecting some column from the first data set and match it against the second one against sv3 and sv4 columns
Can you tell me what i am doing wrong?
the code
select media.Timestamp, media.Campaign, media.Media, media.sv3, media.sv4
from media
inner join media_1
on media.sv3=media_1.sv3 and on media.sv4=media_1.sv4
JOIN queries yielding more results than their source records is not necessarily a sign something is outright wrong; but can be an indicator of something amiss (queries that need to behave that way exist, but are relatively rare).
The source of your issue is likely because you are joining on a value that is non-unique in both tables. As a simple example: If table X has two records with field A = 5, and table Y has three records with field A = 5, and they are JOINed on field A; those records will produce six results.
This may mean there is a problem with your source data, or you may just need to query it in a different manner. I notice you are only selecting fields from media and none from media_1; this query may yield the results you are expecting:
SELECT media.Timestamp, media.Campaign, media.Media, media.sv3, media.sv4
FROM media
WHERE (sv3, sv4) IN (SELECT sv3, sv4 FROM media_1)

Mysql join query with where condition and distinct records

I have two tables called tc_revenue and tc_rates.
tc_revenue contains :- code, revenue, startDate, endDate
tc_rate contains :- code, tier, payout, startDate, endDate
Now I need to get records where code = 100 and records should be unique..
I have used this query
SELECT *
FROM task_code_rates
LEFT JOIN task_code_revenue ON task_code_revenue.code = task_code_rates.code
WHERE task_code_rates.code = 105;
But I am getting repeated records help me to find the correct solution.
eg:
in this example every record is repeated 2 time
Thanks
Use a group by for whatever field you need unique. For example, if you want one row per code, then:
SELECT * FROM task_code_rates LEFT JOIN task_code_revenue ON task_code_revenue.code = task_code_rates.code
where task_code_rates.code = 105
group by task_code_revenue.code, task_code_revenue.tier
If code admits duplicates in both tables and you perform join only using code, then you will get the cartessian product between all matching rows from one table and all matching rows from the other.
If you have 5 records with code 100 in first table and 2 records with code 100 in second table, you'll get 5 times 2 results, all combinations between matching rows from the left and the right.
Unless you have duplicates inside one (or both) tables, all 10 results will differ in colums coming either from one table, the other or both.
But if you were expecting to get two combined rows and three rows from first table with nulls for second table columns, this will not happen.
This is how joins work, and anyway, how should the database decide which rows to combine if it didn't generate all combinations and let you decide in where clause?
Maybe you need to add more criteria to the ON clause, such as also matching dates?

Mysql index use

I have 2 tables with a common field. On one table the common field has an index
while on the other not. Running a query as the following :
SELECT *
FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
the query is way less performing than running the opposite :
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1
Anybody could explain me why and the logic behind the use of indexes in this case?
You can prepend your queries with EXPLAIN to find out how MySQL will use the indexes and in which order it will join the tables.
Take a look at the documentation of the EXPLAIN output format to see how to interpret the result.
Because of the LEFT JOINs, the order of the tables cannot be changed. MySQL needs to include in the final result set all the rows from the left table, whether or not they have matches in the right table.
On INNER JOINs, MySQL usually swaps the tables and puts the table having less rows first because this way it has a smaller number of rows to analyze.
Let's take this query (it's your query with shorter names for the tables):
SELECT *
FROM a
LEFT JOIN b ON a.col = b.col
WHERE 1
How MySQL runs this query:
It gets the first row from table a that matches the query conditions. If there are conditions in the WHERE or join clauses that use only fields of table a and constant values then an index that contain some or all of these fields is used to filter only the rows that matches the conditions.
After a row from table a was selected it goes to the next table from the execution plan (this is table b in our query). It has to select all the rows that match the WHERE condition(s) AND the JOIN condition(s). More specifically, the row(s) selected from table b must match the condition b.col = X where X is the value of column col for the row currently selected from table a on step 1. It finds the first matching row then goes to the next table. Since there is no "next table" in this query, it will put the pair of rows (from a and b) into the result set then discard the row from b and search for the next one, repeating this step until it finds all the rows from b that match the row currently selected from a (on step 1).
If on step 2 cannot find any row from b that match the row currently selected from a, the LEFT JOIN forces MySQL to make up a row (having the columns of b) full of NULLs and together with the current row from a it creates a row puts it into the result set.
After all the matching rows from b were processed, MySQL discards the current row from a, selects the next row from a that matches the WHERE and join conditions and starts over with the selection of matching rows from b (step 2).
This process loops until all the rows from a are processed.
Remarks:
The meaning of "first row" on step 1 depends on a lot of factors. For example, if there is an index on table a that contains all the columns (of table a) specified in the query then MySQL will not read the table data but will use the index instead. In this case, the order of the rows is given by the index. In other cases the rows are read from the table data and the order is provided by the order they are stored on the storage medium.
This simple query doesn't have any WHERE condition (WHERE 1 is always TRUE) and also there is no condition in the JOIN clause that contains only columns from a. All the rows from table a are included in the result set and that leads to a full table scan or an index scan, if possible.
On step 2, if table b has an index on column col then MySQL uses the index to find the rows from b that have value X on column col. This is a fast operation. If table b does not have an index on column col then MySQL needs to perform a full table scan of table b. That means it has to read all the rows of table b in order to find those having values X on column col. This is a very slow and resource consuming operation.
Because there is no condition on rows of table a, MySQL cannot use an index of table a to filter the rows it selects. On the other hand, when it needs to select the rows from table b (on step 2), it has a condition to match (b.col = X) and it could use an index to speed up the selection, given such an index exists on table b.
This explains the big difference of performance between your two queries. More, because of the LEFT JOIN, your two queries are not equivalent, they produce different results.
Disclaimer: Please note that the above list of steps is an overly simplified explanation of how the execution of a query works. It attempts to put it in simple words and skip the many technical aspects of what happens behind the scene.
Hints about how to make your query run faster can be found on MySQL documentation, section 8. Optimization
To check what's going on with MySQL Query optimizer please show EXPLAIN plan of these two queries. Goes like this:
EXPLAIN
SELECT * FROM table_with_index
LEFT JOIN table_without_index ON table_with_index.comcol = table_without_index.comcol
WHERE 1
and
EXPLAIN
SELECT *
FROM table_without_index
LEFT JOIN table_with_indexON table_without_index.comcol = table_with_index.comcol
WHERE 1

Valid SQL without JOIN?

I came across the following SQL statement and I was wondering if it was valid:
SELECT COUNT(*)
FROM
registration_waitinglist,
registration_registrationprofile
WHERE
registration_registrationprofile.activation_key = "ALREADY_ACTIVATED"
What does the two tables separated by a comma mean?
When you SELECT data from multiple tables you obtain the Cartesian Product of all the tuples from these tables. It can be illustrated in the following way:
This means you get each row from the first table paired with all the rows from the second table. Most of the time, it is not what you want. If you really want it, then it's clearer to use the CROSS JOIN notation:
SELECT * FROM A CROSS JOIN B;
In this context, it means that you are going to be joining every row from registration_waitinglist to every row in registration_registrationprofile
It's called a cartesian join
That query is 'syntactically' correct, meaning it will run. What the query will return is the entire product of every row in registration_waitinglist x registration_registrationprofile.
For example, if there were 2 rows in waitinglist and 3 rows in profile, then 6 rows will be returned.
From a practical matter, this is almost always a logical error and not intended. With rare exception, there should be either join criteria or criteria in the where clause.