Mysql join query with where condition and distinct records - mysql

I have two tables called tc_revenue and tc_rates.
tc_revenue contains :- code, revenue, startDate, endDate
tc_rate contains :- code, tier, payout, startDate, endDate
Now I need to get records where code = 100 and records should be unique..
I have used this query
SELECT *
FROM task_code_rates
LEFT JOIN task_code_revenue ON task_code_revenue.code = task_code_rates.code
WHERE task_code_rates.code = 105;
But I am getting repeated records help me to find the correct solution.
eg:
in this example every record is repeated 2 time
Thanks

Use a group by for whatever field you need unique. For example, if you want one row per code, then:
SELECT * FROM task_code_rates LEFT JOIN task_code_revenue ON task_code_revenue.code = task_code_rates.code
where task_code_rates.code = 105
group by task_code_revenue.code, task_code_revenue.tier

If code admits duplicates in both tables and you perform join only using code, then you will get the cartessian product between all matching rows from one table and all matching rows from the other.
If you have 5 records with code 100 in first table and 2 records with code 100 in second table, you'll get 5 times 2 results, all combinations between matching rows from the left and the right.
Unless you have duplicates inside one (or both) tables, all 10 results will differ in colums coming either from one table, the other or both.
But if you were expecting to get two combined rows and three rows from first table with nulls for second table columns, this will not happen.
This is how joins work, and anyway, how should the database decide which rows to combine if it didn't generate all combinations and let you decide in where clause?
Maybe you need to add more criteria to the ON clause, such as also matching dates?

Related

DATEDIFF in MySQL

Leetcode problem link: https://leetcode.com/problems/rising-temperature/
The solution that I don't understand:
SELECT
weather.id AS 'Id'
FROM
weather
JOIN
weather w ON DATEDIFF(weather.recordDate, w.recordDate) = 1
AND weather.Temperature > w.Temperature
;
Here weather and w (Alias of weather) are the same tables and DATEDIFF is comparing dates but I don't understand this, If weather and w are the same tables then doesn't that mean that DATEDIFF is comparing the same rows. The solution is correct which means that both rows are not same how?
The table is the same and the columns being compared are the same but not the same rows satisfy the conditions on both tables.
Simplistic put a table join is a internal product of the rows of the tables involved. Which means that a self join of a table with 3 rows will return 9 rows.
When you set conditions, the result set is filtered and only the combined rows that satisfy them will be returned. In the exercise, the conditions are a relation between the row of the first table instance and the second instance.
It is better if you alias both copies of the table:
SELECT w1.id
FROM weather w1 JOIN weather w2
ON DATEDIFF(w1.recordDate, w2.recordDate) = 1 AND w1.Temperature > w2.Temperature;
What this query does is join every row of the table with every other row of the same table which has as recordDate the previous day and less Temperature and returns the id of the 1st copy (if the conditions are satisfied).
In fact all rows of the table are compared against all rows, but when it comes to the same rows they are rejected because for them the conditions fail.
Also, note that your query may return the same id more than once, because it could happen that for a row of the table there may exist more than 1 other rows where the date is 1 day less and the temperature is less.
So maybe you want:
SELECT DISTINCT w1.id
.....................

INNER JOIN in MySQL returns multiple entries of the same row

I am using MySQL through R. I am working with two tables within the same database and I noticed something strange that I can't explain. To be more specific, when I try to make a connection between the tables using a foreign key the result is not what it should be.
One table is called Genotype_microsatellites, the second table is called Records_morpho. They are connected through the foreign key sample_id.
If I only select records with certain characteristics from the Genotype_microsatellites table using the following command...
Gen_msat <- dbGetQuery(mydb, 'SELECT *
FROM Genotype_microsatellites
WHERE CIDK113a >= 0')
...the query returns 546 observations for 52 variables, exactly what I would expect. Now, I want to do a query that adds a little more info to my results, specifically by including data from the Records_morpho table. I, therefore, use the following code:
Gen_msat <- dbGetQuery(mydb, 'SELECT Genotype_microsatellites.*,
Records_morpho.net_mass_g,
Records_morpho.svl_mm
FROM Genotype_microsatellites
INNER JOIN Records_morpho ON Genotype_microsatellites.sample_id = Records_morpho.sample_id
WHERE CIDK113a >= 0')
The problem is that now the output has 890 observation and 54 variables!! Some sample_id values (i.e., the rows or individuals in the data frame ) are showing up multiple times, which shouldn't be the case. I have tried to fix this using SLECT DISTINCT, but the problem wouldn't go away.
Any help would be much appreciated.
Sounds like it is working as intended, that is how joins work. With A JOIN B ON A.x = B.y you get every row from A combined with every row from B that has a y matching the A row's x. If there are 3 rows in B that match one row in A, you will get three result rows for those. The A row's data will be repeated for each B row match.
To go a little further, if x is not unique and y is not unique. And you have two x with the same value, and three y with that value, they will produce six result rows.
As you mentioned DISTINCT does not make this problem go away because DISTINCT operates across the result row. It will only merge result rows if the values in all selected fields are the same on those result rows. Similarly, if you have a query on a single table that has duplicate rows, DISTINCT will merge those rows despite them being separate rows, as they do not have distinct sets of values.

Counting rows in multiple tables cause large delay

I have 3 tables with mainly string data and unique id column:
categories ~45 rows
clientfuncs ~800 rows
serverfuncs ~600 rows
All tables have unique primary AI column 'id'.
I try to count rows in one query:
SELECT COUNT(categories.id), COUNT(serverfuncs.id), COUNT(clientfuncs.id) FROM categories, serverfuncs, clientfuncs
It takes 1.5 - 1.7 s.
And when I try
SELECT COUNT(categories.id), COUNT(serverfuncs.id) FROM categories, serverfuncs
or
SELECT COUNT(categories.id), COUNT(clientfuncs.id) FROM categories, clientfuncs
or
SELECT COUNT(clientfuncs.id), COUNT(serverfuncs.id) FROM clientfuncs, serverfuncs
, it takes 0.005 - 0.01 s. (as it should be)
Can someone explain, what is the reason for this?
You're doing a cross join of 45*800*600 rows, you'll notice that when you check the result of the counts :-)
Try this instead:
SELECT
(SELECT COUNT(*) FROM categories),
(SELECT COUNT(*) FROM serverfuncs),
(SELECT COUNT(*) FROM clientfuncs);
The queries are doing cartesian product since no join condition is applied so:
1 query : 800*600*45 = 21,6 mil
2 query : 45*600 = 27 k
3 query : 45*800 ...
It's because your query is joining the tables (the commas in the last part of the query are shorthand for a join) rather than counting them individually. So your queries with only two tables will be quicker.
First of all, do you really want to use three tables in the FROM clause to compute counts that are specific to each table? This will cause the SELECT statement to produce a Cartesian product of the three tables which will result in a total number of rows of 45 x 800 x 600 from which counts are computed. Hence many duplicates of categories.id values will be counted and so are the other counts. In any case if you use first two tables in the FROM clause, the Cartesian product will contain only 45 X 800 rows which is much less than the rows the three tables produce. Hence the queries with two tables are much faster. Primary keys are of no use in this cases.
Better use three different statements to get count from each table.
If you still insist on getting counts at one shot, you may use the following syntax:
SELECT (SELECT COUNT(categories.id) FROM categories),
(SELECT COUNT(serverfuncs.id) FROM serverfuncs),
(SELECT COUNT(clientfuncs.id) FROM clientfuncs);
if your RDBMS supports SELECT statements without FROM clause. These will give correct counts and would be very fast.

Valid SQL without JOIN?

I came across the following SQL statement and I was wondering if it was valid:
SELECT COUNT(*)
FROM
registration_waitinglist,
registration_registrationprofile
WHERE
registration_registrationprofile.activation_key = "ALREADY_ACTIVATED"
What does the two tables separated by a comma mean?
When you SELECT data from multiple tables you obtain the Cartesian Product of all the tuples from these tables. It can be illustrated in the following way:
This means you get each row from the first table paired with all the rows from the second table. Most of the time, it is not what you want. If you really want it, then it's clearer to use the CROSS JOIN notation:
SELECT * FROM A CROSS JOIN B;
In this context, it means that you are going to be joining every row from registration_waitinglist to every row in registration_registrationprofile
It's called a cartesian join
That query is 'syntactically' correct, meaning it will run. What the query will return is the entire product of every row in registration_waitinglist x registration_registrationprofile.
For example, if there were 2 rows in waitinglist and 3 rows in profile, then 6 rows will be returned.
From a practical matter, this is almost always a logical error and not intended. With rare exception, there should be either join criteria or criteria in the where clause.

Two SQL Joins, Two Different Results

Hows is it possible for these two queries to be different. I mean the first query didn't include all the rows from my left table so I put the conditions within the join part.
Query 1
SELECT COUNT(*) as opens, hours.hour as point
FROM hours
LEFT OUTER JOIN tracking ON hours.hour = HOUR(FROM_UNIXTIME(tracking.open_date))
WHERE tracking.campaign_id = 83
AND tracking.open_date < 1299538799
AND tracking.open_date > 1299452401
GROUP BY hours.hour
Query 2
SELECT COUNT(*) as opens, hours.hour as point
FROM hours
LEFT JOIN tracking ON hours.hour = HOUR(FROM_UNIXTIME(tracking.open_date))
AND tracking.campaign_id = 83
AND tracking.open_date < 1299538799
AND tracking.open_date > 1299452401
GROUP BY hours.hour
The difference is that the first query gives me 18 rows where there are no rows between point 17 to 22. But when I run the second query, it shows the fully 24 rows but for rows between 17 and 22 it has a value of 1! I would of expected it to be 0 or NULL? If it really is 1 should it not have appeared in the first query?
How has this happened?
the first JOIN is really an INNER JOIN, the outer joined table should not appear in the WHERE clause like you have in the top query, instead of COUNT(*), pick a column from the outer joined table
You're using COUNT(*), which will count every row in your result set (as it's written), since even without data in tracking, you do have data in hours.
Try changing COUNT(*) to COUNT(tracking.open_date) (or any non-nullable column within tracking; it doesn't matter which one).
COUNT(*) counts the number of rows resulted in the query.
You can use count(tracking.open_date), basically any column from tracking table (right table)
The problem is that the first query will do an outer join, with some rows containing NULL in all tables from the tracking table. Then it will apply a filter on those tracking columns, and since they are null, the corresponding row from the result set will be filtered out.
The second query will do a proper outer join on all columns.