INNER JOIN and GROUP BY to prevent duplicate results - mysql

Context:
I'm working on a simple ORM (for PHP) that automatize most of queries, based on a static configuration.
Thus, from tables and entities definitions, the library handles joins automatically and generates appropriate fields/table alias... No problem for LEFT joins but INNER may result in duplicated results in case of relation One-to-Many.
My thought was to automatically add a GROUP BY clause (on the auto-increment key) if necessary.
The question
Is it correct to consider that I need to add a GROUP BY clause if (and only if) the join's ON and WHERE conditions doesn't match a unique key of the joined table ?
Example
A very simple example, where I want to select all events with (at least) an associated Showing.
If there is an other way to do it without INNER JOIN, I'm interested to know how :)
CREATE TABLE `Event` (
`Id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`Name` VARCHAR(255) NOT NULL
);
INSERT INTO `Event` (`Name`) VALUES ('My cool event');
CREATE TABLE `Showing` (
`Id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
`EventId` INT UNSIGNED NOT NULL,
`Place` VARCHAR(50) NOT NULL,
FOREIGN KEY (`EventId`) REFERENCES `Event`(`Id`),
UNIQUE (`EventId`, `Place`)
);
INSERT INTO `Showing` (`EventId`, `Place`) VALUES (1, 'School');
INSERT INTO `Showing` (`EventId`, `Place`) VALUES (1, 'Park');
-- Correct queries
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId` WHERE t1.`PlaceId` = 'School';
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId` AND t1.`PlaceId` = 'School';
-- Query leading to duplicate values
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId`;
-- Group by query to prevent duplicate values
SELECT t.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId` GROUP BY t.`Id`;
Thanks !

(this should be a comment but its a bit long)
No problem for LEFT joins but INNER may result in duplicated results in case of relation One-to-Many
It's clear from that sentence that at least one of us is very confused about how a relational database works, and how object-relation mapping should work.
Query leading to duplicate values
The rows produced are not duplicates - you've written the query so it doesn't show you why they are different:
SELECT t1.place, t.*
FROM Event
INNER JOIN Showing
ON Event.Id=Showing.EventId;
If you're not interested in the data from 'showing' then why is it in your query? If you have events without related showing records then you should be using an 'EXISTS' - not a join (consider where you have a single event but 3 million showings)
SELECT t1.place, t.*
FROM `Event` t
WHERE EXISTS (SELECT 1
FROM Showing
WHERE Event.Id=Showing.EventId);
If you are strictly implementing ORM, then you probably shouldn't be writing queries with joins at all - but IMHO, the scenario is better served by using factories.

The data is saying that "My Cool Event" is happening at the park, and at the school. If you inner join the tables you will get more than one result.
Do this query to see what is going on:
Select t.*, t1.* FROM `Event` t INNER JOIN `Showing` t1 ON t.Id=t1.`EventId`;
That is the same query as your duplicate query, but selecting columns from both tables.
The first line of results says the event is happening at the park. The second line says that the same event is happening at the school.

Related

Selecting Max record in nested Join more efficiently

I am trying to figure out the most efficient method of writing the query below. Right now it is using a user table of 3k records, scheduleday of 12k records, and scheduleuser of 300k records.
The method I am using works, but it is not fast. It is plenty fast of 100 and under records, but not how I need it displayed. I know there must be a more efficient way of running this, if i take out the nested select, it runs in .00025 seconds. Add the nested, and we're pushing 9+ seconds.
All I am trying to do is get the most recent date a user was scheduled. The scheduleuser table only tells the scheduleid and dayid. This is then looked up in scheduleday to get the date. I cant use max(scheduleuser.rec) because the order entered may not be in date order.
The result of this query would be:
Bob 4/6/2022
Ralph 4/7/2022
Please note this query works perfectly fine, I am looking for ways to make it more efficient.
Percona Server Mysql 5.5
SELECT
(
SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser su1
LEFT JOIN scheduleday ON scheduleday.scheduleid=su1.scheduleid AND scheduleday.dayid=su1.dayid WHERE su1.idUser=users.idUser
)
as lastsecheduledate, users.usersName
users
idUser
usersName
1
bob
2
ralph
scheduleday
scheduleid
dayid
ddate
1
1
4/5/2022
1
2
4/6/2022
1
3
4/7/2022
scheduleuser (su1)
rec
idUser
dayid
scheduleid
1
1
2
1
1
2
3
1
1
1
1
1
As requested, full query
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
(SELECT MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) FROM scheduleuser "
'mySQL=mySQL&" LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid WHERE scheduleuser.iduser=users.iduser "
'mySQL=mySQL&" ) as lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
mySQL=mySQL&" LEFT JOIN userrating ON userrating.iduser=users.iduser "
mySQL=mySQL&" LEFT JOIN location ON location.idarea=users.area "
mySQL=mySQL&" LEFT JOIN usertypes ON usertypes.idtype=users.type "
mySQL=mySQL&" LEFT JOIN useropen ON useropen.iduser=users.iduser "
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
As requested, create tables
CREATE TABLE `users` (
`idUser` int(11) NOT NULL,
`usersName` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `users`
ADD PRIMARY KEY (`idUser`);
ALTER TABLE `users`
MODIFY `idUser` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleday` (
`rec` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`ddate` varchar(255) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleday`
ADD PRIMARY KEY (`rec`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleday`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
CREATE TABLE `scheduleuser` (
`rec` int(11) NOT NULL,
`idUser` int(11) NOT NULL,
`dayid` int(11) NOT NULL,
`scheduleid` int(11) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
ALTER TABLE `scheduleuser`
ADD PRIMARY KEY (`rec`),
ADD KEY `idUser` (`idUser`),
ADD KEY `dayid` (`dayid`),
ADD KEY `scheduleid` (`scheduleid`);
ALTER TABLE `scheduleuser`
MODIFY `rec` int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
I think my recommendation would be to do that subquery once with a GROUP BY and join it. Something like
SELECT users.iduser, users.adminName, users.firstname, users.lastname, users.lastLogin, users.area, users.type, users.terminationdate, users.termreason, users.cellphone,
lsd.lastsecheduledate,
IFNULL(userrating.rating,'0.00') as userrating, IFNULL(location.area,'') as userarea, IFNULL(usertypes.name,'') as usertype, IFNULL(useropen.iduser,0) as useropen
FROM users
LEFT JOIN (SELECT iduser, MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y')) lastscheduledate FROM scheduleuser LEFT JOIN scheduleday ON scheduleday.scheduleid=scheduleuser.scheduleid AND scheduleday.dayid=scheduleuser.dayid
GROUP BY iduser
) lsd
ON lsd.iduser=users.iduser
LEFT JOIN userrating ON userrating.iduser=users.iduser
LEFT JOIN location ON location.idarea=users.area
LEFT JOIN usertypes ON usertypes.idtype=users.type
LEFT JOIN useropen ON useropen.iduser=users.iduser
WHERE
users.type<>0 AND users.active=1
ORDER BY users.firstName
This will likely be more efficient since the DB can do the query once for all users, likely using your scheduleuser.iduser index.
If you are using something like above and it's still not performant, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would ensure it can do the entire join in the subquery with the indexes. Of course, there are tradeoffs to adding more indexes, so depending on your data profile it might not be worth it (and it might not actually improve anything).
If you are using your original query, I might suggest experimenting with:
ALTER TABLE scheduleuser ADD INDEX (iduser,scheduleid, dayid)
ALTER TABLE scheduleday ADD INDEX (scheduleid, dayid)
This would allow it to do the subquery (both the JOIN and the WHERE) without touching the actual scheduleuser table at all. Again, I say "experiment" since there are tradeoffs and this might not actually improve things much.
When you nest a query in the SELECT as you're doing, that query will get evaluated for each record in the result set because its WHERE clause is utilizing a column from outside the query. You really just want to calculate a result set of max dates only once and join your users on after it is done:
select usersName, last_scheduled
from users
left join (select su.iduser, max(sd.ddate) as last_scheduled
from scheduleuser as su left join scheduleday as sd on su.dayid = sd.dayid
and su.scheduleid = sd.scheduleid
group by su.iduser) recents on users.iduser = recents.iduser
I've obviously left your other columns off and just given you the name and date, but this is the general principle.
Bug:
MAX(STR_TO_DATE(scheduleday.ddate, '%m/%d/%Y'))
Change to
STR_TO_DATE(MAX(scheduleday.ddate), '%m/%d/%Y')
Else you will be in for a rude surprise next January.
Possible better indexes. Switch from MyISAM to InnoDB. The following indexes assume InnoDB; they may not work as well in MyISAM.
users: INDEX(active, type)
userrating: INDEX(iduser, rating)
location: INDEX(idarea, area)
usertypes: INDEX(idtype, name)
useropen: INDEX(iduser)
scheduleday: INDEX(scheduleid, dayid, ddate)
scheduleuser: INDEX(iduser, scheduleid, dayid)
users: INDEX(iduser)
When adding a composite index, DROP index(es) with the same leading columns.
That is, when you have both INDEX(a) and INDEX(a,b), toss the former.

complex SQL query - one table

I am new to SQL.
I was wondering if there is a way to form a complex (I think) query of a certain form, regarding a single table - or a simple query for the same effect.
Let's say I have a table of voice actor candidates, with different attributes (columns) - name and characteristics.
Let's say I have two different actor evaluators (Stewie and Griffin), and all the candidates were evaluated by minimum one of them (one, or both). The evaluators evaluate the actors, and the table is built.
The rows in the table are per-evaluation, not per-person, meaning that some candidates have two separate rows, one from each evaluation.
The evaluator's name is also an attribute, a column.
Can I make a query that will choose all candidates that were evaluated by both evaluators? (and let's say show all these rows, an even number then)
(There is no attribute "evaluated by both" - that's the core)
I think it should find all rows with evaluator Stewie, then search the entire table for rows with the corresponding candidates' names, and get those with evaluator Griffin.
Summary
A table with people - names and characteristics. One or two rows per person. Each row was filled according to a different observer. There is an attribute "Is Nice". How to find all people that were observed by two observers, one marked "Yes" and one "No" under "Is Nice"?
Update
It will take me some time to check all the answers (as not enough experience yet), and I will update what worked for me.
Can I make a query that will choose all candidates that were evaluated
by both evaluators?
(and let's say show all these rows, an even number then)
There are multiple ways to do this. You can check the existence of other evaluator's evaluation, using EXISTS:
SELECT * FROM Candidate AS C1 WHERE EXISTS (SELECT * FROM Candidate AS C2 WHERE C1.id = C2.id AND C1.evaluator != C2.evaluator)
Or, you could join the table to itself: (The checks for evaluators should be changed as appropriate)
SELECT C1.candidateName FROM Candidate AS C1 JOIN Candidate AS C2 USING (id) WHERE C1.evaluator = Stewie AND C2.evaluator = Griffin
How to find all people that were observed by two observers, one marked
"Yes" and one "No" under "Is Nice"?
For this one, you add another condition to the queries above, that checks if one evaluation was "Yes" and the other one was "No".
You seem to want group by and having. SInce a person cannot have more than two rows, and there are only two distinct possible values for isnice (yes or no), we can phrase the query as:
select name
from people
group by name
having max(isnice) <> min(isnice)
This filter names that have (at least) two different values in isnice. Starting from the above assumptions, this is sufficient to ensure that that person was evaluated more than once, and that isnice has (at least) two different values.
So, I read the problem very carefully, and came up with my own solution.
Please verify the code below if this is what you were really asking for?
--Create Candidates Table
CREATE TABLE tbl_candidates
(
c_id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
c_name VARCHAR(30),
)
--Create Evaluators Table
CREATE TABLE tbl_evaluators
(
e_id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
e_name VARCHAR(30),
)
--Create Evaluations Table
CREATE TABLE tbl_evaluations
(
ee_id INT PRIMARY KEY NOT NULL IDENTITY(1,1),
ee_title VARCHAR(30) NOT NULL,
ee_remarks VARCHAR(30) NOT NULL,
ee_date date NOT NULL,
c_id INT FOREIGN KEY (c_id) REFERENCES tbl_candidates(c_id) NOT NULL,
e_id1 INT FOREIGN KEY (e_id1) REFERENCES tbl_evaluators(e_id) NOT NULL,
e_id2 INT FOREIGN KEY (e_id2) REFERENCES tbl_evaluators(e_id),
IsNice VARCHAR(4)
)
--Populate data & check to verify
INSERT INTO tbl_candidates (c_name) VALUES ('Sam') , ('Smith')
SELECT * FROM tbl_candidates
INSERT INTO tbl_evaluators (e_name) VALUES ('Stewie'),('Griffin')
SELECT * FROM tbl_evaluators
INSERT INTO tbl_evaluations
(ee_title,ee_remarks,ee_date,c_id,e_id1,e_id2,IsNice)
VALUES
('Some Title','Some Comment','2020-6-12',1,1,NULL,'No'),
('Some Title','Some Comment','2020-6-12',2,1,2,'Yes'),
('Some Title','Some Comment','2020-6-12',3,2,NULL,'No')
--finally comparing whether we have the matching data of our input vs tables combined data display
select * from tbl_evaluations
select ee_id,ee_title,c_name,ee_remarks,e1.e_name,e2.e_name,ee_date,IsNice from tbl_evaluations ee
left join tbl_candidates c on c.c_id = ee.c_id left join tbl_evaluators e1 on e1.e_id = ee.e_id1 left join tbl_evaluators e2 on e2.e_id = ee.e_id2
See the result proof :
This is surely not the best way to write it, but my first thought is
SELECT * FROM evaluations
WHERE PrName IN (
SELECT PrName
FROM evaluations
WHERE IsNice ='No')
AND PrName IN (
SELECT PrName
FROM evaluations
WHERE IsNice ='Yes')

Using WHERE on grouped rows after UNION statement

I have a database schema with two tables, song and edited_song. These tables are identical, except for one extra column in edited_song called deleted. The edited_song-table contains a reference to the id in the song-table. I want to find all the songs which aren't deleted.
I have a UNION-statement in which I GROUP on the id of the result of two SELECT-statements. I want to exclude results where the deleted column has the value 1. An example of the setup can be seen here.
CREATE TABLE if not exists song
(
id int(11) NOT NULL auto_increment ,
title varchar(255),
PRIMARY KEY (id)
);
CREATE TABLE if not exists editedsong
(
id int(11) NOT NULL auto_increment ,
title varchar(255),
deleted tinyint(1),
PRIMARY KEY (id)
);
INSERT INTO song (id, title) VALUES
(1, 'Born in the USA');
INSERT INTO editedsong (id, title, deleted) VALUES
(1, 'Born in the USA', 1);
And the query is here:
SELECT * FROM
((SELECT *, 0 AS deleted FROM song WHERE id=1)
UNION
(SELECT * FROM editedsong WHERE id=1)) AS song
WHERE song.deleted!=1
GROUP BY song.id
The UNION-statement is used instead of a join as there is a LOT of text in these two tables and a join results in writing to disk. This is a simplified form of the real query, but it reproduces the problem I'm experiencing. I would expect the query to yield no results as the GROUP BY should preserve the first row and throw away all following. Why doesn't it do this? Is it because the WHERE is executed before the GROUP BY? If it is, what is a good way to overcome this problem?
http://sqlfiddle.com/#!2/5cdb6c/3
The reason that the code in the SQLFiddle doesn't work is that the WHERE clause is excluding the deleted record from editedsong before the GROUP BY is executed.
You can use HAVING to apply criteria after a GROUP BY clause.
This appears to work:
SELECT *, max(deleted) as md FROM
((SELECT *, 0 AS deleted FROM song)
UNION
(SELECT * FROM editedsong)) AS song
-- WHERE song.deleted!=1
GROUP BY song.id
HAVING md != 1
This returns the record from song, not the record from editedsong for records that haven't been deleted. If you want the other, reverse the order of the items in the UNION clause.
This syntax for GROUP BY is unusual, and I'm surprised it's supported. Most database systems I've worked with require every field in the output to have some treatment specified (MAX, COUNT, GROUP BY, etc). So a SELECT * is incompatible with GROUP BY. MySQL must be making some assumption or have some default behaviours here, but I think most servers wouldn't like it (me either).

Current revision of entity in MySQL

Suppose I have the following table
CREATE TABLE `entities` (
`id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
`timestamp` TIMESTAMP NOT NULL
DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`data` VARCHAR(255),
PRIMARY KEY (`id`,`timestamp`)
);
Each entity would normally only be referenced by id, except that there are multiple revisions for each entity, disambiguated by timestamp. The majority of my queries will be selecting the most recent revision, with only a small handful inserting new revisions, and even fewer selecting all past revisions. I expect only about a dozen revisions per id on average.
What is the most efficient (in terms of performance and storage space) method of selecting the most recent revision? Is there an accepted practice for this problem?
As I see it, there are two methods: (1) Create views around a GROUP BY
CREATE VIEW groupedEntities AS
SELECT id, max(timestamp) AS maxt FROM entities GROUP BY id;
CREATE VIEW currentEntities AS
SELECT a.id, data, timestamp FROM groupedEntities AS a
INNER JOIN entities AS b ON b.id=a.id AND b.timestamp=a.maxt
WHERE timestamp <= CURRENT_TIMESTAMP;
SELECT * FROM currentEntities WHERE id=?;
Note the <=CURRENT_TIMESTAMP allows 'deleting' an entity by setting a timestamp to the distant future. And (2) Create a separate table to store current revisions
CREATE TABLE currentEntities (
`id` INT(10) UNSIGNED PRIMARY KEY,
`timestamp` TIMESTAMP,
CONSTRAINT FOREIGN KEY (`id`, `timestamp`)
REFERENCES `entities` (`id`,`timestamp`)
);
SELECT * FROM currentEntites INNER JOIN groupedEntities WHERE id=?;
Or some other option (3)?
Views will eat your lunch in terms of performance, because of the way that MySQL handles views. Specifically, MySQL materializes an intermediate MyISAM table for a view, and does not "push" predicates from an outer query into a view (stored or inline).
The option of having a separate table that holds the frequently used "current" revisions would be the better option of the two you present. That does add complexity, keeping everything in sync, different queries to get current vs. historical, and the overhead of extra inserts, etc.
Given just the original table (storing all the historical revisions in the same table as the current revision (no separate table for just the most recent revision)...
A query with an inline view with a predicate INSIDE the view definition will give the best performance:
SELECT e.id
, e.timestamp
, e.data
FROM `entities` e
JOIN ( SELECT m.id
, MAX(m.timestamp) AS `timestamp`
FROM `entities` m
WHERE m.id = ?
GROUP BY m.id
) c
ON c.id = e.id
AND c.timestamp = e.timestamp
The EXPLAIN output should show "Using where; Using index" on the step to materialize the inline view (derived table). The join predicate on the outer query is by primary key, which is optimal for the retrieval of the data column.

Deleting multiple rows from multiple tables

I have three tables:
`MEMBERS`
with
`NAME` varchar(24) UNIQUE KEY
`LAST_LOGGED_IN` int(11) - It is a timestamp!
`HOMES`
with
`OWNER` varchar(24)
`CARS`
with
`OWNER` varchar(24)
I use InnoDB for these tables, now my actual question is: How do I remove rows within all the tables if the UNIX_TIMESTAMP()-MEMBERS.LAST_LOGGED_IN > 864000?
I'm trying to remove inactive members' rows, and this is the hardest thing yet. I have about 40K rows, and increasing. I clean it regularly with DELETE FROM MEMBERS WHERE UNIX_TIMESTAMP()-LAST_LOGGED_IN> 864000
Any of your help would be extremely grateful! Thanks!!
If you have already removed rows from the MEMBERS table, and you want to remove the rows from the other two tables where the value of the OWNER column does not match a NAME value from any row in the MEMBERS table:
DELETE h.*
FROM `HOMES` h
LEFT
JOIN `MEMBERS` m
ON m.`NAME` = h.`OWNER`
WHERE m.`NAME` IS NULL
DELETE c.*
FROM `CARS` c
LEFT
JOIN `MEMBERS` m
ON m.`NAME` = c.`OWNER`
WHERE m.`NAME` IS NULL
(N.B. these statements will also remove rows from the HOMES and CARS tables the OWNER column as a NULL value.)
I strongly recommend you to run a test of these statements using a SELECT before you run the DELETE. (Replace the keyword DELETE with the keyword SELECT, i.e.
-- DELETE h.*
SELECT h.*
FROM `HOMES` h
LEFT
JOIN `MEMBERS` m
ON m.`NAME` = h.`OWNER`
WHERE m.`NAME` IS NULL
Going forward, if you want to keep these tables "in sync", you may consider defining FOREIGN KEY constraints with the ON CASCADE DELETE option.
Or, you can use a DELETE statement that removes rows from all three tables:
DELETE m.*, h.*, c.*
FROM `MEMBERS` m
LEFT
JOIN `HOMES` h
ON h.`OWNER` = m.`NAME`
LEFT
JOIN `CARS` c
ON c.`OWNER` = m.`NAME`
WHERE UNIX_TIMESTAMP()-m.`LAST_LOGGED_IN` > 864000
(N.B. the predicate there cannot make use of an index on the LAST_LOGGED_IN column. An equivalent predicate with a reference to the "bare" column will be able to use an index.
WHERE m.`LAST_LOGGED_IN` < UNIX_TIMESTAMP()-864000
or an equivalent:
WHERE m.`LAST_LOGGED_IN` < UNIX_TIMESTAMP(NOW() - INTERVAL 10 DAY)
For best performance, you would need indexes on both HOMES and CARS with a leading column of OWNER, e.g.
... ON `HOMES` (`OWNER`)
... ON `CARS` (`OWNER`)
I don't use InnoDB so I had to look it up, but it does appear to support Referential Integrity. If you set relationships and then turn on ON DELETE CASCADE, the database itself will enforce the rules... i.e., when you delete a Member, the DBMS will take care of deleting the associated Homes and Cars.
See here and here, they might help.