I am trying to figure out how to find duplicats based on several different columns and tables.
I've got these tables:
products
tags (lists tagid and productid - one row per tag)
groups (lists groupid and productid - one row per groupid)
I want to find exact matches in my table products on columns productName, brandid, origin. But to cast the row as a duplicate I also need to compare so that they have the exact same tags (column: tagid) and groups (column: groupid) assigned.
Every product may have multiple tags and multiple groups.
This is what I've come up with... but it's not quite doing what I need it to.
SQLFiddle
http://sqlfiddle.com/#!9/43f19/1
In my SQL fiddle example I have listed 10 different products.
For example, products 1,2 are exact matches and thus should be listed as a duplicate.
Product number 3 only has one group assigned and thus differ from product 1 and 2 even if any other parameter fits (it should not be listed). My intention with the dupid column would be to list the first entry of a set of duplicates.
id | name | brandid | origin | tags | groups | dupid
1 | prod | 1 | England | 1,2 | 1,2 | 1
2 | prod | 1 | England | 1,2 | 1,2 | 1
3 | prod | 1 | England | 1,2 | 1 | 3
Complete set of items that should be listed as exact matches in my SQL fiddle are:
id 1
id 2
id 4
id 5
My guess why this fails is that I have not succeeded to involve the tags and the groups correctly into my comparison.
SELECT m.*,dup.id AS dupid,GROUP_CONCAT(DISTINCT t.tagid ORDER BY t.tagid ASC) AS alltags,GROUP_CONCAT(DISTINCT g.groupid ORDER BY g.groupid ASC) AS groups
FROM `products` m
JOIN (SELECT id,`productName`, brandid, origin, COUNT(*) AS c FROM products
GROUP BY `productName`, brandid, origin HAVING c > 1) dup ON m.`productName` = dup.`productName` AND m.brandid = dup.brandid AND m.origin = dup.origin
LEFT JOIN tags AS t ON t.productid = m.id
LEFT JOIN groups AS g ON g.productid = m.id
GROUP BY m.id
ORDER BY `productName`,brandid,origin
Any help and/or advice on how to achieve this is highly appricated.
My guess is that you are missing an aggregation function on the subquery on ID field, also - you need to group by productname,origin and brand and not id so try this:
SELECT m.*,dup.id AS dupid,GROUP_CONCAT(DISTINCT t.tagid ORDER BY t.tagid ASC) AS alltags,GROUP_CONCAT(DISTINCT g.groupid ORDER BY g.groupid ASC) AS groups
FROM `products` m
JOIN (SELECT min(id) as id,`productName`, brandid, origin, COUNT(*) AS c FROM products
GROUP BY `productName`, brandid, origin HAVING c > 1) dup ON m.`productName` = dup.`productName` AND m.brandid = dup.brandid AND m.origin = dup.origin
LEFT JOIN tags AS t ON t.productid = m.id
LEFT JOIN groups AS g ON g.productid = m.id
GROUP BY m.`productName`,m.brandid,m.origin
ORDER BY m.`productName`,m.brandid,m.origin
EDIT : You can use this query:
SELECT tt.*
FROM(
SELECT m.*,GROUP_CONCAT(DISTINCT t.tagid ORDER BY t.tagid ASC) AS alltags,GROUP_CONCAT(DISTINCT g.groupid ORDER BY g.groupid ASC) AS groups
FROM `products` m
LEFT JOIN tags AS t ON t.productid = m.id
LEFT JOIN groups AS g ON g.productid = m.id
GROUP BY m.id) tt
INNER JOIN
(SELECT productName,brandid,origin,alltags,groups
FROM
(SELECT m.*,GROUP_CONCAT(DISTINCT t.tagid ORDER BY t.tagid ASC) AS alltags,GROUP_CONCAT(DISTINCT g.groupid ORDER BY g.groupid ASC) AS groups
FROM `products` m
LEFT JOIN tags AS t ON t.productid = m.id
LEFT JOIN groups AS g ON g.productid = m.id
GROUP BY m.id) s
GROUP BY productName,brandid,origin,alltags,groups
HAVING COUNT(*) > 1) ss
ON(tt.productName = ss.productName and tt.brandid = ss.brandid and tt.origin = ss.origin
and tt.alltags = ss.alltags and tt.groups = ss.groups)
Related
a seemingly generic SQL query really left me clueless.
Here's the case.
I have 3 generic tables (simplified versions here):
Movie
id | title
-----------------------
1 | Evil Dead
-----------------------
2 | Bohemian Rhapsody
....
Genre
id | title
-----------------------
1 | Horror
-----------------------
2 | Comedy
....
Rating
id | title
-----------------------
1 | PG-13
-----------------------
2 | R
....
And 2 many-to-many tables to connect them:
Movie_Genre
movie_id | genre_id
Movie_Rating
movie_id | rating_id
The initial challenge was to write a query which allows me to fetch movies that belong to multiple genres (e.g. horror comedies or sci-fi action).
Thankfully, I was able to find this solution here
MySQL: Select records where joined table matches ALL values
However, what would be the correct option to fetch records that belong to multiple many-to-many tables? E.g. rated R horror comedies. Is there any way to do so without subquery (or a single one only)?
One method uses correlated subqueries:
select m.*
from movies m
where (select count(*)
from movie_genre mg
where mg.movie_id = m.id
) > 1 and
(select count(*)
from movie_rating mr
where mr.movie_id = m.id
) > 1 ;
With indexes on movie_genre(movie_id) and movie_rating(movie_id) this probably has quite reasonable performance.
The above is possibly the most efficient method. However, if you wanted to avoid subqueries, one method would be:
select mg.movie_id
from movie_genres mg join
movie_ratings mr
on mg.movie_id = mr.movie_id
group by mg.movie_id
having count(distinct mg.genre_id) > 0 and
count(distinct mr.genre_id) > 0;
More efficient than the above is aggregating before the join:
select mg.movie_id
from (select movie_id
from mg_genres
group by movie_id
having count(*) >= 2
) mg join
(select movie_id
from mg_ratings
group by movie_id
having count(*) >= 2
) mr
on mg.movie_id = mr.movie_id;
Although you state that you want to avoid subqueries, the irony is that the version with no subqueries probably has the worst performance of these three options.
E.g. rated R horror comedies
You can join all the tables together, aggregate by movie and filter with a HAVING clause:
select m.id, m.title
from movies m
inner join movie_genre mg on mg.movid_id = m.id
inner join genre g on g.id = mg.genre_id
inner join movie_rating mr on mr.movie_id = m.id
inner join rating r on r.id = mr.rating_id
group by m.id, m.title
having
max(r.title = 'R') = 1
and max(g.title = 'Horror') = 1
and max(g.title = 'Comedy') = 1
You can also use a couple of exists conditions along with correlated subqueries:
select m.*
from movie m
where
exists (
select 1
from movie_genre mg
inner join genre g on g.id = mg.genre_id
where mg.movie_id = m.id and g.title = 'R')
and exists (
select 1
from movie_rating mr
inner join rating r on r.id = mr.rating_id
where mr.movie_id = m.id and r.title = 'Horror'
)
and exists (
select 1
from movie_rating mr
inner join rating r on r.id = mr.rating_id
where mr.movie_id = m.id and r.title = 'Comedy'
)
This is the query:
SELECT a.id, a.userName,if(o.userId=1,'C',if(i.userId=1,'I','N')) AS relation
FROM tbl_users AS a
LEFT JOIN tbl_contacts AS o ON a.id = o.contactId
LEFT JOIN tbl_invites AS i ON a.id = i.invitedId
ORDER BY relation
This returns the output as follows:
+----+--------------+-------------+
| ID | USERNAME | RELATION |
+----+--------------+-------------+
| 1 | ray | C |
+----+--------------+-------------+
| 2 | john | I |
+----+--------------+-------------+
| 1 | ray | N |
+----+--------------+-------------+
I need to remove the third row from the select query by checking if possible that id is duplicate. The priority is as follows:
C -> I -> N. So since there is already a "ray" with a C, I dont want it again with an I or N.
I tried adding distinct(a.id) but it doesn't work. How do I do this?
Why doesn't DISTINCT work for this?
From the specs you gave, all you have to do is group by ID and username, then pick the lowest value of relation you can find (since C < I < N)
SELECT a.id, a.userName, MIN(if(o.userId=1,'C',if(i.userId=1,'I','N'))) AS relation
FROM tbl_users AS a
LEFT JOIN tbl_contacts AS o ON a.id = o.contactId
LEFT JOIN tbl_invites AS i ON a.id = i.invitedId
GROUP BY a.id, a.username
There are multiple ways to get the group-wise maximum/minimum as you can see in this manual page.
The best one suited for you is the first one, if the order of the rows can not be defined by alphabetic order.
In this case, given if the desired order were z-a-m (see Rams' comment) you'd need the FIELD() function.
So your answer is
SELECT
a.id,
a.userName,
if(o.userId=1,'C',if(i.userId=1,'I','N')) AS relation
FROM tbl_users a
LEFT JOIN tbl_contacts AS o ON a.id = o.contactId
LEFT JOIN tbl_invites AS i ON a.id = i.invitedId
WHERE
if(o.userId=1,'C',if(i.userId=1,'I','N')) = (
SELECT
if(o.userId=1,'C',if(i.userId=1,'I','N')) AS relation
FROM tbl_users aa
LEFT JOIN tbl_contacts AS o ON aa.id = o.contactId
LEFT JOIN tbl_invites AS i ON aa.id = i.invitedId
WHERE aa.id = a.id AND aa.userName = a.userName
ORDER BY FIELD(relation, 'N', 'I', 'C') DESC
LIMIT 1
)
Note, you can also do it like ORDER BY FIELD(relation, 'C', 'I', 'N') to have it more readable / intuitive. I turned it the other way round, because if you'd have the possibility of having a 'X' in the relation, the FIELD() function would have returned 0 because X is not specified as a parameter. Therefore it would be sorted before 'C'. By sorting descending and turning the order of the parameters around this can not happen.
I have organizations. Each organization can have members and projects.
I want to get list of organizations with number of members and projects.
For example,
Organization | Members | Projects | Action
------------------------------------------
Org 1 | 5 | 6 | Delete - Edit
Org 2 | 2 | 9 | Delete - Edit
I am using this query,
SELECT COUNT(m.id) as members, COUNT(p.id) as projects,
o.status,o.organization_name,o.logo, o.id as id
from tbl_organizations o
LEFT JOIN tbl_organization_members m ON (o.id = m.organization_id)
LEFT JOIN tbl_projects p ON (o.id = p.organization_id)
WHERE o.status= 'active' AND o.created_by= 1
But the output of number of projects is equal to number of members.
How can I make the sample above using query?
Try this way:
SELECT o.id as id, o.organization_name, cnt_ as members, cnt_p as projects
from tbl_organizations o
LEFT JOIN (
SELECT organization_id, COUNT(id) cnt_m
FROM tbl_organization_members
GROUP BY organization_id
) m ON (o.id = m.organization_id)
LEFT JOIN (
SELECT organization_id, COUNT(id) cnt_p
FROM tbl_projects
GROUP BY organization_id
) p ON (o.id = p.organization_id)
WHERE o.status= 'active' AND o.created_by= 1
This way you JOIN to an already aggregated version of member/project tables, so as to get the count of members/projects per organization_id.
Group by the organisation columns and count distinct IDs
SELECT o.status,o.organization_name, o.logo, o.id as id,
COUNT(distinct m.id) as members, COUNT(distinct p.id) as projects,
from tbl_organizations o
LEFT JOIN tbl_organization_members m ON (o.id = m.organization_id)
LEFT JOIN tbl_projects p ON (o.id = p.organization_id)
WHERE o.status= 'active'
AND o.created_by= 1
GROUP BY o.status, o.organization_name, o.logo, o.id
You can co-related subquery:
SELECT
o.id as Organization,
(SELECT COUNT(*) FROM tbl_organization_members WHERE organization_id = o.id) as members,
(SELECT COUNT(*) FROM tbl_projects WHERE organization_id = o.id) as projects
FROM
tbl_organizations o
WHERE
o.status= 'active' AND o.created_by = 1
I have this query which lists out IDs from "pages" on our site.
SELECT mdl_page.id
FROM mdl_page, mdl_log, mdl_user
WHERE mdl_log.module = "page"
AND mdl_log.action = "view"
AND mdl_user.id = mdl_log.userid
AND mdl_log.info = mdl_page.id
AND mdl_log.course = 178
The result is simple:
| ID |
|-----|
| 3 |
| 4 |
| 7 |
| 11 |
Notice the jumps in the count. I'm trying to get something like this:
| ID | NEXT ID |
|-----|---------|
| 3 | 4 |
| 4 | 7 |
| 7 | 11 |
| 11 | 12 |
Can anyone point in me in the right direction for this?
UPDATE
One twist, the system (not my own) I have to run the query through only allows queries that begin with 'SELECT'.
Two ways i can think of use a co-related subquery,in your sub query compare the value from main query and sorts it in ascending manner and limit the result to one
SELECT
p.id ,
(SELECT
p1.id
FROM mdl_page p1
JOIN mdl_log l1 ON (l1.info = p1.id)
JOIN mdl_user u1 ON (u1.id = l1.userid)
WHERE l1.module = "page"
AND l1.action = "view"
AND l1.course = 178
AND p1.id > p.id
ORDER BY p1.id ASC LIMIT 1) NEXT_ID
FROM mdl_page p
JOIN mdl_log l ON (l.info = p.id)
JOIN mdl_user u ON (u.id = l.userid)
WHERE l.module = "page" AND l.action = "view" AND l.course = 178
ORDER BY p.id
and use a rank query, in rank query i am left joining the same query with the less than condition ON (t.id< t1.id) so it will result in multiple rows like (3,4),(3,7),(3,11) so i need to pick the first combination of 3,4 for this i have used a rank query to give the rank to the items that belong to same group, in parent where i am just restricting the result set to show the first pair for each group
SELECT t3.id,t3.NEXT_ID FROM (
SELECT t.id id, t1.id NEXT_ID ,
#r:= CASE WHEN #g = t.id THEN #r +1 ELSE 1 END rownum,
#g:= t.id
FROM
(SELECT
p.id
FROM
mdl_page p
JOIN mdl_log l ON (l.info = p.id)
JOIN mdl_user u ON (u.id = l.userid)
WHERE l.module = "page"
AND l.action = "view"
AND l.course = 178
ORDER BY p.id
) t
LEFT JOIN
(SELECT
p.id
FROM
mdl_page p
JOIN mdl_log l ON (l.info = p.id)
JOIN mdl_user u ON (u.id = l.userid)
WHERE l.module = "page"
AND l.action = "view"
AND l.course = 178
ORDER BY p.id ) t1 ON (t.`id` < t1.id)
CROSS JOIN (SELECT #g:=0,#r:=0) t2
ORDER BY t.`ID` , t1.ID
) t3
WHERE t3.rownum = 1
resutset you will get as null for 11 if there is no more record exist which have an id greater than 11 ,or in other words the last record will have a null in next_id column
ID NEXT_ID
3 4
4 7
7 11
11 NULL
Perhaps you should create a temporary table, which is pretty much the same as the query you're running and erase the first line?
Then run your query and join it with the temporary table?
I have to group my anime index according to their AniDB ID and show the values in a DESCENDING order according to file auto increment id.
Here's what I did currently:
SELECT
f.id, f.category, f.anidb, f.mal_id, COUNT( * ) AS dupes, f.filename,
a.titles, a.synopsis, a.episodes, a.image, a.rating,
c.name as cat_name, c.id as categoryid
FROM table_files f
LEFT JOIN table_anidb a ON a.id = f.anidb
LEFT JOIN table_categories c ON c.id = f.category
GROUP BY a.id ORDER BY f.id DESC
PROBLEM:
I have Naruto 8 episodes. episode 8's ID is 204. And ep.1 has ID 160. The query return like this:
id | anidb | filename | dupes | cat_name
--------------------------------------------------------
201 | 8692 | SAO | 1 | Series
200 | 9251 | RYO | 1 | Movie
.....
.......
160 | 239 | Naruto ep.1 | 8 | Series
But I want Naruto Episode 8 to be showed in the top of the results instead of episode 1 in the last.
How do I group by anidb and mal_id at the same time with an OR logic? So that the grouping can be done even if there is not any anidb ID provided.
Ad. 1.
Since id, anidb and filename are all in one table i'm afraid you can't get away from doing a subquery join:
SQLFiddle
SELECT f.id, f.anidb, f.filename
FROM files f
JOIN
(SELECT MAX(id) as id FROM files GROUP BY anidb) AS f2
ON f2.id = f.id
ORDER BY f.id DESC
(data flattened for the sake of readibility but you can get the general idea)
Ad. 2.
As for the second problem, you really just have to add second grouping column to the above joined subquery:
SQLFiddle
SELECT f.id, f.anidb, f.mal_id, f.filename
FROM files f
JOIN
(SELECT MAX(id) as id FROM files GROUP BY anidb, mal_id) AS f2 on f2.id = f.id
ORDER BY f.id DESC
The NULL's are distinct from each other (e.g. NULL != NULL) so there's no fear that grouping would melt all the nulled anidb rows into one.
For the first problem you can use ORDER BY dupes
SELECT
f.id, f.category, f.anidb, f.mal_id, COUNT( * ) AS dupes, f.filename,
a.titles, a.synopsis, a.episodes, a.image, a.rating,
c.name as cat_name, c.id as categoryid
FROM table_files f
LEFT JOIN table_anidb a ON a.id = f.anidb
LEFT JOIN table_categories c ON c.id = f.category
GROUP BY a.id ORDER BY dupes DESC
For the second problem you can use CASE to check if f.anidb is null
SELECT
f.id, f.category, f.anidb, f.mal_id, COUNT( * ) AS dupes, f.filename,
a.titles, a.synopsis, a.episodes, a.image, a.rating,
c.name as cat_name, c.id as categoryid
FROM table_files f
LEFT JOIN table_anidb a ON a.id = f.anidb
LEFT JOIN table_categories c ON c.id = f.category
GROUP BY
(CASE WHEN f.anidb IS NULL THEN f.mal_id ELSE f.anidb END )
ORDER BY dupes DESC