How to avoid distinct - mysql

I have a query which works when I use DISTINCT. However I have a feeling I could rewrite the query in a way that would help me avoid use of DISTINCT, which would make easier(quicker) for the database to process the query.
If there is no point in rewriting the query, please explain, if there is, please look at simplified query and give me a hint how to reformulate it so I wouldn't get duplicates in the first place.
SELECT Us.user_id, COUNT( DISTINCT Or.order_id ) AS orders
FROM users AS Us
LEFT JOIN events AS Ev ON Ev.user_id = Us.user_id
LEFT JOIN orders AS Or ON Or.event_id = Ev.event_id
OR Or.user_id = Us.user_id
GROUP BY Us.user_id
Short description of the query: I have a table of users, of their events and orders. Sometimes orders have column user_id, but mostly it is null and they have to be connected via event table.
Edit:
These are results of the simplified query I wrote, first without distinct and then including distinct.
user_id orders
3952 263
3953 7
3954 2
3955 6
3956 1
3957 0
...
user_id orders
3952 79
3953 7
3954 2
3955 6
3956 1
3957 0
...
Problem fixed:
SELECT COALESCE( Or.user_id, Ev.user_id ) AS user, COUNT( Or.order_id ) AS orders
FROM orders AS Or
LEFT JOIN events AS Ev ON Ev.event_id = Or.event_id
GROUP BY COALESCE( Or.user_id, Ev.user_id )

If an order can be associated with multiple events, or a user with an event multiple times, then it is possible for the same order to be associated with the same user multiple times. In this scenario, using DISTINCT will count that order only once per user whereas omitting it will count that order once for each association with the user.
If you're after the former, then your existing query is your best option.

You are not getting anything from the user table, nor the events table, so why join them. Your last "OR" clause makes explicit reference that it has a user_ID column. I would hope your order table has an index on the user ID placing the order, then you could just do
select
user_id,
count(*) as Orders
from
orders
group by
user_id

Related

Left join with 1 result for each row in the left table - without GROUP BY

I have 2 tables: (I deleted irrelevant fields)
Traffic: id, impressions, country
Events: trafficID,sells
For each traffic row there might be 0 or more rows in events.
When selecting all rows from traffic table + left join the events table to get the total SUM of sells for each traffic row - some rows in the result set will be duplicate because there are few traffic rows with more then 1 event.
The easy solution is GROUP BY traffic.id.
Now lets say I want to group by country, and select the SUM of impressions and sells for this specific country, If I would GROUP by traffic.id I won't get the result set I want - and if I won't group by traffic.id I will get duplicate traffic rows and thus wrong SUM result.
Any elegant way to solve this? I am using php pdo mysql, innodb engine in case it is relevant.
Group the Events table prior to joining?
SELECT t.country, SUM(t.impressions), SUM(e.totalSells)
FROM Traffic t LEFT JOIN (
SELECT trafficID, SUM(sells) AS totalSells
FROM Events
GROUP BY trafficID
) e ON e.trafficID = t.id
GROUP BY t.country

SQL Query to count sessions without repeating lines

I have two tables which join themselves by a field called user_id. The first table called sessions can have multiple lines for the same day. I'm trying to find a way of selecting the total of that sessions without repeating the days (sort of).
Example:
Table sessions
ID | user_id | datestart
1 1 2014-08-05
2 1 2014-08-05
3 2 2014-08-05
As you can see there are two lines that are repeated (the first and second). If I query SELECT COUNT(sess.id) AS total this will retrieve 3, but I want it to retrieve 2 because the first two lines have the same user_id so it must count as one.
Using the clause Group By will retrieve two different lines: 2 and 1, which is also incorrect.
You can view a full example working at SQLFiddle.
Is there anyway of solving this only by query or do I need to do it by language?
I think you are looking for count(distinct):
SELECT COUNT(distinct user_id) AS total
FROM sessions sess INNER JOIN
users user
ON user.id = sess.user_id
WHERE user.equipment_id = 1 AND
sess.datestart = CURDATE();
If I understand the problem correctly, you want the number of users with sessions, rather than number of unique sessions. Use DISTINCT:
SELECT COUNT(DISTINCT(user_id)) FROM sessions,users WHERE user_id=users.id
Try this way:
SELECT COUNT(distinct sess.user_id) AS total
FROM sessions AS sess
INNER JOIN users AS user ON user.id = sess.user_id
WHERE user.equipment_id = 1 AND sess.datestart = CURDATE()
Sql Fiddle

MySQL - 3 tables, is this complex join even possible?

I have three tables: users, groups and relation.
Table users with fields: usrID, usrName, usrPass, usrPts
Table groups with fields: grpID, grpName, grpMinPts
Table relation with fields: uID, gID
User can be placed in group in two ways:
if collect group minimal number of points (users.usrPts > group.grpMinPts ORDER BY group.grpMinPts DSC LIMIT 1)
if his relation to the group is manually added in relation tables (user ID provided as uID, as well as group ID provided as gID in table named relation)
Can I create one single query, to determine for every user (or one specific), which group he belongs, but, manual relation (using relation table) should have higher priority than usrPts compared to grpMinPts? Also, I do not want to have one user shown twice (to show his real group by points, but related group also)...
Thanks in advance! :) I tried:
SELECT * FROM users LEFT JOIN (relation LEFT JOIN groups ON (relation.gID = groups.grpID) ON users.usrID = relation.uID
Using this I managed to extract specified relations (from relation table), but, I have no idea how to include user points, respecting above mentioned priority (specified first). I know how to do this in a few separated queries in php, that is simple, but I am curious, can it be done using one single query?
EDIT TO ADD:
Thanks to really educational technique using coalesce #GordonLinoff provided, I managed to make this query to work as I expected. So, here it goes:
SELECT o.usrID, o.usrName, o.usrPass, o.usrPts, t.grpID, t.grpName
FROM (
SELECT u.*, COALESCE(relationgroupid,groupid) AS thegroupid
FROM (
SELECT u.*, (
SELECT grpID
FROM groups g
WHERE u.usrPts > g.grpMinPts
ORDER BY g.grpMinPts DESC
LIMIT 1
) AS groupid, (
SELECT grpUID
FROM relation r
WHERE r.userUID = u.usrID
) AS relationgroupid
FROM users u
)u
)o
JOIN groups t ON t.grpID = o.thegroupid
Also, if you are wondering, like I did, is this approach faster or slower than doing three queries and processing in php, the answer is that this is slightly faster way. Average time of this query execution and showing results on a webpage is 14 ms. Three simple queries, processing in php and showing results on a webpage took 21 ms. Average is based on 10 cases, average execution time was, really, a constant time.
Here is an approach that uses correlated subqueries to get each of the values. It then chooses the appropriate one using the precedence rule that if the relations exist use that one, otherwise use the one from the groups table:
select u.*,
coalesce(relationgroupid, groupid) as thegroupid
from (select u.*,
(select grpid from groups g where u.usrPts > g.grpMinPts order by g.grpMinPts desc limit 1
) as groupid,
(select gid from relations r where r.userId = u.userId
) as relationgroupid
from users u
) u
Try something like this
select user.name, group.name
from group
join relation on relation.gid = group.gid
join user on user.uid = relation.uid
union
select user.name, g1.name
from group g1
join group g2 on g2.minpts > g1.minpts
join user on user.pts between g1.minpts and g2.minpts

MySQL Join Query - joining tables into themselves many times

I have 4 queries I need to excecute in order to suggest items to users based on items they've already expressed an interest in:
Select 5 random items the user already likes
SELECT item_id
FROM user_items
WHERE user_id = :user_person
ORDER BY RAND()
LIMIT 5
Select 50 people who like the same items
SELECT user_id
FROM user_items
WHERE user_id != :user_person
AND item_id = :selected_item_list
LIMIT 50
SELECT all items that the original user likes
SELECT item_id
FROM user_items
WHERE user_id = :user_person
SELECT 5 items the user doesn't already like to suggest to the user
SELECT item_id
FROM user_items
WHERE user_id = :user_id_list
AND item_id != :item_id_list
LIMIT 5
What I would like to know is how would I excecute this as one query?
There are a few reasons for me wanting to do this:
at the moment, I have to excecute the 'select 50 people' query 5 times and pick the top 50 people from it
I then have to excecute the 'select 5 items' query 50 * (number of items initial user likes)
Once the query has been excecuted, I intend to store the query result in a cookie (if the user gives consent to me using cookies, otherwise they don't get the 'item suggestion' at all) with the key being a hash of the query, meaning it will only fire once a day / once a week (that's why I return 5 suggestions and select a key at random to display)
Basically, if anybody knows how to write these queries as one query, could you show me and explain what is going on in the query?
This will select all items you need:
SELECT DISTINCT ui_items.item_id
FROM user_items AS ui_own
JOIN user_items AS ui_others ON ui_own.item_id = ui_others.item_id
JOIN user_items AS ui_items ON ui_others.user_id = ui_items.user_id
WHERE ui_own.user_id = :user_person
AND ui_others.user_id <> :user_person
AND ui_items.item_id <> ui_own.item_id
(please, check if result are exact same with you version - I tested it on a very small fake data set)
Next you just cache this list and show 5 items randomly, because ORDER BY RAND() is VERY inefficient (non-deterministic query => no caching)
EDIT: Added the DISTINCT to not show duplicate rows.
You can also return a most popular suggestions in descending popularity order by removing DISTINCT and adding the following code to the end of the query:
GROUP BY ui_items.item_id
ORDER BY COUNT(*) DESC
LIMIT 20
To the end of the query which will return the 20 most popular items.

MySQL Group By and HAVING

I'm a MySQL query noobie so I'm sure this is a question with an obvious answer.
But, I was looking at these two queries. Will they return different result sets? I understand that the sorting process would commence differently, but I believe they will return the same results with the first query being slightly more efficient?
Query 1: HAVING, then AND
SELECT user_id
FROM forum_posts
GROUP BY user_id
HAVING COUNT(id) >= 100
AND user_id NOT IN (SELECT user_id FROM banned_users)
Query 2: WHERE, then HAVING
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN(SELECT user_id FROM banned_users)
GROUP BY user_id
HAVING COUNT(id) >= 100
Actually the first query will be less efficient (HAVING applied after WHERE).
UPDATE
Some pseudo code to illustrate how your queries are executed ([very] simplified version).
First query:
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Group, count, etc.
4. Exclude records from the first result set if they are presented in the second
Second query
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Exclude records from the first result set if they are presented in the second
4. Group, count, etc.
The order of steps 1,2 is not important, mysql can choose whatever it thinks is better. The important difference is in steps 3,4. Having is applied after GROUP BY. Grouping is usually more expensive than joining (excluding records can be considering as join operation in this case), so the fewer records it has to group, the better performance.
You have already answers that the two queries will show same results and various opinions for which one is more efficient.
My opininion is that there will be a difference in efficiency (speed), only if the optimizer yields with different plans for the 2 queries. I think that for the latest MySQL versions the optimizers are smart enough to find the same plan for either query so there will be no difference at all but off course one can test and see either the excution plans with EXPLAIN or running the 2 queries against some test tables.
I would use the second version in any case, just to play safe.
Let me add that:
COUNT(*) is usually more efficient than COUNT(notNullableField) in MySQL. Until that is fixed in future MySQL versions, use COUNT(*) where applicable.
Therefore, you can also use:
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN
( SELECT user_id FROM banned_users )
GROUP BY user_id
HAVING COUNT(*) >= 100
There are also other ways to achieve same (to NOT IN) sub-results before applying GROUP BY.
Using LEFT JOIN / NULL :
SELECT fp.user_id
FROM forum_posts AS fp
LEFT JOIN banned_users AS bu
ON bu.user_id = fp.user_id
WHERE bu.user_id IS NULL
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Using NOT EXISTS :
SELECT fp.user_id
FROM forum_posts AS fp
WHERE NOT EXISTS
( SELECT *
FROM banned_users AS bu
WHERE bu.user_id = fp.user_id
)
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Which of the 3 methods is faster depends on your table sizes and a lot of other factors, so best is to test with your data.
HAVING conditions are applied to the grouped by results, and since you group by user_id, all of their possible values will be present in the grouped result, so the placing of the user_id condition is not important.
To me, second query is more efficient because it lowers the number of records for GROUP BY and HAVING.
Alternatively, you may try the following query to avoid using IN:
SELECT `fp`.`user_id`
FROM `forum_posts` `fp`
LEFT JOIN `banned_users` `bu` ON `fp`.`user_id` = `bu`.`user_id`
WHERE `bu`.`user_id` IS NULL
GROUP BY `fp`.`user_id`
HAVING COUNT(`fp`.`id`) >= 100
Hope this helps.
No it does not gives same results.
Because first query will filter records from count(id) condition
Another query filter records and then apply having clause.
Second Query is correctly written