MySQL, DISTINCT in SUM operation - mysql

Currently I trying to calculate number of unique user visit in my application based on user gender. Here is the example query that calculate all the visits (not unique)
SELECT
DATE(v.visited_at) AS visit_date,
SUM(IF(u.gender = 'M', 1, 0)) AS male_visit,
SUM(IF(u.gender = 'F', 1, 0)) AS female_visit,
SUM(IF(u.gender = '' OR u.gender IS NULL, 1, 0)) AS unknown_visit
FROM
visits v
INNER JOIN users u ON v.user_id = u.id
WHERE
DATE(v.visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
AND v.duration > 30
GROUP BY
DATE(v.visited_at)
Tried using subquery and count distinct it's works, but it's 4 times slower.
SELECT
DATE(visited_at) as visit_date,
(SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = 'M' AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS male_visit,
(SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = 'F' AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS female_visit,
(SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = '' OR u.gender IS NULL AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS unknown_visit
FROM
visits v
WHERE
DATE(visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
GROUP BY
DATE(visited_at)
Any suggestion on this?

COUNT(DISTINCT) is always going to be slower than COUNT(). You can try:
SELECT DATE(v.visited_at) AS visit_date,
COUNT(DISTINCT CASE WHEN u.gender = 'M' THEN u.id END) AS male_visit,
COUNT(DISTINCT CASE WHEN u.gender = 'F' THEN u.id END) AS female_visit,
COUNT(DISTINCT CASE WHEN u.gender = '' OR u.gender IS NULL THEN u.id END) AS unknown_visit
FROM visits v INNER JOIN
users u
ON v.user_id = u.id
WHERE DATE(v.visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY) AND
v.duration > 30
GROUP BY DATE(v.visited_at);
I don't know if it will be much faster, though.

There are 2 tables as per query (user and visit) with sample data.
Query
SELECT
DATE(v.visited_date) AS visit_date,
u.gender,
COUNT(DISTINCT v.user_id) AS total_count
FROM
visits v
INNER JOIN users u ON v.user_id = u.id
WHERE
DATE(v.visited_date) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
AND v.duration >= 30
GROUP BY u.gender,DATE(v.visited_date)
ORDER BY DATE(v.visited_date) ASC;
This query will give you unique count of users gender wise for particular date.

This type of query is likely to be slow, especially if you have a large number of entries in the table as when selecting rows based upon date and time values mysql has to perform a full table scan.
Optimising your database structure is likely to offer you performance gains much in excess of anything you will get trying to query it like this.
A couple of suggestions would be to partition the table by date ranges. Doing so can greatly reduce query execution as it means instead of a full table scan mysql can simply ignore any partitions outside the query date range. The bigger the table the more benefit you will see, but potentially anything from 2x to 10x faster I would expect.
If you were to replace your gender column with 3 columns male, female and unknown you would replace 3 queries containing the slow COUNT(DISTINCT... statements with a single query with less conditions, you can also add the user id to the group by statement to remove the need to count distinct as you can specify more than one column for grouping.
Finally you could add a database trigger and either have an extra column which it sets as 1 when logging the visits if the duration is over 30 and it's their first visit of the day, or you create a new calendar table for visits and have the trigger increment the value within that upon database write of each log which equates to a unique visit for the day.

Related

Query! how can i get total meals and cost from three tables for each user! This is my first question so sorry for the inconvenience

enter image description herei have three tables users,meals,expense. I want to calculate users meal,expense between date display with a single query.
i have tried in many ways. Like
SELECT u.name as BorderName , SUM(e.expenseAmount) as expense
FROM users u
INNER JOIN expenses e on e.user_id=u.id
WHERE e.expenseDate BETWEEN '2019.04-01' AND '2019.04.30'
GROUP BY e.user_id
do the aggregation inline views, with the inline view returning a single row per user_id, to avoid semi-Cartesian product (partial cross product)
Something like this:
SELECT u.name
, IFNULL(t.tot_expenseamount,0) AS tot_expense_amount
, IFNULL(n.tot_noofmeal,0) AS tot_no_of_meal
FROM users u
LEFT
JOIN ( SELECT e.user_id
, SUM(e.expenseamount) AS tot_expenseamount
FROM expense e
WHERE e.expensedate >= '2019-04-01' + INTERVAL 0 MONTH
AND e.expensedate < '2019-04-01' + INTERVAL 1 MONTH
GROUP
BY e.user_id
) t
ON t.user_id = u.id
LEFT
JOIN ( SELECT m.user_id
, SUM(m.noofmeal) AS tot_noofmeal
FROM meal m
WHERE m.mealdate >= '2019-04-01' + INTERVAL 0 MONTH
AND m.mealdate < '2019-04-01' + INTERVAL 1 MONTH
GROUP
BY m.user_id
) n
ON n.user_id = u.id
ORDER
BY u.name
Note that the inline view queries n and t return single row per user_id, with total expense amount or total number of meals. For testing, each inline view query can be executed separately, to verify the results.

MySQL: "Subquery returns more than 1 row" when selecting results of multiple subqueries

I'm getting a "Subquery returns more than 1 row" error while running a query that's meant to return results of two subqueries. Why is returning more than one row a problem here, and how can I get around this problem?
Data tables and relevant fields look like this:
Accounts
id
Meetings
account_id
assigned_user_id
start_date
Users
id
last_name
A meeting is assigned to an account and to a user. I'm trying to create a table that will display the quantities of meetings per assigned user per account where the meeting start date is within different date ranges. The date ranges should be arranged in the same row, as a table with these headings:
Account | User's Last Name | Meetings 1-31 days in the future | Meetings 31-60 days in the future
as shown in this image:
.
This is my query:
SELECT
(SELECT
a.name
FROM
accounts AS a
JOIN
meetings AS m ON a.id = m.account_id
AND date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Account',
(SELECT
u.last_name
FROM
accounts AS a
JOIN
meetings AS m ON a.id = m.account_id
AND date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Name',
(SELECT
COUNT(m.id)
FROM
accounts AS a
JOIN
meetings AS m ON a.id = m.account_id
AND date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 30 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Meetings 1-30 days',
(SELECT
COUNT(m.id)
FROM
accounts AS a2
JOIN
meetings AS m ON a.id = m.account_id
AND m.date_start BETWEEN DATE_ADD(CURDATE(),INTERVAL 31 DAY) AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Meetings 31-60 days'
Columns containing the names of accounts and names of users had to be added as subqueries in order to avoid "Operand should contain 1 column(s)" errors. Columns corresponding to the counts of meetings had to be subqueries because no single row of the joined table can fit both date ranges at the same time. Each subquery returns the expected results when run individually. But I get "Subquery returns more than 1 row" when the subqueries are put together as shown. I tried assigning different aliases to each subquery, but that did not help.
SQL queries do not return nested result sets; so an expression (such as a subquery) used in a SELECT clause cannot have multiple values, as that would "nest" it's values. You more likely just need to use conditional aggregation, like so:
SELECT a.id, u.id, a.name, u.last_name
, COUNT(CASE WHEN m.date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 30 DAY) THEN 1 ELSE NULL END) AS `Meetings 1-30 days`
, COUNT(CASE WHEN m.date_start BETWEEN DATE_ADD(CURDATE(),INTERVAL 31 DAY) AND DATE_ADD(CURDATE(),INTERVAL 60 DAY) THEN 1 ELSE NULL END) AS `Meetings 31-60 days`
, COUNT(CASE WHEN THEN 1 ELSE NULL END) AS
FROM accounts AS a
JOIN meetings AS m ON a.id = m.account_id
JOIN users AS u ON m.assigned_user_id = u.id
WHERE m.status = 'Planned' AND m.deleted = 0
AND m.date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
GROUP BY a.id, u.id, a.name, u.last_name
;
Notes: ELSE NULL is technically automatic, and can be omitted; it is just there for clarity. Aggregate functions, such as COUNT, ignore NULL values; the only time null values affect such functions is when they encounter only null values (in which case their results are null).
Sidenote: You could have continued with your query in a form similar to what you originally had; if you included the grouping fields in the subqueries' results, the subqueries could have been joined together (but that would have been a lot of redundant joining of accounts, meetings, and users).

MySQL: Detect existence of at least 1 record in large joined table

I have two tables:
users (id, name)
user_activities (id, user_id, activity_id, created_at)
The user_activities table is very large with over 300 million rows.
I am trying to detect which users have done any activity between a given date range. In other words, rows on the user table, where a joined row exists on the user_activities table between a certain created_at range.
I can do this with an INNER JOIN, GROUP BY and WHERE clause but the query runs for a long time as I believe its hitting all user_activities rows between my date range.
I don't really care "how many" activities, just if they've had more than zero. So i am grouping to get a count (e.g. 210 activities) when actually I could stop after finding just 1.
Is there a more efficient way to do this rather than grouping all user_activity rows to count them?
For info, here's the current query, which works fine but take a long time:
SELECT u.id, u.name, COUNT(ua.id) AS activity_count
FROM users u
INNER JOIN user_activity ua ON u.id=ua.user_id
WHERE ua.created_at > '2017-01-01' AND ua.created_at < '2017-03-01'
GROUP BY u.id
HAVING activity_count > 0;
Thanks in advance!
You can try this version:
SELECT u.id, u.name,
(SELECT COUNT(*)
FROM user_activity ua
WHERE u.id = ua.user_id AND
ua.created_at > '2017-01-01' AND
ua.created_at < '2017-03-01'
) as activity_count
FROM users u
HAVING activity_count > 0;
For performance you want an index on user_activity(user_id, created_at).
EDIT:
If you just want existence, then use the same index and this should be much faster:
SELECT u.id, u.name
FROM users u
WHERE EXISTS (SELECT 1
FROM user_activity ua
WHERE u.id = ua.user_id AND
ua.created_at > '2017-01-01' AND
ua.created_at < '2017-03-01'
);
Whereas your query does complex processing and then aggregation of a bunch of data, this should scan the users table, and just look up in the index whether an appropriate activity exists for the user.
Use an EXISTS clause, so the DBMS sees that it suffices to find one record per user in the given date range.
SELECT id, name
FROM users u
where exists
(
select *
from user_activity ua
where ua.user_id = u.id
and ua.created_at > '2017-01-01' AND ua.created_at < '2017-03-01'
);
With this index:
create index idx on user_activity(user_id, created_at);
To get users who have done activities for a given date range
select u.id, u.name from users u
where exists ( SELECT 1 FROM user_activity ua
where ua.user_id = u.id
and ua.created_at > '2017-01-01' AND
ua.created_at < '2017-03-01')
Create index for user_activity(created_at)
If its only for testing existence, then:
SELECT EXISTS(
SELECT u.id
FROM user_activity AS ua
WHERE u.id = ua.user_id
AND ua.created_at > '2017-01-01'
AND ua.created_at < '2017-03-01'
) AS ret
this will simply return column
ret = 1
if at least a row of query will satisfy given conditions, else it will return column
ret = 0

How to count the number of rows from left join with conditioning

This is my query:
SELECT claims.id,
COUNT(CASE WHEN claims.sold_at BETWEEN (now() - INTERVAL '7.day') AND now() END FROM claims) week1
FROM users
LEFT JOIN claims ON claims.sales_user_id = users.id
WHERE users.office_id = 2
What I am trying to do is join the claims table but then also get a count of how many were sold_at within a certain date period
This query is giving an error but not sure how to fix/approach it properly?
Try this:
select
users.id, count(claims.id) as count
from users
left join claims on claims.sales_user_id = users.id
where
users.office_id = 2 and
claims.sold_at between (current_timestamp - interval '7' day) and current_timestamp
group by users.id;
Your query should look like this
SELECT claims.id,
COUNT(CASE
WHEN claims.sold_at BETWEEN (now() - INTERVAL '7.day') AND now()
THEN claims.id
ELSE null END) week1
FROM users
LEFT JOIN claims ON claims.sales_user_id = users.id
WHERE users.office_id = 2
Normally for this kind of thing you use SUM(IF()).
Something like this:-
SELECT claims.id,
SUM(IF(claims.sold_at BETWEEN DATE_SUB(now(), INTERVAL 7 DAY) AND now(), 1, 0)) AS week1
FROM users
LEFT JOIN claims ON claims.sales_user_id = users.id
WHERE users.office_id = 2
But this on its own this makes no sense. Why is claims.id being brought back? The aggregate function without a GROUP BY clause will mean only a single row is returned, so the claims.id returned will be random. But I expect the id is the unique key so grouping by it seems wrong in this case.
select
users.id,
count(
claims.sold_at between (now() - interval '7.day') and now()
or null
) week1
from
users
left join
claims on claims.sales_user_id = users.id
where users.office_id = 2
group by users.id

finding closest date from multiple tables mysql

I have many tables that log the users action on some forum, each log event has it's date.
I need a query that gives me all the users that wasn't active in during the last year.
I have the following query (working query):
SELECT *
FROM (questions AS q
INNER JOIN Answers AS a
INNER JOIN bestAnswerByPoll AS p
INNER JOIN answerThumbRank AS t
INNER JOIN notes AS n
INNER JOIN interestingQuestion AS i ON q.user_id = a.user_id
AND a.user_id = p.user_id
AND p.user_id = t.user_id
AND t.user_id = n.user_id
AND n.user_id = i.user_id)
WHERE DATEDIFF(CURDATE(),q.date)>365
AND DATEDIFF(CURDATE(),a.date)>365
AND DATEDIFF(CURDATE(),p.date)>365
AND DATEDIFF(CURDATE(),t.date)>365
AND DATEDIFF(CURDATE(),n.date)>365
AND DATEDIFF(CURDATE(),i.date)>365
what i'm doing in that query - joining all the tables according to the userId, and then checking each
date column individually to see if it's been more then a year
I was wondering if there is a way to make it simpler, something like finding the max between all dates (the latest date) and compering just this one to the current date
If you want to get best performance, you cannot use greatest(). Instead do something like this:
SELECT *
FROM questions q
JOIN Answers a ON q.user_id = a.user_id
JOIN bestAnswerByPoll p ON a.user_id = p.user_id
JOIN answerThumbRank t ON p.user_id = t.user_id
JOIN notes n ON t.user_id = n.user_id
JOIN interestingQuestion i ON n.user_id = i.user_id
WHERE q.date > curdate() - interval 1 year
AND a.date > curdate() - interval 1 year
AND p.date > curdate() - interval 1 year
AND t.date > curdate() - interval 1 year
AND n.date > curdate() - interval 1 year
AND i.date > curdate() - interval 1 year
You want to avoid datediff() such that MySQL can do index lookup on date column comparisons. Now, to make sure that index lookup works, you should create compound (multi-column) index on (user_id, date) for each one of your tables.
In this compound index, first part (user_id) will be user for faster joins, and second part (date) will be used for faster date comparisons. If you replace * in your SELECT * with only columns mentioned above (like user_id only), you might be able to get index-only scans, which will be super-fast.
UPDATE Unfortunately, MySQL does not support WITH clause for common table expressions like PostgreSQL and some other databases. But, you can still factor out common expression as follows:
SELECT *
FROM questions q
JOIN Answers a ON q.user_id = a.user_id
JOIN bestAnswerByPoll p ON a.user_id = p.user_id
JOIN answerThumbRank t ON p.user_id = t.user_id
JOIN notes n ON t.user_id = n.user_id
JOIN interestingQuestion i ON n.user_id = i.user_id,
(SELECT curdate() - interval 1 year AS year_ago) x
WHERE q.date > x.year_ago
AND a.date > x.year_ago
AND p.date > x.year_ago
AND t.date > x.year_ago
AND n.date > x.year_ago
AND i.date > x.year_ago
In MySQL, you can use the greatest() function:
WHERE DATEDIFF(CURDATE(), greatest(q.date, a.date, p.date, t.date, n.date, i.date)) > 365
This will help with readability. It would not affect performance.