finding closest date from multiple tables mysql - mysql

I have many tables that log the users action on some forum, each log event has it's date.
I need a query that gives me all the users that wasn't active in during the last year.
I have the following query (working query):
SELECT *
FROM (questions AS q
INNER JOIN Answers AS a
INNER JOIN bestAnswerByPoll AS p
INNER JOIN answerThumbRank AS t
INNER JOIN notes AS n
INNER JOIN interestingQuestion AS i ON q.user_id = a.user_id
AND a.user_id = p.user_id
AND p.user_id = t.user_id
AND t.user_id = n.user_id
AND n.user_id = i.user_id)
WHERE DATEDIFF(CURDATE(),q.date)>365
AND DATEDIFF(CURDATE(),a.date)>365
AND DATEDIFF(CURDATE(),p.date)>365
AND DATEDIFF(CURDATE(),t.date)>365
AND DATEDIFF(CURDATE(),n.date)>365
AND DATEDIFF(CURDATE(),i.date)>365
what i'm doing in that query - joining all the tables according to the userId, and then checking each
date column individually to see if it's been more then a year
I was wondering if there is a way to make it simpler, something like finding the max between all dates (the latest date) and compering just this one to the current date

If you want to get best performance, you cannot use greatest(). Instead do something like this:
SELECT *
FROM questions q
JOIN Answers a ON q.user_id = a.user_id
JOIN bestAnswerByPoll p ON a.user_id = p.user_id
JOIN answerThumbRank t ON p.user_id = t.user_id
JOIN notes n ON t.user_id = n.user_id
JOIN interestingQuestion i ON n.user_id = i.user_id
WHERE q.date > curdate() - interval 1 year
AND a.date > curdate() - interval 1 year
AND p.date > curdate() - interval 1 year
AND t.date > curdate() - interval 1 year
AND n.date > curdate() - interval 1 year
AND i.date > curdate() - interval 1 year
You want to avoid datediff() such that MySQL can do index lookup on date column comparisons. Now, to make sure that index lookup works, you should create compound (multi-column) index on (user_id, date) for each one of your tables.
In this compound index, first part (user_id) will be user for faster joins, and second part (date) will be used for faster date comparisons. If you replace * in your SELECT * with only columns mentioned above (like user_id only), you might be able to get index-only scans, which will be super-fast.
UPDATE Unfortunately, MySQL does not support WITH clause for common table expressions like PostgreSQL and some other databases. But, you can still factor out common expression as follows:
SELECT *
FROM questions q
JOIN Answers a ON q.user_id = a.user_id
JOIN bestAnswerByPoll p ON a.user_id = p.user_id
JOIN answerThumbRank t ON p.user_id = t.user_id
JOIN notes n ON t.user_id = n.user_id
JOIN interestingQuestion i ON n.user_id = i.user_id,
(SELECT curdate() - interval 1 year AS year_ago) x
WHERE q.date > x.year_ago
AND a.date > x.year_ago
AND p.date > x.year_ago
AND t.date > x.year_ago
AND n.date > x.year_ago
AND i.date > x.year_ago

In MySQL, you can use the greatest() function:
WHERE DATEDIFF(CURDATE(), greatest(q.date, a.date, p.date, t.date, n.date, i.date)) > 365
This will help with readability. It would not affect performance.

Related

MySQL: "Subquery returns more than 1 row" when selecting results of multiple subqueries

I'm getting a "Subquery returns more than 1 row" error while running a query that's meant to return results of two subqueries. Why is returning more than one row a problem here, and how can I get around this problem?
Data tables and relevant fields look like this:
Accounts
id
Meetings
account_id
assigned_user_id
start_date
Users
id
last_name
A meeting is assigned to an account and to a user. I'm trying to create a table that will display the quantities of meetings per assigned user per account where the meeting start date is within different date ranges. The date ranges should be arranged in the same row, as a table with these headings:
Account | User's Last Name | Meetings 1-31 days in the future | Meetings 31-60 days in the future
as shown in this image:
.
This is my query:
SELECT
(SELECT
a.name
FROM
accounts AS a
JOIN
meetings AS m ON a.id = m.account_id
AND date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Account',
(SELECT
u.last_name
FROM
accounts AS a
JOIN
meetings AS m ON a.id = m.account_id
AND date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Name',
(SELECT
COUNT(m.id)
FROM
accounts AS a
JOIN
meetings AS m ON a.id = m.account_id
AND date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 30 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Meetings 1-30 days',
(SELECT
COUNT(m.id)
FROM
accounts AS a2
JOIN
meetings AS m ON a.id = m.account_id
AND m.date_start BETWEEN DATE_ADD(CURDATE(),INTERVAL 31 DAY) AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
JOIN
users AS u ON m.assigned_user_id = u.id
WHERE
m.status = 'Planned'
AND m.deleted = 0
GROUP BY a.id, u.id) AS 'Meetings 31-60 days'
Columns containing the names of accounts and names of users had to be added as subqueries in order to avoid "Operand should contain 1 column(s)" errors. Columns corresponding to the counts of meetings had to be subqueries because no single row of the joined table can fit both date ranges at the same time. Each subquery returns the expected results when run individually. But I get "Subquery returns more than 1 row" when the subqueries are put together as shown. I tried assigning different aliases to each subquery, but that did not help.
SQL queries do not return nested result sets; so an expression (such as a subquery) used in a SELECT clause cannot have multiple values, as that would "nest" it's values. You more likely just need to use conditional aggregation, like so:
SELECT a.id, u.id, a.name, u.last_name
, COUNT(CASE WHEN m.date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 30 DAY) THEN 1 ELSE NULL END) AS `Meetings 1-30 days`
, COUNT(CASE WHEN m.date_start BETWEEN DATE_ADD(CURDATE(),INTERVAL 31 DAY) AND DATE_ADD(CURDATE(),INTERVAL 60 DAY) THEN 1 ELSE NULL END) AS `Meetings 31-60 days`
, COUNT(CASE WHEN THEN 1 ELSE NULL END) AS
FROM accounts AS a
JOIN meetings AS m ON a.id = m.account_id
JOIN users AS u ON m.assigned_user_id = u.id
WHERE m.status = 'Planned' AND m.deleted = 0
AND m.date_start BETWEEN CURDATE() AND DATE_ADD(CURDATE(),INTERVAL 60 DAY)
GROUP BY a.id, u.id, a.name, u.last_name
;
Notes: ELSE NULL is technically automatic, and can be omitted; it is just there for clarity. Aggregate functions, such as COUNT, ignore NULL values; the only time null values affect such functions is when they encounter only null values (in which case their results are null).
Sidenote: You could have continued with your query in a form similar to what you originally had; if you included the grouping fields in the subqueries' results, the subqueries could have been joined together (but that would have been a lot of redundant joining of accounts, meetings, and users).

MySQL, DISTINCT in SUM operation

Currently I trying to calculate number of unique user visit in my application based on user gender. Here is the example query that calculate all the visits (not unique)
SELECT
DATE(v.visited_at) AS visit_date,
SUM(IF(u.gender = 'M', 1, 0)) AS male_visit,
SUM(IF(u.gender = 'F', 1, 0)) AS female_visit,
SUM(IF(u.gender = '' OR u.gender IS NULL, 1, 0)) AS unknown_visit
FROM
visits v
INNER JOIN users u ON v.user_id = u.id
WHERE
DATE(v.visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
AND v.duration > 30
GROUP BY
DATE(v.visited_at)
Tried using subquery and count distinct it's works, but it's 4 times slower.
SELECT
DATE(visited_at) as visit_date,
(SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = 'M' AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS male_visit,
(SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = 'F' AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS female_visit,
(SELECT COUNT(DISTINCT u.id) FROM visits v JOIN users u ON v.user_id = u.id WHERE u.gender = '' OR u.gender IS NULL AND DATE(v.visited_at) = visit_date AND v.duration > 30) AS unknown_visit
FROM
visits v
WHERE
DATE(visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
GROUP BY
DATE(visited_at)
Any suggestion on this?
COUNT(DISTINCT) is always going to be slower than COUNT(). You can try:
SELECT DATE(v.visited_at) AS visit_date,
COUNT(DISTINCT CASE WHEN u.gender = 'M' THEN u.id END) AS male_visit,
COUNT(DISTINCT CASE WHEN u.gender = 'F' THEN u.id END) AS female_visit,
COUNT(DISTINCT CASE WHEN u.gender = '' OR u.gender IS NULL THEN u.id END) AS unknown_visit
FROM visits v INNER JOIN
users u
ON v.user_id = u.id
WHERE DATE(v.visited_at) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY) AND
v.duration > 30
GROUP BY DATE(v.visited_at);
I don't know if it will be much faster, though.
There are 2 tables as per query (user and visit) with sample data.
Query
SELECT
DATE(v.visited_date) AS visit_date,
u.gender,
COUNT(DISTINCT v.user_id) AS total_count
FROM
visits v
INNER JOIN users u ON v.user_id = u.id
WHERE
DATE(v.visited_date) >= DATE_SUB(SYSDATE(), INTERVAL 30 DAY)
AND v.duration >= 30
GROUP BY u.gender,DATE(v.visited_date)
ORDER BY DATE(v.visited_date) ASC;
This query will give you unique count of users gender wise for particular date.
This type of query is likely to be slow, especially if you have a large number of entries in the table as when selecting rows based upon date and time values mysql has to perform a full table scan.
Optimising your database structure is likely to offer you performance gains much in excess of anything you will get trying to query it like this.
A couple of suggestions would be to partition the table by date ranges. Doing so can greatly reduce query execution as it means instead of a full table scan mysql can simply ignore any partitions outside the query date range. The bigger the table the more benefit you will see, but potentially anything from 2x to 10x faster I would expect.
If you were to replace your gender column with 3 columns male, female and unknown you would replace 3 queries containing the slow COUNT(DISTINCT... statements with a single query with less conditions, you can also add the user id to the group by statement to remove the need to count distinct as you can specify more than one column for grouping.
Finally you could add a database trigger and either have an extra column which it sets as 1 when logging the visits if the duration is over 30 and it's their first visit of the day, or you create a new calendar table for visits and have the trigger increment the value within that upon database write of each log which equates to a unique visit for the day.

How to count the number of rows from left join with conditioning

This is my query:
SELECT claims.id,
COUNT(CASE WHEN claims.sold_at BETWEEN (now() - INTERVAL '7.day') AND now() END FROM claims) week1
FROM users
LEFT JOIN claims ON claims.sales_user_id = users.id
WHERE users.office_id = 2
What I am trying to do is join the claims table but then also get a count of how many were sold_at within a certain date period
This query is giving an error but not sure how to fix/approach it properly?
Try this:
select
users.id, count(claims.id) as count
from users
left join claims on claims.sales_user_id = users.id
where
users.office_id = 2 and
claims.sold_at between (current_timestamp - interval '7' day) and current_timestamp
group by users.id;
Your query should look like this
SELECT claims.id,
COUNT(CASE
WHEN claims.sold_at BETWEEN (now() - INTERVAL '7.day') AND now()
THEN claims.id
ELSE null END) week1
FROM users
LEFT JOIN claims ON claims.sales_user_id = users.id
WHERE users.office_id = 2
Normally for this kind of thing you use SUM(IF()).
Something like this:-
SELECT claims.id,
SUM(IF(claims.sold_at BETWEEN DATE_SUB(now(), INTERVAL 7 DAY) AND now(), 1, 0)) AS week1
FROM users
LEFT JOIN claims ON claims.sales_user_id = users.id
WHERE users.office_id = 2
But this on its own this makes no sense. Why is claims.id being brought back? The aggregate function without a GROUP BY clause will mean only a single row is returned, so the claims.id returned will be random. But I expect the id is the unique key so grouping by it seems wrong in this case.
select
users.id,
count(
claims.sold_at between (now() - interval '7.day') and now()
or null
) week1
from
users
left join
claims on claims.sales_user_id = users.id
where users.office_id = 2
group by users.id

Search on specific users in the table

I have following query:
SELECT *
FROM users
WHERE id NOT
IN (
SELECT user_id
FROM `bids`
WHERE DATE_SUB( DATE_ADD( CURDATE( ) , INTERVAL 7
DAY ) , INTERVAL 14
DAY ) <= created
)
AND id NOT
IN (
SELECT user_id
FROM coupon_used WHERE code = 'ACTNOW'
)
AND id
IN (
SELECT user_id
FROM accounts
)
I just want to take specific users and search on them, instead of searching on all users in the table. Like I have the list of users with id 1,2,3,4,5 I only want to search on these users
Just add a WHERE clause using IN()
SELECT *
FROM users
WHERE id IN(1,2,3,4,5)
I believe using left outer joins will simplify your query and hopefully improve performance
SELECT users.*
FROM users
LEFT OUTER JOIN bids on bids.user_id = users.id AND DATE_SUB(DATE_ADD(CURDATE(), INTERVAL 7 DAY), INTERVAL 14 DAY) <= bids.created
LEFT OUTER JOIN coupon_used on coupon_used.user_id = users.id AND coupon_used.code = 'ACTNOW'
INNER JOIN accounts on accounts.user_id = users.id
WHERE bids.id is null AND coupon_used.id is null
AND users.id in (1,2,3,4,5)

SELECT statement subquery

EDIT: I need the overall total in there subtracting both direct and indirect minutes.
I'm trying to SUM M. Minutes as an alias "dminutes". Then, take the SUM of M.minutes again and subtract M.minutes that have "indirect" column value (and give it "inminutes" alias). However, it's showing me null, so the syntax is wrong. Suggestions?
table = tasks
column = task_type
Example:
M.minutes total = 60 minutes
M. minutes (with "direct" task_type column value) = 50 minutes (AS dminutes)
M. minutes (with "indirect" task_type column value) = 10 minutes (AS inminutes)
SQL statement:
SELECT
U.user_name,
SUM(M.minutes) as dminutes,
ROUND(SUM(M.minutes))-(SELECT (SUM(M.minutes)) from summary s WHERE ta.task_type='indirect') as inminutes
FROM summary S
JOIN users U ON U.user_id = S.user_id
JOIN tasks TA ON TA.task_id = S.task_id
JOIN minutes M ON M.minutes_id = S.minutes_id
WHERE DATE(submit_date) = curdate()
AND TIME(submit_date) BETWEEN '00:00:01' and '23:59:59'
GROUP BY U.user_name
LIMIT 0 , 30
I think something like this should work.
You might have to tweak it a little.
SELECT direct.duser_id, indirect.iminutes, direct.dminutes,
direct.dminutes - indirect.iminutes FROM
(SELECT U.user_id AS iuser_id, SUM(M.minutes) AS iminutes
FROM summary S
JOIN users U
ON U.user_id = S.user_id
JOIN minutes M
ON M.minutes_id = S.minutes_id
JOIN tasks TA
ON TA.task_id = S.task_id
WHERE TA.task_type='indirect'
AND DATE(submit_date) = curdate()
AND TIME(submit_date) BETWEEN '00:00:01' and '23:59:59'
GROUP BY U.user_id) AS indirect
JOIN
(SELECT U.user_id AS duser_id, SUM(M.minutes) AS dminutes
FROM summary S
JOIN users U
ON U.user_id = S.user_id
JOIN minutes M
ON M.minutes_id = S.minutes_id
JOIN tasks TA
ON TA.task_id = S.task_id
WHERE TA.task_type='direct'
AND DATE(submit_date) = curdate()
AND TIME(submit_date) BETWEEN '00:00:01' and '23:59:59'
GROUP BY U.user_id) AS direct
WHERE indirect.iuser_id = direct.duser_id
SUM is a nasty little function:
http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html#function_sum
Returns the sum of expr. If the return set has no rows, SUM() returns
NULL. The DISTINCT keyword can be used to sum only the distinct values
of expr.
SUM() returns NULL if there were no matching rows.
Try wrapping SUM to a COALESCE or an IFNULL:
... COALESCE( SUM(whatever), 0) ...