Calculate unique items seen by users via sql - mysql

I need help to resolve the next case.
The data which users want to see is accessible by pagination requests and later these requests are stored in the database in the next form:
+----+---------+-------+--------+
| id | user id | first | amount |
+----+---------+-------+--------+
| 1 | 1 | 0 | 5 |
| 2 | 1 | 10 | 10 |
| 3 | 1 | 10 | 5 |
| 4 | 1 | 15 | 10 |
| 5 | 2 | 0 | 10 |
| 6 | 2 | 0 | 5 |
| 7 | 2 | 10 | 5 |
+----+---------+-------+--------+
The table is ordered by user id asc, first asc, amount desc.
The task is to write the SQL statement which calculate what total unique amount of data the user has seen.
For the first user total amount must be 20, since the request with id=1 returned first 5 items, with id=2 returned another 10 items. Request with id=3 returns data already 'seen' by request with id=2. Request with id=4 intersects with id=2, but still returns 5 'unseen' pieces of data.
For the second user total amount must be 15.
As a result of SQL statement, I should get the next output:
+---------+-------+
| user id | total |
+---------+-------+
| 1 | 20 |
+---------+-------+
| 2 | 15 |
+---------+-------+
I am using MySQL 5.7, so window functions are not available for me. I stuck with this task for a day already and still cannot get the desired output. If it is not possible with this setup, I will end up calculating the results in the application code. I would appreciate any suggestions or help with resolving this task, thank you!

This is a type of gaps and islands problem. In this case, use a cumulative max to determine if one request intersects with a previous request. If not, that is the beginning of an "island" of adjacent requests. A cumulative sum of the beginnings assigns an "island", then an aggregation counts each island.
So, the islands look like this:
select userid, min(first), max(first + amount) as last
from (select t.*,
sum(case when prev_last >= first then 0 else 1 end) over
(partition by userid order by first) as grp
from (select t.*,
max(first + amount) over (partition by userid order by first range between unbounded preceding and 1 preceding) as prev_last
from t
) t
) t
group by userid, grp;
You then want this summed by userid, so that is one more level of aggregation:
with islands as (
select userid, min(first) as first, max(first + amount) as last
from (select t.*,
sum(case when prev_last >= first then 0 else 1 end) over
(partition by userid order by first) as grp
from (select t.*,
max(first + amount) over (partition by userid order by first range between unbounded preceding and 1 preceding) as prev_last
from t
) t
) t
group by userid, grp
)
select userid, sum(last - first) as total
from islands
group by userid;
Here is a db<>fiddle.

This logic is similar to Gordon's, but runs on older releases of MySQL, too.
select userid
-- overall length minus gaps
,max(maxlast)-min(minfirst) + sum(gaplen) as total
from
(
select userid
,prevlast
,min(first) as minfirst -- first of group
,max(last) as maxlast -- last of group
-- if there was a gap, calculate length of gap
,min(case when prevlast < first then prevlast - first else 0 end) as gaplen
from
(
select t.*
,first + amount as last -- last value in range
,( -- maximum end of all previous rows
select max(first + amount)
from t as t2
where t2.userid = t.userid
and t2.first < t.first
) as prevlast
from t
) as dt
group by userid, prevlast
) as dt
group by userid
order by userid
See fiddle

Related

How can I count summed rows as one row in LIMIT?

I want to select user's notifications according to these rules:
all unread notifications
always 2 read notifications
at least 15 notifications (by default)
Here is my query which gets user's notifications ids:
( SELECT id FROM events -- all unread messages
WHERE author_id = ? AND seen = 0
) UNION
( SELECT id FROM events -- 2 read messages
WHERE author_id = ? AND seen <> 0
ORDER BY date_time desc
LIMIT 2
) UNION
( SELECT id FROM events -- at least 15 rows by default
WHERE author_id = ?
ORDER BY seen, date_time desc
LIMIT 15
)
And then I select the matched ids in query above plus other info like this: (I don't want to combine these two queries because of some reasons in reality)
SELECT SUM(score) score, post_id, title, content, date_time
FROM events
GROUP BY post_id, title, content, date_time
ORDER BY seen, MAX(date_time) desc
WHERE id IN ($ids)
It works and all fine.
The problem is: When the first query selects 15 rows which all have the same post_id, then the second query will sum them up and show it as one notification row with total-scores.
I guess I have to add that SUM() also in the first query? And that GROUP BY? Any idea?
An example of the problem, if an user earn 15 upvotes, the first query selects them as 15 notifications, and the second query make it one notification. How can I get 15 separated notification? (those notification which will be summed in the second query should be counted as one notification in the first query, how?)
As you finally want 15 rows per group, you should have rules on groups rather than on messages in my opinion.
You can aggregate your data per group and then check whether the group shall be in your results. You'd do this in the HAVING clause with conditional aggregation, i.e. an aggregation function used on a conditional expression. This is one method to count unread messages for example:
SUM(CASE WHEN seen = 0 THEN 1 ELSE 0 END)
This is another:
COUNT(CASE WHEN seen = 0 THEN 1 END)
(The ELSE branch is omitted and defaults to null, which is not count.)
In MySQL these expressions are even simpler, because false equals 0 and true equals 1. So in MySQL you'd count with:
SUM(seen = 0)
You can use other aggregation functions, too:
HAVING MAX(seen = 0) = 0 -- no unread messages
HAVING MIN(seen = 0) = 1 -- no read messages
Now let's select all groups with at least one unread message:
SELECT SUM(score) AS score, post_id, title, content, date_time
FROM events
GROUP BY post_id, title, content, date_time
HAVING SUM(seen = 0) > 0;
(We could also use HAVING MAX(seen = 0) = 1.)
Now your UNION approach to get all groups with at least one unread message, plus as many other groups as necessary to get at least 15 groups:
(
SELECT SUM(score) AS score, post_id, title, content, date_time, SUM(seen = 0) as unread
FROM events
GROUP BY post_id, title, content, date_time
HAVING SUM(seen = 0) > 0
)
UNION
(
SELECT SUM(score) AS score, post_id, title, content, date_time, SUM(seen = 0) as unread
FROM events
GROUP BY post_id, title, content, date_time
ORDER BY SUM(seen = 0) DESC, date_time DESC
LIMIT 15
)
ORDER BY (unread = 0), date_time DESC;
If you want the single IDs for above groups, then use IN:
SELECT id
FROM events
WHERE (post_id, title, content, date_time) IN
(
SELECT post_id, title, content, date_time
FROM (<above query>) q
);
This is not an answer, but too long for a comment:
You think the rules are all clear, but are they? Let's say it's not at least 15 but only at least 5 rows you want in your final results. From the following table you'd want the IDs 1, 2, 3, and 4, because these are unread. But what about the others?
id | score | post_id | title | content | date_time | seen
---+-------+---------+-------+---------+---------------------+-----
1 | 10 | 11 | hello | it's me | 2018-01-11 12:34:56 | 0
2 | 20 | 22 | hello | it's me | 2018-01-12 12:34:56 | 0
3 | 30 | 33 | hello | it's me | 2018-01-13 12:34:56 | 0
4 | 40 | 44 | hello | it's me | 2018-01-14 12:34:56 | 0
5 | 50 | 11 | hello | it's me | 2018-01-11 12:34:56 | 1
6 | 60 | 22 | hello | it's me | 2018-01-12 12:34:56 | 1
7 | 70 | 44 | hello | it's me | 2018-01-14 12:34:56 | 1
8 | 80 | 55 | hello | it's me | 2018-01-05 12:34:56 | 1
9 | 90 | 55 | hello | it's me | 2018-01-05 12:34:56 | 1
Does it matter that there are read notifications for the same groups? Does it matter that they are newer than notifications 8 and 9? Or will you simply add ID 8 (or 9?) to the set and be done?
No matter whether you select IDs 1, 2, 3, 4, and say 8 or you select all rows, you'd end up with five groups. So please tell us which IDs you'd select and why.

Select promoted items grouped by another attribute

From table like below:
id | node_id | promoted | group_type | created_at |status
------------------------------------------------------------------
8 | 4321 | 1 | 3 | 2018-01-08 13:29:55| 1
4 | 4321 | 0 | 3 | 2018-01-06 11:22:53| 1
3 | 4321 | 0 | 1 | 2018-01-05 23:19:02| 1
2 | 4321 | 1 | 1 | 2018-01-05 21:20:15| 1
1 | 4321 | 1 | 3 | 2018-01-05 11:09:51| 1
I have to get one id and group_type values per each group_type.
If there is promoted item in the group, query should return it's id and group_type.
If there are more than one promoted items in the group, most recent promoted record should be returned.
If there is no promoted item in the group, query should return most recent record.
Using query below I managed to get almost what I need
SELECT a.id, a.group_type, a.promoted, a.created_at
FROM (
SELECT group_type, MAX(promoted) AS max_promoted
FROM nodes
WHERE node_id=4321 AND status=1
GROUP BY group_type
) AS g
INNER JOIN nodes AS a
ON a.group_type = g.group_type AND a.promoted = g.max_promoted
WHERE node_id= 4321 AND status=1 ORDER BY created_at
Unfortunately when there is more than one promoted item in the group I get both.
Any idea how to get only one promoted item per group?
EDIT:
If there is more than one group, query should return multiple rows but one per every group.
You can limit the result of the query by adding LIMIT 0,1 at the end of the query.
As you have ordered your result it will works.
For more information about LIMIT see : https://dev.mysql.com/doc/refman/5.7/en/limit-optimization.html
Edited: You should order items in descending to get the latest one on top and limit items as per required i.e. 1 or 2 and so on. Also union will help in getting latest result either promoted in case not promoted. The last limit will result only single (required) row. Here's your query:
(SELECT a.id, a.group_type, a.promoted, a.created_at
FROM (
SELECT group_type, MAX(promoted) AS max_promoted
FROM nodes
WHERE node_id=4321 and status=1
GROUP BY group_type
) AS g
INNER JOIN nodes AS a
ON a.group_type = g.group_type AND a.promoted = g.max_promoted
WHERE node_id= 4321 and status=1 ORDER BY created_at desc
limit 1)
union
(select a.id, a.group_type, a.promoted, a.created_at from nodes a order by created_at desc limit 1)
limit 1
Hope it helps!

MySQL SUM previous row by date column using Union

I am hoping I am just stumped because its the end of the work day on a Monday, and someone here can give me a hand.
Basically I have 2 tables that have invoice information and a table that has payment information. Using the following I get the first part of my display.
SELECT d.id, i.id as invid, i.company_id, d.total, created, adjustment FROM tbl_finance_invoices as i
LEFT JOIN tbl_finance_invoice_details as d ON d.invoice_id = i.id
WHERE company_id = '69350'
UNION
SELECT id, 0, comp_id, amount_paid, uploaded_date, 'paid' FROM tbl_finance_invoice_paid_items
WHERE comp_id = '69350'
ORDER BY created
What I want to do is:
Create a new column called "Balance" that adds total to the previous total by the created column regardless of how the rest of the table is sorted.
To give a quick example, my current output is something like:
id | invid | company_id | total | created | adjustment
12 | 16 | 1 | 40 | 01/01/16| 0
100| 0 | 1 | 10 | 01/05/16| 0
50 | 20 | 1 | 50 | 05/01/16| 0
What my goal is would be:
id | invid | company_id | total | created | adjustment | balance |Notes
12 | 16 | 1 | 40 | 01/01/16| 0 | 40 | 0 + 40
100| 0 | 1 | 10 | 01/05/16| 1 | 50 | 40 + 10
50 | 20 | 1 | 50 | 05/01/16| 0 | 100 | 50 + 50
And regardless of sorting by id, invid, total, created, etc, the balance would always be tied to the created date.
So if I added a "Where adjustment = '1'" to my sql, I would get:
100| 0 | 1 | 10 | 01/05/16| 1 | 50 | 40 + 10
Since the OP confirmed my understanding in comments, I'm basing my answer on the following assumption:
The running total would be tied to the order of created_date. The
running total would only be affected by company id as a filtering
criterion, all other filters should be disregarded for that
calculation.
Since the running total may have a different order by and filtering criteria than the rest of the query, therefore the running total calculation has to be placed in a subquery.
The other assumption I have to make is that there cannot be more than one invoice with the same created date for a single customer id, since the original query in the OP does not have any group by or summing either.
I prefer to use the approach suggested by #OMG Ponies in this post on SO, where he initiates the mysql variable holding the running total in a subquery, thus there is no need to initialize the variable in a separate set statement.
SELECT d.id, i.id as invid, i.company_id, rt.total, rt.cumulative_sum, rt.created, adjustment
FROM tbl_finance_invoices as i
LEFT JOIN tbl_finance_invoice_details as d ON d.invoice_id = i.id
LEFT JOIN
(SELECT d.total, created, #running_total := #running_total + t.count AS cumulative_sum
FROM tbl_finance_invoices as i
LEFT JOIN tbl_finance_invoice_details as d ON d.invoice_id = i.id
JOIN (SELECT #running_total := 0) r -- no join condition, so this produces a carthesian join
WHERE company_id = '69350'
ORDER BY created) rt
ON i.created=rt.created --this is also an assumption, I do not know which original table holds the created field
WHERE company_id = '69350' and adjustment=1
ORDER BY d.id
If you need to take the amounts from the tbl_finance_invoice_paid_items into account as well, then you need to add that to the subquery.

MAX function in MySQL does not return proper key value

I have a table called tbl_user_sal:
| id | user_id | salary | date |
| 1 | 1 | 1000 | 2014-12-01 |
| 2 | 1 | 2000 | 2014-12-02 |
Now I want to get the id of the maximum date. I used the following query:
SELECT MAX(date) AS from_date, id, user_id, salary
FROM tbl_user_sal
WHERE user_id = 1
But it gave me this output:
| id | user_id | salary | from_date |
| 1 | 1 | 2000 | 2014-12-02 |
Which is correct as far as the max date being 2014-12-02, but the corresponding id is not correct. This happens for other records as well. I used order by to check but that was not successful either. Can anyone shed some light on this?
Note: Its not necessary that max date will have max id, according to my needs. Records can have max date but id may be older.
If you only want to retrieve that information for a single user, which you seem to, because of your WHERE clause, just use ORDER BY and LIMIT:
SELECT *
FROM tbl_user_sal
WHERE user_id = 1
ORDER BY date DESC
LIMIT 1
If you want to do that for every user, however, you will have to get a little bit fancier. Something like that should do it:
SELECT t2.id, user_id, date
--find max date for each user_id
FROM (SELECT user_id, MAX(date) AS date
FROM tbl_user_sal
GROUP BY user_id) AS t1
--join ids for each max date/user_id combo
JOIN tbl_user_sal AS t2
USING (user_id, date)
--limit to 1 id for every user_id
GROUP BY
user_id
You are missing group by clause Try this:
select max(awrd_date) as from_date,awrd_id
from tbl_user_sal
where awrd_user_id = 106
group by awrd_id
What I believe you should do here is have a subquery that pulls the max date, and your outer query looks for the row with that date.
It looks like this:
SELECT *
FROM myTable
WHERE date = (SELECT MAX(date) FROM myTable);
Additional things may need to be added if you want to search for a specific user_id, or get the largest date for each user_id, but this gives your expected results for this example here.
Here is the SQL Fiddle.

MySQL: group by consecutive days and count groups

I have a database table which holds each user's checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a city (a visit consists of consecutive days spent in a city).
So, consider I have the following table (simplified, containing only the DATETIMEs - same user and city):
datetime
-------------------
2011-06-30 12:11:46
2011-07-01 13:16:34
2011-07-01 15:22:45
2011-07-01 22:35:00
2011-07-02 13:45:12
2011-08-01 00:11:45
2011-08-05 17:14:34
2011-08-05 18:11:46
2011-08-06 20:22:12
The number of days this user has been to this city would be 6 (30.06, 01.07, 02.07, 01.08, 05.08, 06.08).
I thought of doing this using SELECT COUNT(id) FROM table GROUP BY DATE(datetime)
Then, for the number of visits this user has made to this city, the query should return 3 (30.06-02.07, 01.08, 05.08-06.08).
The problem is that I have no idea how shall I build this query.
Any help would be highly appreciated!
You can find the first day of each visit by finding checkins where there was no checkin the day before.
select count(distinct date(start_of_visit.datetime))
from checkin start_of_visit
left join checkin previous_day
on start_of_visit.user = previous_day.user
and start_of_visit.city = previous_day.city
and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
where previous_day.id is null
There are several important parts to this query.
First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.
Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)
Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:
select count(distinct date(datetime))
from checkin
where user='some user' and city='some city'
Try to apply this code to your task -
CREATE TABLE visits(
user_id INT(11) NOT NULL,
dt DATETIME DEFAULT NULL
);
INSERT INTO visits VALUES
(1, '2011-06-30 12:11:46'),
(1, '2011-07-01 13:16:34'),
(1, '2011-07-01 15:22:45'),
(1, '2011-07-01 22:35:00'),
(1, '2011-07-02 13:45:12'),
(1, '2011-08-01 00:11:45'),
(1, '2011-08-05 17:14:34'),
(1, '2011-08-05 18:11:46'),
(1, '2011-08-06 20:22:12'),
(2, '2011-08-30 16:13:34'),
(2, '2011-08-31 16:13:41');
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT v.user_id,
COUNT(DISTINCT(DATE(dt))) number_of_days,
MAX(days) number_of_visits
FROM
(SELECT user_id, dt
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS days,
#last_dt := DATE(dt),
#last_user := user_id
FROM
visits
ORDER BY
user_id, dt
) v
GROUP BY
v.user_id;
----------------
Output:
+---------+----------------+------------------+
| user_id | number_of_days | number_of_visits |
+---------+----------------+------------------+
| 1 | 6 | 3 |
| 2 | 2 | 1 |
+---------+----------------+------------------+
Explanation:
To understand how it works let's check the subquery, here it is.
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT user_id, dt,
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS
days,
#last_dt := DATE(dt) lt,
#last_user := user_id lu
FROM
visits
ORDER BY
user_id, dt;
As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -
+---------+---------------------+------+------------+----+
| user_id | dt | days | lt | lu |
+---------+---------------------+------+------------+----+
| 1 | 2011-06-30 12:11:46 | 1 | 2011-06-30 | 1 |
| 1 | 2011-07-01 13:16:34 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 15:22:45 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 22:35:00 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-02 13:45:12 | 1 | 2011-07-02 | 1 |
| 1 | 2011-08-01 00:11:45 | 2 | 2011-08-01 | 1 |
| 1 | 2011-08-05 17:14:34 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-05 18:11:46 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-06 20:22:12 | 3 | 2011-08-06 | 1 |
| 2 | 2011-08-30 16:13:34 | 1 | 2011-08-30 | 2 |
| 2 | 2011-08-31 16:13:41 | 1 | 2011-08-31 | 2 |
+---------+---------------------+------+------------+----+
Then we group this data set by user and use aggregate functions:
'COUNT(DISTINCT(DATE(dt)))' - counts the number of days
'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.
That is all;)
As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the #LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the #LUser and #LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of
User ID Distinct Visits Total Days
1 3 9
2 1 2
select PreQuery.User_ID,
sum( PreQuery.NextVisit ) as DistinctVisits,
count(*) as TotalDays
from
( select v.user_id,
if( #LUser <> v.User_ID OR #LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
#LUser := v.user_id,
#LDate := date( v.dt )
from
Visits v,
( select #LUser := -1, #LDate := date(now()) ) AtVars
order by
v.user_id,
v.dt ) PreQuery
group by
PreQuery.User_ID
for a first sub-task:
select count(*)
from (
select TO_DAYS(p.d)
from p
group by TO_DAYS(p.d)
) t
I think you should consider changing database structure. You could add table visits and visit_id into your checkins table. Each time you want to register new checkin you check if there is any checkin a day back. If yes then you add a new checkin with visit_id from yesterday's checkin. If not then you add new visit to visits and new checkin with new visit_id.
Then you could get you data in one query with something like that:
SELECT COUNT(id) AS number_of_days, COUNT(DISTINCT visit_id) number_of_visits FROM checkin GROUP BY user, city
It's not very optimal but still better than doing anything with current structure and it will work. Also if results can be separate queries it will work very fast.
But of course drawbacks are you will need to change database structure, do some more scripting and convert current data to new structure (i.e. you will need to add visit_id to current data).