Complicated overlap in Mysql query - mysql

Here is my problem, I have a MYSQL table with the following columns and data examples :
id | user | starting date | ending date | activity code
1 | Andy | 2010-04-01 | 2010-05-01 | 3
2 | Andy | 1988-11-01 | 1991-03-01 | 3
3 | Andy | 2005-06-01 | 2008-08-01 | 3
4 | Andy | 2005-08-01 | 2008-11-01 | 3
5 | Andy | 2005-06-01 | 2010-05-01 | 4
6 | Ben | 2010-03-01 | 2011-06-01 | 3
7 | Ben | 2010-03-01 | 2010-05-01 | 4
8 | Ben | 2005-04-01 | 2011-05-01 | 3
As you can see in this table users can have same activity code and similar dates or periods. And For a same user, periods can overlap others or not. It is also possible to have several overlap periods in the table.
What I want is a MYSQL QUERY to get the following result :
new id | user | starting date | ending date | activity code
1 | Andy | 2010-04-01 | 2010-05-01 | 3 => ok, no overlap period
2 | Andy | 1988-11-01 | 1991-03-01 | 3 => ok, no overlap period
3 | Andy | 2005-06-01 | 2008-11-01 | 3 => same user, same activity but ending date coming from row 4 as extended period
4 | Andy | 2005-06-01 | 2010-05-01 | 4 => ok other activity code
5 | Ben | 2005-04-01 | 2011-06-01 | 3 => ok other user, but as overlap period rows 6 and 8 for the same user and activity, I take the widest range
6 | Ben | 2010-03-01 | 2010-05-01 | 4 => ok other activity for second user
In other words, for a same user and activity code, if there is no overlap, I need the starting and ending dates as they are. If there is an overlap for a same user and activity code, I need the lower starting date and the higher ending date coming from the different related rows. I need this for all the users and activity code of the table and in SQL for MYSQL.
I hope it is clear enough and someone can help me because I try different codes from solutions supplied on this site and others without success.

I have somewhat convoluted (strictly MySQL-specific) solution:
SET #user = NULL;
SET #activity = NULL;
SET #interval_id = 0;
SELECT
MIN(inn.`starting date`) AS start,
MAX(inn.`ending date`) AS end,
inn.user,
inn.`activity code`
FROM
(SELECT
IF(user <> #user OR `activity code` <> #activity,
#interval_id := #interval_id + 1, NULL),
IF(user <> #user OR `activity code` <> #activity,
#interval_end := STR_TO_DATE('',''), NULL),
#user := user,
#activity := `activity code`,
#interval_id := IF(`starting date` > #interval_end,
#interval_id + 1,
#interval_id) AS interval_id,
#interval_end := IF(`starting date` < #interval_end,
GREATEST(#interval_end, `ending date`),
`ending date`) AS interval_end,
t.*
FROM Table1 t
ORDER BY t.user, t.`activity code`, t.`starting date`, t.`ending date`) inn
GROUP BY inn.user, inn.`activity code`, inn.interval_id;
The underlying idea was shamelessly borrowed from the 1st answer to this question.
You can use this SQL Fiddle to review the results and try different source data.

Here is a solution - (see http://sqlfiddle.com/#!2/fda3d/15)
SELECT DISTINCT summarized.`user`
, summarized.activity_code
, summarized.true_begin
, summarized.true_end
FROM (
SELECT t1.id,t1.`user`,t1.activity_code
, MIN(LEAST(t1.`starting`, COALESCE(overlap.`starting` ,t1.`starting`))) as true_begin
, MAX(GREATEST(t1.`ending`, COALESCE(overlap.`ending` ,t1.`ending`))) as true_end
FROM t1
LEFT JOIN t1 AS overlap
ON t1.`user` = overlap.`user`
AND t1.activity_code = overlap.activity_code
AND overlap.`ending` >= t1.`starting`
AND overlap.`starting` <= t1.`ending`
AND overlap.id <> t1.id
GROUP BY t1.id, t1.`user`, t1.activity_code) AS summarized;
I am not sure how it will perform with a large data set with many overlaps. You will definitely need an index on the user and activity_code fields - probably the starting and ending date fields also as part of that index.

Related

How get conditional count while using group by in mysql?

Mysql newbie here.
I have a table( name:'audit_webservice_aua' ) like this:
+---------+------------------------------------+-------------------+------------------------+
| auditId | device_code | response_status | request_date
+---------+------------------------------------+-------------------+------------------------+
| 10001 | 0007756-gyy66-4c6e-a59d-xxxccyyyt1 | P | 2020-03-02 00:00:08.785
| 10002 | 0007756-gyy66-4c6e-a59d-xxxccyyyt2 | F | 2020-04-06 00:00:08.785
| 10003 | 0007756-gyy66-4c6e-a59d-xxxccyyyt3 | F | 2020-04-01 00:01:08.785
| 10004 | 0007756-gyy66-4c6e-a59d-xxxccyyyt1 | P | 2020-05-02 00:02:08.785
| 10005 | 0007756-gyy66-4c6e-a59d-xxxccyyyt1 | P | 2020-05-09 00:03:08.785
| 10006 | 0007756-gyy66-4c6e-a59d-xxxccyyyt2 | P | 2020-05-09 01:00:08.785
| 10007 | 0007756-gyy66-4c6e-a59d-xxxccyyyt7 | F | 2020-06-06 02:00:08.785
+---------+------------------------------------+-------------------+------------------------+
Every time a new request is made the above table stores the requesting device_code ,response_status and request time.
I have a requirement of getting the result set which contains the each device_code, total_trans, total_successful, total_failure and date for each day between two given dates.
The query i have written is as follows:
SELECT DATE_FORMAT(aua.request_date,'%b') as month ,
YEAR(aua.request_date) as year,
DATE_FORMAT(aua.request_date,'%Y-%m-%d') as date,
(select count(aua.audit_id) )as total_trans ,
(select count(aua.audit_id) where aua.response_status 'P') as total_failure ,
(select count(aua.audit_id) where aua.response_status = 'P') as total_successful ,
aua.device_code as deviceCode
FROM audit_webservice_aua aua where DATE_FORMAT(aua.request_date,'%Y-%m-%d') between '2020-04-16' and '2020-07-17'
group by dates,deviceCode ;
In the above code im tring to get results between '2020-03-02' and '2020-06-06' but the count im getting is not correct.
Any help would be appreciated.
Thank you in advance.
I think you just want conditional aggregation:
SELECT DATE_FORMAT(aua.request_date,'%b') as month ,
YEAR(aua.request_date) as year,
DATE_FORMAT(aua.request_date, '%Y-%m-%d') as date,
COUNT(aua.audit_id) as total_trans ,
SUM(aua.response_status <> 'P') as total_failure,
SUM(aua.response_status = 'P') as total_successful,
aua.device_code as deviceCode
FROM audit_webservice_aua aua
WHERE DATE_FORMAT(aua.request_date, '%Y-%m-%d') between '2020-04-16' and '2020-07-17'
GROUP BY month, year, date, deviceCode ;
I would also advise you to change the WHERE clause to:
WHERE aua.request_date >= '2020-04-16' AND
aua.request_date >= '2020-07-18'

MYSQL query winning streak including score

I have a query-generated table that counts up the winning streak as long as the player keeps winning. When they player gets a positive score, the streak rises with 1, if he gets a negative score, the streak falls back to 0. The table looks like this:
+--------+------------------+--------+--------+
| player | timestamp | points | streak |
+--------+------------------+--------+--------+
| John | 22/11/2012 23:01 | -2 | 0 |
| John | 22/11/2012 23:02 | 3 | 1 |
| John | 22/11/2012 23:04 | 5 | 2 |
| John | 22/11/2012 23:05 | -2 | 0 |
| John | 22/11/2012 23:18 | 15 | 1 |
| John | 23/11/2012 23:20 | 5 | 2 |
| Chris | 27/11/2012 22:12 | 20 | 1 |
| Chris | 27/11/2012 22:14 | -12 | 0 |
| Chris | 27/11/2012 22:17 | 4 | 1 |
| Chris | 27/11/2012 22:18 | -4 | 0 |
| Chris | 27/11/2012 22:20 | 10 | 1 |
| Chris | 27/11/2012 22:21 | 20 | 2 |
| Chris | 27/11/2012 22:22 | 90 | 3 |
+--------+------------------+--------+--------+
I would like to get the players maximum streak, which is easy to get ofcourse, but I would also like to include the points that the player scored in that particular streak. So, for the above example the result would have to look like this:
+--------+--------+-----------+
| player | points | maxstreak |
+--------+--------+-----------+
| John | 20 | 2 |
| Chris | 120 | 3 |
+--------+--------+-----------+
Any idea's of how I could achieve this? Thanks in advance!
I have not had a chance to actually try this, but is SHOULD work using mySQL Variables...
At the beginning, the inner-most query just queries from your scores table and forces the data in order of player and timestamp. From that, I have to process sequentially with MySQL variables. First thing... on each new record being processed, if I am on a different "Player" (which should ACTUALLY be based on an ID instead of name), I am resetting the streak, points, maxStreak, maxStreakPoints to zero, THEN setting the last user to whoever its about to process.
Immediately after that, I am checking for the streak status, points, etc...
Once all have been tabulated, I then use the OUTERMOST query to get on a per-player basis, what their highest max streak / max streak points.
SELECT
Final.Player,
MAX( Final.MaxStreak ) MaxStreak,
MAX( Final.MaxStreakPoints ) MaxStreakPoints
FROM
(
SELECT
PreOrd.Player,
PreOrd.TimeStamp,
PreOrd.Points,
#nStreak := case when PreOrd.Points < 0 then 0
when PreOrd.Player = #cLastPlayer then #nStreak +1
else 1 end Streak,
#nStreakPoints := case when #nStreak = 1 then PreOrd.Points
when #nStreak > 1 then #nStreakPoints + PreOrd.Points
else 0 end StreakPoints,
#nMaxStreak := case when PreOrd.Player != #cLastPlayer then #nStreak
when #nStreak > #nMaxStreak then #nStreak
else #nMaxStreak end MaxStreak,
#nMaxStreakPoints := case when PreOrd.Player != #cLastPlayer then #nStreakPoints
when #nStreak >= #nMaxStreak and #nStreakPoints > #nMaxStreakPoints then #nStreakPoints
else #nMaxStreakPoints end MaxStreakPoints,
#cLastPlayer := PreOrd.Player PlayerChange
FROM
( select
S.Player,
S.TimeStamp,
S.Points
from
Scores2 S
ORDER BY
S.Player,
S.TimeStamp,
S.`index` ) PreOrd,
( select
#nStreak := 0,
#nStreakPoints := 0,
#nMaxStreak := 0,
#nMaxStreakPoints := 0,
#cLastPlayer := '~' ) SQLVars
) as Final
group by
Final.Player
Now, this could give a false max streak points, such that on a single score the person has 90 points, then a streak of 1 for 10 points, 2 for 10 points, 3 for 10 points, 30 total.. Still thinking on that though... :)
Here's what I get when I add the index column as you've made available from data supplied
SQL Fiddle Showing my solution...
My recommendation is to store additional information when you calculate the streak. For instance, you could store the time stamp when the streak began.
A less-serious recommendation is to switch to another database, that supports window functions. This would be much easier.
The approach is to find when the streak began and then sum up everything between that time and the max streak. To do this, we'll use a correlated subquery:
select t.*,
(select max(timestamp) from t t2 where t2.timestamp <= t.timestamp and t2.player = t.player and t2.streak = 0
) as StreakStartTimeStamp
from t
where t.timeStamp = (select max(streak) from t t2 where t.player = t2.player)
Now, we will embed this query as a subquery, so we can add the appropriate times:
select t.player,
sum(s.points)
from t join
(select t.*,
(select max(timestamp) from t t2 where t2.timestamp <= t.timestamp and t2.player = t.player and t2.streak = 0
) as StreakStartTimeStamp
from t
where t.streak = (select max(streak) from t t2 where t.player = t2.player)
) s
on t.player = s.player
group by t.player
I haven't tested this query, so there are probably some syntax errors. However, the approach should work. You may want to have indexes on the table, on streak and timestamp for performance reasons.

Optimize nested query to single query

I have a (MySQL) table containing dates of the last scan of hosts combined with a report ID:
+--------------+---------------------+--------+
| host | last_scan | report |
+--------------+---------------------+--------+
| 112.86.115.0 | 2012-01-03 01:39:30 | 4 |
| 112.86.115.1 | 2012-01-03 01:39:30 | 4 |
| 112.86.115.2 | 2012-01-03 02:03:40 | 4 |
| 112.86.115.2 | 2012-01-03 04:33:47 | 5 |
| 112.86.115.1 | 2012-01-03 04:20:23 | 5 |
| 112.86.115.6 | 2012-01-03 04:20:23 | 5 |
| 112.86.115.2 | 2012-01-05 04:29:46 | 8 |
| 112.86.115.6 | 2012-01-05 04:17:35 | 8 |
| 112.86.115.5 | 2012-01-05 04:29:48 | 8 |
| 112.86.115.4 | 2012-01-05 04:17:37 | 8 |
+--------------+---------------------+--------+
I want to select a list of all hosts with the date of the last scan and the corresponding report id. I have built the following nested query, but I am sure it can be done in a single query:
SELECT rh.host, rh.report, rh.last_scan
FROM report_hosts rh
WHERE rh.report = (
SELECT rh2.report
FROM report_hosts rh2
WHERE rh2.host = rh.host
ORDER BY rh2.last_scan DESC
LIMIT 1
)
GROUP BY rh.host
Is it possible to do this with a single, non-nested query?
No, but you can do a JOIN in your query
SELECT x.*
FROM report_hosts x
INNER JOIN (
SELECT host,MAX(last_scan) AS last_scan FROM report_hosts GROUP BY host
) y ON x.host=y.host AND x.last_scan=y.last_scan
Your query is doing a filesort, which is very inefficient. My solutions doesn't. It's very advisable to create an index on this table
ALTER TABLE `report_hosts` ADD INDEX ( `host` , `last_scan` ) ;
Else your query will do a filesort twice.
If you want to select from the report_hosts table only once then you could use a sort of 'RANK OVER PARTITION' method (available in Oracle but not, sadly, in MySQL). Something like this should work:
select h.host,h.last_scan as most_recent_scan,h.report
from
(
select rh.*,
case when #curHost != rh.host then #rank := 1 else #rank := #rank+1 end as rank,
case when #curHost != rh.host then #curHost := rh.host end
from report_hosts rh
cross join (select #rank := null,#curHost = null) t
order by host asc,last_scan desc
) h
where h.rank = 1;
Granted it is still nested but it does avoid the 'double select' problem. Not sure if it will be more efficient or not - kinda depends what indexes you have and volume of data.

Sorting some rows by average with SQL

All right, so here's a challenge for all you SQL pros:
I have a table with two columns of interest, group and birthdate. Only some rows have a group assigned to them.
I now want to print all rows sorted by birthdate, but I also want all rows with the same group to end up next to each other. The only semi-sensible way of doing this would be to use the groups' average birthdates for all the rows in the group when sorting. The question is, can this be done with pure SQL (MySQL in this instance), or will some scripting logic be required?
To illustrate, with the given table:
id | group | birthdate
---+-------+-----------
1 | 1 | 1989-12-07
2 | NULL | 1990-03-14
3 | 1 | 1987-05-25
4 | NULL | 1985-09-29
5 | NULL | 1988-11-11
and let's say that the "average" of 1987-05-25 and 1989-12-07 is 1988-08-30 (this can be found by averaging the UNIX timestamp equivalents of the dates and then converting back to a date. This average doesn't have to be completely correct!).
The output should then be:
id | group | birthdate | [sort_by_birthdate]
---+-------+------------+--------------------
4 | NULL | 1985-09-29 | 1985-09-29
3 | 1 | 1987-05-25 | 1988-08-30
1 | 1 | 1989-12-07 | 1988-08-30
5 | NULL | 1988-11-11 | 1988-11-11
2 | NULL | 1990-03-14 | 1990-03-14
Any ideas?
Cheers,
Jon
I normally program in T-SQL, so please forgive me if I don't translate the date functions perfectly to MySQL:
SELECT
T.id,
T.group
FROM
Some_Table T
LEFT OUTER JOIN (
SELECT
group,
'1970-01-01' +
INTERVAL AVG(DATEDIFF('1970-01-01', birthdate)) DAY AS avg_birthdate
FROM
Some_Table T2
GROUP BY
group
) SQ ON SQ.group = T.group
ORDER BY
COALESCE(SQ.avg_birthdate, T.birthdate),
T.group

MySQL: group by consecutive days and count groups

I have a database table which holds each user's checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a city (a visit consists of consecutive days spent in a city).
So, consider I have the following table (simplified, containing only the DATETIMEs - same user and city):
datetime
-------------------
2011-06-30 12:11:46
2011-07-01 13:16:34
2011-07-01 15:22:45
2011-07-01 22:35:00
2011-07-02 13:45:12
2011-08-01 00:11:45
2011-08-05 17:14:34
2011-08-05 18:11:46
2011-08-06 20:22:12
The number of days this user has been to this city would be 6 (30.06, 01.07, 02.07, 01.08, 05.08, 06.08).
I thought of doing this using SELECT COUNT(id) FROM table GROUP BY DATE(datetime)
Then, for the number of visits this user has made to this city, the query should return 3 (30.06-02.07, 01.08, 05.08-06.08).
The problem is that I have no idea how shall I build this query.
Any help would be highly appreciated!
You can find the first day of each visit by finding checkins where there was no checkin the day before.
select count(distinct date(start_of_visit.datetime))
from checkin start_of_visit
left join checkin previous_day
on start_of_visit.user = previous_day.user
and start_of_visit.city = previous_day.city
and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
where previous_day.id is null
There are several important parts to this query.
First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.
Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)
Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:
select count(distinct date(datetime))
from checkin
where user='some user' and city='some city'
Try to apply this code to your task -
CREATE TABLE visits(
user_id INT(11) NOT NULL,
dt DATETIME DEFAULT NULL
);
INSERT INTO visits VALUES
(1, '2011-06-30 12:11:46'),
(1, '2011-07-01 13:16:34'),
(1, '2011-07-01 15:22:45'),
(1, '2011-07-01 22:35:00'),
(1, '2011-07-02 13:45:12'),
(1, '2011-08-01 00:11:45'),
(1, '2011-08-05 17:14:34'),
(1, '2011-08-05 18:11:46'),
(1, '2011-08-06 20:22:12'),
(2, '2011-08-30 16:13:34'),
(2, '2011-08-31 16:13:41');
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT v.user_id,
COUNT(DISTINCT(DATE(dt))) number_of_days,
MAX(days) number_of_visits
FROM
(SELECT user_id, dt
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS days,
#last_dt := DATE(dt),
#last_user := user_id
FROM
visits
ORDER BY
user_id, dt
) v
GROUP BY
v.user_id;
----------------
Output:
+---------+----------------+------------------+
| user_id | number_of_days | number_of_visits |
+---------+----------------+------------------+
| 1 | 6 | 3 |
| 2 | 2 | 1 |
+---------+----------------+------------------+
Explanation:
To understand how it works let's check the subquery, here it is.
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT user_id, dt,
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS
days,
#last_dt := DATE(dt) lt,
#last_user := user_id lu
FROM
visits
ORDER BY
user_id, dt;
As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -
+---------+---------------------+------+------------+----+
| user_id | dt | days | lt | lu |
+---------+---------------------+------+------------+----+
| 1 | 2011-06-30 12:11:46 | 1 | 2011-06-30 | 1 |
| 1 | 2011-07-01 13:16:34 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 15:22:45 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 22:35:00 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-02 13:45:12 | 1 | 2011-07-02 | 1 |
| 1 | 2011-08-01 00:11:45 | 2 | 2011-08-01 | 1 |
| 1 | 2011-08-05 17:14:34 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-05 18:11:46 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-06 20:22:12 | 3 | 2011-08-06 | 1 |
| 2 | 2011-08-30 16:13:34 | 1 | 2011-08-30 | 2 |
| 2 | 2011-08-31 16:13:41 | 1 | 2011-08-31 | 2 |
+---------+---------------------+------+------------+----+
Then we group this data set by user and use aggregate functions:
'COUNT(DISTINCT(DATE(dt)))' - counts the number of days
'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.
That is all;)
As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the #LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the #LUser and #LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of
User ID Distinct Visits Total Days
1 3 9
2 1 2
select PreQuery.User_ID,
sum( PreQuery.NextVisit ) as DistinctVisits,
count(*) as TotalDays
from
( select v.user_id,
if( #LUser <> v.User_ID OR #LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
#LUser := v.user_id,
#LDate := date( v.dt )
from
Visits v,
( select #LUser := -1, #LDate := date(now()) ) AtVars
order by
v.user_id,
v.dt ) PreQuery
group by
PreQuery.User_ID
for a first sub-task:
select count(*)
from (
select TO_DAYS(p.d)
from p
group by TO_DAYS(p.d)
) t
I think you should consider changing database structure. You could add table visits and visit_id into your checkins table. Each time you want to register new checkin you check if there is any checkin a day back. If yes then you add a new checkin with visit_id from yesterday's checkin. If not then you add new visit to visits and new checkin with new visit_id.
Then you could get you data in one query with something like that:
SELECT COUNT(id) AS number_of_days, COUNT(DISTINCT visit_id) number_of_visits FROM checkin GROUP BY user, city
It's not very optimal but still better than doing anything with current structure and it will work. Also if results can be separate queries it will work very fast.
But of course drawbacks are you will need to change database structure, do some more scripting and convert current data to new structure (i.e. you will need to add visit_id to current data).