Calculating average time between dates in SQL - mysql

Using MySQL, I'm trying to figure out how to answer the question: What is the average number of months between users creating their Nth project?
Expected result:
| project count | Average # months |
| 1 | 0 | # On average, it took 0 months to create the first project (nothing to compare to)
| 2 | 12 | # On average, it takes a user 12 months to create their second project
| 3 | 3 | # On average, it takes a user 3 months to create their third project
My MySQL table represents projects created by users. The table can be summarized as:
| user_id | project created at |
|---------|--------------------|
| 1 | Jan 1, 2020 1:00 pm|
| 1 | Feb 2, 2020 3:45 am|
| 1 | Nov 6, 2020 0:01 am|
| 1 | Mar 4, 2021 5:01 pm|
|------------------------------|
| 2 | Another timestamp |
| 2 | Another timestamp |
| 2 | Another timestamp |
| 2 | Another timestamp |
| 2 | Another timestamp |
| 2 | Another timestamp |
|------------------------------|
| ... | Another timestamp |
| ... | Another timestamp |
Some users will have one project while some may have hundreds.
Edit: Current Implementation
with
paid_self_serve_projects_presentation as (
select
`Paid Projects`.owner_email
`Owner Email`,
row_number() over (partition by `Paid Projects`.owner_uuid order by created_at)
`Project Count`,
day(`Paid Projects`.created_at)
`Created Day`,
month(`Paid Projects`.created_at)
`Created Month`,
year(`Paid Projects`.created_at)
`Created Year`,
`Paid Projects`.created_at
`Created`
from self_service_paid_projects as `Paid Projects`
order by `Paid Projects`.owner_uuid, `Paid Projects`.created_at
)
select `Projects`.* from paid_self_serve_projects_presentation as `Projects`

You can use window functions. I am thinking row_number() to enumerate the projects of each user ordered by creation date, and lag() to get the date when the previous project was created:
select rn, avg(datediff(created_at, lag_created_at)) avg_diff_days
from (
select t.*,
row_number() over(partition by user_id order by created_at) rn,
lag(created_at, 1, created_at) over(partition by user_id order by created_at) lag_created_at
from mytable t
) t
group by rn
This gives you the average difference in days, which is somehow more accurates that months. If you really want months, then use timestampdiff(month, lag_created_at, created_at) instead of datediff() - but be aware that the function returns an integer value, hence there is a loss of precision.

Related

How to get daily active users considering previous subscription days?

I am using MySQL to make some data analysis on subscribers and I would like to sort out daily active subscribers since the service launch.
i have a subscription table like below
id | subscriptiondate | unsubscriptiondate
---|------------------|--------------------
1 | 2020-02-12 | null
---|------------------|--------------------
2 | 2020-03-20 | 2020-04-01
---|------------------|--------------------
3 | 2020-03-10 | null
---|------------------|--------------------
4 |2020-04-02 | null
and i expect a result like:
date | active_user
-----------|---------------------------
2020-02-12 | 1
-----------|------------------
2020-03-10 | 2
-----------|------------------
2020-03-20 | 3
-----------|------------------
2020-04-02 | 3
A subscriber opted out the 2020-04-01, that is why we have 3 active subscribers the 2020-04-02.
here is my SQL script, someone could check and assist me to achieve my goal?
SELECT
COUNT(distinct is) AS active_user,
date(subscriptiondate) as day
FROM
subscriptions
WHERE
subscriptiondate in (select subscriptiondate from subscriptions where subscriptiondate <=date(subscriptiondate))
AND (unsubscriptiondate is NULL or unsubscriptiondate>date(subscriptiondate))
GROUP BY
day
ORDER BY day ASC*
You can "unpivot" the table and aggregate with a cumulative sum:
select date, sum(inc) as change_on_date,
sum(sum(inc)) over (order by date) as active_on_day
from ((select subscriptiondate as date, 1 as inc from subscriptions
) union all
(select unsubscriptiondate, -1 from subscriptions
)
) s
group by date;

MYSQL - Finding row delta using date rows that include holidays

I have a table that includes a field with dates (call it date) and a field with a cumulative running total (call it X) | call it table SAMPLE.
***My data in field DATE does not include weekends and holidays.
I can find the delta in the numbers from day to day by simply subtracting any chosen value in "X" and subtracting that from the row above.
Here's my current query:
select
date,
a.X - b.X as 'Daily Total'
from SAMPLE as a
left join SAMPLE as b
on b.date = if(weekday(a.date) = 0 , a.date - interval 3 day, a.date- interval 1 day);
The problem is that the above values work until I hit dates with holidays. If Monday is a holiday, then the values return null because a.date - interval 1 day will not exist. What's the best way to go about solving the holidays issue?
the below are the current results:
+------------+---------------+
| date | X |
+------------+---------------+
| 2018-03-26 | -40105.00 |
| 2018-03-27 | 28470.00 |
| 2018-03-28 | 5265.00 |
| 2018-03-29 | -23010.00 |
| 2018-04-02 | NULL |
| 2018-04-03 | -24830.00 |
| 2018-04-04 | -21970.00 |
| 2018-04-05 | -9620.00 |
| 2018-04-06 | 36465.00 |
Thanks in advance!!
I will sort the table by date then assign a sequence or series of numbers from 1 to n. I will then subtract the value of current row from the previous row except the first row. For first row, i will copy the value X.
select rnk2.`date`,
case when rnk1.r1=1 and rnk2.r2=1 then rnk1.X else rnk2.X-rnk1.X end as 'Daily Total'
from (
select `date`,X,#r1:=#r1+1 as r1
from samples, (select #r1:=0) a
order by `date` ) rnk1
inner join
(select `date`,X,#r2:=#r2+1 as r2
from samples, (select #r2:=0) b
order by `date`) rnk2
on (rnk1.r1=1 and rnk2.r2=1) or (rnk1.r1+1=rnk2.r2)
order by rnk2.`date`

Obtain a list of month and records from the current date(month) in two different years

I have such a set of records in a table in a mysql database;
Date | Number_of_leaves
10th-December-2015 | 10 leaves
6th-August-2015 | 10 leaves
15th-September-2015 | 14 leaves
15th-January-2016: | 100 leaves
7th-November-2015: | 4 leaves
9th-October -2015: | 200 leaves
How can i return a list months and their records for just the past 4 months from Jan-2016 backwards? In other words, i need a result for the past 4 months including the current one like this:
January 2016 | 100 leaves
December 2015 | 10 leaves
November 2015 | 4 leaves
October 2015 | 200 leaves
The above is the kind of result which shows the month and the corresponding year with the number of leaves collected in that month and corresponding year
Schema
create table xyz
( id int auto_increment primary key,
theDate date not null,
leaves int not null
);
-- truncate table xyz;
insert xyz(theDate,leaves) values
('2016-04-10',444510),
('2016-02-10',55510),
('2015-12-10',10),
('2015-08-06',10),
('2015-09-15',14),
('2016-01-15',100),
('2015-11-07',4),
('2015-10-09',200);
Query 1
select month(theDate) as m,
year(theDate) as y,
sum(leaves) as leaves
from xyz
where theDate<='2016-02-01'
group by month(theDate),year(theDate)
order by theDate desc
limit 4;
or
Query 2
select concat(monthname(theDate),' ',year(theDate)) as 'Month/Year',
sum(leaves) as leaves
from xyz
where theDate<='2016-02-01'
group by month(theDate),year(theDate)
order by theDate desc
limit 4;
+---------------+--------+
| Month/Year | leaves |
+---------------+--------+
| January 2016 | 100 |
| December 2015 | 10 |
| November 2015 | 4 |
| October 2015 | 200 |
+---------------+--------+
op is your table name first use str_to_date for convert string to date format .we use if because your date format is different
select * FROM (
SELECT *,
IFNULL(
IFNULL(
str_to_date(Date,'%D-%b-%Y'),str_to_date(Date,'%d-%M-%Y')) ,
str_to_date(Date,'%D-%M-%Y')
)
f_date
FROM `op`
order by number_of_leaves DESC,f_date ASC
) tab
group by month(tab.f_date) LIMIT 5

MySQL grouping by date range with multiple joins

I currently have quite a messy query, which joins data from multiple tables involving two subqueries. I now have a requirement to group this data by DAY(), WEEK(), MONTH(), and QUARTER().
I have three tables: days, qos and employees. An employee is self-explanatory, a day is a summary of an employee's performance on a given day, and qos is a random quality inspection, which can be performed many times a day.
At the moment, I am selecting all employees, and LEFT JOINing day and qos, which works well. However, now, I need to group the data in order to breakdown a team or individual's performance over a date range.
Taking this data:
Employee
id | name
------------------
1 | Bob Smith
Day
id | employee_id | day_date | calls_taken
---------------------------------------------
1 | 1 | 2011-03-01 | 41
2 | 1 | 2011-03-02 | 24
3 | 1 | 2011-04-01 | 35
Qos
id | employee_id | qos_date | score
----------------------------------------
1 | 1 | 2011-03-03 | 85
2 | 1 | 2011-03-03 | 95
3 | 1 | 2011-04-01 | 91
If I were to start by grouping by DAY(), I would need to see the following results:
Day__date | Day__Employee__id | Day__calls | Day__qos_score
------------------------------------------------------------
2011-03-01 | 1 | 41 | NULL
2011-03-02 | 1 | 24 | NULL
2011-03-03 | 1 | NULL | 90
2011-04-01 | 1 | 35 | 91
As you see, Day__calls should be SUM(calls_taken) and Day__qos_score is AVG(score). I've tried using a similar method as above, but as the date isn't known until one of the tables has been joined, its only displaying a record where there's a day saved.
Is there any way of doing this, or am I going about things the wrong way?
Edit: As requested, here's what I've come up with so far. However, it only shows dates where there's a day.
SELECT COALESCE(`day`.day_date, qos.qos_date) AS Day__date,
employee.id AS Day__Employee__id,
`day`.calls_taken AS Day__Day__calls,
qos.score AS Day__Qos__score
FROM faults_employees `employee`
LEFT JOIN (SELECT `day`.employee_id AS employee_id,
SUM(`day`.calls_taken) AS `calls_in`,
FROM faults_days AS `day`
WHERE employee.id = 7
GROUP BY (`day`.day_date)
) AS `day`
ON `day`.employee_id = `employee`.id
LEFT JOIN (SELECT `qos`.employee_id AS employee_id,
AVG(qos.score) AS `score`
FROM faults_qos qos
WHERE employee.id = 7
GROUP BY (qos.qos_date)
) AS `qos`
ON `qos`.employee_id = `employee`.id AND `qos`.qos_date = `day`.day_date
WHERE employee.id = 7
GROUP BY Day__date
ORDER BY `day`.day_date ASC
The solution I'm comming up with looks like:
SELECT
`date`,
`employee_id`,
SUM(`union`.`calls_taken`) AS `calls_taken`,
AVG(`union`.`score`) AS `score`
FROM ( -- select from union table
(SELECT -- first select all calls taken, leaving qos_score null
`day`.`day_date` AS `date`,
`day`.`employee_id`,
`day`.`calls_taken`,
NULL AS `score`
FROM `employee`
LEFT JOIN
`day`
ON `day`.`employee_id` = `employee`.`id`
)
UNION -- union both tables
(
SELECT -- now select qos score, leaving calls taken null
`qos`.`qos_date` AS `date`,
`qos`.`employee_id`,
NULL AS `calls_taken`,
`qos`.`score`
FROM `employee`
LEFT JOIN
`qos`
ON `qos`.`employee_id` = `employee`.`id`
)
) `union`
GROUP BY `union`.`date` -- group union table by date
For the UNION to work, we have to set the qos_score field in the day table and the calls_taken field in the qos table to null. If we don't, both calls_taken and score would be selected into the same column by the UNION statement.
After this, I selected the required fields with the aggregation functions SUM() and AVG() from the union'd table, grouping by the date field in the union table.

MySQL: group by consecutive days and count groups

I have a database table which holds each user's checkins in cities. I need to know how many days a user has been in a city, and then, how many visits a user has made to a city (a visit consists of consecutive days spent in a city).
So, consider I have the following table (simplified, containing only the DATETIMEs - same user and city):
datetime
-------------------
2011-06-30 12:11:46
2011-07-01 13:16:34
2011-07-01 15:22:45
2011-07-01 22:35:00
2011-07-02 13:45:12
2011-08-01 00:11:45
2011-08-05 17:14:34
2011-08-05 18:11:46
2011-08-06 20:22:12
The number of days this user has been to this city would be 6 (30.06, 01.07, 02.07, 01.08, 05.08, 06.08).
I thought of doing this using SELECT COUNT(id) FROM table GROUP BY DATE(datetime)
Then, for the number of visits this user has made to this city, the query should return 3 (30.06-02.07, 01.08, 05.08-06.08).
The problem is that I have no idea how shall I build this query.
Any help would be highly appreciated!
You can find the first day of each visit by finding checkins where there was no checkin the day before.
select count(distinct date(start_of_visit.datetime))
from checkin start_of_visit
left join checkin previous_day
on start_of_visit.user = previous_day.user
and start_of_visit.city = previous_day.city
and date(start_of_visit.datetime) - interval 1 day = date(previous_day.datetime)
where previous_day.id is null
There are several important parts to this query.
First, each checkin is joined to any checkin from the previous day. But since it's an outer join, if there was no checkin the previous day the right side of the join will have NULL results. The WHERE filtering happens after the join, so it keeps only those checkins from the left side where there are none from the right side. LEFT OUTER JOIN/WHERE IS NULL is really handy for finding where things aren't.
Then it counts distinct checkin dates to make sure it doesn't double-count if the user checked in multiple times on the first day of the visit. (I actually added that part on edit, when I spotted the possible error.)
Edit: I just re-read your proposed query for the first question. Your query would get you the number of checkins on a given date, instead of a count of dates. I think you want something like this instead:
select count(distinct date(datetime))
from checkin
where user='some user' and city='some city'
Try to apply this code to your task -
CREATE TABLE visits(
user_id INT(11) NOT NULL,
dt DATETIME DEFAULT NULL
);
INSERT INTO visits VALUES
(1, '2011-06-30 12:11:46'),
(1, '2011-07-01 13:16:34'),
(1, '2011-07-01 15:22:45'),
(1, '2011-07-01 22:35:00'),
(1, '2011-07-02 13:45:12'),
(1, '2011-08-01 00:11:45'),
(1, '2011-08-05 17:14:34'),
(1, '2011-08-05 18:11:46'),
(1, '2011-08-06 20:22:12'),
(2, '2011-08-30 16:13:34'),
(2, '2011-08-31 16:13:41');
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT v.user_id,
COUNT(DISTINCT(DATE(dt))) number_of_days,
MAX(days) number_of_visits
FROM
(SELECT user_id, dt
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS days,
#last_dt := DATE(dt),
#last_user := user_id
FROM
visits
ORDER BY
user_id, dt
) v
GROUP BY
v.user_id;
----------------
Output:
+---------+----------------+------------------+
| user_id | number_of_days | number_of_visits |
+---------+----------------+------------------+
| 1 | 6 | 3 |
| 2 | 2 | 1 |
+---------+----------------+------------------+
Explanation:
To understand how it works let's check the subquery, here it is.
SET #i = 0;
SET #last_dt = NULL;
SET #last_user = NULL;
SELECT user_id, dt,
#i := IF(#last_user IS NULL OR #last_user <> user_id, 1, IF(#last_dt IS NULL OR (DATE(dt) - INTERVAL 1 DAY) > DATE(#last_dt), #i + 1, #i)) AS
days,
#last_dt := DATE(dt) lt,
#last_user := user_id lu
FROM
visits
ORDER BY
user_id, dt;
As you see the query returns all rows and performs ranking for the number of visits. This is known ranking method based on variables, note that rows are ordered by user and date fields. This query calculates user visits, and outputs next data set where days column provides rank for the number of visits -
+---------+---------------------+------+------------+----+
| user_id | dt | days | lt | lu |
+---------+---------------------+------+------------+----+
| 1 | 2011-06-30 12:11:46 | 1 | 2011-06-30 | 1 |
| 1 | 2011-07-01 13:16:34 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 15:22:45 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-01 22:35:00 | 1 | 2011-07-01 | 1 |
| 1 | 2011-07-02 13:45:12 | 1 | 2011-07-02 | 1 |
| 1 | 2011-08-01 00:11:45 | 2 | 2011-08-01 | 1 |
| 1 | 2011-08-05 17:14:34 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-05 18:11:46 | 3 | 2011-08-05 | 1 |
| 1 | 2011-08-06 20:22:12 | 3 | 2011-08-06 | 1 |
| 2 | 2011-08-30 16:13:34 | 1 | 2011-08-30 | 2 |
| 2 | 2011-08-31 16:13:41 | 1 | 2011-08-31 | 2 |
+---------+---------------------+------+------------+----+
Then we group this data set by user and use aggregate functions:
'COUNT(DISTINCT(DATE(dt)))' - counts the number of days
'MAX(days)' - the number of visits, it is a maximum value for the days field from our subquery.
That is all;)
As data sample provided by Devart, the inner "PreQuery" works with sql variables. By defaulting the #LUser to a -1 (probable non-existent user ID), the IF() test checks for any difference between last user and current. As soon as a new user, it gets a value of 1... Additionally, if the last date is more than 1 day from the new date of check-in, it gets a value of 1. Then, the subsequent columns reset the #LUser and #LDate to the value of the incoming record just tested against for the next cycle. Then, the outer query just sums them up and counts them for the final correct results per the Devart data set of
User ID Distinct Visits Total Days
1 3 9
2 1 2
select PreQuery.User_ID,
sum( PreQuery.NextVisit ) as DistinctVisits,
count(*) as TotalDays
from
( select v.user_id,
if( #LUser <> v.User_ID OR #LDate < ( date( v.dt ) - Interval 1 day ), 1, 0 ) as NextVisit,
#LUser := v.user_id,
#LDate := date( v.dt )
from
Visits v,
( select #LUser := -1, #LDate := date(now()) ) AtVars
order by
v.user_id,
v.dt ) PreQuery
group by
PreQuery.User_ID
for a first sub-task:
select count(*)
from (
select TO_DAYS(p.d)
from p
group by TO_DAYS(p.d)
) t
I think you should consider changing database structure. You could add table visits and visit_id into your checkins table. Each time you want to register new checkin you check if there is any checkin a day back. If yes then you add a new checkin with visit_id from yesterday's checkin. If not then you add new visit to visits and new checkin with new visit_id.
Then you could get you data in one query with something like that:
SELECT COUNT(id) AS number_of_days, COUNT(DISTINCT visit_id) number_of_visits FROM checkin GROUP BY user, city
It's not very optimal but still better than doing anything with current structure and it will work. Also if results can be separate queries it will work very fast.
But of course drawbacks are you will need to change database structure, do some more scripting and convert current data to new structure (i.e. you will need to add visit_id to current data).