MySQL pro rata counting of distinct values in last period - mysql

This one is quite tricky i've been scratching my head all day.
I have a table containing billing periods
ID | DATEBEGIN | DATEEND | CUSTOMER_ID
=======+===========+==========+=================
1 | 1/1/2011 | 30/1/2011 | 1
I have a table containing 'sub customers'
ID | NAME
=======+===========
1 | JOHN
2 | JACK
3 | Elwood
I have a table containing items purchased on a subscription (wholesale account
ID | DATE | SUBCUSTOMER_ID | CUSTOMER ID
=======+===========+================+==============
1 | 15/1/2011 | 1 | 1
2 | 18/1/2011 | 1 | 1
3 | 25/1/2011 | 2 | 1
4 | 28/1/2011 | 3 | 1
So I want to count 'credits' to deduct from their account. So the subscription is per 'sub customer'.
So at the end of the billing period (30/1/2011 from first table). I need to count the distinct sub customers (there are 3). They are charged pro-rata from the first purchase they make during the billing period.
Days Having Subscription | SUBCUSTOMER_ID | Pro Rata Rate | CUSTOMER_ID
==========================+===================+==================+==============
3 | 3 | 3/30 | 1
6 | 2 | 6/30 | 1
16 | 1 | 16/30 | 1
The output should therefore be
CUSTOMER_ID | BILLING CREDITS
============+========================
1 | 25/30
I have to count it pro rata, previously it would be unfair to bill a full period even if they purchase an item 1 day prior to the billing date

SELECT customerId, SUM(DATEDIFF(dateend, fp) + 1) / (DATEDIFF(dateend, datestart) + 1)
FROM (
SELECT b.*, MIN(date) AS fp
FROM billing b
JOIN purchase p
ON p.customerId = b.customerId
AND p.date >= b.datebegin
AND p.date < b.dateend + INTERVAL 1 DAY
GROUP BY
b.id, p.subcustomerId
) q
GROUP BY
customerId

SELECT customer_id, sum(pro_rated_date) / max(days_in_billing_cycle)
FROM (SELECT min(date),
subcustomer_id,
datebegin,
dateend,
items.customer_id,
datediff(dateend, datebegin) + 1 AS days_in_billing_cycle,
datediff(dateend, min(date)) AS pro_rated_date
FROM items
JOIN
billing_period
ON items.date BETWEEN billing_period.datebegin
AND billing_period.dateend
AND items.customer_id = billing_period.customer_id
GROUP BY subcustomer_id) AS billing
GROUP BY customer_id

Related

How to get a Cohort analysis using MySQL?

Here is my current query:
SELECT DATEDIFF(created_at, '2020-07-01') DIV 6 period,
user_id FROM transactions
WHERE DATE(created_at) >= '2020-07-01'
GROUP BY user_id, DATEDIFF(created_at, '2020-07-01') DIV 6
ORDER BY period
It returns a list of users that have had at least one transaction per period (period === 6 days). Here is a simplified of the current output:
// res_table
+--------+---------+
| period | user_id |
+--------+---------+
| 0 | 1111 |
| 0 | 2222 |
| 0 | 3333 |
| 1 | 7777 |
| 1 | 1111 |
| 2 | 2222 |
| 2 | 1111 |
| 2 | 8888 |
| 2 | 3333 |
+--------+---------+
Now, I need to know, in which period, how many users have had at least one transaction again (in the term of marketing, I'm trying to picturing the retention rate by a Cohort chart). Therefore, the calculations must be done in the Cartesian algorithm; Like a self-join!
Here is the expected result:
+---------+---------+------------+
| periodX | periodY | percentage |
+---------+---------+------------+
| 0 | 0 | 100% | -- it means 3 users exist in period 0 and logically all of them exist in period 0. So 3/3=100%
| 0 | 1 | 33% | -- It means 3 users exist in period 0, and just 1 of them exist in period 1. So 1/3=33%
| 0 | 2 | 66% | -- It means 3 user exists in period 0, and just 2 of them exist in period 2. So 2/3=66%
| 1 | 1 | 100% | -- it means 1 user (only #777, actually #111 is ignored because it's duplicated in pervious periods) exists in period 1 and logically it exists in period 1. So 1/1=100%
| 1 | 2 | 0% |
| 2 | 2 | 100% |
+---------+---------+------------+
Is it possible to do this using MySQL purely?
You can use window functions:
SELECT first_period, period, COUNT(*),
COUNT(*) / SUM(COUNT(*)) OVER (PARTITION BY first_period) as ratio
FROM (SELECT DATEDIFF(created_at, '2020-07-01') DIV 6 period,
user_id,
MIN(MIN(DATEDIFF(created_at, '2020-07-01') DIV 6) OVER (PARTITION BY user_id)) as first_period
FROM transactions
WHERE DATE(created_at) >= '2020-07-01'
GROUP BY user_id, DATEDIFF(created_at, '2020-07-01') DIV 6
) u
GROUP BY first_period, period
ORDER BY first_period, period;
This does not include missing periods. That is a little trickers, because you need to enumerate all of them:
with periods as (
select 0 as period union all
select 1 as period union all
select 2 as period
)
select p1.period, p2.period, COUNT(u.user_id)
from periods p1 join
periods p2
on p1.period <= p2.period left join
(SELECT DATEDIFF(created_at, '2020-07-01') DIV 6 period,
user_id,
MIN(MIN(DATEDIFF(created_at, '2020-07-01') DIV 6) OVER (PARTITION BY user_id)) as first_period
FROM transactions
WHERE DATE(created_at) >= '2020-07-01'
GROUP BY user_id, DATEDIFF(created_at, '2020-07-01') DIV 6
) u
ON p1.period = u.first_period AND p2.period = u.period
GROUP BY p1.period, p2.period;

Select duplicate payment pending orders with no corresponding paid order

My Table structure is:
orders
+------+-------------+----------------+-------------+
| id | customer_id | payment_status | created_on|
+------+-------------+----------------+-------------+
| 1 | 1 | unpaid | 2018-12-28 |
| 2 | 1 | unpaid | 2018-12-29 |
| 3 | 2 | unpaid | 2018-12-29 |
| 4 | 2 | unpaid | 2018-12-29 |
| 5 | 4 | paid | 2018-12-30 |
| 6 | 3 | unpaid | 2018-12-30 |
+------+-------------+----------------+-------------+
order_items
+------+-----------+-------------+----------+-------+
| id | order_id | product_id | quantity | price |
+------+-----------+-------------+----------+-------+
| 1 | 1 | 4 | 2 | 20.50 |
| 2 | 1 | 5 | 2 | 25.00 |
| 3 | 2 | 4 | 2 | 20.50 |
| 4 | 2 | 5 | 2 | 25.00 |
| 5 | 3 | 1 | 1 | 20.00 |
| 6 | 3 | 2 | 1 | 25.00 |
| 7 | 4 | 1 | 1 | 20.00 |
| 8 | 4 | 2 | 1 | 25.00 |
| 9 | 5 | 4 | 2 | 20.50 |
| 10 | 5 | 5 | 2 | 25.00 |
| 11 | 6 | 3 | 4 | 15.00 |
+------+-----------+-------------+----------+-------+
customer
+-----+---------------+----------+
| id | email | name |
+-----+---------------+----------+
| 1 | abc#mail.com | user 1 |
| 2 | xyz#mail.com | user 2 |
| 3 | pqr#mail.com | user 3 |
| 4 | abc#mail.com | user 4 |
+-----+---------------+----------+
Q: I want the data as orders which are under one customer email with pending status and no paid status orders under that customer with in a week
Expected output: 1
single order with no corresponding paid order with in a week
+------+-------------+----------------+-------------+
| id | customer_id | payment_status | created_on|
+------+-------------+----------------+-------------+
| 3 | 2 | unpaid | 2018-12-29 |
| 4 | 2 | unpaid | 2018-12-29 |
| 6 | 3 | unpaid | 2018-12-30 |
+------+-------------+----------------+-------------+
Q: I want the data as if there are 2 orders which has same products and same quantity under one customer email with pending status and no paid status orders under that customer with in a week
Expected output: 2
two orders with no corresponding paid order with in a week
+------+-------------+----------------+-------------+
| id | customer_id | payment_status | created_on|
+------+-------------+----------------+-------------+
| 3 | 2 | unpaid | 2018-12-29 |
| 4 | 2 | unpaid | 2018-12-29 |
+------+-------------+----------------+-------------+
The first query is dubious -- Did you really mean email or customer_id? The latter should be how you designed the schema to distinguish one "customer" from another. Think that through. (And fix the data to make it clear.) Meanwhile, I will assume customer_id distinguishes customers.
I can't wrap my head around the purpose of the first query. You are looking for customers that paid for a later Order, but have not paid for an earlier order? Or looking for mis-postings in the database? Anyway, here is a shot at it:
SELECT Unpd.id, Unpd.customer_id, Unpd.payment_status, Unpd.created_on
FROM Orders AS Pd ON Pd.customer_id = C.id
AND payment_status = 'paid'
WHERE NOT EXISTS
(
SELECT 1
FROM Orders AS Pd
WHERE Pd.customer_id = C.id
AND Pd.payment_status = 'paid'
AND Pd.created_on > NOW() - INTERVAL 1 WEEK
)
Second query. I rephrase it as: Locate two (or more) orders (paid or unpaid) by the same customer ON the same day (but not checking that the items are the same):
SELECT O2.id, O2.customer_id, O2.payment_status, O2.created_on
FROM
(
SELECT O.customer_id, O.created_on
FROM Orders AS O
GROUP BY O.customer_id, O.created_on
HAVING COUNT(*) >= 2
) AS MultipleInOneDay
JOIN Orders AS O2 USING (customer_id, created_on)
I agree completely with Rick about cleaning up the schema
If I've read you correctly, currently your customer table effectively just adds the columns email and name to your orders table
Q1
Assuming you want within a week of today's date and ID fields cannot be null
SELECT ou.*
FROM orders ou /** orders unpaid */
JOIN customer cu /** customer unpaid */
ON cu.id = ou.customer_id
WHERE ou.payment_status = 'unpaid'
AND NOT EXISTS (
SELECT 1
FROM orders op /** orders paid */
JOIN customer cp /** customer paid */
ON cp.id = op.customer_id
WHERE op.payment_status = 'paid'
AND op.created_on > CURDATE() - INTERVAL 1 WEEK /** or >= if required */
AND cp.email = cu.email
)
N.B. As it is more than a week since the paid orders in your examples, you will have to adjust the temporal condition to see the same results
Q2
Same assumptions as Q1, plus assumption that a product_id can only appear once per order
SELECT ou.*
FROM orders ou /** orders unpaid */
JOIN customer cu /** customer unpaid */
ON cu.id = ou.customer_id
JOIN (
SELECT GROUP_CONCAT(oudc.id) orders_csv
FROM (
SELECT oui.id,
cui.email,
GROUP_CONCAT(oiui.product_id ORDER BY oiui.product_id) products,
GROUP_CONCAT(oiui.quantity ORDER BY oiui.product_id) quantity
FROM orders oui /** orders unpaid internal */
JOIN customer cui /** customer unpaid internal */
ON cui.id = oui.customer_id
JOIN order_items oiui /** order items unpaid internal */
ON oiui.order_id = oui.id
WHERE oui.payment_status = 'unpaid'
GROUP BY oui.id,
cui.email
) oudc /** orders unpaid dupe check */
GROUP BY oudc.email,
oudc.products,
oudc.quantity
HAVING COUNT(*) = 2 /** or >=2 if required */
) oud /** orders unpaid dupes */
ON FIND_IN_SET(ou.id, oud.orders_csv) > 0
WHERE ou.payment_status = 'unpaid'
AND NOT EXISTS (
SELECT 1
FROM orders op /** orders paid */
JOIN customer cp /** customer paid */
ON cp.id = op.customer_id
WHERE op.payment_status = 'paid'
AND op.created_on > CURDATE() - INTERVAL 1 WEEK /** or >= if required */
AND cp.email = cu.email
)
N.B. As it is more than a week since the paid orders in your examples, you will have to adjust the temporal condition to see the same results
This query is only roughly tested and is probably woefully slow. I suggest you run each nested select query individually (starting at the deepest) to see what is happening. Basically it concatenates the orders into 1 row each, then concatenates duplicate orders with the same email into 1 row each, then checks for orders in this row using a similar logic to Q1
If you can have the same product_id more than once per order, you can normalize with a further nested grouping select within my orders unpaid dupe check subquery
SQLfiddle
I have also created an SQLfiddle to demonstrate these 2 queries in action on your example data. I have, however, adjusted the dates of the example orders so that they depend on the current date

How to join two tables with average function and where clause? SQL

I have two tables below with the following information
project.analytics
| proj_id | list_date | state
| 1 | 03/05/10 | CA
| 2 | 04/05/10 | WA
| 3 | 03/05/10 | WA
| 4 | 04/05/10 | CA
| 5 | 03/05/10 | WA
| 6 | 04/05/10 | CA
employees.analytics
| employee_id | proj_id | worked_date
| 20 | 1 | 3/12/10
| 30 | 1 | 3/11/10
| 40 | 2 | 4/15/10
| 50 | 3 | 3/16/10
| 60 | 3 | 3/17/10
| 70 | 4 | 4/18/10
What query can I write to determine the average number of unique employees who have worked on the project in the first 7 days that it was listed by month and state?
Desired output:
| list_date | state | # Unique Employees of projects first 7 day list
| March | CA | 1
| April | WA | 2
| July | WA | 2
| August | CA | 1
My Attempt
select
month(list_date),
state_name,
count(*) as Projects,
from projects
group by
month(list_date),
state_name;
I understand the next steps are to subtract the worked_date - list_date and if value is <7 then average count of employees from the 2nd table but I'm not sure what query functions to use.
You could use a CASE with a DISTINCT to COUNT the unique employees that worked within the first 7 days of the list_date.
Once you have that total of employees per project, then you can calculate those averages per month & state.
SELECT
MONTHNAME(list_date) as `ListMonth`,
state,
AVG(TotalUniqEmp7Days) AS `Average Unique Employees of projects first 7 day list`
FROM
(
SELECT
proj.proj_id,
proj.list_date,
proj.state,
COUNT(DISTINCT CASE
WHEN emp.worked_date BETWEEN proj.list_date and DATE_ADD(proj.list_date, INTERVAL 6 DAY)
THEN emp.employee_id
END) AS TotalUniqEmp7Days
-- , COUNT(DISTINCT emp.employee_id) AS TotalUniqEmp
FROM project.analytics proj
LEFT JOIN employees.analytics emp ON emp.proj_id = proj.proj_id
GROUP BY proj.proj_id, proj.list_date, proj.state
) AS ProjectTotals
GROUP BY YEAR(list_date), MONTH(list_date), MONTHNAME(list_date), state;
A Sql Fiddle test can be found here
I think this is the code that you want
select
p.list_date, p.state,
emp.no_of_unique_emp
from project.analytics p
inner join (
select
t.project_id,
count(t.employee_id) as no_of_unique_emp
from (
select distinct employee_id, project_id
from employees.analytics
) t
group by t.project_id
) emp
on emp.project_id = p.project_id
where datediff (p.list_date, getdate()) <= 7

Mysql: Compare and filter time entries

I have an table with user ids and login dates.
id | customer | timestamp
1 | 1 | 2017-02-10 11:30:28
2 | 1 | 2017-02-11 11:30:28
3 | 2 | 2017-02-12 11:30:28
4 | 3 | 2017-02-13 11:30:28
5 | 1 | 2017-02-14 11:30:28
Now I want to get the count of the longest continuous streak of logins per customer.
I got to the point, where the difference is calculated correctly for one customer.
SELECT a.id aId,
b.id bId,
a.customer,
a.timestamp aTime,
b.timestamp bTime,
DATEDIFF(b.timestamp, a.timestamp) as diff
FROM logins a
INNER JOIN logins b
ON a.customer = b.customer AND a.id < b.id
WHERE b.customer = 7
GROUP BY a.id
How can I do this for the whole table and count the following logins with a difference under 1 day?
The wanted result for this example should be:
customer | days of continuous login
1 | 2
2 | 1
3 | 1
You can do this with a LEFT JOIN
Query
SELECT
logins.customer
, COUNT(*) as "longest continuous streak of logins"
FROM (
SELECT
login1.*
FROM
login login1
LEFT JOIN
login login2
ON
login1.timestamp < login2.timestamp
AND
# Only JOIN if date diff is less or equal then 1 day
DATEDIFF(login2.timestamp, login1.timestamp) <= 1
WHERE
login2.id IS NOT NULL
AND
login2.customer IS NOT NULL
AND
login2.timestamp IS NOT NULL
ORDER BY
login1.customer
)
AS logins
GROUP BY
logins.customer
Result
| customer | longest continuous streak of logins |
|----------|-------------------------------------|
| 1 | 2 |
| 2 | 1 |
| 3 | 1 |
see demo http://www.sqlfiddle.com/#!9/ad581/17

Get the balance of my users in the same table

Help please, I have a table like this:
| ID | userId | amount | type |
-------------------------------------
| 1 | 10 | 10 | expense |
| 2 | 10 | 22 | income |
| 3 | 3 | 25 | expense |
| 4 | 3 | 40 | expense |
| 5 | 3 | 63 | income |
I'm looking for a way to use one query and retrive the balance of each user.
The hard part comes when the amounts has to be added on expenses and substracted on incomes.
This would be the result table:
| userId | balance |
--------------------
| 10 | 12 |
| 3 | -2 |
You need to get each totals of income and expense using subquery then later on join them so you can subtract expense from income
SELECT a.UserID,
(b.totalIncome - a.totalExpense) `balance`
FROM
(
SELECT userID, SUM(amount) totalExpense
FROM myTable
WHERE type = 'expense'
GROUP BY userID
) a INNER JOIN
(
SELECT userID, SUM(amount) totalIncome
FROM myTable
WHERE type = 'income'
GROUP BY userID
) b on a.userID = b.userid
SQLFiddle Demo
This is easiest to do with a single group by:
select user_id,
sum(case when type = 'income' then amount else - amount end) as balance
from t
group by user_id
You could have 2 sub-queries, each grouped by id: one sums the incomes, the other the expenses. Then you could join these together, so that each row had an id, the sum of the expenses and the sum of the income(s), from which you can easily compute the balance.