I am trying to create features for my ML work on a grocery customers data.
The data has transaction which user makes in buying groceries.
I am trying to find the name of the users who have made consecutive transactions within 30 seconds time frame. This is important to get a profile of such users
So for example if data looks like below:
User Datetime Amount
1 Mary 2020-11-30 10:10:20 24
2 Jacob 2020-11-30 12:12:12 43.2
3 Alice 2020-11-30 11:11:11 75.29
4 Mary 2020-11-30 10:10:45 34
5 Mary 2020-11-30 10:11:15 21
6 Alice 2020-11-30 11:11:41 100
the correct answer would be Alice as only Alice had more than 1 transactions which are within 30 seconds time frame.
Mary might appear as probable answer but not all consecutive transactions had 30 seconds gap. It had 25 and 30. So correct answer we need is Alice
One method is lag() to get the time of the previous transaction. The following returns the transactions that are within 30 seconds:
select t.*
from (select t.*,
lag(datetime) over (partition by user order by datetime) as prev_datetime
from t
) t
where prev_datetime > datetime - interval '30 second';
This syntax uses standard SQL; date/time functions vary among databases, so the exact syntax depends on the database you are using.
It is unclear how you want to summarize this to get Alice but not Mary.
If you need for all transactions to be exactly 30 seconds, you can use:
select user
from (select t.*,
lag(datetime) over (partition by user order by datetime) as prev_datetime
from t
) t
group by user
having sum(prev_datetime <> datetime - interval 30 second) = 0;
Related
I have two tables in my schema. The first contains a list of recurring appointments - default_appointments. The second table is actual_appointments - these can be generated from the defaults or individually created so not linked to any default entry.
Example:
default_appointments
id
day_of_week
user_id
appointment_start_time
appointment_end_time
1
1
1
10:00:00
16:00:00
2
4
1
11:30:00
17:30:00
3
6
5
09:00:00
17:00:00
actual_appointments
id
default_appointment_id
user_id
appointment_start
appointment_end
1
1
1
2021-09-13 10:00:00
2021-09-13 16:00:00
2
NULL
1
2021-09-13 11:30:00
2021-09-13 13:30:00
3
6
5
2021-09-18 09:00:00
2021-09-18 17:00:00
I'm looking to calculate the total minutes that were scheduled in against the total that were actually created/generated. So ultimately I'd end up with a query result with this data:
user_id
appointment_date
total_planned_minutes
total_actual_minutes
1
2021-09-13
360
480
1
2021-09-16
360
0
5
2021-09-18
480
480
What would be the best approach here? Hopefully the above makes sense.
Edit
OK so the default_appointments table contains all appointments that are "standard" and are automatically generated. These are what appointments "should" happen every week. So e.g. ID 1, this appointment should occur between 10am and 4pm every Monday. ID 2 should occur between 11:30am an 5:30pm every Thursday.
The actual_appointments table contains a list of all of the appointments which did actually occur. Basically what happens is a default_appointment will automatically generate itself an instance in the actual_appointments table when initially set up. The corresponding default_appointment_id indicates that it links to a default and has not been changed - therefore the times on both will remain the same. The user is free to change these appointments that have been generated by a default, resulting in setting the default_appointment_id to NULL * - or -* can add new appointments unrelated to a default.
So, if on a Monday (day_of_week = 1) I should normally have a default appointment at 10am - 4pm, the total minutes I should have planned based on the defaults are 360 minutes, regardless of what's in the actual_appointments table, I should be planned for those 360 minutes every Monday without fail. If in the system I say - well actually, I didn't have an appointment from 10am - 4pm and instead change it to 10am - 2pm, actual_appointments table will then contain the actual time for the day, and the actual minutes appointed would be 240 minutes.
What I need is to group each of these by the date and user to understand how much time the user had planned for appointments in the default_appointments table vs how much they actually appointed.
Adjusted based on new detail in the question.
Note: I used day_of_week values compatible with default MySQL behavior, where Monday = 2.
The first CTE term (args) provides the search parameters, start date and number of days. The second CTE term (drange) calculates the dates in the range to allow generation of the scheduled appointments within that range.
allrows combines the scheduled and actual appointments via UNION to prepare for aggregation. There are other ways to set this up.
Finally, we aggregate the results per user_id and date.
The test case:
Working Test Case (Updated)
WITH RECURSIVE args (startdate, days) AS (
SELECT DATE('2021-09-13'), 7
)
, drange (adate, days) AS (
SELECT startdate, days-1 FROM args UNION ALL
SELECT adate + INTERVAL '1' DAY, days-1 FROM drange WHERE days > 0
)
, allrows AS (
SELECT da.user_id
, dr.adate
, ROUND(TIME_TO_SEC(TIMEDIFF(da.appointment_end_time, da.appointment_start_time))/60, 0) AS planned
, 0 AS actual
FROM drange AS dr
JOIN default_appointments AS da
ON da.day_of_week = dayofweek(adate)
UNION
SELECT user_id
, DATE(appointment_start) AS xdate
, 0 AS planned
, TIMESTAMPDIFF(MINUTE, appointment_start, appointment_end)
FROM drange AS dr
JOIN actual_appointments aa
ON DATE(appointment_start) = dr.adate
)
SELECT user_id, adate
, SUM(planned) AS planned
, SUM(actual) AS actual
FROM allrows
GROUP BY adate, user_id
;
Result:
+---------+------------+---------+--------+
| user_id | adate | planned | actual |
+---------+------------+---------+--------+
| 1 | 2021-09-13 | 360 | 480 |
| 1 | 2021-09-16 | 360 | 0 |
| 5 | 2021-09-18 | 480 | 480 |
+---------+------------+---------+--------+
I have a set of inventory data where the amount increases at a given rate. For example, the inventory increases by ten units every day. However, from time to time there will be an inventory reduction that could be any amount. I need a query that can find me the most recent inventory reduction and return to me the sum of that deduction.
My table holds date and amount for numerous item id's. In theory what I am trying to do is select all amounts and dates for a given item ID, and then find the difference between the most recent reduction between two days inventory. Due to the fact that multiple items are tracked, there is no guarantee that the id column will be consecutive for a set of items.
Researching to find a solution to this has been completely overwhelming. It seems like window functions might be a good route to try, but I have never used them and don't even really have a concept of where to start.
While I could easily return the amounts and do the calculation in PHP, I feel the right thing to do here is harness SQL but my experience with more complex queries is limited.
ID | ItemID | Date | Amount
1 2 2019-05-05 25
7 2 2019-05-06 26
34 2 2019-05-07 14
35 2 2019-05-08 15
67 2 2019-05-09 16
89 2 2019-05-10 5
105 2 2019-05-11 6
Given the data above, it would be nice to see a result like:
item id | date | reduction
2 2019-05-10 11
This is because the most recent inventory reduction is between id 67 and 89 and the amount of the reduction is 11 on May 10th 2019.
In MySQL 8+, you can use lag():
select t.*, (prev_amount - amount) as reduction
from (select t.*,
lag(amount) over (partition by itemid order by date) as prev_amount
from t
) t
where prev_amount > amount
order by date desc
limit 1;
We have a database for patients that shows the details of their various visits to our office, such as their weight during that visit. I want to generate a report that returns the visit (a row from the table) based on the difference between the date of that visit and the patient's first visit being the largest value possible but not exceeding X number of days.
That's confusing, so let me try an example. Let's say I have the following table called patient_visits:
visit_id | created | patient_id | weight
---------+---------------------+------------+-------
1 | 2006-08-08 09:00:05 | 10 | 180
2 | 2006-08-15 09:01:03 | 10 | 178
3 | 2006-08-22 09:05:43 | 10 | 177
4 | 2006-08-29 08:54:38 | 10 | 176
5 | 2006-09-05 08:57:41 | 10 | 174
6 | 2006-09-12 09:02:15 | 10 | 173
In my query, if I were wanting to run this report for "30 days", I would want to return the row where visit_id = 5, because it's 28 days into the future, and the next row is 35 days into the future, which is too much.
I've tried a variety of things, such as joining the table to itself, or creating a subquery in the WHERE clause to try to return the max value of created WHERE it is equal to or less than created + 30 days, but I seem to be at a loss at this point. As a last resort, I can just pull all of the data into a PHP array and build some logic there, but I'd really rather not.
The bigger picture is this: The database has about 5,000 patients, each with any number of office visits. I want to build the report to tell me what the average wait loss has been for all patients combined when going from their first visit to X days out (that is, X days from each individual patient's first visit, not an arbitrary X-day period). I'm hoping that if I can get the above resolved, I'll be able to work the rest out.
You can get the date of the first and next visit using query like this (Note that this doesn't has correct syntax for date comparing and it is just an schema of the query):
select
first_visits.patient_id,
first_visits.date first_date,
max(next_visit.created) next_date
from (
select patient_id, min(created) as "date"
from patient_visits
group by patient_id
) as first_visits
inner join patient_visits next_visit
on (next_visit.patient_id = first_visits.patient_id
and next_visit.created between first_visits.created and first_visits.created + 30 days)
group by first_visits.patient_id, first_visits.date
So basically you need to find start date using grouping by patient_id and then join patient_visits and find max date that is within the 30 days window.
Then you can join the result to patient_visits to get start and end weights and calculate the loss.
EDIT: The original post follows, but its a bit long and wordy. This edit presents a simplified question.
I'm trying to SUM 1 column multiple times; from what I've found, my options are either CASE or (SELECT). I am trying to SUM based on a date range and I can't figure out if CASE allows that.
table.number | table.date
2 2014/12/18
2 2014/12/19
3 2015/01/11
3 2015/01/12
7 2015/02/04
7 2015/02/05
As separate queries, it would look like this:
SELECT SUM(number) as alpha FROM table WHERE date >= 2014/12/01 AND date<= DATE_ADD (2014/12/01, INTERVAL 4 WEEKS)
SELECT SUM(number) as beta FROM table WHERE date >= 2014/12/29 AND date<= DATE_ADD (2014/12/01, INTERVAL 4 WEEKS)
SELECT SUM(number) as gamma FROM table WHERE date >= 2014/01/19 AND date<= DATE_ADD (2014/12/01, INTERVAL 4 WEEKS)
Looking for result set
alpha | beta | gamma
2 6 14
ORIGINAL:
I'm trying to return SUM of payments that will be due within my budgeting time frame (4 weeks) for the current budgeting period and 2 future periods. Some students pay every 4 weeks, others every 12. Here are the relevant fields in my tables:
client.name | client.ppid | client.last_payment
john | 1 | 12/01/14
jack | 2 | 11/26/14
jane | 3 | 10/27/14
pay_profile.id | pay_profile.price | pay_profile.interval (in weeks)
1 140 4
2 399 4
3 1 12
pay_history.name | pay_history.date | pay_history.amount
john | 12/02/14 | 140
jerry | more historical | data
budget.period_start |
12/01/14
I think the most efficient way of doing this is:
1.)SUM all students who pay every 4 weeks as base_pay
2.)SUM all students who pay every 12 weeks and whose DATEADD(client.last_payment, INTERVAL pay_profile.interval WEEKS) is >= budget.period_start and <= DATEADD(budget.period_start, INTERVAL 28 DAYS) as accounts_receivable
3.) As the above step will miss people who've already paid in this budgeting period (as this updates their last_payment dating, putting them out of the range specified in #2), I'll also need to SUM pay_history.date for the range above as well. paid_in_full
4.) repeat step 2 above, adjusting the range and column name for future periods (i.e. accounts_receivable_2
5.) use php to SUM base_pay, accounts_receivable, and pay_history, repeating the process for future periods.
I'm guessing the easiest way would be to use CASE, which I've not done before. Here was my best guess, which fails due to a sytax error. I assuming I can use DATE_ADD in the WHEN statement.
SELECT
CASE
DATE_ADD(client.last_payment, INTERVAL pay_profile.interval WEEK) >= budget.period_start
AND
DATE_ADD(client.last_payment, INTERVAL pay_profile.interval WEEK) <=
DATE_ADD(budget.period_start,INTERVAL 28 DAY) THEN SUM(pay_profile.price) as base_pay
FROM client
LEFT OUTER JOIN pay_profile ON client.ppid = pay_profile.ppid
LEFT OUTER JOIN budget ON client.active = 1
WHERE
client.active = 1
Thanks.
I have a table with 4 columns:
ID, USER_ID, SOURCE, CREATED_DATE
In that table is the following data:
ID USER_ID SOURCE CREATED_DATE
1 25 PURCHASE 2012-01-01 12:30:00
2 26 PLEDGE 2012-01-01 12:40:00
3 25 PLEDGE 2012-01-01 12:50:00
4 25 PURCHASE 2012-01-14 12:00:00
Now as you can see, I have 4 rows of data, and two unique users. User (25) made 3 transactions (two purchases and one pledge), user (26) made one transaction – (one pledge)
Here is what I am trying to achieve:
I need to select ALL transactions from this table, but I want to select a UNIQUE user for each REQUEST TYPE (source), and that row needs to be the EARLIEST TRANSACTION.
My expected result data would be:
ID USER_ID SOURCE CREATED_DATE
1 25 PURCHASE 2012-01-01 12:30:00
2 26 PLEDGE 2012-01-01 12:40:00
3 25 PLEDGE 2012-01-01 12:00:00
User (25) made TWO PURCHASES (one on 2012-01-01 and one on 2012-01-14) – the first is the one that gets returned.
This is the SQL I have come up with so far:
SELECT
Supporter.user_id,
MIN(Supporter.created) as created,
Supporter.*,
Supporter.source
FROM
supporters AS Supporter
GROUP BY Supporter.source
ORDER BY Supporter.created ASC
Now, this gets me really close, except it only selects ONE of the user id’s (the one with two items – a pledge and a purchase). If I could figure out how to select the data on both users, that would be what I need to do! Can anyone see what I am possibly doing wrong here, or missing?
You need to group by source and by user id
Something like this
SELECT
Supporter.user_id,
MIN(Supporter.created) as created,
Supporter.*,
Supporter.source
FROM
supporters AS Supporter
GROUP BY Supporter.user_id, Supporter.source
ORDER BY Supporter.created ASC