SQL - Calculating variable moving average over variable lenghts - mysql

FIRST: This question is NOT a duplicate. I have asked this on here already and it was closed as a duplicate. While it is similar to other threads on stackoverflow, it is actually far more complex. Please read the post before assuming it is a duplicate:
I am trying to calculate variable moving averages crossover with variable dates.
That is: I want to prompt the user for 3 values and 1 option. The input is through a web front end so I can build/edit the query based on input or have multiple queries if needed.
X = 1st moving average term (N day moving average. Any number 1-N)
Y = 2nd moving average term. (N day moving average. Any number 1-N)
Z = Amount of days back from present to search for the occurance of:
option = Over/Under: (> or <. X passing over Y, or X passing Under Y)
X day moving average passing over OR under Y day moving average
within the past Z days.
My database is structured:
tbl_daily_data
id
stock_id
date
adj_close
And:
tbl_stocks
stock_id
symbol
I have a btree index on:
daily_data(stock_id, date, adj_close)
stock_id
I am stuck on this query and having a lot of trouble writing it. If the variables were fixed it would seem trivial but because X, Y, Z are all 100% independent of each other (could look, for example for 5 day moving average within the past 100 days, or 100 day moving average within the past 5) I am having a lot of trouble coding it.
Please help! :(
Edit: I've been told some more context might be helpful?
We are creating an open stock analytic system where users can perform trend analysis. I have a database containing 3500 stocks and their price histories going back to 1970.
This query will be running every day in order to find stocks that match certain criteria
for example:
10 day moving average crossing over 20 day moving average within 5
days
20 day crossing UNDER 10 day moving average within 5 days
55 day crossing UNDER 22 day moving average within 100 days
But each user may be interested in a different analysis so I cannot just store the moving average with each row, it must be calculated.

I am not sure if I fully understand the question ... but something like this might help you get where you need to go: sqlfiddle
SET #X:=5;
SET #Y:=3;
set #Z:=25;
set #option:='under';
select * from (
SELECT stock_id,
datediff(current_date(), date) days_ago,
adj_close,
(
SELECT
AVG(adj_close) AS moving_average
FROM
tbl_daily_data T2
WHERE
(
SELECT
COUNT(*)
FROM
tbl_daily_data T3
WHERE
date BETWEEN T2.date AND T1.date
) BETWEEN 1 AND #X
) move_av_1,
(
SELECT
AVG(adj_close) AS moving_average
FROM
tbl_daily_data T2
WHERE
(
SELECT
COUNT(*)
FROM
tbl_daily_data T3
WHERE
date BETWEEN T2.date AND T1.date
) BETWEEN 1 AND #Y
) move_av_2
FROM
tbl_daily_data T1
where
datediff(current_date(), date) <= #z
) x
where
case when #option ='over' and move_av_1 > move_av_2 then 1 else 0 end +
case when #option ='under' and move_av_2 > move_av_1 then 1 else 0 end > 0
order by stock_id, days_ago
Based on answer by #Tom H here: How do I calculate a moving average using MySQL?

Related

Efficient SQL Query to calculate portion of a row in half hourly time series that has occurred

I have a table that looks like this:
id
slot
total
1
2022-12-01T12:00
100
2
2022-12-01T12:30
150
3
2022-12-01T13:00
200
There's an index on slot already. The table has ~100mil rows (and a bunch more columns not shown here)
I want to sum the total up to the current moment in time (EDIT: WASN'T CLEAR INITIALLY, I WILL PROVIDE A LOWER SLOT BOUND, SO THE SUM WILL BE OVER SOME NUMBER OF DAYS/WEEKS, NOT OVER FULL TABLE). Let's say the time is currently 2022-12-01T12:45. If I run select * from my_table where slot < CURRENT_TIMESTAMP(),
then I get back records 1 and 2.
However, in my data, the records represent forecasted sales within a time slot. I want to find the forecasts as of 2022-12-01T12:45, and so I want to find the proportion of the half hour slot of record 2 that has elapsed, and return that proportion of the total.
As of 2022-12-01T12:45 (assuming minute granularity), 50% of row 2 has elapsed, so I would expect the total to return as 150 / 2 = 75.
My current query works, but is slow. What are some ways I can optimise this, or other approaches I can take?
Also, how can we extend this solution to be generalised to any interval frequency? Maybe tomorrow we change our forecasting model and the data comes in sporadically. The hardcoded 30 would not work in that case.
select sum(fraction * total) as t from
select total,
LEAST(
timestampdiff(
minute,
datetime,
current_timestamp()
),
30
) / 30 as fraction
from my_table
where slot <= current_timestamp()
Consider computing your sum first, then remove the last element partial total. In order to keep the last element total, I'd prefer applying window functions instead of aggregations, and limit the output to the last row.
SET #current_time = CURRENT_TIMESTAMP();
WITH cte AS (
SELECT slot,
SUM(total) OVER(ORDER BY slot) AS total,
total AS rowtotal
FROM my_table
WHERE slot < #current_time
ORDER BY slot DESC
LIMIT 1
)
SELECT slot,
total - (30 - TIMESTAMPDIFF(MINUTE,
slot,
#current_time))
/30 * rowtotal AS total
FROM cte
Check the demo here.
Note1: Adding an index on the slot field is likely to boost this query performance.
Note2: If your query is running on millions of data, your timestamp may be likely to change during the query. You could store it into a variable before the query is run (or into another cte).
create an ondex in slot column btree as it is having high selectivity;

How to query available item leases based on a date range in MySQL?

We have a business that rents out international phone numbers to customers when traveling. When a customer makes an order We want to display to the customer the available phone numbers for his booking dates based on his start_date and end_date and numbers which is not occupied yet.
Since these phone numbers are rented out, I need to select from the table ONLY those numbers that are not rented out yet for dates that would interfere with the current customers dates.
I also don't want to rent out any phone number prior to 7 days after its end date. Meaning, If a customer booked a phone number for 1-1-2020 through 1-20-2020, I don't want this phone number to be booked by another customer before 1-27-2020. I want the phone number to have a 7 day window of being clear.
I have a table with the phone numbers and a table with the orders that is related to the phone numbers table via phone_number_id. The orders table has the current customers start_date and end_date for travel without the phone number id saved yet to it. The orders table also has the start_date and end_date for all other customers dates of travel as well as which phone_number_id was assigned/booked up for their travel dates.
How would the MySQL query look like when trying to select the phone numbers that are available for the current customers dates?
I build below query at the moment
SELECT x.id
, x.area_code
, x.phone_number
, y.start_date
, y.end_date
FROM vir_num_table x
LEFT
JOIN orderitemsdetail_table y
ON y.vn_id = x.id
WHERE y.start_date BETWEEN '2020-01-11' AND '2020-01-18'
OR y.start_date IS NULL
I've build this query but stuck here how can I add end_date logic.
Any help would be appreciated! Thanks in advance.
The way I'd approach the problem would be to look at conceptually, is as a cross product of the set of all phone numbers, along with the reservation timeframe, and then exclude those where there's a conflicting reservation.
A conflict would be an overlap, existing reservation that has a start_date before the end of the proposed reservation AND has an end_date on or after the start of the proposed reservation.
I'd do an anti-join pattern, something like this:
SELECT pn.phone_number
FROM phone_number pn
LEFT
JOIN reservation rs
ON rs.phone_number = pn.phone_number
AND rs.start_dt <= '2019-12-27' + INTERVAL +7 DAY
AND rs.end_dt > '2019-12-20' + INTERVAL -7 DAY
WHERE rs.phone_number IS NULL
That essentially says get all rows from phone number, along with matching rows from reservations (rows that overlap), but then exclude all the rows that had a match, leaving just phone_number rows that did not have a match.
We can make the < test a <= or , subtract 8 days, to tailor the "7 day" window before; we can tweak as we run the query through the test cases,
We can achieve an equivalent result using a NOT EXISTS and a correlated subquery. Some people find this easier to comprehend than the ant-join, but its essentially the same query, doing the same thing, get all rows from phone_number but exclude the rows where there is a matching (overlapping) row in reservation
SELECT pn.phone_number
FROM phone_number pn
WHERE NOT EXISTS
( SELECT 1
FROM reservation rs
WHERE rs.phone_number = pn.phone_number
AND rs.start_dt <= '2019-12-27' + INTERVAL +7 DAY
AND rs.end_dt > '2019-12-20' + INTERVAL -7 DAY
)
There are several questions on StackOverflow about checking for overlap, or no overlap, of date ranges.
See e.g.
How to check if two date ranges overlap in mysql?
PHP/SQL - How can I check if a date user input is in between an existing date range in my database?
MySQL query to select distinct rows based on date range overlapping
EDIT
Based on the SQL added as an edit to the question, I'd do the query like this:
SELECT pn.`id`
, pn.`area_code`
, pn.`phone_number`
FROM `vir_num_table` pn
LEFT
JOIN `orderitemsdetail_table` rs
ON rs.vn_id = pn.id
AND rs.start_date <= '2020-01-18' + INTERVAL +7 DAY
AND rs.end_date > '2020-01-11' + INTERVAL -7 DAY
WHERE rs.vn_id IS NULL
The two "tricky" parts. First is the anti-join, understanding how that works. (An outer join, to return all rows from vir_num_table but exclude any rows that have a matching row in reservations. The second tricky part is checking for the overlap, coming up with the conditions: r.start <= p.end AND r.end >= p.start, then tweaking whether we want to include the equals as an overlap, and tweaking the extra seven days (easiest to me to just subtract the 7 days from the beginning of the proposed reservation)
... now occurs to me like we need to add a guard period of 7 days on the end of the reservation period as well, doh!
Here's a query plus sorting algo to choose the optimal phone number selection for maximum utilization efficiency (i.e. getting as close as possible to exactly 7 days before and after each use).
I set it to give open ends a weight of 9, so that "near perfect" fits (7-8 days before or after) would be selected ahead of open-ended numbers. This will yield a slight efficiency improvement, as open numbers can accommodate any reservation. You can adjust this for your needs. If you set this to 0, for example, it would always select open numbers first.
SELECT ph.phone_number,
COALESCE(
MIN(
IF(res.end_date > res.start_date > '2020-01-18',
NULL, -- ignore before-comparison for reservations starting and ending after date range
DATEDIFF('2020-01-11', res.end_date)
), 9) AS open_days_before,
COALESCE(
MIN(
IF(res.start_date < res.end_date < '2020-01-11',
NULL, -- ignore after-comparison for reservations starting and ending before date range
DATEDIFF(res.start_date, '2020-01-18')
), 9) AS open_days_after
FROM phone_number ph
LEFT JOIN reservation res
ON res.phone_number = ph.phone_number
AND res.end_date >= CURRENT_DATE() - INTERVAL 6 DAY
GROUP BY ph.phone_number
HAVING open_days_before >= 7
AND open_days_after >= 7
ORDER BY open_days_before + open_days_after
LIMIT 1
Edit: updated to add grouping, because I realize this is an aggregate problem.
Edit 2: bug fix, changed MAX to MIN
Edit 3: added res.end_date >= CURRENT_DATE - INTERVAL 6 DAY to ignore past reservations, limiting aggregate data and treating phone number with no reservations between 6 days ago and the beginning of the new order as "open on the front-end"
Edit 4: added IF conditions to eliminate reservations outside the given before-or-after comparison ranges (e.g. comparing reservations after the selected range from influencing the "open days before" number), to prevent negative numbers, except when there's overlap with the selected range.
Based on the info you've added then you shouldn't need to check the start date of phone numbers which have been booked out.
You customer provides you with a start date and an end date.
You only rent out phone numbers 7 days after their last lease ended
All you need to do is fetch back phone numbers which either:
- Are not rented out and therefor aren't in the orderitems table
- OR have an end_date which is 7 days before the new customer's start date.
Here you go:
SELECT
`main_table`.`id`,
`main_table`.`area_code`,
`main_table`.`phone_number`,
`orderitemsdetail_table`.`start_date`,
`orderitemsdetail_table`.`end_date`
FROM
`vir_num_table` AS `main_table`
LEFT JOIN
`orderitemsdetail_table` AS `orderitemsdetail_table` ON main_table.id = orderitemsdetail_table.vn_id
WHERE
(DATE_ADD(orderitemsdetail_table.end_date, INTERVAL 7 DAY) < '<CUSTOMER START DATE>'
AND orderitemsdetail_table.start_date > '<CUSTOMER END DATE>')
OR orderitemsdetail_table.id IS NULL

MySQL Daily Time Coverage Without Gaps

I have a table like the following example:
What I need to do is return the coverage (number of hours an operator/s were onsite) for each day. The challenge is that I need to ignore gaps in coverage and not double count hours where two operators were signed in at the same time. For instance, the image below is a visual representation of the table.
The logic of the image is as follows:
Operator A: Signed in at 10 and signed out at noon for a total of 2 hours
Operator B: Signed in at 1 and signed out at 3 for a total of 2 hours
Operator A: Came back and signed in at 2 and signed out at 5 for a total of 3 hours but 1 hour overlaps with operator A so I cannot count that 1 hour otherwise I will be double counting coverage
Therefore the total coverage time without overlaps is 6 hours and the value I need the query to produce. So far I can ignore double counting by taking the max in min dates of each day and subtracting the two:
SELECT YEAR, WEEK, SUM(HOURS)
FROM
(SELECT
YEAR(SignedIn) AS YEAR,
WEEK(SignedIn) AS WEEK,
DAY(SignedIn) AS DAY,
time_to_sec(timediff(MAX(SignedOut), MIN(SignedIn)))/ 3600 AS HOURS
FROM OperatorLogs
GROUP BY YEAR, WEEK, DAY) As VirtualTable
GROUP BY YEAR, WEEK
Which produces 7 because it takes the first sign-in (10 AM) and calculates the hours up until the last sign-out (4:00 PM). However, it includes the gap in coverage (12 - 1) which should not be included. I am unsure of how to remove that time from the total hours while also not double counting when there is overlap, i.e. from 2-3 there should only be 1 hour of coverage even though two separate operators are on site each putting in an hour. Any help is appreciated.
Sorry, work interrupted me.
Here's my working solution, I'm not convinced it's optimal due to the (relatively) expensive nature of the joins, but I've optimised it slightly based on the soft-rule that "shifts" never span multiple days.
SELECT
calendar_date,
SUM(coverage_seconds) / 3600 AS coverage_hours
FROM
(
-- Signins that didn't happen within another operators shift
SELECT DISTINCT
DATE(e.signedin) AS calendar_date,
-(UNIX_TIMESTAMP(e.signedin) MOD 86400) AS coverage_seconds
FROM
OperatorLogs e
LEFT JOIN
OperatorLogs o
ON o.signedin >= DATE(e.signedin)
AND o.signedin < e.signedin
AND o.signedout >= e.signedin
WHERE
o.signedin IS NULL
UNION ALL
-- Signouts that didn't happen within another operators shift
SELECT DISTINCT
DATE(e.signedout) AS calendar_date,
+(UNIX_TIMESTAMP(e.signedout) MOD 86400) AS coverage_seconds
FROM
OperatorLogs e
LEFT JOIN
OperatorLogs o
ON o.signedin >= DATE(e.signedout)
AND o.signedin <= e.signedout
AND o.signedout > e.signedout
WHERE
o.signedin IS NULL
)
AS coverage_markers
GROUP BY
calendar_date
;
Feel free to test it with more rigourous data...
https://www.db-fiddle.com/f/4RgWVhcdNEro21rUksVdXD/0
(As a note, to make your sample data match your excel image, your first shift should have started at 9am)

SQL Calculating Moving Average Crossover of variable lengths [duplicate]

This question already has answers here:
How to calculated multiple moving average in MySQL
(3 answers)
Closed 9 years ago.
I am trying to calculate moving averages crossover with variable dates.
My database is structured:
id
stock_id
date
closing_price
And:
stock_id
symbol
For example, I'd like to find out if the average price going back X days ever gets greater than the average price going back Y days within the past Z days. Each of those time periods is variable. This needs to be run for every stock in the database (about 3000 stocks with prices going back 100 years).
I'm a bit stuck on this, what I currently have is a mess of SQL subqueries that don't work because they cant account for the fact that X, Y, and Z can all be any value (0-N). That is, in the past 5 days I could be looking for a stock where the 40 day average is > than 5, or the 5 > 40. Or I could be looking over the past 40 days to find stocks where the 10 day moving average is > 30 day moving average.
This question is different from the other questions as there is variable short and long dates as well as a variable term.
Please find see these earlier posts on Stackoverflow:
How to calculated multiple moving average in MySQL
Calculate moving averages in SQL
These posts have solutions to your question.
I think the most direct way to do a moving average in MySQL is using a correlated subquery. Here is an example:
select p.*,
(select avg(closing_price)
from prices p2
where p2.stock_id = p.stock_id and
p2.date between p.date - interval x day and pdate
) as MvgAvg_X,
(select avg(closing_price)
from prices p2
where p2.stock_id = p.stock_id and
p2.date between p.date - interval y day and pdate
) as MvgAvg_Y
from prices p
You need to fill in the values for x and y.
For performance reasons, you will want an index on prices(stock_id, date, closing_price).
If you have an option for another database, Oracle, Postgres, and SQL Server 2012 all offer much better performing solutions for this problem.
In Postgres, you can write this as:
select p.*,
avg(p.price) over (partition by stock_id rows x preceding) as AvgX,
avg(p.price) over (partition by stock_id rows y preceding) as AvgY
from p

MySQL: Average interval between records

Assume this table:
id date
----------------
1 2010-12-12
2 2010-12-13
3 2010-12-18
4 2010-12-22
5 2010-12-23
How do I find the average intervals between these dates, using MySQL queries only?
For instance, the calculation on this table will be
(
( 2010-12-13 - 2010-12-12 )
+ ( 2010-12-18 - 2010-12-13 )
+ ( 2010-12-22 - 2010-12-18 )
+ ( 2010-12-23 - 2010-12-22 )
) / 4
----------------------------------
= ( 1 DAY + 5 DAY + 4 DAY + 1 DAY ) / 4
= 2.75 DAY
Intuitively, what you are asking should be equivalent to the interval between the first and last dates, divided by the number of dates minus 1.
Let me explain more thoroughly. Imagine the dates are points on a line (+ are dates present, - are dates missing, the first date is the 12th, and I changed the last date to Dec 24th for illustration purposes):
++----+---+-+
Now, what you really want to do, is evenly space your dates out between these lines, and find how long it is between each of them:
+--+--+--+--+
To do that, you simply take the number of days between the last and first days, in this case 24 - 12 = 12, and divide it by the number of intervals you have to space out, in this case 4: 12 / 4 = 3.
With a MySQL query
SELECT DATEDIFF(MAX(dt), MIN(dt)) / (COUNT(dt) - 1) FROM a;
This works on this table (with your values it returns 2.75):
CREATE TABLE IF NOT EXISTS `a` (
`dt` date NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `a` (`dt`) VALUES
('2010-12-12'),
('2010-12-13'),
('2010-12-18'),
('2010-12-22'),
('2010-12-24');
If the ids are uniformly incremented without gaps, join the table to itself on id+1:
SELECT d.id, d.date, n.date, datediff(d.date, n.date)
FROM dates d
JOIN dates n ON(n.id = d.id + 1)
Then GROUP BY and average as needed.
If the ids are not uniform, do an inner query to assign ordered ids first.
I guess you'll also need to add a subquery to get the total number of rows.
Alternatively
Create an aggregate function that keeps track of the previous date, and a running sum and count. You'll still need to select from a subquery to force the ordering by date (actually, I'm not sure if that's guaranteed in MySQL).
Come to think of it, this is a much better way of doing it.
And Even Simpler
Just noting that Vegard's solution is much better.
The following query returns correct result
SELECT AVG(
DATEDIFF(i.date, (SELECT MAX(date)
FROM intervals WHERE date < i.date)
)
)
FROM intervals i
but it runs a dependent subquery which might be really inefficient with no index and on a larger number of rows.
You need to do self join and get differences using DATEDIFF function and get average.