Fastest way to get closest data from multiple tables based on time - mysql

I have three tables, with the following setup:
TEMPERATURE_1
time
zone (FK)
temperature
TEMPERATURE_2
time
zone (FK)
temperature
TEMPERATURE_3
time
zone (FK)
temperature
The data in each table is updated periodically, but not necessarily concurrently (ie, the time entries are not identical).
I want to be able to access the closest reading from each table for each time, ie:
TEMPERATURES
time
zone (FK)
temperature_1
temperature_2
temperature_3
In other words, for every unique time across my three tables, I want a row in the TEMPERATURES table, where the temperature_n values are the temperature reading closest in time from each original table.
At the moment, I've set this up using two views:
create view temptimes
as select time, zone
from temperature_1
union
select time, zone
from temperature_2
union
select time, zone
from temperature_3;
create view temperatures
as select tt.time,
tt.zone,
(select temperature
from temperature_1
order by abs(timediff(time, tt.time))
limit 1) as temperature_1,
(select temperature
from temperature_2
order by abs(timediff(time, tt.time))
limit 1) as temperature_2,
(select temperature
from temperature_3
order by abs(timediff(time, tt.time))
limit 1) as temperature_3,
from temptimes as tt
order by tt.time;
This approach works, but is too slow to use in production (it takes minutes+ for small data sets of ~1000 records for each temperature).
I'm not great with SQL, so I'm sure I'm missing the correct way to do this. How should I approach the problem?

The expensive part is where the correlated subqueries have to compute the time difference for every single row of each temperature_* table to find just one closest row for one column of one row in the main query.
It would be dramatically faster if you could just pick one row after and one row before the current time according to an index and only compute the time difference for these two candidates. All you need for that to be fast is an index on the column time in your tables.
I am ignoring the column zone, since its role remains unclear in the question, and it just add more noise to the core problem. Should be easy to add to the query.
Without an additional view, this query does all at once:
SELECT time
,COALESCE(temp1
,CASE WHEN timediff(time, time1a) > timediff(time1b, time) THEN
(SELECT t.temperature
FROM temperature_1 t
WHERE t.time = y.time1b)
ELSE
(SELECT t.temperature
FROM temperature_1 t
WHERE t.time = y.time1a)
END) AS temp1
,COALESCE(temp2
,CASE WHEN timediff(time, time2a) > timediff(time2b, time) THEN
(SELECT t.temperature
FROM temperature_2 t
WHERE t.time = y.time2b)
ELSE
(SELECT t.temperature
FROM temperature_2 t
WHERE t.time = y.time2a)
END) AS temp2
,COALESCE(temp3
,CASE WHEN timediff(time, time3a) > timediff(time3b, time) THEN
(SELECT t.temperature
FROM temperature_3 t
WHERE t.time = y.time3b)
ELSE
(SELECT t.temperature
FROM temperature_3 t
WHERE t.time = y.time3a)
END) AS temp3
FROM (
SELECT time
,max(t1) AS temp1
,max(t2) AS temp2
,max(t3) AS temp3
,CASE WHEN max(t1) IS NULL THEN
(SELECT t.time FROM temperature_1 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time1a
,CASE WHEN max(t1) IS NULL THEN
(SELECT t.time FROM temperature_1 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time1b
,CASE WHEN max(t2) IS NULL THEN
(SELECT t.time FROM temperature_2 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time2a
,CASE WHEN max(t2) IS NULL THEN
(SELECT t.time FROM temperature_2 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time2b
,CASE WHEN max(t3) IS NULL THEN
(SELECT t.time FROM temperature_3 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time3a
,CASE WHEN max(t3) IS NULL THEN
(SELECT t.time FROM temperature_3 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time3b
FROM (
SELECT time, temperature AS t1, NULL AS t2, NULL AS t3 FROM temperature_1
UNION ALL
SELECT time, NULL AS t1, temperature AS t2, NULL AS t3 FROM temperature_2
UNION ALL
SELECT time, NULL AS t1, NULL AS t2, temperature AS t3 FROM temperature_3
) AS x
GROUP BY time
) y
ORDER BY time;
->sqlfiddle
Explain
suqquery x replaces your view temptimes and brings the temperature into the result. If all three tables are in sync and have temperatures for all the same points in time, the rest is not even needed and extremely fast.
For every point in time where one of the three tables has no row, the temperature is being fetched as instructed: take the "closest" one from each table.
suqquery y aggregates the rows from x and fetches the previous time (time1a) and the next time (time1b) according to the current time from each table where the temperature is missing. These lookups should be fast using the index.
The final query fetches the temperature from the row with the closest time for each temperature that's actually missing.
This query could be simpler if MySQL would allow to reference columns from more than one level above the current subquery. Bit it cannot. Works just fine with in PostgreSQL: ->sqlfiddle
It also would be simpler if one could return more than one column from a correlated subquery, but I don't know how to do that in MySQL.
And it would be much simpler with CTEs and window functions, but MySQL doesn't know these modern SQL features (unlike other relevant RDBMS).

The reason that this is slow is that it requires 3 table scans to calculate and order the diferences.
I assume that you allready have indexes on the time zone columns - at the moment they won't help becuase of the table scan problem.
There are a number of options to avoid this depending on what you need and what the data collection rates are.
You have already said that the data is collected periodically but not concurrently. This suggests a few options.
To what level of significance do you need the temp data - the day, the hour, the minute etc. Store the time zone info to that level of significance only (or have another column that does) and do your queries on that.
If you know that the 3 closets times will be within a certain time frame (hour, day etc) put in a where clause to limit the calculation to those times that are potential candidates. You are effectively constructing histogram type buckets - you will need a calendar table to do this efficiently.
Make the comparison unidirectional i.e. limit consideration to only those times after the time you are looking for, so if you are looking for 12:00:00 then 13:45:32 is a candidate but 11:59:59 isn't.
I understand what you are trying to accomplish - ask yourself why and if a simpler solution will neet your needs.

My suggestion is that you don't take the closest time, but you take the first time on or before a given time. The reason for this is simple: generally the data for a given time is what is known at that time. Incorporating future information is generally not a good idea for most purposes.
With this change, you can modify your query to take advantage of an index on time. The problem with an index on your query is that the function precludes the use of the index.
So, if you want the most recent temperature, use this instead for each variable:
(select temperature
from temperature_1 t2
where t2.time <= tt.time
order by t2.time desc
limit 1
) as temperature_1,
Actually, you can also construct it like this:
(select time
from temperature_1 t2
where t2.time <= tt.time
order by t2.time desc
limit 1
) as time_1,
And then join the information for the temperature back in. This will be efficient, with the use of an index.
With that in mind, you could actually have two variables time_1_before and time_1_after, for the best time on or before and the best time on or after. You can use logic in the select to choose the nearest value. The joins back to the temperature should be efficient using an index.
But, I will reiterate, I think the last temperature on or before may be the best choice.

Related

condition in SELECT in mysql

I have a query that looks like this
SELECT customer, totalvolume
FROM orders
WHERE deliverydate BETWEEN '2020-01-01' AND CURDATE()
Is there any way to select totalvolume for specific date range and make it a separate column?
So for example, I already have totalvolume. I'd like to also add totalvolume for the previous month as a separate column (totalvolume where deliverydate BETWEEN '2020-08-01' AND '2020-08-31'). Is there a function for that?
Simply use 2 table copies:
SELECT t1.customer, t1.totalvolume, t2.totalvolume previousvolume
FROM orders t1
LEFT JOIN orders t2 ON t1.customer = t2.customer
AND t1.deliverydate = t2.deliverydate + INTERVAL 1 MONTH
WHERE t1.deliverydate BETWEEN '2020-08-01' AND '2020-08-31';
You can do it with case/when construct in your columns and just expand your WHERE clause. Sometimes I would do it by having a secondary #variables to simplify my clauses. Something like
SELECT
o.customer,
sum( case when o.deliveryDate < #beginOfMonth
then o.TotalVolume else 0 end ) PriorMonthVolume,
sum( case when o.deliveryDate >= #beginOfMonth
then o.TotalVolume else 0 end ) ThisMonthVolume,
sum( o.totalvolume ) TwoMonthsVolume
FROM
( select #myToday := date(curdate()),
#beginOfMonth := date_sub( #myToday, interval dayOfMonth( #myToday ) -1 day ),
#beginLastMonth := date_sub( #beginOfMonth, interval 1 month ) ) SqlVars,
orders o
WHERE
o.deliverydate >= #beginLastMonth
group by
o.customer
To start, the "from" clause of the query alias "SqlVars" will dynamically create 3 variables and return a single row for that set. With no JOIN condition, is always a 1:1 ratio for everything in the orders table. Nice thing, you don't have to pre-declare variables and the #variables are available for the query.
By querying for all records on or after the beginning of the LAST month, you get all records for both months in question. The sum( case/when ) can now use those variables as the demarcation point for the respective volume totals.
I know you mentioned this was a simplified query, but masking that might not be a perfect answer to what you need, but may help you look at it from a different querying perspective.

SQL query Compare two WHERE clauses using same table

I am looking to compare two sets of data that are stored in the same table. I am sorry if this is a duplicate SO post, I have read some other posts but have not been able to implement it to solve my problem.
I am running a query to show all Athletes and times for the most recent date (2017-05-20):
SELECT `eventID`,
`location`,<BR>
`date`,
`barcode`,
`runner`,
`Gender`,
`time` FROM `TableName` WHERE `date`='2017-05-20'
I would like to compare the time achieved on the 20th May with the previous time for each athlete.
SELECT `time` FROM `TableName` WHERE `date`='2017-05-13'
How can I structure my query showing all of the ATHLETES, TIME on 13th, TIME on 20th
I have tried some methods such as UNION ALL for example
You can get the previous time using a correlated subquery:
SELECT t.*,
(SELECT t2.time
FROM TableName t2
WHERE t2.runner = t.runner AND t2.eventId = t.eventId AND
t2.date < t.date
ORDER BY t2.date DESC
LIMIT 1
) prev_time
FROM `TableName` t
WHERE t.date = '2017-05-20';
For performance, you want an index on (runner, eventid, date, time).

SQL Query to group and add time between consecutive rows

Need help with SQL Query (MySQL)
Say I have a table with data as..
The table has the Latitude and Longitude locations logged for a person at some time intervals (TIME column), And DISTANCE_TRAVELLED column has the distance traveled from its previous record.
If i want to know how many minutes a person was not moving (i.e DISTANCE_TRAVEKLLED <= 0.001)
what query should i use?
Can we also group the data by Date? Basically i want to know how many minutes the person was idle in a specific day.
You need to get the previous time for each record. I like to do this using a correlated subquery:
select t.*,
(select t2.time
from table t2
where t2.device = t.device and t2.time < t.time
order by time desc
limit 1
) as prevtime
from table t;
Now you can get the number of minutes not moved, as something like:
select t.*, TIMESTAMPDIFF(MINUTE, prevftime, time) as minutes
from (select t.*,
(select t2.time
from table t2
where t2.device = t.device and t2.time < t.time
order by time desc
limit 1
) as prevtime
from table t
) t
The rest of what you request is just adding the appropriate where clause or group by clause. For instance:
select device, date(time), sum(TIMESTAMPDIFF(MINUTE, prevftime, time)) as minutes
from (select t.*,
(select t2.time
from table t2
where t2.device = t.device and t2.time < t.time
order by time desc
limit 1
) as prevtime
from table t
) t
where distance_travelled <= 0.001
group by device, date(time)
EDIT:
For performance, create an index on table(device, time).

MySql -- Determine periods of missing data with query

I have a database that's set up like this:
(Schema Name)
Historical
-CID int UQ AI NN
-ID Int PK
-Location Varchar(255)
-Status Varchar(255)
-Time datetime
So an entry might look like this
433275 | 97 | MyLocation | OK | 2013-08-20 13:05:54
My question is, if I'm expecting 5 minute interval data from each of my sites, how can I determine how long a site has been down?
Example, if MyLocation didn't send in the 5 minute interval data from 13:05:54 until 14:05:54 it would've missed 60 minutes worth of intervals, how could I find this downtime and report on it easily?
Thanks,
*Disclaimer: I'm assuming that your time column determines the order of the entries in your table and that you can't easily (and without heavy performance loss) self-join the table on auto_increment column since it can contain gaps.*
Either you create a table containing simply datetime values and do a
FROM datetime_table d
LEFT JOIN your_table y ON DATE_FORMAT(d.datetimevalue, '%Y-%m-%d %H:%i:00') = DATE_FORMAT(y.`time`, '%Y-%m-%d %H:%i:00')
WHERE y.some_column IS NULL
(date_format() function is used here to get rid of the seconds part in the datetime values).
Or you use user defined variables.
SELECT * FROM (
SELECT
y.*,
TIMESTAMPDIFF(MINUTE, #prevDT, `Time`) AS timedifference
#prevDT := `Time`
FROM your_table y ,
(SELECT #prevDT:=(SELECT MIN(`Time`) FROM your_table)) vars
ORDER BY `Time`
) sq
WHERE timedifference > 5
EDIT: I thought you wanted to scan the whole table (or parts of it) for rows where the timedifference to the previous row is greater than 5 minutes. To check for a specific ID (and still having same assumptions as in the disclaimer) you'd have to do a different approach:
SELECT
TIMESTAMPDIFF(MINUTE, (SELECT `Time` FROM your_table sy WHERE sy.ID < y.ID ORDER BY ID DESC LIMIT 1), `Time`) AS timedifference
FROM your_table y
WHERE ID = whatever
EDIT 2:
When you say "if the ID is currently down" is there already an entry in your table or not? If not, you can simply check this via
SELECT TIMESTAMPDIFF(MINUTE, NOW(), (SELECT MAX(`Time`) FROM your_table WHERE ID = whatever));
So I assume you are going to have some sort of cron job running to check this table. If that is the case you can simply check for the highest time value for each id/location and compare it against current time to flag any id's that have a most recent time that is older than the specified threshold. You can do that like this:
SELECT id, location, MAX(time) as most_recent_time
FROM Historical
GROUP BY id
HAVING most_recent_time < DATE_SUB(NOW(), INTERVAL 5 minutes)
Something like this:
SELECT h1.ID, h1.location, h1.time, min(h2.time)
FROM Historical h1 LEFT JOIN Historical h2
ON (h1.ID = h2.ID AND h2.CID > h1.CID)
WHERE now() > h1.time + INTERVAL 301 SECOND
GROUP BY h1.ID, h1.location, h1.time
HAVING min(h2.time) IS NULL
OR min(h2.time) > h1.time + INTERVAL 301 SECOND

How to calculate a moving average in MySQL in a correlated subquery?

I want to create a timeline report that shows, for each date in the timeline, a moving average of the latest N data points in a data set that has some measures and the dates they were measured. I have a calendar table populated with every day to provide the dates. I can calculate a timeline to show the overall average prior to that date fairly simply with a correlated subquery (the real situation is much more complex than this, but it can essentially be simplified to this):
SELECT c.date
, ( SELECT AVERAGE(m.value)
FROM measures as m
WHERE m.measured_on_dt <= c.date
) as `average_to_date`
FROM calendar c
WHERE c.date between date1 AND date2 -- graph boundaries
ORDER BY c.date ASC
I've spent days reading around this and I've not found any good solutions. Some have suggested that LIMIT might work in the subquery (LIMIT is supported in subqueries the current version of MySQL), however LIMIT applies to the return set, not the rows going into the aggregate, so it makes no difference to add it.
Nor can I write a non-aggregated SELECT with a LIMIT and then aggregate over that, because a correlated subquery is not allowed inside a FROM statement. So this (sadly) WON'T work:
SELECT c.date
, SELECT AVERAGE(last_5.value)
FROM ( SELECT m.value
FROM measures as m
WHERE m.measured_on_dt <= c.date
ORDER BY m.measured_on_dt DESC
LIMIT 5
) as `last_5`
FROM calendar c
WHERE c.date between date1 AND date2 -- graph boundaries
ORDER BY c.date ASC
I'm thinking I need to avoid the subquery approach completely and see if I do this with a clever join / row numbering technique with user-variables and then aggregate that but while I'm working on that I thought I'd ask if anyone knew a better method?
UPDATE:
Okay, I've got a solution working which I've simplified for this example. It relies on some user-variable trickery to number the measures backwards from the calendar date. It also does a cross product with the calendar table (instead of a subquery) but this has the unfortunate side-effect of causing the row-numbering trick to fail (user-variables are evaluated when they're sent to the client, not when the row is evaluated) so to workaround this, I've had to nest the query one level, order the results and then apply the row-numbering trick to that set, which then works.
This query only returns calendar dates for which there are measures, so if you wanted the whole timeline you'd simply select the calendar and LEFT JOIN to this result set.
set #day = 0;
set #num = 0;
set #LIMIT = 5;
SELECT date
, AVG(value) as recent_N_AVG
FROM
( SELECT *
, #num := if(#day = c.date, #num + 1, 1) as day_row_number
, #day := day as dummy
FROM
( SELECT c.full_date
, m.value
, m.measured_on_dt
FROM calendar c
JOIN measures as m
WHERE m.measured_on_dt <= c.full_date
AND c.full_date BETWEEN date1 AND date2
ORDER BY c.full_date ASC, measured_on_dt DESC
) as full_data
) as numbered
WHERE day_row_number <= #LIMIT
GROUP BY date
The row numbering trick can be generalised to more complex data (my measures are in several dimensions which need aggregating up).
If your timeline is continuous (1 value each day) you could improve your first attempt like this:
SELECT c.date,
( SELECT AVERAGE(m.value)
FROM measures as m
WHERE m.measured_on_dt
BETWEEN DATE_SUB(c.date, INTERVAL 5 day) AND c.date
) as `average_to_date`
FROM calendar c
WHERE c.date between date1 AND date2 -- graph boundaries
ORDER BY c.date ASC
If your timeline has holes in it this would result in less than 5 values for the average.