Speed up the SQL query - mysql

I have stored temperatures in a MySQL database. The table is called temperatures. It contains, for example, the columns dtime and temperature. The first one is the time the temperature was measured (the column type is DATETIME) and the latter, well, apparently the temperature (the type is FLOAT).
At the moment I use the following query to fetch the temperatures in a certain period.
SELECT dtime, temperature
FROM temperatures
WHERE dtime BETWEEN "2012-11-15 00:00:00" AND "2012-11-30 23:59:59"
ORDER BY dtime DESC
I'd like to add the average temperature of the day in the results. I tried the following.
SELECT
dtime AS cPVM,
temperature,
(
SELECT AVG(temperature)
FROM temperatures
WHERE DATE(dtime) = DATE(cPVM)
) AS avg
FROM temperatures
WHERE dtime BETWEEN "2012-11-15 00:00:00" AND "2012-11-30 23:59:59"
ORDER BY dtime DESC
Works ok, but this is really, really slow. Fetching the results in that period takes about 5 seconds, when the first one (without the averages) is done in 0.03 seconds.
SELECT DATE(dtime), AVG(temperature)
FROM temperatures
WHERE DATE(dtime) BETWEEN "2012-11-15" AND "2012-11-30"
GROUP BY DATE(dtime)
ORDER BY dtime DESC
This one however is done in 0.04 seconds.
How do I fetch the average temperatures more efficiently?

Use a join instead of a correlated subquery:
SELECT dtime, temperature, avg_temperature
FROM temperatures
JOIN (
SELECT DATE(dtime) AS date_dtime, AVG(temperature) AS avg_temperature
FROM temperatures
WHERE dtime >= '2012-11-15' AND dtime < '2012-12-01'
GROUP BY DATE(dtime)
) AS avg_t
ON date_dtime = DATE(dtime)
WHERE dtime dtime >= '2012-11-15' AND dtime < '2012-12-01'
ORDER BY dtime DESC

Since your first query is very efficient already, let's use it as a starting point. Depending on the size of the result sets it produces, querying the results of your first query can still be very efficient.
Your third query also seems to run very efficiently, so you can fall back to that if my proposed query doesn't perform well enough. The reason I like it is because you can take the original query as a parameter of sorts (minus the ORDER BY) and plug it into this one, which shows the average temperature from the date range of the original query:
SELECT
DATE(dtime) AS day_of_interest,
AVG(temperature) AS avg_temperature
FROM
(
-- Your first query is here, minus the ORDER BY clause
SELECT
dtime,
temperature
FROM
temperatures
WHERE
dtime BETWEEN "2012-11-15 00:00:00" AND "2012-11-30 23:59:59"
-- ORDER BY irrelevant inside subqueries, only slows you down
-- ORDER BY
-- dtime DESC
) AS temperatures_of_interest
GROUP BY
day_of_interest
ORDER BY
day_of_interest DESC
If this query runs "efficiently enough" ™ for you, then this could potentially be an easier solution to code up and automate than perhaps some others.
Hope this helps!

Related

MySQL, how to get time difference between rows

I'm trying to get time difference between rows. I tried this but it's not working.
SELECT id,locationDate,
TIMESTAMPDIFF(SECOND,
(SELECT MAX(locationDate) FROM location WHERE locationDate< t.locationDate),
created_at
) as secdiff
FROM location t where tagCode = 24414 AND locationDate >= '2017-05-10 16:00:01' and locationDate <= '2017-05-10 16:59:59';
What should I do for calculating time difference between rows ?
You can reach the sample structure and data from sqlfiddle
I am guessing you just want a correlated subquery:
select l.id, l.locationDate,
TIMESTAMPDIFF(SECOND,
(SELECT MAX(l2.locationDate)
FROM location l2
WHERE l2.locationDate < l.locationDate AND
l2.tagCode = l.tagCode
),
locationDate
) as secdiff
from location l
where l.tagCode = 24414 and
l.locationDate > '2017-05-10 16:00:00' and
l.locationDate < '2017-05-10 17:00:00';
I modified the date/time constants to be a bit more reasonable (from my perspective). If you really care about one second before or after a time, then you can use your original formulation.

how to delete every record except one per hour

I have a mysql table with millions of sensor records with the following structure:
datanumber (auto increment),
stationid (int),
sensortype (int),
measuredate (datetime),
data (medtext)
each stations adds a record every 2-10 minute per sensortype (2-5 sensors)
I would like to keep only one record per hour, per sensor, per station
and this too only if measuredate is older than 1 year.
I understand how to select data older than one year but I have no clue on how to delete rows except one for each hour. It does not really matter if it's the first, last or a random value which is kept at each hour. I also do not need to calculate average values or something, just strip down the amount of records stored
You should be able to do something like
Select * from observations where <old> group by sensortype, stationid, extract(year_month, measure_date), extract(day_hour, measure_date);
group_by will collapse the records in each group into one. You could select this into a new table if you want.
If you need to actually delete all the redundant old records, just select the datanumbers using the above query, and then delete all records NOT IN(<those ids>).
If you are going to be deleting a very large number of rows, then one approach recommended by the MySQL docs is to select the rows you want to retain into a temporary table, and then perform an atomic table renaming. Maybe like this:
INSERT INTO
sensordata_squeezed
SELECT
datanumber,
stationid,
sensortype,
measuredate,
data
FROM
sensordata
WHERE
measuredate < DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
GROUP BY
DATE_ADD(DATE(measuredate), INTERVAL HOUR(measuredate) HOUR),
stationid,
sensortype
UNION ALL
SELECT
datanumber,
stationid,
sensortype,
measuredate,
data
FROM
sensordata
WHERE
measuredate >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
;
RENAME TABLE
sensordata TO sensordata_old,
sensordata_squeezed TO sensordata
;
DROP TABLE sensordata_old
;
Note well: that relies on MySQL's documented behavior with respect to selecting columns from an aggregate query that are neither grouping columns nor aggregate functions of the groups: it chooses an indeterminate value from each group. (This is an extension to standard SQL.) I am supposing that within each group, all the nonaggregated column values will come from the same row; you should check because that part is not documented, and this approach depends on that to maintain data integrity.
This approach allows you to avoid both large, expensive joins, and large numbers of subqueries.
Do note that however you do this, you are going to have to work around issues of how to avoid losing data that comes in while this operation is running, as it is likely to take a long time.
This would be a lead-pipe cinch if we could use row_number over( ... ) but a solution for MySQL is not difficult. For problems like this, look to see if we can query a list of just the rows we want to delete. That sounds easy enough. First, we want to have a list of each hour of each day and the first (least) entry for that hour:
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour;
So we just have to join the table back to this result set:
select T.*
from T
join(
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour
) as T1
on T1.MinTime = T.MeasureDate
This gives us all the rows we want to keep. So use a left join to invert the results:
select T.*
from T
left join(
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour
) as T1
on T1.MinTime = T.MeasureDate
where T1.MinTime is null;
Change the select to delete et viola:
delete TDel
from T TDel
left join(
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour
) as T1
on T1.MinTime = TDel.MeasureDate
where T1.MinTime is null;
You can add other fields such as SensorType as appropriate to keep first entry of each hour per sensor or however you want to tune it. SqlFiddle

MySQL: Select corresponding row for maximum value grouped by date

I have a table with hourly temperatures:
id timestamp temperature
I want to select the highest temperature AND the corresponding timestamp for each day.
select max(temperature), timestamp from table group by date(from_unixtime(timestamp))
does not work because it always returns the timestamp of the first row (but I need the timestamp from the row with the highest temperature).
Any suggestions would be greatly appreciated.
Try this one....
select max(temperature), timestamp from temp group by UNIX_TIMESTAMP(date(timestamp));
Use a sub query like this;
select * from temp WHERE temperature=(select min(temperature) from temp)
Select the max temperature for each date, and then put that inside a join back to the table on the temperature and date, which will allow you to select the rows that match for temperature and date. A join will be faster than a subquery in most situations, and you can't always group inside a subquery, anyway.
Use date() to get the date part from the timestamp, and from_unixtime() will get a mySQL timestamp from a unix timestamp stored as either a string or an integer.
SELECT temperature, timestamp
FROM temp t
JOIN (
SELECT date(from_unixtime(timestamp)) as dt,
max(temperature) as maxTemp
FROM temp
GROUP BY date(from_unixtime(timestamp))
) m ON (
t.temperature = m.maxTemp AND
date(from_unixtime(t.timestamp)) = m.dt
)
However, I would suggest changing the table to store the timestamp as timestamp instead of varchar or int, and doing the conversion once when the data is inserted, instead of having to put it throughout the query. It will make things easier to read and maintain in the long term. Here's the same query if you change timestamp to be an actual timestamp:
SELECT temperature, timestamp
FROM temp t
JOIN (
SELECT date(timestamp) as dt,
max(temperature) as maxTemp
FROM temp
GROUP BY date(timestamp)
) m ON (
t.temperature = m.maxTemp AND
date(t.timestamp) = m.dt
)
Just a little easier to read, and probably faster, depending on how much data you have. You could also write that with an implicit join, which might be easier to read still. Just depends on your taste.
SELECT temperature, timestamp
FROM temp t, (
SELECT date(timestamp) as dt,
max(temperature) as maxTemp
FROM temp
GROUP BY date(timestamp)
) m
WHERE t.temperature = m.maxTemp
AND date(t.timestamp) = m.dt

Fastest way to get closest data from multiple tables based on time

I have three tables, with the following setup:
TEMPERATURE_1
time
zone (FK)
temperature
TEMPERATURE_2
time
zone (FK)
temperature
TEMPERATURE_3
time
zone (FK)
temperature
The data in each table is updated periodically, but not necessarily concurrently (ie, the time entries are not identical).
I want to be able to access the closest reading from each table for each time, ie:
TEMPERATURES
time
zone (FK)
temperature_1
temperature_2
temperature_3
In other words, for every unique time across my three tables, I want a row in the TEMPERATURES table, where the temperature_n values are the temperature reading closest in time from each original table.
At the moment, I've set this up using two views:
create view temptimes
as select time, zone
from temperature_1
union
select time, zone
from temperature_2
union
select time, zone
from temperature_3;
create view temperatures
as select tt.time,
tt.zone,
(select temperature
from temperature_1
order by abs(timediff(time, tt.time))
limit 1) as temperature_1,
(select temperature
from temperature_2
order by abs(timediff(time, tt.time))
limit 1) as temperature_2,
(select temperature
from temperature_3
order by abs(timediff(time, tt.time))
limit 1) as temperature_3,
from temptimes as tt
order by tt.time;
This approach works, but is too slow to use in production (it takes minutes+ for small data sets of ~1000 records for each temperature).
I'm not great with SQL, so I'm sure I'm missing the correct way to do this. How should I approach the problem?
The expensive part is where the correlated subqueries have to compute the time difference for every single row of each temperature_* table to find just one closest row for one column of one row in the main query.
It would be dramatically faster if you could just pick one row after and one row before the current time according to an index and only compute the time difference for these two candidates. All you need for that to be fast is an index on the column time in your tables.
I am ignoring the column zone, since its role remains unclear in the question, and it just add more noise to the core problem. Should be easy to add to the query.
Without an additional view, this query does all at once:
SELECT time
,COALESCE(temp1
,CASE WHEN timediff(time, time1a) > timediff(time1b, time) THEN
(SELECT t.temperature
FROM temperature_1 t
WHERE t.time = y.time1b)
ELSE
(SELECT t.temperature
FROM temperature_1 t
WHERE t.time = y.time1a)
END) AS temp1
,COALESCE(temp2
,CASE WHEN timediff(time, time2a) > timediff(time2b, time) THEN
(SELECT t.temperature
FROM temperature_2 t
WHERE t.time = y.time2b)
ELSE
(SELECT t.temperature
FROM temperature_2 t
WHERE t.time = y.time2a)
END) AS temp2
,COALESCE(temp3
,CASE WHEN timediff(time, time3a) > timediff(time3b, time) THEN
(SELECT t.temperature
FROM temperature_3 t
WHERE t.time = y.time3b)
ELSE
(SELECT t.temperature
FROM temperature_3 t
WHERE t.time = y.time3a)
END) AS temp3
FROM (
SELECT time
,max(t1) AS temp1
,max(t2) AS temp2
,max(t3) AS temp3
,CASE WHEN max(t1) IS NULL THEN
(SELECT t.time FROM temperature_1 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time1a
,CASE WHEN max(t1) IS NULL THEN
(SELECT t.time FROM temperature_1 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time1b
,CASE WHEN max(t2) IS NULL THEN
(SELECT t.time FROM temperature_2 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time2a
,CASE WHEN max(t2) IS NULL THEN
(SELECT t.time FROM temperature_2 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time2b
,CASE WHEN max(t3) IS NULL THEN
(SELECT t.time FROM temperature_3 t
WHERE t.time < x.time
ORDER BY t.time DESC LIMIT 1) ELSE NULL END AS time3a
,CASE WHEN max(t3) IS NULL THEN
(SELECT t.time FROM temperature_3 t
WHERE t.time > x.time
ORDER BY t.time LIMIT 1) ELSE NULL END AS time3b
FROM (
SELECT time, temperature AS t1, NULL AS t2, NULL AS t3 FROM temperature_1
UNION ALL
SELECT time, NULL AS t1, temperature AS t2, NULL AS t3 FROM temperature_2
UNION ALL
SELECT time, NULL AS t1, NULL AS t2, temperature AS t3 FROM temperature_3
) AS x
GROUP BY time
) y
ORDER BY time;
->sqlfiddle
Explain
suqquery x replaces your view temptimes and brings the temperature into the result. If all three tables are in sync and have temperatures for all the same points in time, the rest is not even needed and extremely fast.
For every point in time where one of the three tables has no row, the temperature is being fetched as instructed: take the "closest" one from each table.
suqquery y aggregates the rows from x and fetches the previous time (time1a) and the next time (time1b) according to the current time from each table where the temperature is missing. These lookups should be fast using the index.
The final query fetches the temperature from the row with the closest time for each temperature that's actually missing.
This query could be simpler if MySQL would allow to reference columns from more than one level above the current subquery. Bit it cannot. Works just fine with in PostgreSQL: ->sqlfiddle
It also would be simpler if one could return more than one column from a correlated subquery, but I don't know how to do that in MySQL.
And it would be much simpler with CTEs and window functions, but MySQL doesn't know these modern SQL features (unlike other relevant RDBMS).
The reason that this is slow is that it requires 3 table scans to calculate and order the diferences.
I assume that you allready have indexes on the time zone columns - at the moment they won't help becuase of the table scan problem.
There are a number of options to avoid this depending on what you need and what the data collection rates are.
You have already said that the data is collected periodically but not concurrently. This suggests a few options.
To what level of significance do you need the temp data - the day, the hour, the minute etc. Store the time zone info to that level of significance only (or have another column that does) and do your queries on that.
If you know that the 3 closets times will be within a certain time frame (hour, day etc) put in a where clause to limit the calculation to those times that are potential candidates. You are effectively constructing histogram type buckets - you will need a calendar table to do this efficiently.
Make the comparison unidirectional i.e. limit consideration to only those times after the time you are looking for, so if you are looking for 12:00:00 then 13:45:32 is a candidate but 11:59:59 isn't.
I understand what you are trying to accomplish - ask yourself why and if a simpler solution will neet your needs.
My suggestion is that you don't take the closest time, but you take the first time on or before a given time. The reason for this is simple: generally the data for a given time is what is known at that time. Incorporating future information is generally not a good idea for most purposes.
With this change, you can modify your query to take advantage of an index on time. The problem with an index on your query is that the function precludes the use of the index.
So, if you want the most recent temperature, use this instead for each variable:
(select temperature
from temperature_1 t2
where t2.time <= tt.time
order by t2.time desc
limit 1
) as temperature_1,
Actually, you can also construct it like this:
(select time
from temperature_1 t2
where t2.time <= tt.time
order by t2.time desc
limit 1
) as time_1,
And then join the information for the temperature back in. This will be efficient, with the use of an index.
With that in mind, you could actually have two variables time_1_before and time_1_after, for the best time on or before and the best time on or after. You can use logic in the select to choose the nearest value. The joins back to the temperature should be efficient using an index.
But, I will reiterate, I think the last temperature on or before may be the best choice.

MySQL Query not selecting correct date range

Im currently trying to run a SQL query to export data between a certain date, but it runs the query fine, just not the date selection and i can't figure out what's wrong.
SELECT
title AS Order_No,
FROM_UNIXTIME(entry_date, '%d-%m-%Y') AS Date,
status AS Status,
field_id_59 AS Transaction_ID,
field_id_32 AS Customer_Name,
field_id_26 AS Sub_Total,
field_id_28 AS VAT,
field_id_31 AS Discount,
field_id_27 AS Shipping_Cost,
(field_id_26+field_id_28+field_id_27-field_id_31) AS Total
FROM
exp_channel_data AS d NATURAL JOIN
exp_channel_titles AS t
WHERE
t.channel_id = 5 AND FROM_UNIXTIME(entry_date, '%d-%m-%Y') BETWEEN '01-05-2012' AND '31-05-2012' AND status = 'Shipped'
ORDER BY
entry_date DESC
As explained in the manual, date literals should be in YYYY-MM-DD format. Also, bearing in mind the point made by #ypercube in his answer, you want:
WHERE t.channel_id = 5
AND entry_date >= UNIX_TIMESTAMP('2012-05-01')
AND entry_date < UNIX_TIMESTAMP('2012-06-01')
AND status = 'Shipped'
Besides the date format there is another issue. To effectively use any index on entry_date, you should not apply functions to that column when you use it conditions in WHERE, GROUP BY or HAVING clauses (you can use the formatting in SELECT list, if you need a different than the default format to be shown). An effective way to write that part of the query would be:
( entry_date >= '2012-05-01'
AND entry_date < '2012-06-01'
)
It works with DATE, DATETIME and TIMESTAMP columns.