how to delete every record except one per hour - mysql

I have a mysql table with millions of sensor records with the following structure:
datanumber (auto increment),
stationid (int),
sensortype (int),
measuredate (datetime),
data (medtext)
each stations adds a record every 2-10 minute per sensortype (2-5 sensors)
I would like to keep only one record per hour, per sensor, per station
and this too only if measuredate is older than 1 year.
I understand how to select data older than one year but I have no clue on how to delete rows except one for each hour. It does not really matter if it's the first, last or a random value which is kept at each hour. I also do not need to calculate average values or something, just strip down the amount of records stored

You should be able to do something like
Select * from observations where <old> group by sensortype, stationid, extract(year_month, measure_date), extract(day_hour, measure_date);
group_by will collapse the records in each group into one. You could select this into a new table if you want.
If you need to actually delete all the redundant old records, just select the datanumbers using the above query, and then delete all records NOT IN(<those ids>).

If you are going to be deleting a very large number of rows, then one approach recommended by the MySQL docs is to select the rows you want to retain into a temporary table, and then perform an atomic table renaming. Maybe like this:
INSERT INTO
sensordata_squeezed
SELECT
datanumber,
stationid,
sensortype,
measuredate,
data
FROM
sensordata
WHERE
measuredate < DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
GROUP BY
DATE_ADD(DATE(measuredate), INTERVAL HOUR(measuredate) HOUR),
stationid,
sensortype
UNION ALL
SELECT
datanumber,
stationid,
sensortype,
measuredate,
data
FROM
sensordata
WHERE
measuredate >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
;
RENAME TABLE
sensordata TO sensordata_old,
sensordata_squeezed TO sensordata
;
DROP TABLE sensordata_old
;
Note well: that relies on MySQL's documented behavior with respect to selecting columns from an aggregate query that are neither grouping columns nor aggregate functions of the groups: it chooses an indeterminate value from each group. (This is an extension to standard SQL.) I am supposing that within each group, all the nonaggregated column values will come from the same row; you should check because that part is not documented, and this approach depends on that to maintain data integrity.
This approach allows you to avoid both large, expensive joins, and large numbers of subqueries.
Do note that however you do this, you are going to have to work around issues of how to avoid losing data that comes in while this operation is running, as it is likely to take a long time.

This would be a lead-pipe cinch if we could use row_number over( ... ) but a solution for MySQL is not difficult. For problems like this, look to see if we can query a list of just the rows we want to delete. That sounds easy enough. First, we want to have a list of each hour of each day and the first (least) entry for that hour:
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour;
So we just have to join the table back to this result set:
select T.*
from T
join(
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour
) as T1
on T1.MinTime = T.MeasureDate
This gives us all the rows we want to keep. So use a left join to invert the results:
select T.*
from T
left join(
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour
) as T1
on T1.MinTime = T.MeasureDate
where T1.MinTime is null;
Change the select to delete et viola:
delete TDel
from T TDel
left join(
select Date( MeasureDate ) TheDate, Hour( MeasureDate ) TheHour, Min( MeasureDate ) MinTime
from T
group by TheDate, TheHour
) as T1
on T1.MinTime = TDel.MeasureDate
where T1.MinTime is null;
You can add other fields such as SensorType as appropriate to keep first entry of each hour per sensor or however you want to tune it. SqlFiddle

Related

Group by help ( grouping by multiple, have duplicates)

SO i have a task and i need to group my results by Date and by Provider_name but currently my code is listing out multiple dates and Providers. (need to have one provider per day (25 days in all) so my table shows how many messages the provider got that day and how much did they earn)
This needs to be my result. Result table
But this is what i'm currently getting
This is my code currently
SELECT date_format( time, '%Y-%m-%d' ) AS Date, provider_name, COUNT( message_id ) AS Messages_count, SUM( price ) AS Total_price
FROM mobile_log_messages_sms
INNER JOIN service_instances ON service_instances.service_instance_id = mobile_log_messages_sms.service_instance_id
INNER JOIN mobile_providers ON mobile_providers.network_code = mobile_log_messages_sms.network_code
WHERE time
BETWEEN '2017-02-26 00:00:00'
AND time
AND '2017-03-22 00:00:00'
AND price IS NOT NULL
AND price <> ''
AND service IS NOT NULL
AND service <> ''
AND enabled IS NOT NULL
AND enabled >=1
GROUP BY provider_name, time
ORDER BY time DESC
Can you tell me where i've messed up, i really can't figure out the answer.
Try like this:
....
GROUP BY provider_name, date_format( time, '%Y-%m-%d' )
ORDER BY time DESC
You are grouping time which will group the result by time including hour, minute and second so on ... that is why you getting different count from same day. Try grouping by day instead.
time column is datetime. So its grouped by date and time both rather than just date.
Change GROUP BY statement to
GROUP BY provider_name, date_format( time, '%Y-%m-%d' )

Get percentage of total when using GROUP BY in SQL query

I have a SQL query that I'm using to return the number of training sessions recorded by a client on each day of the week (during the last year).
SELECT COUNT(*) total_sessions
, DAYNAME(log_date) day_name
FROM programmes_results
WHERE log_date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
AND log_date <= CURDATE()
AND client_id = 7171
GROUP
BY day_name
ORDER
BY FIELD(day_name, 'MONDAY', 'TUESDAY', 'WEDNESDAY', 'THURSDAY', 'FRIDAY', 'SATURDAY', 'SUNDAY')
I would like to then plot a table showing these values as a percentage of the total, as opposed to as a 'count' for each day. However I'm at a bit of a loss as to how to do that without another query (which I'd like to avoid).
Any thoughts?
Use a derived table
select day_name, total_sessions, total_sessions / sum(total_sessions) * 100 percentage
from (
query from your question goes here
) temp
group by day_name, total_sessions
You can add the number of trainings per day in your client application to get the total count. This way you definitely avoid having a 2nd query to get the total.
Use the with rollup modifier in the query to get the total returned in the last row:
...GROUP BY day_name WITH ROLLUP ORDER BY ...
Use a subquery to return the overall count within each row
SELECT ..., t.total_count
...FROM programmes_results INNER JOIN (SELECT COUNT(*) as total_count FROM programmes_results WHERE <same where criteria>) as t --NO join condition
...
This will have the largest performance impact on the database, however, it enables you to have the total number in each row.

MySQL Date difference between two rows

I have a TABLE with Columns: USER_ID,TIMESTAMP and ACTION
Every row tells me which user did what action at a certain time-stamp.
Example:
Alice starts the application at 2014-06-12 16:37:46
Alice stops the application at 2014-06-12 17:48:55
I want a list of users with the time difference between the first row in which they start the application and the last row in which they close it.
Here is how I'm trying to do it:
SELECT USER_ID,DATEDIFF(
(SELECT timestamp FROM MOBILE_LOG WHERE ACTION="START_APP" AND USER_ID="Alice" order by TIMESTAMP LIMIT 1),
(SELECT timestamp FROM MOBILE_LOG WHERE ACTION ="CLOSE_APP" AND USER_ID="Alice" order by TIMESTAMP LIMIT 1)
) AS Duration FROM MOBILE_LOG AS t WHERE USER_ID="Alice";
I ask for the DATEDIFF between two SELECT queries, but I just get a list of Alice`s with -2 as Duration.
Am i on the right track?
I think you should group this table by USER_ID and find minimum date of "START_APP" and maximum of "CLOSE_APP" for each user. Also you should use in DATEDIFF the CLOSE_APP time first and then START_APP time in this case you will get positive value result
SELECT USER_ID,
DATEDIFF(MAX(CASE WHEN ACTION="CLOSE_APP" THEN timestamp END),
MIN(CASE WHEN ACTION="START_APP" THEN timestamp END)
) AS Duration
FROM MOBILE_LOG AS t
GROUP BY USER_ID
SQLFiddle demo
SELECT user_id, start_time, close_time, DATEDIFF(close_time, start_time) duration
FROM
(SELECT MIN(timestamp) start_time, user_id FROM MOBILE_LOG WHERE action="START_APP" GROUP BY user_id) start_action
JOIN
(SELECT MAX(timestamp) close_time, user_id FROM MOBILE_LOG WHERE ACTION ="CLOSE_APP" GROUP BY user_id) close_action
USING (user_id)
WHERE USER_ID="Alice";
You make two "tables" with the earliest time for start for each user, and the latest time for close for each user. Then join them so that the actions of the same user are together.
Now that you have everything setup you can easily subtract between them.
You have the int value because you use the function DATEDIFF, it shows you the number of days between two dates, if you want to have the number of hours and minutes and seconds between dates you have to use TIMEDIFF
Try this:
select t1.USER_ID, TIMEDIFF(t2.timestamp, t1.timestamp)
from MOBILE_LOG t1, MOBILE_LOG t2
where (t1.action,t1.timestamp) in (select action, max(timestamp) from MOBILE_LOG t where t.ACTION = "START_APP" group by USER_ID)
and (t1.action,t1.timestamp) in (select action, max(timestamp), max(id) from MOBILE_LOG t where t.ACTION = "CLOSE_APP" group by USER_ID)
and t1.USER_ID = t2.USER_ID
It will show you difference between two latest dates (startdate,enddate) for all user.
P.S: Sorry, I wrote it without any databases, and may be there are some mistakes. If you have problems with (t1.action,t1.timestamp) in (select...) split it on two: where t1.action in (select ...) and t1.timestamp in (select ...)

MySQL - How to select rows with the min(timestamp) per hour of a given date

I have a table of production readings and need to get a result set containing a row for the min(timestamp) for EACH hour.
The column layout is quite simple:
ID,TIMESTAMP,SOURCE_ID,SOURCE_VALUE
The data sample would look like:
123,'2013-03-01 06:05:24',PMPROD,12345678.99
124,'2013-03-01 06:15:17',PMPROD,88888888.99
125,'2013-03-01 06:25:24',PMPROD,33333333.33
126,'2013-03-01 06:38:14',PMPROD,44444444.44
127,'2013-03-01 07:12:04',PMPROD,55555555.55
128,'2013-03-01 10:38:14',PMPROD,44444444.44
129,'2013-03-01 10:56:14',PMPROD,22222222.22
130,'2013-03-01 15:28:02',PMPROD,66666666.66
Records are added to this table throughout the day and the source_value is already calculated, so no sum is needed.
I can't figure out how to get a row for the min(timestamp) for each hour of the current_date.
select *
from source_readings
use index(ID_And_Time)
where source_id = 'PMPROD'
and date(timestamp)=CURRENT_DATE
and timestamp =
( select min(timestamp)
from source_readings use index(ID_And_Time)
where source_id = 'PMPROD'
)
The above code, of course, gives me one record. I need one record for the min(hour(timestamp)) of the current_date.
My result set should contain the rows for IDs: 123,127,128,130. I've played with it for hours. Who can be my hero? :)
Try below:
SELECT * FROM source_readings
JOIN
(
SELECT ID, DATE_FORMAT(timestamp, '%Y-%m-%d %H') as current_hour,MIN(timestamp)
FROM source_readings
WHERE source_id = 'PMPROD'
GROUP BY current_hour
) As reading_min
ON source_readings.ID = reading_min.ID
SELECT a.*
FROM Table1 a
INNER JOIN
(
SELECT DATE(TIMESTAMP) date,
HOUR(TIMESTAMP) hour,
MIN(TIMESTAMP) min_date
FROM Table1
GROUP BY DATE(TIMESTAMP), HOUR(TIMESTAMP)
) b ON DATE(a.TIMESTAMP) = b.date AND
HOUR(a.TIMESTAMP) = b.hour AND
a.timestamp = b.min_date
SQLFiddle Demo
With window function:
WITH ranked (
SELECT *, ROW_NUMBER() OVER(PARTITION BY HOUR(timestamp) ORDER BY timestamp) rn
FROM source_readings -- original table
WHERE date(timestamp)=CURRENT_DATE AND source_id = 'PMPROD' -- your custom filter
)
SELECT * -- this will contain `rn` column. you can select only necessary columns
FROM ranked
WHERE rn=1
I haven't tested it, but the basic idea is:
1) ROW_NUMBER() OVER(PARTITION BY HOUR(timestamp) ORDER BY timestamp)
This will give each row a number, starting from 1 for each hour, increasing by timestamp. The result might look like:
|rest of columns |rn
123,'2013-03-01 06:05:24',PMPROD,12345678.99,1
124,'2013-03-01 06:15:17',PMPROD,88888888.99,2
125,'2013-03-01 06:25:24',PMPROD,33333333.33,3
126,'2013-03-01 06:38:14',PMPROD,44444444.44,4
127,'2013-03-01 07:12:04',PMPROD,55555555.55,1
128,'2013-03-01 10:38:14',PMPROD,44444444.44,1
129,'2013-03-01 10:56:14',PMPROD,22222222.22,2
130,'2013-03-01 15:28:02',PMPROD,66666666.66,1
2) Then on the main query we select only rows with rn=1, in other words, rows that has lowest timestamp in each hourly partition (1st row after sorted by timestamp in each hour).

MySQL query to count items by week for the current 52-weeks?

I have a query that I'd like to change so that it gives me the counts for the current 52 weeks. This query makes use of a calendar table I've made which contains a list of dates in a fixed range. The query as it stands is selecting max and min dates and not necessarily the last 52 weeks.
I'm wondering how to keep my calendar table current such that I can get the last 52-weeks (i.e, from right now to one year ago). Or is there another way to make the query independent of using a calendar table?
Here's the query:
SELECT calendar.datefield AS date, IFNULL(SUM(purchaseyesno),0) AS item_sales
FROM items_purchased join items on items_purchased.item_id=items.item_id
RIGHT JOIN calendar ON (DATE(items_purchased.purchase_date) = calendar.datefield)
WHERE (calendar.datefield BETWEEN (SELECT MIN(DATE(purchase_date))
FROM items_purchased) AND (SELECT MAX(DATE(purchase_date)) FROM items_purchased))
GROUP BY week(date)
thoughts?
Some people dislike this approach but I tend to use a dummy table that contains values from 0 - 1000 and then use a derived table to produce the ranges that are needed -
CREATE TABLE dummy (`num` INT NOT NULL);
INSERT INTO dummy VALUES (0), (1), (2), (3), (4), (5), .... (999), (1000);
If you have a table with an auto-incrementing id and plenty of rows you could generate it from that -
CREATE TABLE `dummy`
SELECT id AS `num` FROM `some_table` WHERE `id` <= 1000;
Just remember to insert the 0 value.
SELECT CURRENT_DATE - INTERVAL num DAY
FROM dummy
WHERE num < 365
So, applying this approach to your query you could do something like this -
SELECT WEEK(calendar.datefield) AS `week`, IFNULL(SUM(purchaseyesno),0) AS item_sales
FROM items_purchased join items on items_purchased.item_id=items.item_id
RIGHT JOIN (
SELECT (CURRENT_DATE - INTERVAL num DAY) AS datefield
FROM dummy
WHERE num < 365
) AS calendar ON (DATE(items_purchased.purchase_date) = calendar.datefield)
WHERE calendar.datefield >= (CURRENT_DATE - INTERVAL 1 YEAR)
GROUP BY week(datefield) -- shouldn't this be datefield instead of date?
I too typically "simulate" a table on the fly by using #sql variables and just join to ANY table in your system that has AT least as many weeks as you want. NOTE... when dealing with dates, I like to typically use the date-part only which implies a 12:00:00 am. Also, by advancing the start date by 7 days for the "EndOfWeek", you can now apply a BETWEEN clause for records within a given time period... such as your weekly needs.
I've applied such a sample to coordinate the join based on date association to the per week basis... Since your
select
DynamicCalendar.StartOfWeek,
COALESCE( SUM( IP.PurchaseYesNo ), 0 ) as Item_Sales
from
( select
#weekNum := #weekNum +1 as WeekNum,
#startDate as StartOfWeek,
#startDate := date_add( #startDate, interval 1 week ) EndOfWeek
from
( select #weekNum := 0,
#startDate := date(date_sub(now(), interval 1 year ))) sqlv,
AnyTableThatHasAtLeast52Records,
limit
52 ) DynamicCalendar
LEFT JOIN items_purchased IP
on IP.Purchase_Date bewteen DynamicCalendar.StartOfWeek
AND DynamicCalendar.EndOfWeek
group by
DynamicCalendar.StartOfWeek
This is under the premise that your "PurchaseYesNo" value is in your purchased table directly. If so, no need to join to the ITEMS table. If the field IS in the items table, then I would just tack on a LEFT JOIN for your items table and get value from that.
However you could use the dynamicCalendar context in MANY conditions.