Reduce historical time series by averaging over time intervals - mysql

I have a weather observation table with frequent entries - for simplicity let's consider just the temperature observations. The observations can be somewhat sporadic, but sometimes up to half a dozen occur in each clock hour interval. My goal is to run a procedure at, say, hourly intervals to find those historical hours that contain multiple temperature observations and find the average temperature and time and then to replace all those observations with the single averaged observation.
I have managed to compose a mysql query which creates an averaged temperature value for the hour interval (shown below) but am needing assistance to know how to take this one step further by actually replacing each hours observation entries with the single new average entry.
SELECT stationcode, AVG(temperature) as t_avg, COUNT(temperature) as t_count, FROM_UNIXTIME
(AVG(UNIX_TIMESTAMP(obs_datetime))) as datetime_avg, MINUTE(obs_datetime) as minute, HOUR(obs_datetime) as hour, DAY(obs_datetime) as day, MONTH(obs_datetime) as month, YEAR(obs_datetime) as year
FROM obs_table
WHERE stationcode='AT301'
GROUP BY hour, day, month, year
HAVING count(*) > 1
ORDER BY datetime_avg DESC
I am imagining that the solution might involve a join or a temporary table. Can anyone provide any sample code or hints as to how I can go about this?
Adding following due to a request for the table structure:
--
-- Table structure for table `obs_table`
--
CREATE TABLE `obs_table` (
`rec_id` bigint(12) UNSIGNED NOT NULL,
`stationcode` varchar(8) NOT NULL,
`obs_datetime` datetime NOT NULL,
`temperature` float DEFAULT NULL,
`temp_dewpt` float DEFAULT NULL,
`rel_humidity` float DEFAULT NULL,
`wind_dir_degs` float DEFAULT NULL,
`wind_avg_kmh` float DEFAULT NULL,
`wind_gust_kmh` float DEFAULT NULL,
`pressure_hpa` float DEFAULT NULL,
`visibility_m` float DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`icon` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
--
-- Indexes for table `obs_table`
--
ALTER TABLE `obs_table`
ADD PRIMARY KEY (`rec_id`),
ADD UNIQUE KEY `stationcode` (`stationcode`,`obs_datetime`);

Related

How do I structure sensor data in SQL?

I want to store the sensor data from several weather stations on an SQL-database so that it can be viewed through a django web-page.
To keep the explanation simple, I read a few sensors (bools and float values) from each weather station every few minutes. I also want to store the time stamp of each reading.
What is the best way to structure this data in an SQL database? I would like to keep the system running for years, so it has to be able to store several hundred thousand values. I also need to read these values for display in graphs.
For a start, you can have 2 tables - stations and readings.
The stations table has an auto-increment id field and any other info about the stations that you need / have. E.g.:
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`name` VARCHAR(255) NOT NULL DEFAULT '0',
`lat` DOUBLE NOT NULL DEFAULT '0',
`lng` DOUBLE NOT NULL DEFAULT '0',
...other things
PRIMARY KEY (`id`)
The readings table contains entries for a single report from a station at a given time (I'm guessing the values will be averaged over a few minutes):
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`time` TIMESTAMP NOT NULL,
`id_station` INT NOT NULL,
`temp` DOUBLE NULL,
`humidity` DOUBLE NULL,
`wind_speed` DOUBLE NULL,
`wind_dir` DOUBLE NULL,
`pressure` DOUBLE NULL,
`squirrel_count` DOUBLE NULL,
... other things
PRIMARY KEY (`id`),
INDEX `time` (`time`),
INDEX `id_station` (`id_station`)
Depending on the amount of readings and how 'big' your server is, you might be able to use this table to aggredate (e.g. daily) and create a chart, or you might need to pre-aggregate the data in another table for reporting.
E.g: once a day a script/SQL proc runs a query, which aggregates the data for the previous day and inserts it into another table. The second table will be almost the same as the 1-st, except that it will contain only daily averages (instead of few-minute-ish ones).
You can use preaggregation to create multiple tables with different granularity (hourly, daily, weekly...) as needed for your reports. How many and which you will need depends on how fast you want it to run, and the hardware you have.

MySQL Query Optimization on a Big Table

I am working with mysql querying a table that has 12 millions registers that are a year of the said data.
The query has to select certain kind of data (coin, enterprise, type, etc..) and then provide a daily average for certain fields of that data, so we can graph it afterwards.
The dream its to be able to do this in real time, so with a response time less than 10 secs, however at the moment its not looking bright at all as its taking between 4 to 6 minutes.
For example, one of the where querys come up with 150k registers, split about 500 per day, and then we average three fields (which are not on the where clause) using a AVG() and GroupBy.
Now, to the raw data, the query is
SELECT
`Valorizacion`.`fecha`, AVG(tir) AS `tir`, AVG(tirBase) AS `tirBase`, AVG(precioPorcentajeValorPar) AS `precioPorcentajeValorPar`
FROM `Valorizacion` USE INDEX (ix_mercado2)
WHERE
(Valorizacion.fecha >= '2011-07-17' ) AND
(Valorizacion.fecha <= '2012-07-18' ) AND
(Valorizacion.plazoResidual >= 365 ) AND
(Valorizacion.plazoResidual <= 3650000 ) AND
(Valorizacion.idMoneda_cache IN ('UF')) AND
(Valorizacion.idEmisorFusionado_cache IN ('ABN AMRO','WATTS', ...)) AND
(Valorizacion.idTipoRA_cache IN ('BB', 'BE', 'BS', 'BU'))
GROUP BY `Valorizacion`.`fecha` ORDER BY `Valorizacion`.`fecha` asc;
248 rows in set (4 min 28.82 sec)
The index is made over all the where clause fields in the order
(fecha, idTipoRA_cache, idMoneda_cache, idEmisorFusionado_cache, plazoResidual)
Selecting the "where" registers, without using group by or AVG
149670 rows in set (58.77 sec)
And selecting the registers, grouping and just doing a count(*) istead of average takes
248 rows in set (35.15 sec)
Which probably its because it doesnt need to go to the disk to search for the data but its obtained directly from the index queries.
So as far as it goes im of the idea of telling my boss "Im sorry but it cant be done", but before doing so i come to you guys asking if you think there is something i could do to improve this. I think i could improve the search by index time moving the index with the biggest cardinality to the front and so on, but even after that the time that takes to access the disk for each record and do the AVG seems too much.
Any ideas?
-- EDIT, the table structure
CREATE TABLE `Valorizacion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`idInstrumento` int(11) NOT NULL,
`fecha` date NOT NULL,
`tir` decimal(10,4) DEFAULT NULL,
`tirBase` decimal(10,4) DEFAULT NULL,
`plazoResidual` double NOT NULL,
`duracionMacaulay` double DEFAULT NULL,
`duracionModACT365` double DEFAULT NULL,
`precioPorcentajeValorPar` decimal(20,15) DEFAULT NULL,
`valorPar` decimal(20,15) DEFAULT NULL,
`convexidad` decimal(20,15) DEFAULT NULL,
`volatilidad` decimal(20,15) DEFAULT NULL,
`montoCLP` double DEFAULT NULL,
`tirACT365` decimal(10,4) DEFAULT NULL,
`tipoVal` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idEmisorFusionado_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idMoneda_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idClasificacionRA_cache` int(11) DEFAULT NULL,
`idTipoRA_cache` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`fechaPrepagable_cache` date DEFAULT NULL,
`tasaEmision_cache` decimal(10,4) DEFAULT NULL,
PRIMARY KEY (`id`,`fecha`),
KEY `ix_FechaNemo` (`fecha`,`idInstrumento`) USING BTREE,
KEY `ix_mercado_stackover` (`idMoneda_cache`,`idTipoRA_cache`,`idEmisorFusionado_cache`,`plazoResidual`)
) ENGINE=InnoDB AUTO_INCREMENT=12933194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Selecting 150K records out of 12M records and performing aggregate functions on them will not be fast no matter what you try to do.
You are probably dealing with primarily historical data as your sample query is for a year of data. A better approach may be to pre-calculate your daily averages and put them into separate tables. Then you may query those tables for reporting, graphs, etc. You will need to decide when and how to run such calculations so that you don't need to re-run them again on the same data.
When your requirement is to do analysis and reporting on millions of historical records you need to consider a data warehouse approach http://en.wikipedia.org/wiki/Data_warehouse rather than a simple database approach.

MySQL Query list

I'm going to try to explain this best I can I will provide more information if needed quickly.
I'm storing data for each hour in military time. I only need to store a days worth of data. My table structure is below
CREATE TABLE `onlinechart` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`user` varchar(100) DEFAULT NULL,
`daytime` varchar(10) DEFAULT NULL,
`maxcount` smallint(20) DEFAULT NULL,
`lastupdate` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=innodb AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
The "user" column is unique to each user. So I will have list for each user.
The "daytime" column I'm having it store the day and hour together. So as for today and hour it would be "2116" so the day is 21 and the current hour is 16.
The "maxcount" column is what data for each hour. I'm tracking just one total number each hour.
The "lastupdate" column is just a timestamp im using to delete data that is 24 hours+ old.
I have the script running in PHP fine for the tracking. It keeps a total of 24 rows of data for each user and deletes anything older then 24hours. My problem is how would I go about a query that would start from the current hour/day and pull that past 24 hours maxcount and display them in order.
Thanks
You will run into an issue of handling this near the end of the year. It's advisable you switch to using the native timestamp type of MySQL (described here: http://dev.mysql.com/doc/refman/5.0/en/datetime.html). Then you can grab max count by doing something such as:
SELECT * FROM onlinechart WHERE daytime >= ? ORDER BY maxcount
The question mark should be replaced by the timestamp - 86400 (number of seconds in a day).

MySQL: Precision of a Datefield

since I have launched a podcast recently I wanted to analyse our Downloaddata. But some clients seem to send multiple requests. So I wanted to only count one request per IP and User-Agent every 15 Minutes. Best thing I could come up with is the following query, that counts one request per IP and User-Agent every hour. Any ideas how to solve that Problem in MySQL?
SELECT episode, podcast, DATE_FORMAT(date, '%d.%m.%Y %k') as blurry_date, useragent, ip FROM downloaddata GROUP BY ip, useragent
This is the table I've got
CREATE TABLE `downloaddata` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`podcast` varchar(255) DEFAULT NULL,
`episode` int(4) DEFAULT NULL,
`source` varchar(255) DEFAULT NULL,
`useragent` varchar(255) DEFAULT NULL,
`referer` varchar(255) DEFAULT NULL,
`filetype` varchar(15) DEFAULT NULL,
`ip` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=216 DEFAULT CHARSET=utf8;
Personally I'd recomend collecting every request, and then only taking one every 15 mins with a distict query, or perhaps counting the number every 15 mins.
If you are determined to throw data away so it can never be analysed though.
Quick and simple is to just the date and have an int column which is the 15 minute period,
Hour part of current time * 4 + Minute part / 4
DatePart functions are what you want to look up. Things is each time you want to record, you'll have to check if they have in the 15 minute period. Extra work, extra complexity and less / lower quality data...
MINUTE(date)/15 will give you the quarter hour (0-3). Ensure that along with the date is unique (or ensure UNIX_TIMESTAMP(date)/(15*60) is unique).

Counting records of a large table based on date format

For reference, this is my current table:
`impression` (
`impressionid` bigint(19) unsigned NOT NULL AUTO_INCREMENT,
`creationdate` datetime NOT NULL,
`ip` int(4) unsigned DEFAULT NULL,
`canvas2d` tinyint(1) DEFAULT '0',
`canvas3d` tinyint(1) DEFAULT '0',
`websockets` tinyint(1) DEFAULT '0',
`useragentid` int(10) unsigned NOT NULL,
PRIMARY KEY (`impressionid`),
UNIQUE KEY `impressionsid_UNIQUE` (`impressionid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=447267 ;
It keeps a record of all the impressions on a certain page. After one day of running, it has gathered 447266 views. Those are a lot of records.
Now I want the amount of visitors per minute. I can easily get them like this:
SELECT COUNT( impressionid ) AS visits, DATE_FORMAT( creationdate, '%m-%d %H%i' ) AS DATE
FROM `impression`
GROUP BY DATE
This query takes a long time, of course. Right now around 56 seconds.
So I'm wondering what to do next. Do I:
Create an index on creationdate (I don't know if that'll help since I'm using a function to alter this data by which to group)
Create new fields that stores hours and minutes separately.
The last one would cause there to be duplicate data, and I hate that. But maybe it's the only way in this case?
Or should I go about it in some different way?
If you run this query often, you could denormaize the calculated value into a separate column (perhaps by a trigger on insert/update) then grouping by that.
Your idea of hours and minutes is a good one too, since it lets you group a few different ways other than just minutes. It's still denormalization, but it's more versatile.
Denormalization is fine, as long as it's justified and understood.