since I have launched a podcast recently I wanted to analyse our Downloaddata. But some clients seem to send multiple requests. So I wanted to only count one request per IP and User-Agent every 15 Minutes. Best thing I could come up with is the following query, that counts one request per IP and User-Agent every hour. Any ideas how to solve that Problem in MySQL?
SELECT episode, podcast, DATE_FORMAT(date, '%d.%m.%Y %k') as blurry_date, useragent, ip FROM downloaddata GROUP BY ip, useragent
This is the table I've got
CREATE TABLE `downloaddata` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`podcast` varchar(255) DEFAULT NULL,
`episode` int(4) DEFAULT NULL,
`source` varchar(255) DEFAULT NULL,
`useragent` varchar(255) DEFAULT NULL,
`referer` varchar(255) DEFAULT NULL,
`filetype` varchar(15) DEFAULT NULL,
`ip` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=216 DEFAULT CHARSET=utf8;
Personally I'd recomend collecting every request, and then only taking one every 15 mins with a distict query, or perhaps counting the number every 15 mins.
If you are determined to throw data away so it can never be analysed though.
Quick and simple is to just the date and have an int column which is the 15 minute period,
Hour part of current time * 4 + Minute part / 4
DatePart functions are what you want to look up. Things is each time you want to record, you'll have to check if they have in the 15 minute period. Extra work, extra complexity and less / lower quality data...
MINUTE(date)/15 will give you the quarter hour (0-3). Ensure that along with the date is unique (or ensure UNIX_TIMESTAMP(date)/(15*60) is unique).
Related
I have a weather observation table with frequent entries - for simplicity let's consider just the temperature observations. The observations can be somewhat sporadic, but sometimes up to half a dozen occur in each clock hour interval. My goal is to run a procedure at, say, hourly intervals to find those historical hours that contain multiple temperature observations and find the average temperature and time and then to replace all those observations with the single averaged observation.
I have managed to compose a mysql query which creates an averaged temperature value for the hour interval (shown below) but am needing assistance to know how to take this one step further by actually replacing each hours observation entries with the single new average entry.
SELECT stationcode, AVG(temperature) as t_avg, COUNT(temperature) as t_count, FROM_UNIXTIME
(AVG(UNIX_TIMESTAMP(obs_datetime))) as datetime_avg, MINUTE(obs_datetime) as minute, HOUR(obs_datetime) as hour, DAY(obs_datetime) as day, MONTH(obs_datetime) as month, YEAR(obs_datetime) as year
FROM obs_table
WHERE stationcode='AT301'
GROUP BY hour, day, month, year
HAVING count(*) > 1
ORDER BY datetime_avg DESC
I am imagining that the solution might involve a join or a temporary table. Can anyone provide any sample code or hints as to how I can go about this?
Adding following due to a request for the table structure:
--
-- Table structure for table `obs_table`
--
CREATE TABLE `obs_table` (
`rec_id` bigint(12) UNSIGNED NOT NULL,
`stationcode` varchar(8) NOT NULL,
`obs_datetime` datetime NOT NULL,
`temperature` float DEFAULT NULL,
`temp_dewpt` float DEFAULT NULL,
`rel_humidity` float DEFAULT NULL,
`wind_dir_degs` float DEFAULT NULL,
`wind_avg_kmh` float DEFAULT NULL,
`wind_gust_kmh` float DEFAULT NULL,
`pressure_hpa` float DEFAULT NULL,
`visibility_m` float DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`icon` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
--
-- Indexes for table `obs_table`
--
ALTER TABLE `obs_table`
ADD PRIMARY KEY (`rec_id`),
ADD UNIQUE KEY `stationcode` (`stationcode`,`obs_datetime`);
I have a MYSQL DB with table definition like this:
CREATE TABLE `minute_data` (
`date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`open` decimal(10,2) DEFAULT NULL,
`high` decimal(10,2) DEFAULT NULL,
`low` decimal(10,2) DEFAULT NULL,
`close` decimal(10,2) DEFAULT NULL,
`volume` decimal(10,2) DEFAULT NULL,
`adj_close` varchar(45) DEFAULT NULL,
`symbol` varchar(10) NOT NULL DEFAULT '',
PRIMARY KEY (`symbol`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
It stores 1 minute data points from the stock market. The primary key is a combination of the symbol and date columns. This way I always have only 1 data point for each symbol at any time.
I am wondering why the following query takes so long that I can't even wait for it to finish:
select distinct date from test.minute_data where date >= "2013-01-01"
order by date asc limit 100;
However I can select count(*) from minute_data; and that finishes very quickly.
I know that it must have something to do with the fact that there are over 374 million rows of data in the table, and my desktop computer is pretty far from a super computer.
Does anyone know something I can try to speed up with query? Do I need to abandon all hope of using a MySQL table this big??
Thanks a lot!
When you have a composite index on 2 columns, like your (symbol, date) primary key, searching and grouping by a prefix of they key will be fast. But searching for something that doesn't include the first column in the index requires scanning all rows or using some other index.
You can either change your primary key to (date, symbol) if you don't usually need to search for symbol without date. Or you can add an additional index on date:
alter table minute_data add index (date)
I am working with mysql querying a table that has 12 millions registers that are a year of the said data.
The query has to select certain kind of data (coin, enterprise, type, etc..) and then provide a daily average for certain fields of that data, so we can graph it afterwards.
The dream its to be able to do this in real time, so with a response time less than 10 secs, however at the moment its not looking bright at all as its taking between 4 to 6 minutes.
For example, one of the where querys come up with 150k registers, split about 500 per day, and then we average three fields (which are not on the where clause) using a AVG() and GroupBy.
Now, to the raw data, the query is
SELECT
`Valorizacion`.`fecha`, AVG(tir) AS `tir`, AVG(tirBase) AS `tirBase`, AVG(precioPorcentajeValorPar) AS `precioPorcentajeValorPar`
FROM `Valorizacion` USE INDEX (ix_mercado2)
WHERE
(Valorizacion.fecha >= '2011-07-17' ) AND
(Valorizacion.fecha <= '2012-07-18' ) AND
(Valorizacion.plazoResidual >= 365 ) AND
(Valorizacion.plazoResidual <= 3650000 ) AND
(Valorizacion.idMoneda_cache IN ('UF')) AND
(Valorizacion.idEmisorFusionado_cache IN ('ABN AMRO','WATTS', ...)) AND
(Valorizacion.idTipoRA_cache IN ('BB', 'BE', 'BS', 'BU'))
GROUP BY `Valorizacion`.`fecha` ORDER BY `Valorizacion`.`fecha` asc;
248 rows in set (4 min 28.82 sec)
The index is made over all the where clause fields in the order
(fecha, idTipoRA_cache, idMoneda_cache, idEmisorFusionado_cache, plazoResidual)
Selecting the "where" registers, without using group by or AVG
149670 rows in set (58.77 sec)
And selecting the registers, grouping and just doing a count(*) istead of average takes
248 rows in set (35.15 sec)
Which probably its because it doesnt need to go to the disk to search for the data but its obtained directly from the index queries.
So as far as it goes im of the idea of telling my boss "Im sorry but it cant be done", but before doing so i come to you guys asking if you think there is something i could do to improve this. I think i could improve the search by index time moving the index with the biggest cardinality to the front and so on, but even after that the time that takes to access the disk for each record and do the AVG seems too much.
Any ideas?
-- EDIT, the table structure
CREATE TABLE `Valorizacion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`idInstrumento` int(11) NOT NULL,
`fecha` date NOT NULL,
`tir` decimal(10,4) DEFAULT NULL,
`tirBase` decimal(10,4) DEFAULT NULL,
`plazoResidual` double NOT NULL,
`duracionMacaulay` double DEFAULT NULL,
`duracionModACT365` double DEFAULT NULL,
`precioPorcentajeValorPar` decimal(20,15) DEFAULT NULL,
`valorPar` decimal(20,15) DEFAULT NULL,
`convexidad` decimal(20,15) DEFAULT NULL,
`volatilidad` decimal(20,15) DEFAULT NULL,
`montoCLP` double DEFAULT NULL,
`tirACT365` decimal(10,4) DEFAULT NULL,
`tipoVal` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idEmisorFusionado_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idMoneda_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idClasificacionRA_cache` int(11) DEFAULT NULL,
`idTipoRA_cache` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`fechaPrepagable_cache` date DEFAULT NULL,
`tasaEmision_cache` decimal(10,4) DEFAULT NULL,
PRIMARY KEY (`id`,`fecha`),
KEY `ix_FechaNemo` (`fecha`,`idInstrumento`) USING BTREE,
KEY `ix_mercado_stackover` (`idMoneda_cache`,`idTipoRA_cache`,`idEmisorFusionado_cache`,`plazoResidual`)
) ENGINE=InnoDB AUTO_INCREMENT=12933194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Selecting 150K records out of 12M records and performing aggregate functions on them will not be fast no matter what you try to do.
You are probably dealing with primarily historical data as your sample query is for a year of data. A better approach may be to pre-calculate your daily averages and put them into separate tables. Then you may query those tables for reporting, graphs, etc. You will need to decide when and how to run such calculations so that you don't need to re-run them again on the same data.
When your requirement is to do analysis and reporting on millions of historical records you need to consider a data warehouse approach http://en.wikipedia.org/wiki/Data_warehouse rather than a simple database approach.
For reference, this is my current table:
`impression` (
`impressionid` bigint(19) unsigned NOT NULL AUTO_INCREMENT,
`creationdate` datetime NOT NULL,
`ip` int(4) unsigned DEFAULT NULL,
`canvas2d` tinyint(1) DEFAULT '0',
`canvas3d` tinyint(1) DEFAULT '0',
`websockets` tinyint(1) DEFAULT '0',
`useragentid` int(10) unsigned NOT NULL,
PRIMARY KEY (`impressionid`),
UNIQUE KEY `impressionsid_UNIQUE` (`impressionid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=447267 ;
It keeps a record of all the impressions on a certain page. After one day of running, it has gathered 447266 views. Those are a lot of records.
Now I want the amount of visitors per minute. I can easily get them like this:
SELECT COUNT( impressionid ) AS visits, DATE_FORMAT( creationdate, '%m-%d %H%i' ) AS DATE
FROM `impression`
GROUP BY DATE
This query takes a long time, of course. Right now around 56 seconds.
So I'm wondering what to do next. Do I:
Create an index on creationdate (I don't know if that'll help since I'm using a function to alter this data by which to group)
Create new fields that stores hours and minutes separately.
The last one would cause there to be duplicate data, and I hate that. But maybe it's the only way in this case?
Or should I go about it in some different way?
If you run this query often, you could denormaize the calculated value into a separate column (perhaps by a trigger on insert/update) then grouping by that.
Your idea of hours and minutes is a good one too, since it lets you group a few different ways other than just minutes. It's still denormalization, but it's more versatile.
Denormalization is fine, as long as it's justified and understood.
I have a table which collects data for web pages performance. There are multiple machines, testing multiple sites in 10 minutes intervals, so currently I have about 700 000 rows (920 MB) with +/- 50 000 new rows daily.
Table source:
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
CREATE TABLE `http_perf_raw_log` (
`run_dt` int(11) DEFAULT NULL,
`dataset` varchar(64) DEFAULT NULL,
`runner` varchar(64) DEFAULT NULL,
`site` varchar(128) DEFAULT NULL,
`machine` varchar(32) DEFAULT NULL,
`called_url` varchar(1024) DEFAULT NULL,
`method` varchar(8) DEFAULT NULL,
`url` varchar(1024) DEFAULT NULL,
`content_type` varchar(64) DEFAULT NULL,
`http_code` int(11) DEFAULT NULL,
`header_size` int(11) DEFAULT NULL,
`request_size` int(11) DEFAULT NULL,
`filetime` int(11) DEFAULT NULL,
`ssl_verify_result` int(11) DEFAULT NULL,
`redirect_count` int(11) DEFAULT NULL,
`total_time` decimal(6,4) DEFAULT NULL,
`namelookup_time` decimal(6,4) DEFAULT NULL,
`connect_time` decimal(6,4) DEFAULT NULL,
`pretransfer_time` decimal(6,4) DEFAULT NULL,
`starttransfer_time` decimal(6,4) DEFAULT NULL,
`redirect_time` decimal(6,4) DEFAULT NULL,
`size_upload` int(11) DEFAULT NULL,
`size_download` int(11) DEFAULT NULL,
`speed_download` int(11) DEFAULT NULL,
`speed_upload` int(11) DEFAULT NULL,
`download_content_length` int(11) DEFAULT NULL,
`upload_content_length` int(11) DEFAULT NULL,
`certinfo` varchar(1024) DEFAULT NULL,
`request_header` varchar(1024) DEFAULT NULL,
`return_content` varchar(4096) DEFAULT NULL,
`return_headers` varchar(2048) DEFAULT NULL,
KEY `run_dt_idx` (`run_dt`),
KEY `dataset_idx` (`dataset`),
KEY `runner_idx` (`runner`),
KEY `site_idx` (`site`),
KEY `machine_idx` (`machine`),
KEY `total_time_idx` (`total_time`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
For aggregating stats (with 1 hour resolution), I created a view:
CREATE OR REPLACE VIEW http_perf_stats (dataset, runner, site, machine, day, hour, calls, total_time, namelookup_time, connect_time, pretransfer_time, starttransfer_time, size_download) AS
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log GROUP BY runner, site, machine, day, hour ORDER BY `day` DESC
But the performance of VIEW (and underlying SELECT) is terrible - takes about 4 seconds.
So, my questions:
1. Is using GROUP BY in a VIEW good idea at all? And if not, what is better alternative?
2. Is there ( I imagine yes, I am not SQL expert :/) a way to optimize this SELECT (changing query or structure of http_perf_raw_log)?
Remove the GROUP BY from the VIEW and use it in the SELECT that calls the VIEW.
In this case it might be a good idea to only create statistics periodically (once per hour for example).
I'd do that as follows. Run the following code once to create a table structure.
CREATE TABLE http_perf_stats AS
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log
GROUP BY runner, site, machine, day, hour
ORDER BY `day` DESC
Make some modifications like changing field types, default values, adding a primary key, and perhaps add some indexes so that you can access and query this table in a fast way.
From then on, update the table like this:
START TRANSACTION;
DELETE FROM http_perf_stats;
INSERT INTO TABLE
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log
GROUP BY runner, site, machine, day, hour
ORDER BY `day` DESC;
COMMIT;
Several ways to do this:
Create a MySQL event (see http://dev.mysql.com/doc/refman/5.1/en/create-event.html) (that's how I would do it)
Create a cron job (unix-flavoured systems) or window scheduler task
Do a "lazy" update. When somebody requests this list, run the code above if the last time it was ran was longer than x minutes/hours ago. That way it works more like a cache. Slow on the first request, fast after. But you won't slow the server down unless somebody is interested in this.
The view is just another SELECT query, but abstracted away to make it easier querying the resultset. If the underlying SELECT is slow, so is the view. Reading through and summing together 1 GB of data in four seconds doesn't sound slow at all to me.