I inherited a system that keeps track of temperature data related to time. I had asked a previous question about it: (What is the most efficient way to store a collection of temperature values into MYSQL?)
The system has a separate table which is used to keep track of dates (shown below). It contains several descriptor columns of the current day. I am hesitant of the benefits this kind of structure provides, as it seems to add extra weight to do the same thing a few date functions and math can do.
I was told by the creator of the system that it is better to select a range of data by using the DATE_ID with operators instead of a date function.
For example: Let's say you want to collect all temperature information from June 1st, 2012 till the end of 2012, you could do the following.
1) Get the date ID that corresponds to June 1st, 2012. Lets say the id was 23000
2) Get the date ID that corresponds to the end of the year by using something like:
SELECT DATE_ID FROM DATE_REPRESENTATION WHERE DATE_ID >= 23000 AND END_YEAR_FLAG = 1 AND LIMIT 1;
Lets say that one was 23213
3) Now we would have 2 DATE_IDs, which we could just use like so:
SELECT * FROM temperature_readings WHERE DATE_ID BETWEEN 23000 AND 23213;
I feel that it might be better to properly index the 'temperature_readings' table and use date functions. For example:
SELECT ...... actual_date BETWEEN DATE('2012-06-01') AND LAST_DAY(DATE_ADD(DATE('2012-06-01'), INTERVAL (12 - MONTH(DATE('2012-06-01'))) MONTH))
Is there a better solution than what is currently in use in terms of improving the overall performance? In the previous question, I mention that the system uses the data to produce graphs and alerts based on the data selected by date ranges (daily,weekly, monthly, yearly, or a range that a user can specify).
Current table:
CREATE TABLE `DATE_REPRESENTATION` (
`DATE_ID` int(10) NOT NULL,
`DAY_DATE` timestamp NULL DEFAULT NULL,
`DATE_DESC_LONG` varchar(18) DEFAULT NULL,
`MB_DATE_M_D_YYYY` varchar(18) DEFAULT NULL,
`WEEKDAY` varchar(9) DEFAULT NULL,
`WEEKDAY_ABBREV` char(4) DEFAULT NULL,
`WEEKDAY_NUM` decimal(1,0) DEFAULT NULL,
`WEEK` char(13) DEFAULT NULL,
`WEEK_NUM` decimal(4,0) DEFAULT NULL,
`WEEK_NUM_ABS` decimal(4,0) DEFAULT NULL,
`MONTH_LONG` varchar(9) DEFAULT NULL,
`MONTH_ABBREV` char(3) DEFAULT NULL,
`MONTH_NUM` decimal(2,0) DEFAULT NULL,
`MONTH_NUM_ABS` decimal(5,0) DEFAULT NULL,
`QUARTER` char(1) DEFAULT NULL,
`QUARTER_NUM` decimal(1,0) DEFAULT NULL,
`QUARTER_NUM_ABS` decimal(5,0) DEFAULT NULL,
`YEAR4` decimal(4,0) DEFAULT NULL,
`BEG_WEEK_FLAG` decimal(1,0) DEFAULT NULL,
`END_WEEK_FLAG` decimal(1,0) DEFAULT NULL,
`BEG_MONTH_FLAG` decimal(1,0) DEFAULT NULL,
`END_MONTH_FLAG` decimal(1,0) DEFAULT NULL,
`BEG_QUARTER_FLAG` decimal(1,0) DEFAULT NULL,
`END_QUARTER_FLAG` decimal(1,0) DEFAULT NULL,
`BEG_YEAR_FLAG` decimal(1,0) DEFAULT NULL,
`END_YEAR_FLAG` decimal(1,0) DEFAULT NULL,
PRIMARY KEY (`DATE_ID`),
UNIQUE KEY `DATEID_PK` (`DATE_ID`),
KEY `timeStampky` (`DAY_DATE`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
A DATE should be stored internally as just a number, so the only thing I can imagine is that the old person used to store dates as CHAR and suffered for it :)
When MySQL calculates the BETWEEN values, it will do that once, so there will be little math to be done. Add in the standard optimizations (preparing, parameterizing, indexing, etc), and you should be fine.
The formulas might be a little illegible. Maybe you could wrap them in a stored procedure, so you could call GET_LAST_DAY_OF_QUARTER(date) instead of putting all the date math in the SELECT.
Related
I am processing a mysql table with 40K rows. Current execution time is around 2 seconds with the table indexed.could some one guide me how to optimized this query and table better? and how to getrid of "Using where; Using temporary; Using filesort" ??. Any help is appreciated.
The goup by with be for the following cases...
LS_CHG_DTE_OCR
LS_CHG_DTE_OCR/RES_STATE_HSE
LS_CHG_DTE_OCR/RES_STATE_HSE/RES_CITY_HSE
LS_CHG_DTE_OCR/RES_STATE_HSE/RES_CITY_HSE/POSTAL_CDE_HSE
Thanks in advance
SELECT DATE_FORMAT(`LS_CHG_DTE_OCR`, '%Y-%b') AS fmt_date,
SUM(IF(`TYPE`='Connect',COUNT_SUBS,0)) AS connects,
SUM(IF(`TYPE`='Disconnect',COUNT_SUBS,0)) AS disconnects,
SUM(IF(`TYPE`='Connect',ROUND(REV,2),0)) AS REV,
SUM(IF(`TYPE`='Upgrade',COUNT_SUBS,0)) AS upgrades,
SUM(IF(`TYPE`='Downgrade',COUNT_SUBS,0)) AS downgrades,
SUM(IF(`TYPE`='Upgrade',ROUND(REV,2),0)) AS upgradeRev FROM `hsd`
WHERE LS_CHG_DTE_OCR!='' GROUP BY MONTH(LS_CHG_DTE_OCR) ORDER BY LS_CHG_DTE_OCR ASC
CREATE TABLE `hsd` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`SYS_OCR` varchar(255) DEFAULT NULL,
`PRIN_OCR` varchar(255) DEFAULT NULL,
`SERV_CDE_OHI` varchar(255) DEFAULT NULL,
`DSC_CDE_OHI` varchar(255) DEFAULT NULL,
`LS_CHG_DTE_OCR` datetime DEFAULT NULL,
`SALESREP_OCR` varchar(255) DEFAULT NULL,
`CHANNEL` varchar(255) DEFAULT NULL,
`CUST_TYPE` varchar(255) DEFAULT NULL,
`LINE_BUS` varchar(255) DEFAULT NULL,
`ADDR1_HSE` varchar(255) DEFAULT NULL,
`RES_CITY_HSE` varchar(255) DEFAULT NULL,
`RES_STATE_HSE` varchar(255) DEFAULT NULL,
`POSTAL_CDE_HSE` varchar(255) DEFAULT NULL,
`ZIP` varchar(100) DEFAULT NULL,
`COUNT_SUBS` double DEFAULT NULL,
`REV` double DEFAULT NULL,
`TYPE` varchar(255) DEFAULT NULL,
`lat` varchar(100) DEFAULT NULL,
`long` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx` (`LS_CHG_DTE_OCR`,`CHANNEL`,`CUST_TYPE`,`LINE_BUS`,`RES_CITY_HSE`,`RES_STATE_HSE`,`POSTAL_CDE_HSE`,`ZIP`,`COUNT_SUBS`,`TYPE`)
) ENGINE=InnoDB AUTO_INCREMENT=402342 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
Using where; Using temporary; Using filesort[enter image description here][1]
The only condition you apply is LS_CHG_DTE_OCR != "". Other than that you are doing a full table scan because of the aggregations. Index wise you can't do much here.
I ran into the same problem. I had fully optimized my queries (I had joins and more conditions) but the table kept growing and with it query time. Finally I decided to mirror the data to ElasticSearch. In my case it cut down query time to about 1/20th to 1/100th (for different queries).
The only possible index for that SELECT is INDEX(LS_CHG_DTE_OCR). But it is unlikely for it to be used.
Perform the WHERE -- If there are a lot of '' values, then the index may be used for filtering.
GROUP BY MONTH(...) -- You might be folding the same month from multiple years. The Optimizer can't tell, so it will punt on using the index.
ORDER BY LS_CHG_DTE_OCR -- This is done after the GROUP BY; the ORDER BY can't be performed until the data is gathered -- too late for any index. However, if multiple years are folded together, you could get some strange results. Cure it by making the ORDER BY be the same as the GROUP BY. This will also prevent an extra sort that is caused by the GROUP BY and ORDER BY being different.
Yeah, if that idx you added has all the columns in the SELECT, then it is a "covering index". But it won't help any because of the comments above. "Using index" won't help a lot.
GROUP BY LS_CHG_DTE_OCR/RES_STATE_HSE -- Eh? Divide a DATETIME by a VARCHAR? That sounds like a disaster.
This table will grow even bigger over time, correct? Consider building and maintaining Summary Table(s) with month as part of the PRIMARY KEY.
I am working with mysql querying a table that has 12 millions registers that are a year of the said data.
The query has to select certain kind of data (coin, enterprise, type, etc..) and then provide a daily average for certain fields of that data, so we can graph it afterwards.
The dream its to be able to do this in real time, so with a response time less than 10 secs, however at the moment its not looking bright at all as its taking between 4 to 6 minutes.
For example, one of the where querys come up with 150k registers, split about 500 per day, and then we average three fields (which are not on the where clause) using a AVG() and GroupBy.
Now, to the raw data, the query is
SELECT
`Valorizacion`.`fecha`, AVG(tir) AS `tir`, AVG(tirBase) AS `tirBase`, AVG(precioPorcentajeValorPar) AS `precioPorcentajeValorPar`
FROM `Valorizacion` USE INDEX (ix_mercado2)
WHERE
(Valorizacion.fecha >= '2011-07-17' ) AND
(Valorizacion.fecha <= '2012-07-18' ) AND
(Valorizacion.plazoResidual >= 365 ) AND
(Valorizacion.plazoResidual <= 3650000 ) AND
(Valorizacion.idMoneda_cache IN ('UF')) AND
(Valorizacion.idEmisorFusionado_cache IN ('ABN AMRO','WATTS', ...)) AND
(Valorizacion.idTipoRA_cache IN ('BB', 'BE', 'BS', 'BU'))
GROUP BY `Valorizacion`.`fecha` ORDER BY `Valorizacion`.`fecha` asc;
248 rows in set (4 min 28.82 sec)
The index is made over all the where clause fields in the order
(fecha, idTipoRA_cache, idMoneda_cache, idEmisorFusionado_cache, plazoResidual)
Selecting the "where" registers, without using group by or AVG
149670 rows in set (58.77 sec)
And selecting the registers, grouping and just doing a count(*) istead of average takes
248 rows in set (35.15 sec)
Which probably its because it doesnt need to go to the disk to search for the data but its obtained directly from the index queries.
So as far as it goes im of the idea of telling my boss "Im sorry but it cant be done", but before doing so i come to you guys asking if you think there is something i could do to improve this. I think i could improve the search by index time moving the index with the biggest cardinality to the front and so on, but even after that the time that takes to access the disk for each record and do the AVG seems too much.
Any ideas?
-- EDIT, the table structure
CREATE TABLE `Valorizacion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`idInstrumento` int(11) NOT NULL,
`fecha` date NOT NULL,
`tir` decimal(10,4) DEFAULT NULL,
`tirBase` decimal(10,4) DEFAULT NULL,
`plazoResidual` double NOT NULL,
`duracionMacaulay` double DEFAULT NULL,
`duracionModACT365` double DEFAULT NULL,
`precioPorcentajeValorPar` decimal(20,15) DEFAULT NULL,
`valorPar` decimal(20,15) DEFAULT NULL,
`convexidad` decimal(20,15) DEFAULT NULL,
`volatilidad` decimal(20,15) DEFAULT NULL,
`montoCLP` double DEFAULT NULL,
`tirACT365` decimal(10,4) DEFAULT NULL,
`tipoVal` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idEmisorFusionado_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idMoneda_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idClasificacionRA_cache` int(11) DEFAULT NULL,
`idTipoRA_cache` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`fechaPrepagable_cache` date DEFAULT NULL,
`tasaEmision_cache` decimal(10,4) DEFAULT NULL,
PRIMARY KEY (`id`,`fecha`),
KEY `ix_FechaNemo` (`fecha`,`idInstrumento`) USING BTREE,
KEY `ix_mercado_stackover` (`idMoneda_cache`,`idTipoRA_cache`,`idEmisorFusionado_cache`,`plazoResidual`)
) ENGINE=InnoDB AUTO_INCREMENT=12933194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Selecting 150K records out of 12M records and performing aggregate functions on them will not be fast no matter what you try to do.
You are probably dealing with primarily historical data as your sample query is for a year of data. A better approach may be to pre-calculate your daily averages and put them into separate tables. Then you may query those tables for reporting, graphs, etc. You will need to decide when and how to run such calculations so that you don't need to re-run them again on the same data.
When your requirement is to do analysis and reporting on millions of historical records you need to consider a data warehouse approach http://en.wikipedia.org/wiki/Data_warehouse rather than a simple database approach.
since I have launched a podcast recently I wanted to analyse our Downloaddata. But some clients seem to send multiple requests. So I wanted to only count one request per IP and User-Agent every 15 Minutes. Best thing I could come up with is the following query, that counts one request per IP and User-Agent every hour. Any ideas how to solve that Problem in MySQL?
SELECT episode, podcast, DATE_FORMAT(date, '%d.%m.%Y %k') as blurry_date, useragent, ip FROM downloaddata GROUP BY ip, useragent
This is the table I've got
CREATE TABLE `downloaddata` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`date` datetime NOT NULL,
`podcast` varchar(255) DEFAULT NULL,
`episode` int(4) DEFAULT NULL,
`source` varchar(255) DEFAULT NULL,
`useragent` varchar(255) DEFAULT NULL,
`referer` varchar(255) DEFAULT NULL,
`filetype` varchar(15) DEFAULT NULL,
`ip` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=216 DEFAULT CHARSET=utf8;
Personally I'd recomend collecting every request, and then only taking one every 15 mins with a distict query, or perhaps counting the number every 15 mins.
If you are determined to throw data away so it can never be analysed though.
Quick and simple is to just the date and have an int column which is the 15 minute period,
Hour part of current time * 4 + Minute part / 4
DatePart functions are what you want to look up. Things is each time you want to record, you'll have to check if they have in the 15 minute period. Extra work, extra complexity and less / lower quality data...
MINUTE(date)/15 will give you the quarter hour (0-3). Ensure that along with the date is unique (or ensure UNIX_TIMESTAMP(date)/(15*60) is unique).
I have a table which collects data for web pages performance. There are multiple machines, testing multiple sites in 10 minutes intervals, so currently I have about 700 000 rows (920 MB) with +/- 50 000 new rows daily.
Table source:
SET SQL_MODE="NO_AUTO_VALUE_ON_ZERO";
CREATE TABLE `http_perf_raw_log` (
`run_dt` int(11) DEFAULT NULL,
`dataset` varchar(64) DEFAULT NULL,
`runner` varchar(64) DEFAULT NULL,
`site` varchar(128) DEFAULT NULL,
`machine` varchar(32) DEFAULT NULL,
`called_url` varchar(1024) DEFAULT NULL,
`method` varchar(8) DEFAULT NULL,
`url` varchar(1024) DEFAULT NULL,
`content_type` varchar(64) DEFAULT NULL,
`http_code` int(11) DEFAULT NULL,
`header_size` int(11) DEFAULT NULL,
`request_size` int(11) DEFAULT NULL,
`filetime` int(11) DEFAULT NULL,
`ssl_verify_result` int(11) DEFAULT NULL,
`redirect_count` int(11) DEFAULT NULL,
`total_time` decimal(6,4) DEFAULT NULL,
`namelookup_time` decimal(6,4) DEFAULT NULL,
`connect_time` decimal(6,4) DEFAULT NULL,
`pretransfer_time` decimal(6,4) DEFAULT NULL,
`starttransfer_time` decimal(6,4) DEFAULT NULL,
`redirect_time` decimal(6,4) DEFAULT NULL,
`size_upload` int(11) DEFAULT NULL,
`size_download` int(11) DEFAULT NULL,
`speed_download` int(11) DEFAULT NULL,
`speed_upload` int(11) DEFAULT NULL,
`download_content_length` int(11) DEFAULT NULL,
`upload_content_length` int(11) DEFAULT NULL,
`certinfo` varchar(1024) DEFAULT NULL,
`request_header` varchar(1024) DEFAULT NULL,
`return_content` varchar(4096) DEFAULT NULL,
`return_headers` varchar(2048) DEFAULT NULL,
KEY `run_dt_idx` (`run_dt`),
KEY `dataset_idx` (`dataset`),
KEY `runner_idx` (`runner`),
KEY `site_idx` (`site`),
KEY `machine_idx` (`machine`),
KEY `total_time_idx` (`total_time`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
For aggregating stats (with 1 hour resolution), I created a view:
CREATE OR REPLACE VIEW http_perf_stats (dataset, runner, site, machine, day, hour, calls, total_time, namelookup_time, connect_time, pretransfer_time, starttransfer_time, size_download) AS
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log GROUP BY runner, site, machine, day, hour ORDER BY `day` DESC
But the performance of VIEW (and underlying SELECT) is terrible - takes about 4 seconds.
So, my questions:
1. Is using GROUP BY in a VIEW good idea at all? And if not, what is better alternative?
2. Is there ( I imagine yes, I am not SQL expert :/) a way to optimize this SELECT (changing query or structure of http_perf_raw_log)?
Remove the GROUP BY from the VIEW and use it in the SELECT that calls the VIEW.
In this case it might be a good idea to only create statistics periodically (once per hour for example).
I'd do that as follows. Run the following code once to create a table structure.
CREATE TABLE http_perf_stats AS
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log
GROUP BY runner, site, machine, day, hour
ORDER BY `day` DESC
Make some modifications like changing field types, default values, adding a primary key, and perhaps add some indexes so that you can access and query this table in a fast way.
From then on, update the table like this:
START TRANSACTION;
DELETE FROM http_perf_stats;
INSERT INTO TABLE
SELECT dataset, runner, site, machine,
DATE_FORMAT(run_dt, '%Y-%m-%d') AS day,
DATE_FORMAT(run_dt, '%k') AS hour,
COUNT(*) AS calls,
SUM(total_time),
SUM(namelookup_time),
SUM(connect_time),
SUM(pretransfer_time),
SUM(starttransfer_time),
SUM(size_download)
FROM http_perf_raw_log
GROUP BY runner, site, machine, day, hour
ORDER BY `day` DESC;
COMMIT;
Several ways to do this:
Create a MySQL event (see http://dev.mysql.com/doc/refman/5.1/en/create-event.html) (that's how I would do it)
Create a cron job (unix-flavoured systems) or window scheduler task
Do a "lazy" update. When somebody requests this list, run the code above if the last time it was ran was longer than x minutes/hours ago. That way it works more like a cache. Slow on the first request, fast after. But you won't slow the server down unless somebody is interested in this.
The view is just another SELECT query, but abstracted away to make it easier querying the resultset. If the underlying SELECT is slow, so is the view. Reading through and summing together 1 GB of data in four seconds doesn't sound slow at all to me.
I will try to explain myself quickly.
I have a database called 'artikli' which has about 1M records.
On this table i run a lots of different queryies but 1 particular is causing problems (long execution time) when ORDER by is present.
This is my table structure:
CREATE TABLE IF NOT EXISTS artikli (
id int(11) NOT NULL,
name varchar(250) NOT NULL,
datum datetime NOT NULL,
kategorije_id int(11) default NULL,
id_valute int(11) default NULL,
podogovoru int(1) default '0',
cijena decimal(10,2) default NULL,
valuta int(1) NOT NULL default '0',
cijena_rezerva decimal(10,0) NOT NULL,
cijena_kupi decimal(10,0) default NULL,
cijena_akcija decimal(10,2) NOT NULL,
period int(3) NOT NULL default '30',
dostupnost enum('svugdje','samobih','samomojgrad','samomojkanton') default 'svugdje',
zemlja varchar(10) NOT NULL,
slike varchar(500) NOT NULL,
od_s varchar(34) default NULL,
od_id int(10) unsigned default NULL,
vrsta int(1) default '0',
trajanje datetime default NULL,
izbrisan int(1) default '0',
zakljucan int(1) default '0',
prijava int(3) default '0',
izdvojen decimal(1,0) NOT NULL default '0',
izdvojen_kad datetime NOT NULL,
izdvojen_datum datetime NOT NULL,
sajt int(1) default '0',
PRIMARY KEY (id),
KEY brend (brend),
KEY kanton (kanton),
KEY datum (datum),
KEY cijena (cijena),
KEY kategorije_id (kategorije_id,podogovoru,sajt,izdvojen,izdvojen_kad,datum)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
And this is the query:
SELECT artikli.datum as brojx,
artikli.izdvojen as i,
artikli.izdvojen_kad as ii,
artikli.cijena as cijena, artikli.name
FROM artikli
WHERE artikli.izbrisan=0 and artikli.prodano!=3
and artikli.zavrseno=0 and artikli.od_id!=0
and (artikli.sajt=0 or (artikli.sajt=1 and artikli.dostupnost='svugdje'))
and kategorije_id IN (18)
ORDER by i DESC, ii DESC, brojx DESC
LIMIT 0,20
What i want to do is to avoid Filesort which is very slow.
It would have been a big help if you'd provided the explain plan for the query.
Why do you think its the filesort which is causing the problem? Looking at the query you seem to be applying a lot filtering - which should reduce the output set significantly - but none of can use the available indexes.
artikli.izbrisan=0 and artikli.prodano!=3
and artikli.zavrseno=0 and artikli.od_id!=0
and (artikli.sajt=0 or (artikli.sajt=1 and artikli.dostupnost='svugdje'))
and kategorije_id IN (18)
Although I don't know what the pattern of your data is, I suspect that you might get a lot more benefit by adding an index on :
kategorije_id,izbrisan,sajt
Are all those other indexes really being used already?
Although you'd get a LOT more bang for your buck by denormalizing all those booleans (assuming that the table is normalised to start with and there are not hidden functional dependencies in there).
C.
The problem is that you don't have an index on the izdvojen, izdvojen_kad and datum columns that are used by the ORDER BY.
Note that the large index you have starting with kategorije_id can't be used for sorting (although it will help somewhat with the where clause) because the columns you are sorting by are at the end of the index.
Actually, the order by is not the basis for the index you want... but the CRITERIA you want to mostly match the query... Filter the smaller set of data out, you'll get smaller set of the table... I would change the WHERE clause a bit, but you'll know your data best. Put your smallest expected condition first and ensure an index is based on that... something like
WHERE
artikli.izbrisan = 0
and artikli.zavrseno = 0
and artikli.kategorije_id IN (18)
and artikli.prodano != 3
and artikli.od_id != 0
and ( artikli.sajt = 0
or ( artikli.sajt = 1
and artikli.dostupnost='svugdje')
)
and having a compound index on (izbrisan, zavrseno, kategorije_id)... I've mode the other != comparisons after as they are not specific key values, instead, they are ALL EXCEPT the value in question.