I am working with mysql querying a table that has 12 millions registers that are a year of the said data.
The query has to select certain kind of data (coin, enterprise, type, etc..) and then provide a daily average for certain fields of that data, so we can graph it afterwards.
The dream its to be able to do this in real time, so with a response time less than 10 secs, however at the moment its not looking bright at all as its taking between 4 to 6 minutes.
For example, one of the where querys come up with 150k registers, split about 500 per day, and then we average three fields (which are not on the where clause) using a AVG() and GroupBy.
Now, to the raw data, the query is
SELECT
`Valorizacion`.`fecha`, AVG(tir) AS `tir`, AVG(tirBase) AS `tirBase`, AVG(precioPorcentajeValorPar) AS `precioPorcentajeValorPar`
FROM `Valorizacion` USE INDEX (ix_mercado2)
WHERE
(Valorizacion.fecha >= '2011-07-17' ) AND
(Valorizacion.fecha <= '2012-07-18' ) AND
(Valorizacion.plazoResidual >= 365 ) AND
(Valorizacion.plazoResidual <= 3650000 ) AND
(Valorizacion.idMoneda_cache IN ('UF')) AND
(Valorizacion.idEmisorFusionado_cache IN ('ABN AMRO','WATTS', ...)) AND
(Valorizacion.idTipoRA_cache IN ('BB', 'BE', 'BS', 'BU'))
GROUP BY `Valorizacion`.`fecha` ORDER BY `Valorizacion`.`fecha` asc;
248 rows in set (4 min 28.82 sec)
The index is made over all the where clause fields in the order
(fecha, idTipoRA_cache, idMoneda_cache, idEmisorFusionado_cache, plazoResidual)
Selecting the "where" registers, without using group by or AVG
149670 rows in set (58.77 sec)
And selecting the registers, grouping and just doing a count(*) istead of average takes
248 rows in set (35.15 sec)
Which probably its because it doesnt need to go to the disk to search for the data but its obtained directly from the index queries.
So as far as it goes im of the idea of telling my boss "Im sorry but it cant be done", but before doing so i come to you guys asking if you think there is something i could do to improve this. I think i could improve the search by index time moving the index with the biggest cardinality to the front and so on, but even after that the time that takes to access the disk for each record and do the AVG seems too much.
Any ideas?
-- EDIT, the table structure
CREATE TABLE `Valorizacion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`idInstrumento` int(11) NOT NULL,
`fecha` date NOT NULL,
`tir` decimal(10,4) DEFAULT NULL,
`tirBase` decimal(10,4) DEFAULT NULL,
`plazoResidual` double NOT NULL,
`duracionMacaulay` double DEFAULT NULL,
`duracionModACT365` double DEFAULT NULL,
`precioPorcentajeValorPar` decimal(20,15) DEFAULT NULL,
`valorPar` decimal(20,15) DEFAULT NULL,
`convexidad` decimal(20,15) DEFAULT NULL,
`volatilidad` decimal(20,15) DEFAULT NULL,
`montoCLP` double DEFAULT NULL,
`tirACT365` decimal(10,4) DEFAULT NULL,
`tipoVal` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idEmisorFusionado_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idMoneda_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idClasificacionRA_cache` int(11) DEFAULT NULL,
`idTipoRA_cache` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`fechaPrepagable_cache` date DEFAULT NULL,
`tasaEmision_cache` decimal(10,4) DEFAULT NULL,
PRIMARY KEY (`id`,`fecha`),
KEY `ix_FechaNemo` (`fecha`,`idInstrumento`) USING BTREE,
KEY `ix_mercado_stackover` (`idMoneda_cache`,`idTipoRA_cache`,`idEmisorFusionado_cache`,`plazoResidual`)
) ENGINE=InnoDB AUTO_INCREMENT=12933194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Selecting 150K records out of 12M records and performing aggregate functions on them will not be fast no matter what you try to do.
You are probably dealing with primarily historical data as your sample query is for a year of data. A better approach may be to pre-calculate your daily averages and put them into separate tables. Then you may query those tables for reporting, graphs, etc. You will need to decide when and how to run such calculations so that you don't need to re-run them again on the same data.
When your requirement is to do analysis and reporting on millions of historical records you need to consider a data warehouse approach http://en.wikipedia.org/wiki/Data_warehouse rather than a simple database approach.
Related
I am trying to partition a table with 60 million rows of data based on year.
Specifications:
MySQL 5.7.1
OS : windows
ALTER TABLE full_data PARTITION BY RANGE (YEAR(ProcessDate))(
PARTITION years VALUES LESS THAN (2019)
) ;
For the past one day, the process is running. Could you please help me to improve the performance.
CREATE TABLE full_data` (
Mobile bigint(11) DEFAULT NULL,
Name varchar(200) DEFAULT NULL,
Barcode varchar(200) DEFAULT NULL,
Batch varchar(200) DEFAULT NULL,
Carton varchar(500) DEFAULT NULL,
Doctype varchar(500) DEFAULT NULL,
Rack varchar(500) DEFAULT NULL,
ProcessDate datetime DEFAULT NULL,
KEY Mobile (Mobile,Barcode),
KEY MobileBarcode (Mobile,Barcode)
) ENGINE=InnoDB DEFAULT CHARSET=utf8; '
Such an ALTER will take hours to copy all the data from the existing table into a temp table, then swap it into place. Disk I/O is taking the time.
But why do that partitioning? Only in rare cases will it provide any performance benefit. How many years do you have? What queries do you run against the table? Please provide SHOW CREATE TABLE.
If you are having performance problems, let's start with EXPLAIN SELECT ...
For self education I am developing an invoicing system for an electricity company. I have multiple time series tables, with different intervals. One table represents consumption, two others represent prices. A third price table should be still incorporated. Now I am running calculation queries, but the queries are slow. I would like to improve the query speed, especially since this is only the beginning calculations and the queries will only become more complicated. Also please note that this is my first database i created and exercises I have done. A simplified explanation is preferred. Thanks for any help provided.
I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. This speed up the process from 60 seconds to 5 seconds.
The structure of the tables is the following:
CREATE TABLE `apxprice` (
`APX_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`PRICE` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`APX_id`)
) ENGINE=MyISAM AUTO_INCREMENT=28728 DEFAULT CHARSET=latin1
CREATE TABLE `imbalanceprice` (
`imbalanceprice_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PTU` tinyint(3) DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`UPWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`DOWNWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`UPWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`DOWNWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`INCENTIVE_COMPONENT` decimal(10,2) DEFAULT NULL,
`TAKE_FROM_SYSTEM` decimal(10,2) DEFAULT NULL,
`FEED_INTO_SYSTEM` decimal(10,2) DEFAULT NULL,
`REGULATION_STATE` tinyint(1) DEFAULT NULL,
`HOUR` int(2) DEFAULT NULL,
PRIMARY KEY (`imbalanceprice_id`),
KEY `DATE` (`DATE`,`PERIOD_FROM`,`PERIOD_UNTIL`)
) ENGINE=MyISAM AUTO_INCREMENT=117427 DEFAULT CHARSET=latin
CREATE TABLE `powerload` (
`powerload_id` int(11) NOT NULL AUTO_INCREMENT,
`EAN` varchar(18) DEFAULT NULL,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`POWERLOAD` int(11) DEFAULT NULL,
PRIMARY KEY (`powerload_id`)
) ENGINE=MyISAM AUTO_INCREMENT=61039 DEFAULT CHARSET=latin
Now when running this query:
SELECT i.DATE, i.PERIOD_FROM, i.TAKE_FROM_SYSTEM, i.FEED_INTO_SYSTEM,
a.PRICE, p.POWERLOAD, sum(a.PRICE * p.POWERLOAD)
FROM imbalanceprice i, apxprice a, powerload p
WHERE i.DATE = a.DATE
and i.DATE = p.DATE
AND i.PERIOD_FROM >= a.PERIOD_FROM
and i.PERIOD_FROM = p.PERIOD_FROM
AND i.PERIOD_FROM < a.PERIOD_UNTIL
AND i.DATE >= '2018-01-01'
AND i.DATE <= '2018-01-31'
group by i.DATE
I have run the query with explain and get the following result: Select_type, all simple partitions all null possible keys a,p = null i = DATE Key a,p = null i = DATE key_len a,p = null i = 8 ref a,p = null i = timeseries.a.DATE,timeseries.p.PERIOD_FROM rows a = 28727 p = 61038 i = 1 filtered a = 100 p = 10 i = 100 a extra: using where using temporary using filesort b extra: using where using join buffer (block nested loop) c extra: null
Preferably I run a more complicated query for a whole year and group by month for example with all price tables incorporated. However, this would be too slow. I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. The calculation result may not be changed, in this case quarter hourly consumption of two meters multiplied by hourly prices.
"Categorically speaking," the first thing you should look at is indexes.
Your clauses such as WHERE i.DATE = a.DATE ... are categorically known as INNER JOINs, and the SQL engine needs to have the ability to locate the matching rows "instantly." (That is to say, without looking through the entire table!)
FYI: Just like any index in real-life – here I would be talking about "library card catalogs" if we still had such a thing – indexes will assist both "equal to" and "less/greater than" queries. The index takes the computer directly to a particular point in the data, whether that's a "hit" or a "near miss."
Finally, the EXPLAIN verb is very useful: put that word in front of your query, and the SQL engine should "explain to you" exactly how it intends to carry out your query. (The SQL engine looks at the structure of the database to make that decision.) Although the EXPLAIN output is ... (heh) ... "not exactly standardized," it will help you to see if the computer thinks that it needs to do something very time-wasting in order to deliver your answer.
I have an interesting problem trying to select rows from a table where there are multiple possibilities for a VARCHAR column in my where clause.
Here's my table (which has around 7 million rows):
CREATE TABLE `search_upload_detailed_results` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`surId` bigint(20) DEFAULT NULL,
`company` varchar(100) DEFAULT NULL,
`country` varchar(45) DEFAULT NULL,
`clei` varchar(100) DEFAULT NULL,
`partNumber` varchar(100) DEFAULT NULL,
`mfg` varchar(100) DEFAULT NULL,
`cond` varchar(45) DEFAULT NULL,
`price` float DEFAULT NULL,
`qty` int(11) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`description` varchar(500) DEFAULT NULL,
`status` varchar(45) DEFAULT NULL,
`fileId` bigint(20) DEFAULT NULL,
`nmId` bigint(20) DEFAULT NULL,
`quoteRequested` tinyint(1) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `sudr.surId` (`surId`),
KEY `surd.clei` (`clei`),
KEY `surd.pn` (`partNumber`),
KEY `surd.fileId` (`fileId`),
KEY `surd.price` (`price`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I'm trying to match on the partNumber column. The problem is that the partNumber is in different formts, and can be entered in the search form in multiple formats.
Example: Part Number '300-1231-932' could be:
300-1231-932
3001231932
300 1231 932
A simple select like this takes 0.0008 seconds.
select avg(price) as price from search_upload_detailed_results where
partNumber LIKE '3001231932%' and price > 0;
But it doesn't give me all of the matches that I need. So I wrote this query.
select avg(price) as price from search_upload_detailed_results
where REPLACE(REPLACE(partNumber,'-',''),' ','') LIKE REPLACE(REPLACE('3001231932%','-',''),' ','') and price > 0;
This gives me all of the correct matches, but it's super slow at 3.3 seconds.
I played around with some things, trying to reduce the number of rows I'm doing the replace on, and came up with this.
select avg(price) as price from search_upload_detailed_results
where price > 0 AND
partNumber LIKE('300%') AND
REPLACE(REPLACE(partNumber,'-',''),' ','') LIKE REPLACE(REPLACE('3001231932%','-',''),' ','');
It takes 0.4 seconds to execute. Pretty fast, but could still be a bit time consuming in a multi-part search.
I would like to get it a little faster, but this is as far as I could get. Are there any other ways to optimize this query?
UPDATE to show explain for the 3rd query:
# id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, search_upload_detailed_results, range, surd.pn,surd.price, surd.pn, 103, , 89670, Using where
The obvious solution is to just store the part number with no extra characters in the table. Then remove these characters from the user input, and just do a simple WHERE partnumber = #input query.
If that's not possible, you can add that as an additional column. In MySQL 5.7 you can use a generated column; in earlier versions you can use a trigger that fills in this column.
I would like to get it a little faster, but this is as far as I could get. Are there any other ways to optimize this query?
As Barmar has said, the best solution if you really need speed (is 3.3s slow?) is to have a column with the untransformed data in it (hopefully now standardised), that'll allow you to query it without specifying all the different types of part numbers.
Example: Part Number '300-1231-932' could be:
300-1231-932 ||
3001231932 ||
300 1231 932
I think you should worry about the presentation of your data, having all those different 'formats' will make it difficult - can you format to one standard (before it reaches the DB)?
Here's my table (which has around 7 million rows):
Don't forget your index!
As mentioned elsewhere, the problem is the table format. If this is a non-negotiable then another alternative is:
If there are a few formats, but not too many, and they are well known (e.g. the three you've shown), then the query can be made to run faster by explicitly precalculating them all and searching for any of them.
select avg(price) as price from search_upload_detailed_results where
partNumber IN ('300-1231-932', '3001231932', '300 1231 932')
This will take the best advantage of the index you presumably have on partNumber.
You may find that MySQL can make good use of the indexes for carefully selected regular expressions.
select avg(price) as price from search_upload_detailed_results where
partNumber REGEXP '^300[- ]?1231[- ]?932';
Say i have a table like below:
CREATE TABLE `hadoop_apps` (
`clusterId` smallint(5) unsigned NOT NULL,
`appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`submitTime` datetime NOT NULL COMMENT 'App submission time',
`finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
`elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
KEY `hadoop_apps_ibk_finish` (`finishTime`),
KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
mysql> SELECT COUNT(*) FROM hadoop_apps;
This would return me a count 158593816
So I am trying to understand what is inefficient about the below query and how I can improve it.
mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';
Also, what's the difference between these two queries?
mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';
WHERE DATE(finishTime)='10-11-2013';
This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.
To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.
Also, you should use MySQL standard date format: YYYY-MM-DD.
WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'
What is the difference between [conditions in WHERE and HAVING clauses]?
The WHERE clause is for filtering rows.
The HAVING clause is for filtering results after applying GROUP BY.
See SQL - having VS where
If WHERE works, it is preferred over HAVING. The former is done earlier in the processing, thereby cutting down on the amount of data to shovel through. OK, in your one example, there may be no difference between them.
I cringe whenever I see a DATETIME in a UNIQUE key (your PK). Can't the app have two rows in the same second? Is that a risk you want to take.
Even changing to DATETIME(6) (microseconds) could be risky.
Regardless of what you do in that area, I recommend this pattern for testing:
WHERE finishTime >= '2013-10-11'
AND finishTime < '2013-10-11' + INTERVAL 1 DAY
It works "correctly" for DATE, DATETIME, and DATETIME(6), etc. Other flavors add an extra midnight or miss parts of a second. And it avoids hassles with leapdays, etc, if the interval is more than a single day.
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`)
is bad. It won't get past user(8). And prefixing like that is often useless. Let's see the query that tempted you to build that key; we'll come up with a better one.
158M rows with 4 varchars. And they sound like values that don't have many distinct values? Build lookup tables and replace them with SMALLINT UNSIGNED (2 bytes, 0..64K range) or other small id. This will significantly shrink the table, thereby making it faster.
I have a MYSQL DB with table definition like this:
CREATE TABLE `minute_data` (
`date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`open` decimal(10,2) DEFAULT NULL,
`high` decimal(10,2) DEFAULT NULL,
`low` decimal(10,2) DEFAULT NULL,
`close` decimal(10,2) DEFAULT NULL,
`volume` decimal(10,2) DEFAULT NULL,
`adj_close` varchar(45) DEFAULT NULL,
`symbol` varchar(10) NOT NULL DEFAULT '',
PRIMARY KEY (`symbol`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
It stores 1 minute data points from the stock market. The primary key is a combination of the symbol and date columns. This way I always have only 1 data point for each symbol at any time.
I am wondering why the following query takes so long that I can't even wait for it to finish:
select distinct date from test.minute_data where date >= "2013-01-01"
order by date asc limit 100;
However I can select count(*) from minute_data; and that finishes very quickly.
I know that it must have something to do with the fact that there are over 374 million rows of data in the table, and my desktop computer is pretty far from a super computer.
Does anyone know something I can try to speed up with query? Do I need to abandon all hope of using a MySQL table this big??
Thanks a lot!
When you have a composite index on 2 columns, like your (symbol, date) primary key, searching and grouping by a prefix of they key will be fast. But searching for something that doesn't include the first column in the index requires scanning all rows or using some other index.
You can either change your primary key to (date, symbol) if you don't usually need to search for symbol without date. Or you can add an additional index on date:
alter table minute_data add index (date)