Efficient MySQL query for huge set of data

Efficient MySQL query for huge set of data - mysql

Say i have a table like below:
CREATE TABLE `hadoop_apps` (
`clusterId` smallint(5) unsigned NOT NULL,
`appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`submitTime` datetime NOT NULL COMMENT 'App submission time',
`finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
`elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
KEY `hadoop_apps_ibk_finish` (`finishTime`),
KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
mysql> SELECT COUNT(*) FROM hadoop_apps;
This would return me a count 158593816
So I am trying to understand what is inefficient about the below query and how I can improve it.
mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';
Also, what's the difference between these two queries?
mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';

WHERE DATE(finishTime)='10-11-2013';
This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.
To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.
Also, you should use MySQL standard date format: YYYY-MM-DD.
WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'
What is the difference between [conditions in WHERE and HAVING clauses]?
The WHERE clause is for filtering rows.
The HAVING clause is for filtering results after applying GROUP BY.
See SQL - having VS where

If WHERE works, it is preferred over HAVING. The former is done earlier in the processing, thereby cutting down on the amount of data to shovel through. OK, in your one example, there may be no difference between them.
I cringe whenever I see a DATETIME in a UNIQUE key (your PK). Can't the app have two rows in the same second? Is that a risk you want to take.
Even changing to DATETIME(6) (microseconds) could be risky.
Regardless of what you do in that area, I recommend this pattern for testing:
WHERE finishTime >= '2013-10-11'
AND finishTime < '2013-10-11' + INTERVAL 1 DAY
It works "correctly" for DATE, DATETIME, and DATETIME(6), etc. Other flavors add an extra midnight or miss parts of a second. And it avoids hassles with leapdays, etc, if the interval is more than a single day.
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`)
is bad. It won't get past user(8). And prefixing like that is often useless. Let's see the query that tempted you to build that key; we'll come up with a better one.
158M rows with 4 varchars. And they sound like values that don't have many distinct values? Build lookup tables and replace them with SMALLINT UNSIGNED (2 bytes, 0..64K range) or other small id. This will significantly shrink the table, thereby making it faster.

Related

Faster way to match a string in MySQL using replace

I have an interesting problem trying to select rows from a table where there are multiple possibilities for a VARCHAR column in my where clause.
Here's my table (which has around 7 million rows):
CREATE TABLE `search_upload_detailed_results` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`surId` bigint(20) DEFAULT NULL,
`company` varchar(100) DEFAULT NULL,
`country` varchar(45) DEFAULT NULL,
`clei` varchar(100) DEFAULT NULL,
`partNumber` varchar(100) DEFAULT NULL,
`mfg` varchar(100) DEFAULT NULL,
`cond` varchar(45) DEFAULT NULL,
`price` float DEFAULT NULL,
`qty` int(11) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`description` varchar(500) DEFAULT NULL,
`status` varchar(45) DEFAULT NULL,
`fileId` bigint(20) DEFAULT NULL,
`nmId` bigint(20) DEFAULT NULL,
`quoteRequested` tinyint(1) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `sudr.surId` (`surId`),
KEY `surd.clei` (`clei`),
KEY `surd.pn` (`partNumber`),
KEY `surd.fileId` (`fileId`),
KEY `surd.price` (`price`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I'm trying to match on the partNumber column. The problem is that the partNumber is in different formts, and can be entered in the search form in multiple formats.
Example: Part Number '300-1231-932' could be:
300-1231-932
3001231932
300 1231 932
A simple select like this takes 0.0008 seconds.
select avg(price) as price from search_upload_detailed_results where
partNumber LIKE '3001231932%' and price > 0;
But it doesn't give me all of the matches that I need. So I wrote this query.
select avg(price) as price from search_upload_detailed_results
where REPLACE(REPLACE(partNumber,'-',''),' ','') LIKE REPLACE(REPLACE('3001231932%','-',''),' ','') and price > 0;
This gives me all of the correct matches, but it's super slow at 3.3 seconds.
I played around with some things, trying to reduce the number of rows I'm doing the replace on, and came up with this.
select avg(price) as price from search_upload_detailed_results
where price > 0 AND
partNumber LIKE('300%') AND
REPLACE(REPLACE(partNumber,'-',''),' ','') LIKE REPLACE(REPLACE('3001231932%','-',''),' ','');
It takes 0.4 seconds to execute. Pretty fast, but could still be a bit time consuming in a multi-part search.
I would like to get it a little faster, but this is as far as I could get. Are there any other ways to optimize this query?
UPDATE to show explain for the 3rd query:
# id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, search_upload_detailed_results, range, surd.pn,surd.price, surd.pn, 103, , 89670, Using where

The obvious solution is to just store the part number with no extra characters in the table. Then remove these characters from the user input, and just do a simple WHERE partnumber = #input query.
If that's not possible, you can add that as an additional column. In MySQL 5.7 you can use a generated column; in earlier versions you can use a trigger that fills in this column.

I would like to get it a little faster, but this is as far as I could get. Are there any other ways to optimize this query?
As Barmar has said, the best solution if you really need speed (is 3.3s slow?) is to have a column with the untransformed data in it (hopefully now standardised), that'll allow you to query it without specifying all the different types of part numbers.
Example: Part Number '300-1231-932' could be:
300-1231-932 ||
3001231932 ||
300 1231 932
I think you should worry about the presentation of your data, having all those different 'formats' will make it difficult - can you format to one standard (before it reaches the DB)?
Here's my table (which has around 7 million rows):
Don't forget your index!

As mentioned elsewhere, the problem is the table format. If this is a non-negotiable then another alternative is:
If there are a few formats, but not too many, and they are well known (e.g. the three you've shown), then the query can be made to run faster by explicitly precalculating them all and searching for any of them.
select avg(price) as price from search_upload_detailed_results where
partNumber IN ('300-1231-932', '3001231932', '300 1231 932')
This will take the best advantage of the index you presumably have on partNumber.

You may find that MySQL can make good use of the indexes for carefully selected regular expressions.
select avg(price) as price from search_upload_detailed_results where
partNumber REGEXP '^300[- ]?1231[- ]?932';

What is the optimal index for this DB table?

I have a table for storing stats. Currently this is populated with about 10 million rows at the end of the day then copied to daily stats table and deleted. For this reason I can't have an auto-incrementing primary key.
This is the table structure:
CREATE TABLE `stats` (
`shop_id` int(11) NOT NULL,
`title` varchar(255) CHARACTER SET latin1 NOT NULL,
`created` datetime NOT NULL,
`mobile` tinyint(1) NOT NULL DEFAULT '0',
`click` tinyint(1) NOT NULL DEFAULT '0',
`conversion` tinyint(1) NOT NULL DEFAULT '0',
`ip` varchar(20) CHARACTER SET latin1 NOT NULL,
KEY `shop_id` (`shop_id`,`created`,`ip`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
I have a key on shop_id, created, ip but I'm not sure what columns I should use to create the optimal index to increase lookup speeds any further?
The query below takes about 12 seconds with no key and about 1.5 seconds using the index above:
SELECT DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane')) AS `date`, COUNT(*) AS `views`
FROM `stats`
WHERE `created` <= '2017-07-18 09:59:59'
AND `shop_id` = '17515021'
AND `click` != 1
AND `conversion` != 1
GROUP BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'))
ORDER BY DATE(CONVERT_TZ(`created`, 'UTC', 'Australia/Brisbane'));

If there is no column (or combination of columns) that is guaranteed unique, then do have an AUTO_INCREMENT id. Don't worry about truncating/deleting. (However, if the id does not reset, you probably need to use BIGINT, not INT UNSIGNED to avoid overflow.)
Don't use id as the primary key, instead, PRIMARY KEY(shop_id, created, id), INDEX(id).
That unconventional PK will help with performance in 2 ways, while being unique (due to the addition of id). The INDEX(id) is to keep AUTO_INCREMENT happy. (Whether you DELETE hourly or daily is a separate issue.)
Build a Summary table based on each hour (or minute). It will contain the count for such -- 400K/hour or 7K/minute. Augment it each hour (or minute) so that you don't have to do all the work at the end of the day.
The summary table can also filter on click and/or conversion. Or it could keep both, if you need them.
If click/conversion have only two states (0 & 1), don't say != 1, say = 0; the optimizer is much better at = than at !=.
If they 2-state and you changed to =, then this becomes viable and much better: INDEX(shop_id, click, conversion, created) -- created must be last.
Don't bother with TZ when summarizing into the Summary table; apply the conversion later.
Better yet, don't use DATETIME, use TIMESTAMP so that you won't need to convert (assuming you have TZ set correctly).
After all that, if you still have issues, start over on the Question; there may be further tweaks.

In your where clause, Use the column first which will return the small set of results and so on and create the index in the same order.
You have
WHERE created <= '2017-07-18 09:59:59'
AND shop_id = '17515021'
AND click != 1
AND conversion != 1
If created will return the small number of set as compare to other 3 columns then you are good otherwise you that column at first position in your where clause then select the second column as per the same explanation and create the index as per you where clause.
If you think order is fine then create an index
KEY created_shopid_click_conversion (created,shop_id, click, conversion);.

MySQL Multiple column index

Ok, I have the following MySQL table structure:
CREATE TABLE `creditlog` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`memberId` int(10) unsigned NOT NULL,
`quantity` decimal(10,2) unsigned DEFAULT NULL,
`timeAdded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`reference` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `memberId` (`memberId`),
KEY `timeAdded` (`timeAdded`));
And I'm querying it like this:
SELECT SUM(quantity) FROM creditlog where timeAdded>'2016-09-01' AND timeAdded<'2016-10-01' AND memberId IN (3,6,8,9,11)
Now, I also use the use index (timeAdded) because due to the number of entries it is more convenient. Explaining the above query shows:
type -> range,
key -> timeAdded,
rows -> 921294
extra -> using where
Meanwhile if I use the memberId INDEX it shows:
type -> range,
key -> memberId,
rows -> 1707849
extra -> using where
Now, my question is it's possible to combine these 2 indexes somehow to be used together and reduce the surface of the query since I ll also need to add more conditions (on other columns).

MySQL almost never uses two indexes in a single query; it is just not cost effective. However, composite indexes are often very efficient. You need this order: INDEX(memberId, timeAdded).
Build the index this way...
First include column(s) that are in the WHERE clause tested with =. (None, in your case.)
Any column(s) with IN.
One 'range', such as <, BETWEEN, etc.
Move onto all the fields of the GROUP BY or ORDER BY. (Not relevant here.)
There are a lot of exceptions and caveats. Some are given in my cookbook .
(Contrary to popular opinion, cardinality is almost never relevant in designing an index.)
Here is a way to compare two indexes (even with a table that is too small to get reliable timings):
FLUSH STATUS;
SELECT SQL_NO_CACHE ...;
SHOW SESSION STATUS LIKE 'Handler%';
(repeat for other query/index)
Smaller numbers almost always indicate better.
"timeAdded>'2016-09-01' AND timeAdded<'2016-10-01'" -- That excludes midnight on the first day. I recommend this pattern:
timeAdded >= '2016-09-01'
AND timeAdded < '2016-09-01' + INTERVAL 1 MONTH
That also avoids computing dates.
That smells like a common query? Have you considered building and maintaining Summary tables ? The equivalent query would probably run 10 times as fast.

Selecting rows where column value changed from previous row, user variables, innodb

I have a problem similar to
SQL: selecting rows where column value changed from previous row
The accepted answer by ypercube which i adapted to
CREATE TABLE `schange` (
`PersonID` int(11) NOT NULL,
`StateID` int(11) NOT NULL,
`TStamp` datetime NOT NULL,
KEY `tstamp` (`TStamp`),
KEY `personstate` (`PersonID`, `StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `states` (
`StateID` int(11) NOT NULL AUTO_INCREMENT,
`State` varchar(100) NOT NULL,
`Available` tinyint(1) NOT NULL,
`Otherstatuseshere` tinyint(1) NOT NULL,
PRIMARY KEY (`StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT
COALESCE((#statusPre <> s.Available), 1) AS statusChanged,
c.PersonID,
c.TStamp,
s.*,
#statusPre := s.Available
FROM schange c
INNER JOIN states s USING (StateID),
(SELECT #statusPre:=NULL) AS d
WHERE PersonID = 1 AND TStamp > "2012-01-01" AND TStamp < "2013-01-01"
ORDER BY TStamp ;
The query itself worked just fine in testing, and with the right mix of temporary tables i was able to generate reports with daily sum availability from a huge pile of data in virtually no time at all.
The real problem comes in when i discovered that the tables where using the MyISAM engine, which we have completely abandoned, recreated the tables to use InnoDB, and noticed the query no longer works as expected.
After some bashing head into wall i have discovered that MyISAM seems to go over the columns each row in order (selecting statusChanged before updating #statusPre), while InnoDB seems to do all the variable assigning first, and only after that it populates result rows, regardless if the assigning happens in the select or where clauses, in functions (coalesce, greater etc), subqueries or otherwise.
Trying to accomplish this in a query without variables seems to always end the same way, a subquery requiring exponentially more time to process the more rows are in the set, resulting in a excrushiating minutes (or hours) long wait to get beginning and ending events for one status, while a finished report should include daily sums of multiple.
Can this type of query work on the InnoDB engine, and if so, how should one go about it?
or is the only feasible option to go for a database product that supports WITH statements?

Removing
KEY personstate (PersonID, StateID)
fixes the problem.
No idea why tho, but it was not really required anyway, the timestamp key is the more important one and speeds up the query nicely.

MySQL Query Optimization on a Big Table

I am working with mysql querying a table that has 12 millions registers that are a year of the said data.
The query has to select certain kind of data (coin, enterprise, type, etc..) and then provide a daily average for certain fields of that data, so we can graph it afterwards.
The dream its to be able to do this in real time, so with a response time less than 10 secs, however at the moment its not looking bright at all as its taking between 4 to 6 minutes.
For example, one of the where querys come up with 150k registers, split about 500 per day, and then we average three fields (which are not on the where clause) using a AVG() and GroupBy.
Now, to the raw data, the query is
SELECT
`Valorizacion`.`fecha`, AVG(tir) AS `tir`, AVG(tirBase) AS `tirBase`, AVG(precioPorcentajeValorPar) AS `precioPorcentajeValorPar`
FROM `Valorizacion` USE INDEX (ix_mercado2)
WHERE
(Valorizacion.fecha >= '2011-07-17' ) AND
(Valorizacion.fecha <= '2012-07-18' ) AND
(Valorizacion.plazoResidual >= 365 ) AND
(Valorizacion.plazoResidual <= 3650000 ) AND
(Valorizacion.idMoneda_cache IN ('UF')) AND
(Valorizacion.idEmisorFusionado_cache IN ('ABN AMRO','WATTS', ...)) AND
(Valorizacion.idTipoRA_cache IN ('BB', 'BE', 'BS', 'BU'))
GROUP BY `Valorizacion`.`fecha` ORDER BY `Valorizacion`.`fecha` asc;
248 rows in set (4 min 28.82 sec)
The index is made over all the where clause fields in the order
(fecha, idTipoRA_cache, idMoneda_cache, idEmisorFusionado_cache, plazoResidual)
Selecting the "where" registers, without using group by or AVG
149670 rows in set (58.77 sec)
And selecting the registers, grouping and just doing a count(*) istead of average takes
248 rows in set (35.15 sec)
Which probably its because it doesnt need to go to the disk to search for the data but its obtained directly from the index queries.
So as far as it goes im of the idea of telling my boss "Im sorry but it cant be done", but before doing so i come to you guys asking if you think there is something i could do to improve this. I think i could improve the search by index time moving the index with the biggest cardinality to the front and so on, but even after that the time that takes to access the disk for each record and do the AVG seems too much.
Any ideas?
-- EDIT, the table structure
CREATE TABLE `Valorizacion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`idInstrumento` int(11) NOT NULL,
`fecha` date NOT NULL,
`tir` decimal(10,4) DEFAULT NULL,
`tirBase` decimal(10,4) DEFAULT NULL,
`plazoResidual` double NOT NULL,
`duracionMacaulay` double DEFAULT NULL,
`duracionModACT365` double DEFAULT NULL,
`precioPorcentajeValorPar` decimal(20,15) DEFAULT NULL,
`valorPar` decimal(20,15) DEFAULT NULL,
`convexidad` decimal(20,15) DEFAULT NULL,
`volatilidad` decimal(20,15) DEFAULT NULL,
`montoCLP` double DEFAULT NULL,
`tirACT365` decimal(10,4) DEFAULT NULL,
`tipoVal` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idEmisorFusionado_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idMoneda_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idClasificacionRA_cache` int(11) DEFAULT NULL,
`idTipoRA_cache` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`fechaPrepagable_cache` date DEFAULT NULL,
`tasaEmision_cache` decimal(10,4) DEFAULT NULL,
PRIMARY KEY (`id`,`fecha`),
KEY `ix_FechaNemo` (`fecha`,`idInstrumento`) USING BTREE,
KEY `ix_mercado_stackover` (`idMoneda_cache`,`idTipoRA_cache`,`idEmisorFusionado_cache`,`plazoResidual`)
) ENGINE=InnoDB AUTO_INCREMENT=12933194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Selecting 150K records out of 12M records and performing aggregate functions on them will not be fast no matter what you try to do.
You are probably dealing with primarily historical data as your sample query is for a year of data. A better approach may be to pre-calculate your daily averages and put them into separate tables. Then you may query those tables for reporting, graphs, etc. You will need to decide when and how to run such calculations so that you don't need to re-run them again on the same data.
When your requirement is to do analysis and reporting on millions of historical records you need to consider a data warehouse approach http://en.wikipedia.org/wiki/Data_warehouse rather than a simple database approach.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Efficient MySQL query for huge set of data - mysql

Related

Faster way to match a string in MySQL using replace

What is the optimal index for this DB table?

MySQL Multiple column index

Selecting rows where column value changed from previous row, user variables, innodb

MySQL Query Optimization on a Big Table

Categories

Resources