Faster way to match a string in MySQL using replace - mysql

I have an interesting problem trying to select rows from a table where there are multiple possibilities for a VARCHAR column in my where clause.
Here's my table (which has around 7 million rows):
CREATE TABLE `search_upload_detailed_results` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`surId` bigint(20) DEFAULT NULL,
`company` varchar(100) DEFAULT NULL,
`country` varchar(45) DEFAULT NULL,
`clei` varchar(100) DEFAULT NULL,
`partNumber` varchar(100) DEFAULT NULL,
`mfg` varchar(100) DEFAULT NULL,
`cond` varchar(45) DEFAULT NULL,
`price` float DEFAULT NULL,
`qty` int(11) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`description` varchar(500) DEFAULT NULL,
`status` varchar(45) DEFAULT NULL,
`fileId` bigint(20) DEFAULT NULL,
`nmId` bigint(20) DEFAULT NULL,
`quoteRequested` tinyint(1) DEFAULT '0',
PRIMARY KEY (`id`),
KEY `sudr.surId` (`surId`),
KEY `surd.clei` (`clei`),
KEY `surd.pn` (`partNumber`),
KEY `surd.fileId` (`fileId`),
KEY `surd.price` (`price`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I'm trying to match on the partNumber column. The problem is that the partNumber is in different formts, and can be entered in the search form in multiple formats.
Example: Part Number '300-1231-932' could be:
300-1231-932
3001231932
300 1231 932
A simple select like this takes 0.0008 seconds.
select avg(price) as price from search_upload_detailed_results where
partNumber LIKE '3001231932%' and price > 0;
But it doesn't give me all of the matches that I need. So I wrote this query.
select avg(price) as price from search_upload_detailed_results
where REPLACE(REPLACE(partNumber,'-',''),' ','') LIKE REPLACE(REPLACE('3001231932%','-',''),' ','') and price > 0;
This gives me all of the correct matches, but it's super slow at 3.3 seconds.
I played around with some things, trying to reduce the number of rows I'm doing the replace on, and came up with this.
select avg(price) as price from search_upload_detailed_results
where price > 0 AND
partNumber LIKE('300%') AND
REPLACE(REPLACE(partNumber,'-',''),' ','') LIKE REPLACE(REPLACE('3001231932%','-',''),' ','');
It takes 0.4 seconds to execute. Pretty fast, but could still be a bit time consuming in a multi-part search.
I would like to get it a little faster, but this is as far as I could get. Are there any other ways to optimize this query?
UPDATE to show explain for the 3rd query:
# id, select_type, table, type, possible_keys, key, key_len, ref, rows, Extra
1, SIMPLE, search_upload_detailed_results, range, surd.pn,surd.price, surd.pn, 103, , 89670, Using where

The obvious solution is to just store the part number with no extra characters in the table. Then remove these characters from the user input, and just do a simple WHERE partnumber = #input query.
If that's not possible, you can add that as an additional column. In MySQL 5.7 you can use a generated column; in earlier versions you can use a trigger that fills in this column.

I would like to get it a little faster, but this is as far as I could get. Are there any other ways to optimize this query?
As Barmar has said, the best solution if you really need speed (is 3.3s slow?) is to have a column with the untransformed data in it (hopefully now standardised), that'll allow you to query it without specifying all the different types of part numbers.
Example: Part Number '300-1231-932' could be:
300-1231-932 ||
3001231932 ||
300 1231 932
I think you should worry about the presentation of your data, having all those different 'formats' will make it difficult - can you format to one standard (before it reaches the DB)?
Here's my table (which has around 7 million rows):
Don't forget your index!

As mentioned elsewhere, the problem is the table format. If this is a non-negotiable then another alternative is:
If there are a few formats, but not too many, and they are well known (e.g. the three you've shown), then the query can be made to run faster by explicitly precalculating them all and searching for any of them.
select avg(price) as price from search_upload_detailed_results where
partNumber IN ('300-1231-932', '3001231932', '300 1231 932')
This will take the best advantage of the index you presumably have on partNumber.

You may find that MySQL can make good use of the indexes for carefully selected regular expressions.
select avg(price) as price from search_upload_detailed_results where
partNumber REGEXP '^300[- ]?1231[- ]?932';

Related

Efficient MySQL query for huge set of data

Say i have a table like below:
CREATE TABLE `hadoop_apps` (
`clusterId` smallint(5) unsigned NOT NULL,
`appId` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`user` varchar(64) COLLATE utf8_unicode_ci NOT NULL,
`queue` varchar(35) COLLATE utf8_unicode_ci NOT NULL,
`appName` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`submitTime` datetime NOT NULL COMMENT 'App submission time',
`finishTime` datetime DEFAULT NULL COMMENT 'App completion time',
`elapsedTime` int(11) DEFAULT NULL COMMENT 'App duration in milliseconds',
PRIMARY KEY (`clusterId`,`appId`,`submitTime`),
KEY `hadoop_apps_ibk_finish` (`finishTime`),
KEY `hadoop_apps_ibk_queueCluster` (`queue`,`clusterId`),
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
mysql> SELECT COUNT(*) FROM hadoop_apps;
This would return me a count 158593816
So I am trying to understand what is inefficient about the below query and how I can improve it.
mysql> SELECT * FROM hadoop_apps WHERE DATE(finishTime)='10-11-2013';
Also, what's the difference between these two queries?
mysql> SELECT * FROM hadoop_apps WHERE user='foobar';
mysql> SELECT * FROM hadoop_apps HAVING user='foobar';
WHERE DATE(finishTime)='10-11-2013';
This is a problem for the optimizer because anytime you put a column into a function like this, the optimizer doesn't know if the order of values returned by the function will be the same as the order of values input to the function. So it can't use an index to speed up lookups.
To solve this, refrain from putting the column inside a function call like that, if you want the lookup against that column to use an index.
Also, you should use MySQL standard date format: YYYY-MM-DD.
WHERE finishTime BETWEEN '2013-10-11 00:00:00' AND '2013-10-11 23:59:59'
What is the difference between [conditions in WHERE and HAVING clauses]?
The WHERE clause is for filtering rows.
The HAVING clause is for filtering results after applying GROUP BY.
See SQL - having VS where
If WHERE works, it is preferred over HAVING. The former is done earlier in the processing, thereby cutting down on the amount of data to shovel through. OK, in your one example, there may be no difference between them.
I cringe whenever I see a DATETIME in a UNIQUE key (your PK). Can't the app have two rows in the same second? Is that a risk you want to take.
Even changing to DATETIME(6) (microseconds) could be risky.
Regardless of what you do in that area, I recommend this pattern for testing:
WHERE finishTime >= '2013-10-11'
AND finishTime < '2013-10-11' + INTERVAL 1 DAY
It works "correctly" for DATE, DATETIME, and DATETIME(6), etc. Other flavors add an extra midnight or miss parts of a second. And it avoids hassles with leapdays, etc, if the interval is more than a single day.
KEY `hadoop_apps_ibk_userCluster` (`user`(8),`clusterId`)
is bad. It won't get past user(8). And prefixing like that is often useless. Let's see the query that tempted you to build that key; we'll come up with a better one.
158M rows with 4 varchars. And they sound like values that don't have many distinct values? Build lookup tables and replace them with SMALLINT UNSIGNED (2 bytes, 0..64K range) or other small id. This will significantly shrink the table, thereby making it faster.

mysql select distinct date takes FOREVER on database w/ 374 million rows

I have a MYSQL DB with table definition like this:
CREATE TABLE `minute_data` (
`date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`open` decimal(10,2) DEFAULT NULL,
`high` decimal(10,2) DEFAULT NULL,
`low` decimal(10,2) DEFAULT NULL,
`close` decimal(10,2) DEFAULT NULL,
`volume` decimal(10,2) DEFAULT NULL,
`adj_close` varchar(45) DEFAULT NULL,
`symbol` varchar(10) NOT NULL DEFAULT '',
PRIMARY KEY (`symbol`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
It stores 1 minute data points from the stock market. The primary key is a combination of the symbol and date columns. This way I always have only 1 data point for each symbol at any time.
I am wondering why the following query takes so long that I can't even wait for it to finish:
select distinct date from test.minute_data where date >= "2013-01-01"
order by date asc limit 100;
However I can select count(*) from minute_data; and that finishes very quickly.
I know that it must have something to do with the fact that there are over 374 million rows of data in the table, and my desktop computer is pretty far from a super computer.
Does anyone know something I can try to speed up with query? Do I need to abandon all hope of using a MySQL table this big??
Thanks a lot!
When you have a composite index on 2 columns, like your (symbol, date) primary key, searching and grouping by a prefix of they key will be fast. But searching for something that doesn't include the first column in the index requires scanning all rows or using some other index.
You can either change your primary key to (date, symbol) if you don't usually need to search for symbol without date. Or you can add an additional index on date:
alter table minute_data add index (date)

MySQL query usually fast, sometimes very slow (when run on my website)

I have a query on my MySQL database that is used to return products (on an e-commerce site) after the user performs a search in a free-form text box.
Until recently, user searches have been running fast. However, in the last few days, the searches are periodically very slow. This occurs for about 3 or 4 hours (spread randomly throughout the day) each day.
I had thought that there was a problem with my server. But now I have moved to another server, and the same thing still happens.
I suspect that the query I use is quite inefficient, and maybe this is the cause. But I don't understand why usually the query can run fast, and at other times be very slow. I would assume that an inefficient query would always be slow.
The query runs on two tables. If the search is for "blue jeans", then the query would look like this:
SELECT I.itemURL, I.itemTitle, I.itemPrice, I.itemReduced, I.itemFileName, I.itemBrand, I.itemStore, I.itemID, I.itemColour, I.itemSizes, I.itemBrandEn
FROM Item AS I, Search AS S
WHERE I.itemID = S.itemID
AND (S.searchStringTEXT LIKE '% blue %' OR S.searchStringTEXT LIKE 'blue %' OR S.searchStringTEXT LIKE '% blue')
AND (S.searchStringTEXT LIKE '% jeans %' OR S.searchStringTEXT LIKE 'jeans %' OR S.searchStringTEXT LIKE '% jeans')
Item is the table containing all products on the site. It has around 100,000 rows.
Search is a table containing product ids, and the tags associated to each product id. The tags are in the column "searchStringTEXT", and are separated by spaces. E.g., an entry in this column may be something like "jeans blue calvin klein small".
The search above will find all items that have both the tag "jeans" and "blue" attached.
In theory, Search should have the same number of rows as Item; but, due to a problem that I haven't fixed yet, it has about 500 fewer rows, so these items are effectively excluded from searches.
The create table details for both tables are as follows:
CREATE TABLE `Search` (
`itemID` int(11) NOT NULL,
`searchStringTEXT` varchar(255) DEFAULT NULL,
`searchStringVARCHAR` varchar(1000) DEFAULT NULL,
PRIMARY KEY (`itemID`),
KEY `indexSearch_837` (`itemID`) USING BTREE,
KEY `indexSearch_837_text` (`searchStringTEXT`)
) ENGINE=InnoDB DEFAULT CHARSET=latin5
and
CREATE TABLE `Item_8372` (
`itemID` int(11) NOT NULL AUTO_INCREMENT,
`itemURL` varchar(2000) DEFAULT NULL,
`itemTitle` varchar(500) DEFAULT NULL,
`itemFileName` varchar(200) DEFAULT NULL,
`itemPictureURL` varchar(2000) DEFAULT NULL,
`itemReduced` int(11) DEFAULT NULL,
`itemPrice` int(11) DEFAULT NULL,
`itemStore` varchar(500) DEFAULT NULL,
`itemBrand` varchar(500) CHARACTER SET latin1 DEFAULT NULL,
`itemShopCat` varchar(500) DEFAULT NULL,
`itemTimestamp` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`itemCat` varchar(200) DEFAULT NULL,
`itemSubCat` varchar(200) DEFAULT NULL,
`itemSubSubCat` varchar(200) DEFAULT NULL,
`itemSubSubSubCat` varchar(200) DEFAULT NULL,
`itemColour` varchar(200) DEFAULT NULL,
`itemSizes` varchar(200) DEFAULT NULL,
`itemBrandEn` varchar(500) DEFAULT NULL,
`itemReduction` float DEFAULT NULL,
`ItemPopularity` int(6) DEFAULT NULL,
PRIMARY KEY (`itemID`),
KEY `indexItem_8372_ts` (`itemTimestamp`) USING BTREE,
KEY `indexItem_8372_pop` (`ItemPopularity`),
KEY `indexItem_8372_red` (`itemReduction`),
KEY `indexItem_8372_price` (`itemReduced`)
) ENGINE=InnoDB AUTO_INCREMENT=970846 DEFAULT CHARSET=latin5
In the title of the question I say "(when run on my website)", because I find that the query's speed is consistent when run locally. But maybe this is just because I haven't tested it as much locally.
I'm thinking of changing the Search table so it's a MyISAM table, and then I can use a full text search instead of "LIKE". But I'd still like to figure out why I am experiencing what I am experiencing with the current setup.
Any ideas/suggestions much appreciated,
Andrew
Edit:
Here is the EXPLAIN result for the SELECT statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE I ALL PRIMARY NULL NULL NULL 81558
1 SIMPLE S eq_ref PRIMARY,indexSearch_837,indexSearch_837_text PRIMARY 4 I.itemID 1 Using where
Using LIKE will result in slow queries, especially the amount of times you have it. This is because LIKE scans all the rows in the table, which even with a few hundred can cause a problem (add to that the number of times you used LIKE with different variations of text to match with).
It only takes a few people simultaneously loading a page that runs this query to really slow your site down. Furthermore, the query may be running multiple times on your server in background threads from earlier, causing the longer-term slowdowns you're seeing.
If you're going to be performing text searches regularly, consider a RAM-based indexing solution for your search such as Sphinx. You could perform the text search there (it will be very fast compared to MySQL) and then retrieve the needed rows from the MySQL tables after.
You might want to try using REGEXP (reference). It's also better practice to stop using implicit joins.
SELECT I.itemurl,
I.itemtitle,
I.itemprice,
I.itemreduced,
I.itemfilename,
I.itembrand,
I.itemstore,
I.itemid,
I.itemcolour,
I.itemsizes,
I.itembranden
FROM item AS I
INNER JOIN search AS S
ON I.itemid = S.itemid
WHERE S.searchstringtext REGEXP ' blue | blue|blue '
AND S.searchstringtext REGEXP ' jeans | jeans|jeans '

MySQL Query Optimization on a Big Table

I am working with mysql querying a table that has 12 millions registers that are a year of the said data.
The query has to select certain kind of data (coin, enterprise, type, etc..) and then provide a daily average for certain fields of that data, so we can graph it afterwards.
The dream its to be able to do this in real time, so with a response time less than 10 secs, however at the moment its not looking bright at all as its taking between 4 to 6 minutes.
For example, one of the where querys come up with 150k registers, split about 500 per day, and then we average three fields (which are not on the where clause) using a AVG() and GroupBy.
Now, to the raw data, the query is
SELECT
`Valorizacion`.`fecha`, AVG(tir) AS `tir`, AVG(tirBase) AS `tirBase`, AVG(precioPorcentajeValorPar) AS `precioPorcentajeValorPar`
FROM `Valorizacion` USE INDEX (ix_mercado2)
WHERE
(Valorizacion.fecha >= '2011-07-17' ) AND
(Valorizacion.fecha <= '2012-07-18' ) AND
(Valorizacion.plazoResidual >= 365 ) AND
(Valorizacion.plazoResidual <= 3650000 ) AND
(Valorizacion.idMoneda_cache IN ('UF')) AND
(Valorizacion.idEmisorFusionado_cache IN ('ABN AMRO','WATTS', ...)) AND
(Valorizacion.idTipoRA_cache IN ('BB', 'BE', 'BS', 'BU'))
GROUP BY `Valorizacion`.`fecha` ORDER BY `Valorizacion`.`fecha` asc;
248 rows in set (4 min 28.82 sec)
The index is made over all the where clause fields in the order
(fecha, idTipoRA_cache, idMoneda_cache, idEmisorFusionado_cache, plazoResidual)
Selecting the "where" registers, without using group by or AVG
149670 rows in set (58.77 sec)
And selecting the registers, grouping and just doing a count(*) istead of average takes
248 rows in set (35.15 sec)
Which probably its because it doesnt need to go to the disk to search for the data but its obtained directly from the index queries.
So as far as it goes im of the idea of telling my boss "Im sorry but it cant be done", but before doing so i come to you guys asking if you think there is something i could do to improve this. I think i could improve the search by index time moving the index with the biggest cardinality to the front and so on, but even after that the time that takes to access the disk for each record and do the AVG seems too much.
Any ideas?
-- EDIT, the table structure
CREATE TABLE `Valorizacion` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`idInstrumento` int(11) NOT NULL,
`fecha` date NOT NULL,
`tir` decimal(10,4) DEFAULT NULL,
`tirBase` decimal(10,4) DEFAULT NULL,
`plazoResidual` double NOT NULL,
`duracionMacaulay` double DEFAULT NULL,
`duracionModACT365` double DEFAULT NULL,
`precioPorcentajeValorPar` decimal(20,15) DEFAULT NULL,
`valorPar` decimal(20,15) DEFAULT NULL,
`convexidad` decimal(20,15) DEFAULT NULL,
`volatilidad` decimal(20,15) DEFAULT NULL,
`montoCLP` double DEFAULT NULL,
`tirACT365` decimal(10,4) DEFAULT NULL,
`tipoVal` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idEmisorFusionado_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idMoneda_cache` varchar(20) COLLATE utf8_unicode_ci DEFAULT NULL,
`idClasificacionRA_cache` int(11) DEFAULT NULL,
`idTipoRA_cache` varchar(20) COLLATE utf8_unicode_ci NOT NULL,
`fechaPrepagable_cache` date DEFAULT NULL,
`tasaEmision_cache` decimal(10,4) DEFAULT NULL,
PRIMARY KEY (`id`,`fecha`),
KEY `ix_FechaNemo` (`fecha`,`idInstrumento`) USING BTREE,
KEY `ix_mercado_stackover` (`idMoneda_cache`,`idTipoRA_cache`,`idEmisorFusionado_cache`,`plazoResidual`)
) ENGINE=InnoDB AUTO_INCREMENT=12933194 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Selecting 150K records out of 12M records and performing aggregate functions on them will not be fast no matter what you try to do.
You are probably dealing with primarily historical data as your sample query is for a year of data. A better approach may be to pre-calculate your daily averages and put them into separate tables. Then you may query those tables for reporting, graphs, etc. You will need to decide when and how to run such calculations so that you don't need to re-run them again on the same data.
When your requirement is to do analysis and reporting on millions of historical records you need to consider a data warehouse approach http://en.wikipedia.org/wiki/Data_warehouse rather than a simple database approach.

mySQL query optimisation, range/composite index/group by

The basic form the the query is:
EXPLAIN SELECT SUM(impressions) as impressions, SUM(clicks) as clicks, SUM(cost) as cost, SUM(conversions) as conversions, keyword_id FROM `keyword_track` WHERE user_id=1 AND campaign_id=543 AND `recorded`>1325376071 GROUP BY keyword_id
It seems that I can either index say user_id, campaign_id and keyword_id and get the GROUP BY without a file sort, although a range index on the recorded is really going to more aggressively cut down on rows, this example has a big range but other queries have a much smaller time range.
Table looks like:
CREATE TABLE IF NOT EXISTS `keyword_track` (
`track_id` int(11) NOT NULL auto_increment,
`user_id` int(11) NOT NULL,
`campaign_id` int(11) NOT NULL,
`adgroup_id` int(11) NOT NULL,
`keyword_id` int(11) NOT NULL,
`recorded` int(11) NOT NULL,
`impressions` int(11) NOT NULL,
`clicks` int(11) NOT NULL,
`cost` decimal(10,2) NOT NULL,
`conversions` int(11) NOT NULL,
`max_cpc` decimal(3,2) NOT NULL,
`quality_score` tinyint(4) NOT NULL,
`avg_position` decimal(2,1) NOT NULL,
PRIMARY KEY (`track_id`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ;
I have left any keys I currently have out of that. Basically by question is what would the best way be to get in index on the range which still indexing at least the campaign_id and ideally not needing to filesort (although that might be an acceptable tradeoff to get a range index on the recorded time).
Whenever we have range constraint and order by constraint on different attributes of a table, we can either take the advantage of the fast filtering or fast ordering for result set but not BOTH.
My answer is...
If your range constraint really cut down huge number of records and result a small set of rows out, better index to support the range constraint. i.e (user_id, campaign_id, recorded)
if not, i mean if there are really big number of rows even after the range condition validated and are supposed to be sorted, then go for for an index that support ordering.
i.e(user_id, campaign_id, key_id)
To better understand this, have a look at the below link where the same thing is explained very clearly.
http://explainextended.com/2009/04/01/choosing-index/
The best index for you in this case is composite one user_id + campaign_id + recorded
Though this will not help to avoid filesort as long as you have > comparison with recorded and group by field that isn't included in the index at all.