MySQL RDS performance of aggregation functions - mysql

We have a query that performs some aggregation on one column.
The filtering of the data seems to be pretty fast, but the aggregation seems to take too much time.
This query returns ~ 1.5 million rows. It runs for 0.6 seconds (if we want to return the data to the client it takes ~ 2 minutes - the way we tested this is by using the pymysql python library. We used an unbuffered cursor, so we can distinguish between query run time and fetch time):
SELECT *
FROM t_data t1
WHERE (t1.to_date = '2019-03-20')
AND (t1.period = 30)
AND (label IN ('aa','bb') )
AND ( id IN (
SELECT id
FROM t_location_data
WHERE (to_date = '2019-03-20') AND (period = 30)
AND ( country = 'Narniya' ) ) )
But if we run this query:
SELECT MAX(val) val_max,
AVG(val) val_avg,
MIN(val) val_min
FROM t_data t1
WHERE (t1.to_date = '2019-03-20')
AND (t1.period = 30)
AND (label IN ('aa','bb') )
AND ( id IN (
SELECT id
FROM t_location_data
WHERE (to_date = '2019-03-20') AND (period = 30)
AND ( country = 'Narniya' ) ) )
We see that the time to run the query takes 40 seconds and the time to fetch the results in this case is obviously less than a second..
Any help with this terrible performance of the aggregation functions over RDS Aurora? Why calculating Max Min and Avergae on 1.5 million lines takes so long (When comparing to Python on those same numbers, the calculation takes less than 1 second..)
NOTE: We added random number to each select to make sure we do not get cached values.
We use Aurora RDS:
1 instance of db.r5.large (2 vCPU + 16 GB RAM)
MySQL Engine version: 5.6.10a
Create table:
Create Table: CREATE TABLE `t_data` (
`id` varchar(256) DEFAULT NULL,
`val2` int(11) DEFAULT NULL,
`val3` int(11) DEFAULT NULL,
`val` int(11) DEFAULT NULL,
`val4` int(11) DEFAULT NULL,
`tags` varchar(256) DEFAULT NULL,
`val7` int(11) DEFAULT NULL,
`label` varchar(32) DEFAULT NULL,
`val5` varchar(64) DEFAULT NULL,
`val6` int(11) DEFAULT NULL,
`period` int(11) DEFAULT NULL,
`to_date` varchar(64) DEFAULT NULL,
`data_line_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`data_line_id`),
UNIQUE KEY `id_data` (`to_date`,`period`,`id`),
KEY `index1` (`to_date`,`period`,`id`),
KEY `index3` (`to_date`,`period`,`label`)
) ENGINE=InnoDB AUTO_INCREMENT=218620560 DEFAULT CHARSET=latin1
Create Table: CREATE TABLE `t_location_data` (
`id` varchar(256) DEFAULT NULL,
`country` varchar(256) DEFAULT NULL,
`state` varchar(256) DEFAULT NULL,
`city` varchar(256) DEFAULT NULL,
`latitude` float DEFAULT NULL,
`longitude` float DEFAULT NULL,
`val8` int(11) DEFAULT NULL,
`val9` tinyint(1) DEFAULT NULL,
`period` int(11) DEFAULT NULL,
`to_date` varchar(64) DEFAULT NULL,
`location_line_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`location_line_id`),
UNIQUE KEY `id_location_data` (`to_date`,`period`,`id`,`latitude`,`longitude`),
KEY `index1` (`to_date`,`period`,`id`,`country`),
KEY `index2` (`country`,`state`,`city`),
KEY `index3` (`to_date`,`period`,`country`,`state`)
) ENGINE=InnoDB AUTO_INCREMENT=315944737 DEFAULT CHARSET=latin1
Parameters:
##innodb_buffer_pool_size/1024/1024/1024: 7.7900
##innodb_buffer_pool_instances: 8
UPDATE:
Adding the val index (like suggest by #rick-james) did improve the query dramatically (took ~2 seconds) only if I delete the AND ( id IN (SELECT id FROM t_location_data.. condition. If I leave it, the query runs for about ~25 seconds.. better than before but still not good..

Indexes needed:
t_data: INDEX(period, to_date, label, val)
t_data: INDEX(period, label, to_date, val)
t_location_data: INDEX(period, country, to_date, id)
Also, change from the slow IN ( SELECT ... ) to a JOIN:
FROM t_data AS d
JOIN t_location_data AS ld USING(id)
WHERE ...
Better yet, since the tables are 1:1 (is that correct?), combine the tables so as to eliminate the JOIN. If id is not the PRIMARY KEY in each table, you really need to provide SHOW CREATE TABLE and should change the name(s).

Related

Slow search query with a one to many join

My problem is a slow search query with a one-to-many relationship between the tables. My tables look like this.
Table Assignment
CREATE TABLE `Assignment` (
`Id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`ProjectId` int(10) unsigned NOT NULL,
`AssignmentTypeId` smallint(5) unsigned NOT NULL,
`AssignmentNumber` varchar(30) NOT NULL,
`AssignmentNumberExternal` varchar(50) DEFAULT NULL,
`DateStart` datetime DEFAULT NULL,
`DateEnd` datetime DEFAULT NULL,
`DateDeadline` datetime DEFAULT NULL,
`DateCreated` datetime DEFAULT NULL,
`Deleted` datetime DEFAULT NULL,
`Lat` double DEFAULT NULL,
`Lon` double DEFAULT NULL,
PRIMARY KEY (`Id`),
KEY `idx_assignment_assignment_type_id` (`AssignmentTypeId`),
KEY `idx_assignment_assignment_number` (`AssignmentNumber`),
KEY `idx_assignment_assignment_number_external`
(`AssignmentNumberExternal`)
) ENGINE=InnoDB AUTO_INCREMENT=5280 DEFAULT CHARSET=utf8;
Table ExtraFields
CREATE TABLE `ExtraFields` (
`assignment_id` int(10) unsigned NOT NULL,
`name` varchar(30) NOT NULL,
`value` text,
PRIMARY KEY (`assignment_id`,`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
My search query
SELECT
`Assignment`.`Id`, COL_5_72, COL_5_73, COL_5_74, COL_5_75, COL_5_76,
COL_5_77 FROM (
SELECT
`Assignment`.`Id`,
`Assignment`.`AssignmentNumber` AS COL_5_72,
`Assignment`.`AssignmentNumberExternal` AS COL_5_73 ,
`AssignmentType`.`Name` AS COL_5_74,
`Assignment`.`DateStart` AS COL_5_75,
`Assignment`.`DateEnd` AS COL_5_76,
`Assignment`.`DateDeadline` AS COL_5_77 FROM `Assignment`
CASE WHEN `ExtraField`.`Name` = "WorkDistrict" THEN
`ExtraField`.`Value` end as COL_5_78 FROM `Assignment`
LEFT JOIN `ExtraFields` as `ExtraField` on
`ExtraField`.`assignment_id` = `Assignment`.`Id`
WHERE `Assignment`.`Deleted` IS NULL -- Assignment should not be removed.
AND (1=1) -- Add assignment filters.
) AS q1
GROUP BY `Assignment`.`Id`
HAVING 1 = 1
AND COL_5_78 LIKE '%Amsterdam East%'
ORDER BY COL_5_72 ASC, COL_5_73 ASC;
When the table is only around 3500 records my query takes a couple of seconds to execute and return the results.
What is a better way to search in the related data? Should I just add a JSON field to the Assignment table and use the MySQL 5.7 Json query features? Or did I made a mistake in designing my database?
You are using select from subquery that forces MySQL to create unindexed temp table for each execution. Remove subquery (you really don't need it here) and it will be much faster.

Optimize sql query to speed up a search which currently takes around 85 seconds

I have a database with the records near about 2.7 milion . I need to fetch records from that for that i am using the below query
for result
SELECT r3.original_image_title,r3.uuid,r3.original_image_URL FROM `image_attributes` AS r1 INNER JOIN `filenames` as r3 WHERE r1.`uuid` = r3.`uuid` and r3.`status` = 1 and r1.status=1 and (r1.`attribute_name` like "Quvenzhané Wallis%" or r3.original_image_URL like "Quvenzhané Wallis%") group by r3.`uuid` limit 0,20
for total count
SELECT count(DISTINCT(r1.`uuid`)) as count FROM `image_attributes` AS r1 INNER JOIN `filenames` as r3 WHERE r1.`uuid` = r3.`uuid` and r3.`status` = 1 and r1.status=1 and (r1.`attribute_name` like "Quvenzhané Wallis%" or r3.original_image_URL like "Quvenzhané Wallis%")
table structures are as below
CREATE TABLE IF NOT EXISTS `image_attributes` (
`index` int(11) NOT NULL AUTO_INCREMENT,
`attribute_name` text NOT NULL,
`attribute_type` varchar(255) NOT NULL,
`uuid` varchar(255) NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`index`),
KEY `attribute_type` (`attribute_type`),
KEY `uuid` (`uuid`),
KEY `status` (`status`),
KEY `attribute_name` (`attribute_name`(50))
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=2730431 ;
CREATE TABLE IF NOT EXISTS `filenames` (
`index` int(11) NOT NULL AUTO_INCREMENT,
`original_image_title` text NOT NULL,
`original_image_URL` text NOT NULL,
`uuid` varchar(255) NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`index`),
KEY `uuid` (`uuid`),
KEY `original_image_URL` (`original_image_URL`(50))
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=591967 ;
please suggest me how can i optimize the queries to make the search result faster
I would recommend to you a book called 'High Performance MySql'. There is a section called Optimize databases and queries, or something like that.

how to make my select query faster in mysql?

I've got a table in mysql:
CREATE TABLE `pdd_data` (
`pdd_id` INT(16) NOT NULL primary key auto_increment,
`vin` varchar(32) NOT NULL,
`time` TIMESTAMP NOT NULL,
`cmd` varchar(16) NOT NULL,
`data` varchar(128) NOT NULL
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
I insert 1,000,000 records into pdd_data, and I'll use queries frequently in the future as below:
select * from pdd_data where cmd = 4599;
select * from pdd_data where vin = 400;
select * from pdd_data where vin = 400 and cmd = 4599;
Currently, the query time is about 1.20sec~1.90sec. Could anyone give me some suggestions on how to make this query faster?
p.s. I create a table using index:
CREATE TABLE `pdd_data1` (
`pdd_id` INT(16) NOT NULL primary key auto_increment,
`vin` varchar(32) NOT NULL,
`time` TIMESTAMP NOT NULL,
`cmd` varchar(16) NOT NULL,
`data` varchar(128) NOT NULL,
index idx_vin_cmd (vin(32), cmd(16))
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
But no improvement on select query.
My suggestion is do not use select *. Instead of using select * use select pdd_id, vin,time, cmd, data. This will definitely reduce your execution time.

mysql join not use index for 'between' operator

So basically I have three tables:
CREATE TABLE `cdIPAddressToLocation` (
`IPADDR_FROM` int(10) unsigned NOT NULL COMMENT 'Low end of the IP Address block',
`IPADDR_TO` int(10) unsigned NOT NULL COMMENT 'High end of the IP Address block',
`IPLOCID` int(10) unsigned NOT NULL COMMENT 'The Location ID for the IP Address range',
PRIMARY KEY (`IPADDR_TO`),
KEY `Index_2` USING BTREE (`IPLOCID`),
KEY `Index_3` USING BTREE (`IPADDR_FROM`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `cdIPLocation` (
`IPLOCID` int(10) unsigned NOT NULL default '0',
`Country` varchar(4) default NULL,
`Region` int(10) unsigned default NULL,
`City` varchar(90) default NULL,
`PostalCode` varchar(10) default NULL,
`Latitude` float NOT NULL,
`Longitude` float NOT NULL,
`MetroCode` varchar(4) default NULL,
`AreaCode` varchar(4) default NULL,
`State` varchar(45) default NULL,
`Continent` varchar(10) default NULL,
PRIMARY KEY (`IPLOCID`)
) ENGINE=MyISAM AUTO_INCREMENT=218611 DEFAULT CHARSET=latin1;
and
CREATE TABLE 'data'{
'IP' varchar(50)
'SCORE' int
}
My task is to join these three tables and find the location data for given IP address.
My query is as follows:
select
t.ip,
l.Country,
l.State,
l.City,
l.PostalCode,
l.Latitude,
l.Longitude,
t.score
from
(select
ip, inet_aton(ip) ipv, score
from
data
order by score desc
limit 5) t
join
cdIPAddressToLocation a ON t.ipv between a.IPADDR_FROM and a.IPADDR_TO
join
cdIPLocation l ON l.IPLOCID = a.IPLOCID
While this query works, it's very very slow, it took about 100 seconds to return the result on my dev box.
I'm using mysql 5.1, the cdIPAddressToLocation has 5.9 million rows and cdIPLocation table has about 0.3 million rows.
When I check the execution plan, I found it's not using any index in the table 'cdIPAddressToLocation', so for each row in the 'data' table it would do a full table scan against table 'cdIPAddressToLocation'.
It is very weird to me. I mean since there are already two indexes in table 'cdIPAddressToLocation' on columns 'IPADDR_FROM' and 'IPADDR_TO', the execution plan should exploit the index to improve the performance, but why it didn't use them.
Or was there something wrong with my query?
Please help, thanks a lot.
Have you tried using a composite index on the columns cdIPAddressToLocation.IPADDR_FROM and cdIPAddressToLocation.IPADDR_TO?
Multiple-Column Indexes

Joining multiple tables makes the query run too long

I have several tables, containing (a.o.) the following fields:
tweets:
--------------------------
tweet_id ticker created_at
--------------------------
1 1 1298063318
2 1 1298053197
stocks:
---------------------------------
ticker date close volume
---------------------------------
1 1313013600 12.25 40370600
1 1312927200 11.60 37281300
wiki:
-----------------------
ticker date views
-----------------------
1 1296514800 550
1 1296601200 504
I want to compile an overview of # of tweets, close, volume and views per day (for rows identified by ticker = 1). The tweets table is leading, meaning that if there is a date on which there are no tweets, the close, volume and views for that day don't matter. In oter words, I want the output of a query to be like:
-------------------------------------
date tweets close volume views
-------------------------------------
2011-02-13 4533 12.25 40370600 550
2011-02-14 6534 11.60 53543564 340
2011-02-16 5333 13.10 56464333 664
In this example output, there were no tweets on 2011-02-15, so there is no need for the rest of the data of that day. My query thus far goes:
SELECT
DATE_FORMAT(FROM_UNIXTIME(tweets.created_at), '%Y-%m-%d') AS date,
COUNT(tweets.tweet_id) AS tweets,
stocks.close,
stocks.volume,
wiki.views
FROM tweets
LEFT JOIN stocks ON tweets.ticker = stocks.ticker
LEFT JOIN wiki ON tweets.ticker = wiki.ticker
WHERE tweets.ticker = 1
GROUP BY date
ORDER BY date ASC
Could someone verify if this query is correct? It doesn't run into any errors but it freezes my PC. Perhaps I should set an index here or there, possibly on the "ticker" columns?
[edit]
As requested, the table definitions:
CREATE TABLE `stocks` (
`ticker` int(3) NOT NULL,
`date` int(10) NOT NULL,
`open` decimal(8,2) NOT NULL,
`high` decimal(8,2) NOT NULL,
`low` decimal(8,2) NOT NULL,
`close` decimal(8,2) NOT NULL,
`volume` int(8) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tweets` (
`tweet_id` int(11) NOT NULL AUTO_INCREMENT,
`ticker` varchar(5) NOT NULL,
`id_str` varchar(18) NOT NULL,
`created_at` int(10) NOT NULL,
`from_user` int(11) NOT NULL,
`text` text NOT NULL,
PRIMARY KEY (`tweet_id`),
KEY `id_str` (`id_str`),
KEY `from_user` (`from_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `wiki` (
`ticker` int(3) NOT NULL,
`date` int(11) NOT NULL,
`views` int(6) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I hope this helps.
You are right about indices, without an index on ticker you would have to do a where-search in all tables and if they are big that's going to take alot of time.
I suggest that you turn on logging of all queries that run without index at least every now and then to find queries that if not already are slow will be slow when the data increases.
Check queries with [EXPLAIN SELECT ...][2] if you find them slow, learn how to interpret the results (not easy but important) to understand where to put new indices.
Do believe you should check the joins between the tables. Your query does not indicate which stocks-rows (or wiki-row) is to be matched to the date for tweets. Based on example data the match is done for all stocks and wiki-rows which have the same ticker_id.
Does stocks and wiki-tables have only one row for certain day for one ticker? Assuming this is the case, more logical query would look like this:
SELECT
DATE_FORMAT(FROM_UNIXTIME(t.created_at), '%Y-%m-%d') AS date,
COUNT(t.tweet_id) AS tweets,
s.close,
s.volume,
w.views
FROM tweets t
LEFT JOIN stocks s ON t.ticker = s.ticker
and FROM_UNIXTIME(t.created_at,'%Y-%m-%d')=FROM_UNIXTIME(s.date,'%Y-%m-%d')
LEFT JOIN wiki w ON t.ticker = w.ticker
and FROM_UNIXTIME(t.created_at,'%Y-%m-%d')=FROM_UNIXTIME(w.date,'%Y-%m-%d')
WHERE tweets.ticker = 1
GROUP BY date, s.close, s.volume, w.views
ORDER BY date ASC
If there are more than one row in stocks/wiki for certain day for one ticker, the you need to apply aggregate function to those columns as well and change the COUNT(t.tweet_id) to COUNT(distinct t.created_at).
I think that one of problems is date calculation
DATE_FORMAT(FROM_UNIXTIME(tweets.created_at), '%Y-%m-%d') date
Try to add this field to the tweets table to avoid CPU consumption
edit:
you can use something like this
CREATE TABLE `stocks` (
`ticker` int(3) NOT NULL,
`date` int(10) NOT NULL,
`open` decimal(8,2) NOT NULL,
`high` decimal(8,2) NOT NULL,
`low` decimal(8,2) NOT NULL,
`close` decimal(8,2) NOT NULL,
`volume` int(8) NOT NULL,
`day_date` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tweets` (
`tweet_id` int(11) NOT NULL AUTO_INCREMENT,
`ticker` varchar(5) NOT NULL,
`id_str` varchar(18) NOT NULL,
`created_at` int(10) NOT NULL,
`from_user` int(11) NOT NULL,
`text` text NOT NULL,
`day_date` varchar(10) NOT NULL,
PRIMARY KEY (`tweet_id`),
KEY `id_str` (`id_str`),
KEY `from_user` (`from_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `wiki` (
`ticker` int(3) NOT NULL,
`date` int(11) NOT NULL,
`views` int(6) NOT NULL,
`day_date` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SELECT
tweets.day_date AS date,
COUNT(tweets.tweet_id) AS tweets,
stocks.close as close,
stocks.volume as volume,
wiki.views as views
FROM tweets
LEFT JOIN stocks ON tweets.ticker = stocks.ticker
and tweets.day_date = stocks.day_date
LEFT JOIN wiki ON tweets.ticker = wiki.ticker
and tweets.day_date = wiki.day_date
WHERE tweets.ticker = 1
GROUP BY date, close, volume, views
ORDER BY date ASC