Select a record from millions of records slowness - mysql

I have a standalone table, we insert it's data through a weekly job, and retrieve data in our search module.
the table has around 4 millions records (and will get bigger) when I execute the straight forward select query it take long time (around 15 second). I am using MySql DB.
Here is my table structure
CREATE TABLE `myTable` (
`myTableId` int(11) NOT NULL AUTO_INCREMENT,
`date` varchar(255) DEFAULT NULL,
`startTime` int(11) DEFAULT NULL,
`endTime` int(11) DEFAULT NULL,
`price` decimal(19,4) DEFAULT NULL,
`total` decimal(19,4) DEFAULT NULL,
`taxes` decimal(19,4) DEFAULT NULL,
`persons` int(11) NOT NULL DEFAULT '0',
`length` int(11) DEFAULT NULL,
`total` decimal(19,4) DEFAULT NULL,
`totalPerPerson` decimal(19,4) DEFAULT NULL,
`dayId` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`myTableId`)
);
When I run the following statement it take around 15 second to retrieve results.
So, how to optimize it to be faster.
SELECT
tt.testTableId,
(SELECT
totalPerPerson
FROM
myTable mt
WHERE
mt.venueId = tt.venueId
ORDER BY totalPerPerson ASC
LIMIT 1) AS minValue
FROM
testTable tt
WHERE
status is NULL;
Please note that testTable tble has around 15 records only.

This is the query:
SELECT tt.testTableId,
(SELECT mt.totalPerPerson
FROM myTable mt
WHERE mt.venueId = tt.venueId
ORDER BY mt.totalPerPerson ASC
LIMIT 1
) as minValue
FROM testTable tt
WHERE status is NULL;
For the subquery, you want an index on mytable(venueId, totalPerPerson). For the outer query, an index is unnecessary. However, if the table were larger, you would want in index on testTable(status, venueId, testTableId).

Using MIN and GROUP BY may be faster.
SELECT tt.testTableId, MIN(totalPerPerson)
FROM testTable tt
INNER JOIN mytable mt ON tt.venueId = mt.venueId
WHERE tt.status is NULL
GROUP BY tt.testTableId

Related

MySQL RDS performance of aggregation functions

We have a query that performs some aggregation on one column.
The filtering of the data seems to be pretty fast, but the aggregation seems to take too much time.
This query returns ~ 1.5 million rows. It runs for 0.6 seconds (if we want to return the data to the client it takes ~ 2 minutes - the way we tested this is by using the pymysql python library. We used an unbuffered cursor, so we can distinguish between query run time and fetch time):
SELECT *
FROM t_data t1
WHERE (t1.to_date = '2019-03-20')
AND (t1.period = 30)
AND (label IN ('aa','bb') )
AND ( id IN (
SELECT id
FROM t_location_data
WHERE (to_date = '2019-03-20') AND (period = 30)
AND ( country = 'Narniya' ) ) )
But if we run this query:
SELECT MAX(val) val_max,
AVG(val) val_avg,
MIN(val) val_min
FROM t_data t1
WHERE (t1.to_date = '2019-03-20')
AND (t1.period = 30)
AND (label IN ('aa','bb') )
AND ( id IN (
SELECT id
FROM t_location_data
WHERE (to_date = '2019-03-20') AND (period = 30)
AND ( country = 'Narniya' ) ) )
We see that the time to run the query takes 40 seconds and the time to fetch the results in this case is obviously less than a second..
Any help with this terrible performance of the aggregation functions over RDS Aurora? Why calculating Max Min and Avergae on 1.5 million lines takes so long (When comparing to Python on those same numbers, the calculation takes less than 1 second..)
NOTE: We added random number to each select to make sure we do not get cached values.
We use Aurora RDS:
1 instance of db.r5.large (2 vCPU + 16 GB RAM)
MySQL Engine version: 5.6.10a
Create table:
Create Table: CREATE TABLE `t_data` (
`id` varchar(256) DEFAULT NULL,
`val2` int(11) DEFAULT NULL,
`val3` int(11) DEFAULT NULL,
`val` int(11) DEFAULT NULL,
`val4` int(11) DEFAULT NULL,
`tags` varchar(256) DEFAULT NULL,
`val7` int(11) DEFAULT NULL,
`label` varchar(32) DEFAULT NULL,
`val5` varchar(64) DEFAULT NULL,
`val6` int(11) DEFAULT NULL,
`period` int(11) DEFAULT NULL,
`to_date` varchar(64) DEFAULT NULL,
`data_line_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`data_line_id`),
UNIQUE KEY `id_data` (`to_date`,`period`,`id`),
KEY `index1` (`to_date`,`period`,`id`),
KEY `index3` (`to_date`,`period`,`label`)
) ENGINE=InnoDB AUTO_INCREMENT=218620560 DEFAULT CHARSET=latin1
Create Table: CREATE TABLE `t_location_data` (
`id` varchar(256) DEFAULT NULL,
`country` varchar(256) DEFAULT NULL,
`state` varchar(256) DEFAULT NULL,
`city` varchar(256) DEFAULT NULL,
`latitude` float DEFAULT NULL,
`longitude` float DEFAULT NULL,
`val8` int(11) DEFAULT NULL,
`val9` tinyint(1) DEFAULT NULL,
`period` int(11) DEFAULT NULL,
`to_date` varchar(64) DEFAULT NULL,
`location_line_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`location_line_id`),
UNIQUE KEY `id_location_data` (`to_date`,`period`,`id`,`latitude`,`longitude`),
KEY `index1` (`to_date`,`period`,`id`,`country`),
KEY `index2` (`country`,`state`,`city`),
KEY `index3` (`to_date`,`period`,`country`,`state`)
) ENGINE=InnoDB AUTO_INCREMENT=315944737 DEFAULT CHARSET=latin1
Parameters:
##innodb_buffer_pool_size/1024/1024/1024: 7.7900
##innodb_buffer_pool_instances: 8
UPDATE:
Adding the val index (like suggest by #rick-james) did improve the query dramatically (took ~2 seconds) only if I delete the AND ( id IN (SELECT id FROM t_location_data.. condition. If I leave it, the query runs for about ~25 seconds.. better than before but still not good..
Indexes needed:
t_data: INDEX(period, to_date, label, val)
t_data: INDEX(period, label, to_date, val)
t_location_data: INDEX(period, country, to_date, id)
Also, change from the slow IN ( SELECT ... ) to a JOIN:
FROM t_data AS d
JOIN t_location_data AS ld USING(id)
WHERE ...
Better yet, since the tables are 1:1 (is that correct?), combine the tables so as to eliminate the JOIN. If id is not the PRIMARY KEY in each table, you really need to provide SHOW CREATE TABLE and should change the name(s).

MySQL Order by subquery column

I have problem with sql query. The idea is to select all loans that are after payment (status 1/2/3) between 8 and 21 days with calculated value from payment_day til now.
I have already done some query but can't use columns days_after_payment and days_after_part_payment in WHERE section. I would like to have one column like days_after_payment based on loan type.
SELECT l.*,
(SELECT SUM(`value`) FROM `loan_part` WHERE `loan_id` = l.id AND `paid`=0) AS left_to_pay,
-(DATEDIFF((SELECT date FROM `loan_part` WHERE `loan_id` = l.id AND `paid`=0 AND `date`<CURDATE() ORDER BY `date` LIMIT 1), NOW())) AS days_after_part_payment,
-(DATEDIFF(l.payment_date, NOW())) AS days_after_payment
FROM loan l
WHERE (l.type=1 or l.type=2) AND (l.status=1 OR l.status=2 OR l.status=3)
GROUP BY l.client_id
ORDER BY
CASE l.type
WHEN 1 THEN days_after_payment
WHEN 2 THEN days_after_part_payment
ELSE 1 END
ASC
CREATE TABLE IF NOT EXISTS `loan` (
`id` int(11) NOT NULL,
`value` int(11) NOT NULL,
`client_id` int(11) NOT NULL,
`status` int(11) NOT NULL,
`type` int(11) NOT NULL,
`payment_date` date DEFAULT NULL
) ENGINE=MyISAM AUTO_INCREMENT=2068 DEFAULT CHARSET=utf8;
CREATE TABLE IF NOT EXISTS `loan_part` (
`id` int(10) unsigned NOT NULL,
`loan_id` int(11) NOT NULL,
`value` float NOT NULL,
`date` date DEFAULT NULL,
`paid` tinyint(1) NOT NULL DEFAULT '0'
) ENGINE=MyISAM AUTO_INCREMENT=1751 DEFAULT CHARSET=utf8;
Update1 : I had to cut unnecessary columns and rewrite it into English from my native language.
ORDER BY 7
"7" means the 7th field in the SELECT. That works for GROUP BY also. I had to see the table definition to count how many in l.*.
How come id is not declared AUTO_INCREMENT?

select count, group by and having optimization

I have this query
SELECT
t2.counter_id,
t2.hash_counter,
count(1) AS cnt
FROM
table1 t1
RIGHT JOIN
table2 t2 USING(counter_id)
WHERE
t2.hash_id = 973
GROUP BY
t1.counter_id
HAVING
cnt < 8000
Here are the tables.
CREATE TABLE `table1` (
`id` varchar(255) NOT NULL,
`platform` varchar(32) DEFAULT NULL,
`version` varchar(10) DEFAULT NULL,
`edition` varchar(2) NOT NULL DEFAULT 'us',
`counter_id` int(11) NOT NULL,
`created_on` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `counter_id` (`counter_id`)
) ENGINE=InnoDB
CREATE TABLE `table2` (
`counter_id` int(11) NOT NULL AUTO_INCREMENT,
`hash_id` int(11) DEFAULT NULL,
`hash_counter` int(11) DEFAULT NULL,
PRIMARY KEY (`counter_id`),
UNIQUE KEY `counter_key` (`hash_id`,`hash_counter`)
) ENGINE=InnoDB
The "EXPLAIN" shows "Using index; Using temporary; Using filesort" for table t2. Is there any way to get rid off temporary/filesort ? or any other ideas about optimizing this guy.
Your comment above gives more insight into what you want. It is always better to explain more about what you are trying to achieve - just looking at the non-working SQL leads people down the wrong path.
So, you want to know which table2 rows have < 8000 table1 rows?
Why not this:
select *
from table2 as t2
where hash_id = 973
and 8000 < (select count(*) from table1 as t1 where t1.counter_id = t2.counter_id)
;

Improving the MySQL Query

I have the following query which filters the row with replyAutoId=0 and then fetches the most recent record of each propertyId. Now the query takes 0.23225 sec for fetching just 5,435 from 21,369 rows and I want to improve this. All I am asking is, Is there a better way of writing this query ? Any suggestions ?
SELECT pc1.* FROM (SELECT * FROM propertyComment WHERE replyAutoId=0) as pc1
LEFT JOIN propertyComment as pc2
ON pc1.propertyId= pc2.propertyId AND pc1.updatedDate < pc2.updatedDate
WHERE pc2.propertyId IS NULL
The SHOW CREATE TABLE propertyComment Output:
CREATE TABLE `propertyComment` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`propertyId` int(11) NOT NULL,
`agentId` int(11) NOT NULL,
`comment` longtext COLLATE utf8_unicode_ci NOT NULL,
`replyAutoId` int(11) NOT NULL,
`updatedDate` datetime NOT NULL,
`contactDate` date NOT NULL,
`status` enum('Y','N') COLLATE utf8_unicode_ci NOT NULL DEFAULT 'N',
`clientStatusId` int(11) NOT NULL,
`adminsId` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `propertyId` (`propertyId`),
KEY `agentId` (`agentId`),
KEY `status` (`status`),
KEY `adminsId` (`adminsId`),
KEY `replyAutoId` (`replyAutoId`)
) ENGINE=MyISAM AUTO_INCREMENT=21404 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Try to get rid of the nested query.
The following query should give the same result as your original query:
SELECT pc1.*
FROM propertyComment AS pc1
LEFT JOIN propertyComment AS pc2
ON pc1.propertyID = pc2.propertyId AND pc1.updatedDate < pc2.updatedDate
WHERE pc1.replyAutoId = 0 AND pc2.propertyID IS NULL
SELECT pc1.* FROM (SELECT * WHERE replyAutoId=0) as pc1
LEFT JOIN (SELECT propertyID, updatedDate from propertyComment order by 1,2) as pc2
ON pc1.propertyId= pc2.propertyId AND pc1.updatedDate < pc2.updatedDate
WHERE pc2.propertyId IS NULL
You also don't have any indexes?
If you did on primary key, you're not joining on it, so why include it?
Why not only select the columns you're interested from B table? This will limit the number of columns you're selecting from table B. Since you're pulling everything from table A where replyAutoID = 0, it wouldn't make much sense to limit the columns there. This should speed it up little.

Joining multiple tables makes the query run too long

I have several tables, containing (a.o.) the following fields:
tweets:
--------------------------
tweet_id ticker created_at
--------------------------
1 1 1298063318
2 1 1298053197
stocks:
---------------------------------
ticker date close volume
---------------------------------
1 1313013600 12.25 40370600
1 1312927200 11.60 37281300
wiki:
-----------------------
ticker date views
-----------------------
1 1296514800 550
1 1296601200 504
I want to compile an overview of # of tweets, close, volume and views per day (for rows identified by ticker = 1). The tweets table is leading, meaning that if there is a date on which there are no tweets, the close, volume and views for that day don't matter. In oter words, I want the output of a query to be like:
-------------------------------------
date tweets close volume views
-------------------------------------
2011-02-13 4533 12.25 40370600 550
2011-02-14 6534 11.60 53543564 340
2011-02-16 5333 13.10 56464333 664
In this example output, there were no tweets on 2011-02-15, so there is no need for the rest of the data of that day. My query thus far goes:
SELECT
DATE_FORMAT(FROM_UNIXTIME(tweets.created_at), '%Y-%m-%d') AS date,
COUNT(tweets.tweet_id) AS tweets,
stocks.close,
stocks.volume,
wiki.views
FROM tweets
LEFT JOIN stocks ON tweets.ticker = stocks.ticker
LEFT JOIN wiki ON tweets.ticker = wiki.ticker
WHERE tweets.ticker = 1
GROUP BY date
ORDER BY date ASC
Could someone verify if this query is correct? It doesn't run into any errors but it freezes my PC. Perhaps I should set an index here or there, possibly on the "ticker" columns?
[edit]
As requested, the table definitions:
CREATE TABLE `stocks` (
`ticker` int(3) NOT NULL,
`date` int(10) NOT NULL,
`open` decimal(8,2) NOT NULL,
`high` decimal(8,2) NOT NULL,
`low` decimal(8,2) NOT NULL,
`close` decimal(8,2) NOT NULL,
`volume` int(8) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tweets` (
`tweet_id` int(11) NOT NULL AUTO_INCREMENT,
`ticker` varchar(5) NOT NULL,
`id_str` varchar(18) NOT NULL,
`created_at` int(10) NOT NULL,
`from_user` int(11) NOT NULL,
`text` text NOT NULL,
PRIMARY KEY (`tweet_id`),
KEY `id_str` (`id_str`),
KEY `from_user` (`from_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `wiki` (
`ticker` int(3) NOT NULL,
`date` int(11) NOT NULL,
`views` int(6) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
I hope this helps.
You are right about indices, without an index on ticker you would have to do a where-search in all tables and if they are big that's going to take alot of time.
I suggest that you turn on logging of all queries that run without index at least every now and then to find queries that if not already are slow will be slow when the data increases.
Check queries with [EXPLAIN SELECT ...][2] if you find them slow, learn how to interpret the results (not easy but important) to understand where to put new indices.
Do believe you should check the joins between the tables. Your query does not indicate which stocks-rows (or wiki-row) is to be matched to the date for tweets. Based on example data the match is done for all stocks and wiki-rows which have the same ticker_id.
Does stocks and wiki-tables have only one row for certain day for one ticker? Assuming this is the case, more logical query would look like this:
SELECT
DATE_FORMAT(FROM_UNIXTIME(t.created_at), '%Y-%m-%d') AS date,
COUNT(t.tweet_id) AS tweets,
s.close,
s.volume,
w.views
FROM tweets t
LEFT JOIN stocks s ON t.ticker = s.ticker
and FROM_UNIXTIME(t.created_at,'%Y-%m-%d')=FROM_UNIXTIME(s.date,'%Y-%m-%d')
LEFT JOIN wiki w ON t.ticker = w.ticker
and FROM_UNIXTIME(t.created_at,'%Y-%m-%d')=FROM_UNIXTIME(w.date,'%Y-%m-%d')
WHERE tweets.ticker = 1
GROUP BY date, s.close, s.volume, w.views
ORDER BY date ASC
If there are more than one row in stocks/wiki for certain day for one ticker, the you need to apply aggregate function to those columns as well and change the COUNT(t.tweet_id) to COUNT(distinct t.created_at).
I think that one of problems is date calculation
DATE_FORMAT(FROM_UNIXTIME(tweets.created_at), '%Y-%m-%d') date
Try to add this field to the tweets table to avoid CPU consumption
edit:
you can use something like this
CREATE TABLE `stocks` (
`ticker` int(3) NOT NULL,
`date` int(10) NOT NULL,
`open` decimal(8,2) NOT NULL,
`high` decimal(8,2) NOT NULL,
`low` decimal(8,2) NOT NULL,
`close` decimal(8,2) NOT NULL,
`volume` int(8) NOT NULL,
`day_date` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE `tweets` (
`tweet_id` int(11) NOT NULL AUTO_INCREMENT,
`ticker` varchar(5) NOT NULL,
`id_str` varchar(18) NOT NULL,
`created_at` int(10) NOT NULL,
`from_user` int(11) NOT NULL,
`text` text NOT NULL,
`day_date` varchar(10) NOT NULL,
PRIMARY KEY (`tweet_id`),
KEY `id_str` (`id_str`),
KEY `from_user` (`from_user`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE `wiki` (
`ticker` int(3) NOT NULL,
`date` int(11) NOT NULL,
`views` int(6) NOT NULL,
`day_date` varchar(10) NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
SELECT
tweets.day_date AS date,
COUNT(tweets.tweet_id) AS tweets,
stocks.close as close,
stocks.volume as volume,
wiki.views as views
FROM tweets
LEFT JOIN stocks ON tweets.ticker = stocks.ticker
and tweets.day_date = stocks.day_date
LEFT JOIN wiki ON tweets.ticker = wiki.ticker
and tweets.day_date = wiki.day_date
WHERE tweets.ticker = 1
GROUP BY date, close, volume, views
ORDER BY date ASC