How to optimize GROUP BY on calculated field (to use index)? - mysql

I have a large (nearly 10M records) data table which, for performance reasons, has a secondary aggregation companion table. The aggregation table is regularly populated with sofar unaggregated data:
REPLACE INTO aggregate (channel_id, type, timestamp, value, count)
SELECT channel_id, 'day' AS type, MAX(timestamp) AS timestamp, SUM(value) AS value, COUNT(timestamp) AS count FROM data
WHERE timestamp < UNIX_TIMESTAMP(DATE_FORMAT(NOW(), "%Y-%m-%d")) * 1000
AND timestamp >= IFNULL((SELECT UNIX_TIMESTAMP(DATE_ADD(FROM_UNIXTIME(MAX(timestamp)/1000, "%Y-%m-%d"),
INTERVAL 1 day)) * 1000 FROM aggregate WHERE type = 'day'), 0)
GROUP BY channel_id, YEAR(FROM_UNIXTIME(timestamp/1000)), DAYOFYEAR(FROM_UNIXTIME(timestamp/1000));
I've found that the SELECT part of the statement is pretty slow (2+ seconds on fast PC) even when no data is being returned. As the aggregation needs to be running on embedded devices this is a concern. Here is the plan:
id select_type table type key key_len rows Extra
1 PRIMARY data ALL 9184560 Using where; Using temporary; Using filesort
2 SUBQUERY aggregate index ts_uniq 22 1940 Using where; Using index
The sub-query itself is instant. Apparently data doesn't use the channel_id/timestamp index due to the calculation in the GROUP BY clause:
CREATE TABLE `data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`channel_id` int(11) DEFAULT NULL,
`timestamp` bigint(20) NOT NULL,
`value` double NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ts_uniq` (`channel_id`,`timestamp`),
KEY `IDX_ADF3F36372F5A1AA` (`channel_id`)
) ENGINE=MyISAM AUTO_INCREMENT=10432870 DEFAULT CHARSET=latin1;
Can the query be further optimized?
Update: adding requested information
SHOW INDEXES FROM data;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Null Index_type
data 0 PRIMARY 1 id A 9184560 BTREE
data 0 ts_uniq 1 channel_id A 164 YES BTREE
data 0 ts_uniq 2 timestamp A 9184560 BTREE
data 1 IDX_ADF3.. 1 channel_id A 164 YES BTREE
CREATE TABLE `aggregate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`channel_id` int(11) NOT NULL,
`type` varchar(8) NOT NULL,
`timestamp` bigint(20) NOT NULL,
`value` double NOT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ts_uniq` (`channel_id`,`type`,`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=1941 DEFAULT CHARSET=latin1;
I've also noticed that the query becomes instant when changing the GROUP BY to channel_id, timestamp. Unfortunately adding the data calculations as columns is not desirable as the grouping is dynamically calculated.
I'm failing to understand why the GROUP BY index should be such a problem when there isn't even any data to be grouped. I've tried running
SELECT channel_id, 'day' AS type, MAX(timestamp) AS timestamp, SUM(value) AS value, COUNT(timestamp) AS count FROM data
WHERE timestamp < UNIX_TIMESTAMP(DATE_FORMAT(NOW(), "%Y-%m-%d")) * 1000
AND timestamp >= IFNULL((SELECT UNIX_TIMESTAMP(DATE_ADD(FROM_UNIXTIME(MAX(timestamp)/1000, "%Y-%m-%d"), INTERVAL 1 day)) * 1000
FROM aggregate WHERE type = 'day'), 0)
which is just as slow so the GROUP doesn't seem to be the problem?
Update 2
Digging further down that road shows that
SELECT channel_id, 'day' AS type, timestamp, value, 1 FROM data
WHERE timestamp >= (SELECT UNIX_TIMESTAMP(DATE_ADD(FROM_UNIXTIME(MAX(timestamp)/1000, "%Y-%m-%d"),
INTERVAL 1 day)) * 1000 FROM aggregate WHERE type = 'day');
is still slow (1.4sec)- so not a GROUP BY problem at all.
Update 3
And this is still slow:
SELECT channel_id, 'day' AS type, timestamp, value, 1 FROM data WHERE timestamp >= 1380837600000;
So- the problem is that the inner comparison is for timestamp which cannot make use of the channel_id, timestamp index although that is part of the GROUP BY clause.
Which leads to the question on how to force that index?

Add a year and dayofyear column to data table, and have an index on (channel_id, year, dayofyear). Populate the two new columns when you insert a row.

Related

How I can optimize query with where index?

I have query
select `price`, `asset_id`
from `history_average_pairs`
where `currency_id` = 1
and date(`created_at`) >= DATE_SUB(NOW(), INTERVAL 7 DAY)
group by hour(created_at), date(created_at), asset_id
order by `created_at` asc
And table
CREATE TABLE IF NOT EXISTS history_average_pairs (
id bigint(20) unsigned NOT NULL,
asset_id bigint(20) unsigned NOT NULL,
currency_id bigint(20) unsigned NOT NULL,
market_cap bigint(20) NOT NULL,
price double(20,6) NOT NULL,
volume bigint(20) NOT NULL,
circulating bigint(20) NOT NULL,
change_1h double(8,2) NOT NULL,
change_24h double(8,2) NOT NULL,
change_7d double(8,2) NOT NULL,
created_at timestamp NOT NULL DEFAULT current_timestamp(),
updated_at timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
total_supply bigint(20) unsigned NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
ALTER TABLE history_average_pairs
ADD PRIMARY KEY (id),
ADD KEY history_average_pairs_currency_id_asset_id_foreign (currency_id,asset_id),
ALTER TABLE history_average_pairs
MODIFY id bigint(20) unsigned NOT NULL AUTO_INCREMENT;
It contains more than 10 000 000 rows, and query takes
Showing rows 0 - 24 (32584 total, Query took 27.8344 seconds.)
But without currency_id = 1, it takes like 4 sec.
UPDATE 1
Okey, I updated key from currency_id, asset_id to currency_id, asset_id, created_at and it takes
Showing rows 0 - 24 (32784 total, Query took 6.4831 seconds.)
Its much faster, any proposal to do it more faster?
GROUP BY here to take only first row for every hour.
For example:
19:01:10
19:02:14
19:23:15
I need only 19:01:10
You can rephrase the filtering predicate to avoid using expressions on columns. For example:
select max(`price`) as max_price, `asset_id`
from `history_average_pairs`
where `currency_id` = 1
and created_at >= date_add(curdate(), interval - 7 day)
group by hour(created_at), date(created_at), asset_id
order by `created_at` asc
Then, this query could be much faster if you added the index:
create index ix1 on `history_average_pairs` (`currency_id`, created_at);
You must make the test "sargeable"; change
date(`created_at`) >= DATE_SUB(NOW(), INTERVAL 7 DAY)
to
created_at >= CURDATE() - INTERVAL 7 DAY
Then the optimal index is
INDEX(currency_id, -- 1st because of "=" test
created_at, -- 2nd to finish out WHERE
asset_id) -- only for "covering"
When designing an index, it is usually best to handle the WHERE first.
The GROUP BY cannot use the index. Did you really want the hour first?
"I need only 19:01:10" is unclear, so I have not factored that in. Where's the date? Where's the asset_id? See "only_full_group_by". Do you need "groupwise max"?
Making the ORDER BY have the same columns as the GROUP BY avoids a sort. (In your query, the order may be slightly different, but it probably does not matter.)
Datatype issues...
BIGINT takes 8 bytes; INT takes only 4 bytes and is usually big enough. Shrinking the table provides some speed.
double(8,2) takes 8 bytes -- Don't use (m,n) on FLOAT or DOUBLE; it adds an extra rounding. Perhaps you meant DECIMAL(8,2), which takes 4 bytes.

Why does SQL query visit all rows and is very slow

The table 'reading' contains readings taken every 40s, for today. The query returns averages for 180s periods. 'time_stamp' is indexed. The query below returns a reasonable number of rows (a few hundred) but visits ALL rows and get slower the bigger the table gets. WHERE clause does not seem to be restricting it to today's rows only.
EXPLAIN SELECT
DATE_FORMAT(time_stamp, '%Y-%m-%dT%T+00:00') ,
AVG(temp_c)
FROM reading
WHERE DATE(time_stamp) = CURDATE()
GROUP BY round(UNIX_TIMESTAMP(time_stamp) / 180)
Table schema:
CREATE TABLE reading (
id bigint(20) NOT NULL AUTO_INCREMENT,
time_stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
temp_c float NOT NULL,
pressure_hpa float NOT NULL,
wind_speed_kt int(11) NOT NULL,
wind_dir_degree int(11) NOT NULL,
rain_mm float NOT NULL,
rain_day_mm float NOT NULL,
wind_gust_kt int(11) NOT NULL,
humidity float DEFAULT NULL,
PRIMARY KEY (id),
KEY time_stamp (time_stamp),
KEY time_stamp_idx (time_stamp)
) ENGINE=InnoDB AUTO_INCREMENT=1747097 DEFAULT CHARSET=latin1;
EXPLAIN SELECT
DATE_FORMAT(time_stamp, '%Y-%m-%dT%T+00:00') ,
AVG(temp_c)
FROM reading
WHERE DATE(time_stamp) = CURDATE()
GROUP BY round(UNIX_TIMESTAMP(time_stamp) / 180)
When the above query is executed, MySQL optimizer isn't interested in index scan (could be because of cost factor) rather full table scan is initiated and the issue appears to be because of WHERE DATE(time_stamp) = CURDATE().
Having changed your where clause to time_stamp >= CURDATE(), I've seen index being used and less number of rows were fetched shunning full scan.
Hence, your final query will be:
EXPLAIN SELECT
DATE_FORMAT(time_stamp, '%Y-%m-%dT%T+00:00') ,
AVG(temp_c)
FROM reading
WHERE time_stamp >= CURDATE()
GROUP BY round(UNIX_TIMESTAMP(time_stamp) / 180);
I suspect date(time_stamp) isn't that efficient with index. Similar topic was discussed here (see ypercube's answer).
The above query can be further improved by choosing an alternate of round(UNIX_TIMESTAMP(time_stamp) / 180) as UNIX_TIMESTAMP(timestamp) doesn't use index. But, I'm not trying furthermore.
Hope this helps!

MySQL subquery count with calendar table slow

I have a sales table in MySQL (InnoDB). It's +- 1 million records big. I would like to show some nice charts. Fetching the right data is not a problem. Fetching it fast is...
So I like to count the amount of sales in table A grouped per day (later on also month, and year) for PERIOD A till Z. Concrete; for the last 30 days I like to know for each day how many sales records we have in the DB.
So MySQL would have to return something like this:
I like to achieve that MySQL returns the data like this:
date, count
2017-04-01, 2482
2017-04-02, 1934
2017-04-03, 2701
...
The structure of the Sales basically like this:
CREATE TABLE `sales` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `contacts_created_at_index` (`created_at`),
KEY `contacts_deleted_at_index` (`deleted_at`),
KEY `ind_created_at_deleted_at` (`created_at`,`deleted_at`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Some days (datapoints) might not have any results, but I don't like to have gaps in the data. So I also have some 'calendar' table.
CREATE TABLE `time_dimension` (
`id` int(11) NOT NULL,
`db_date` date NOT NULL,
`year` int(11) NOT NULL,
`month` int(11) NOT NULL,
`day` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `td_ymd_idx` (`year`,`month`,`day`),
UNIQUE KEY `td_dbdate_idx` (`db_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Fetching 30 rows (30 days) with a count per day takes 30 secs...
This is the first query I tried:
SELECT
`db_date` AS `date`,
(SELECT
COUNT(1)
FROM
sales
WHERE
DATE(created_at) = db_date) AS count
FROM
`time_dimension`
WHERE
`db_date` >= '2017-04-11'
AND `db_date` <= '2017-04-25'
ORDER BY `db_date` ASC
But like I said it's really slow (11.9 secs). I tried al kinds of other approaches, but without luck. For example:
SELECT time_dimension.db_date AS DATE,
COUNT(1) AS count
FROM sales RIGHT JOIN time_dimension ON (DATE(sales.created_at) =
time_dimension.db_date)
WHERE
(time_dimension.db_date BETWEEN '2017-03-11' AND '2017-04-11')
GROUP BY
DATE
A query for just 1 datapoint takes only 5.4ms:
SELECT COUNT(1) FROM sales WHERE created_at BETWEEN '2017-04-11 00:00:00' AND '2017-04-25 23:59:59'
I haven't checked innodb_buffer_poolsize on my local machine. I will check that as well. Any ideas on how to make queries like this fast? In the future I would even need to where clauses and joins, to filter the set of sales records..
Thanks.
Nick
You could try to count sale data first, then join count result with your calendar table.
SELECT time_dimension.db_date AS date,
by_date.sale_count
FROM time_dimension
LEFT JOIN (SELECT DATE(sales.created_at) sale_date,
COUNT(1) AS sale_count
FROM sales
WHERE created_at BETWEEN '2017-03-11 00:00:00' AND
'2017-04-11 23:59:59'
GROUP BY DATE(sales.created_at)) by_date
ON time_dimension.db_date = by_date.sale_date
WHERE time_dimension.db_date BETWEEN '2017-03-11' AND '2017-04-11'
The problematic part of your query is the data type conversion DATE(created_at), which effectively prevents Mysql from using the index at created_at.
Your 1 datapoint query avoids that, and that is why it is working fast.
To fix this you should check if created_at is within a range of specific day, like that:
created_at BETWEEN db_date AND DATE_ADD(db_date,INTERVAL 1 DAY)
This way Mysql will be able to make use of index on it (do a range lookup), as appropriate.
WHERE DATE(created_at) = db_date)
-->
WHERE created_at >= db_date
AND created_at < db_date + INTERVAL 1 DAY
This avoids including midnight of second day (as BETWEEN does)
Work for all flavors: DATE, DATETIME, DATETIME(6)
Does not hid the created_at inside a function where the index cannot see it.
For time_dimension, get rid of PRIMARY KEY (id) and change UNIQUE(db_date) to the PK.
After making these changes, your original subquery may be competitive with the LEFT JOIN ( SELECT ... ). (It depends on which version of MySQL.)

Speed up this MySQL query

I have a query which is getting slower and slower because there are more and more records in my table. So I'm trying to speed things up.
Database size:
Records: 1,200,000
Data 22,9 MiB
Index 46,8 MiB
Total 69,7 MiB
The purpose of the query is counting the number of records that exist that match the conditions. The conditions are a date (current date) and a status number. See query below:
SELECT
COUNT(id) AS total
FROM
order_process
WHERE
DATE(datetime) = CURDATE() AND
status = '7';
At the moment, this query is taking 800ms. And I need to run this query multiple times with different dates. These are all in the same script so script execution is going over the 3 seconds at the moment. How can I speed this up?
What have I already done:
Created indexes (Index on status and datetime both don't speed up the query).
Tested InnoDB engine (which is slower, mostly reading on this table)
To make it complete, below the current table setup.
CREATE TABLE IF NOT EXISTS `order_process` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`order_id` int(11) NOT NULL,
`status` int(11) NOT NULL,
`datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`remark` text NOT NULL,
PRIMARY KEY (`id`),
KEY `orderid` (`order_id`),
KEY `datetime` (`datetime`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When you use date() function on a timestamp/datetime column and even if the column is indexed it can't use the index
So you need to construct the query as
where
datetime >= concat(CURDATE(),' 00:00:00')
and datetime <= concat(CURDATE(),' 23:59:59')
and status = '7'

Mysql with big tables: how to optmize this query?

I have a table using InnoDB that stores all messages sent by my system. Currently the table have 40 million rows and grows 3/4 million per month.
My query is basically to select messages sent from an user and within a data range. Here is a simplistic create table:
CREATE TABLE `log` (
`id` int(10) NOT NULL DEFAULT '0',
`type` varchar(10) NOT NULL DEFAULT '',
`timeLogged` int(11) NOT NULL DEFAULT '0',
`orig` varchar(128) NOT NULL DEFAULT '',
`rcpt` varchar(128) NOT NULL DEFAULT '',
`user` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `timeLogged` (`timeLogged`),
KEY `user` (`user`),
KEY `user_timeLogged` (`user`,`timeLogged`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Note: I have individual indexes too because of other queries.
Query looks like this:
SELECT COUNT(*) FROM log WHERE timeLogged BETWEEN 1282878000 AND 1382878000 AND user = 20
The issue is that this query takes from 2 minutes to 10 minutes, depending of user and server load which is too much time to wait for a page to load. I have mysql cache enabled and cache in application, but the problem is that when user search for new ranges, it won't hit cache.
My question are:
Would changing user_timeLogged index make any difference?
Is this a problem with MySQL and big databases? I mean, does Oracle or other DBs also suffer from this problem?
AFAIK, my indexes are correctly created and this query shouldn't take so long.
Thanks for anyone who help!
you're using innodb but not taking full advantage of your innodb clustered index (primary key) as it looks like your typical query is of the form:
select <fields> from <table> where user_id = x and <datefield> between y and z
not
select <fields> from <table> where id = x
the following article should help you optimise your table design for your query.
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
If you understand the article correctly you should find youself with something like the following:
drop table if exists user_log;
create table user_log
(
user_id int unsigned not null,
created_date datetime not null,
log_type_id tinyint unsigned not null default 0, -- 1 byte vs varchar(10)
...
...
primary key (user_id, created_date, log_type_id)
)
engine=innodb;
Here's some query performance stats from the above design:
Counts
select count(*) as counter from user_log
counter
=======
37770394
select count(*) as counter from user_log where
created_date between '2010-09-01 00:00:00' and '2010-11-30 00:00:00'
counter
=======
35547897
User and date based queries (all queries run with cold buffers)
select count(*) as counter from user_log where user_id = 4755
counter
=======
7624
runtime = 0.215 secs
select count(*) as counter from user_log where
user_id = 4755 and created_date between '2010-09-01 00:00:00' and '2010-11-30 00:00:00'
counter
=======
7404
runtime = 0.015 secs
select
user_id,
created_date,
count(*) as counter
from
user_log
where
user_id = 4755 and created_date between '2010-09-01 00:00:00' and '2010-11-30 00:00:00'
group by
user_id, created_date
order by
counter desc
limit 10;
runtime = 0.031 secs
Hope this helps :)
COUNT(*) is not loading from the table cache because you have a WHERE clause, using EXPLAIN as #jason mentioned, try changing it to COUNT(id) and see if that helps.
I could be wrong, but I also think that your indexes have to be in the same order as your WHERE clause. Since your WHERE clause uses timeLogged before user then your index should be KEYuser_timeLogged(timeLogged,user)`
Again, EXPLAIN will tell you whether this index change makes a difference.