How I can optimize query with where index? - mysql

I have query
select `price`, `asset_id`
from `history_average_pairs`
where `currency_id` = 1
and date(`created_at`) >= DATE_SUB(NOW(), INTERVAL 7 DAY)
group by hour(created_at), date(created_at), asset_id
order by `created_at` asc
And table
CREATE TABLE IF NOT EXISTS history_average_pairs (
id bigint(20) unsigned NOT NULL,
asset_id bigint(20) unsigned NOT NULL,
currency_id bigint(20) unsigned NOT NULL,
market_cap bigint(20) NOT NULL,
price double(20,6) NOT NULL,
volume bigint(20) NOT NULL,
circulating bigint(20) NOT NULL,
change_1h double(8,2) NOT NULL,
change_24h double(8,2) NOT NULL,
change_7d double(8,2) NOT NULL,
created_at timestamp NOT NULL DEFAULT current_timestamp(),
updated_at timestamp NOT NULL DEFAULT current_timestamp() ON UPDATE current_timestamp(),
total_supply bigint(20) unsigned NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
ALTER TABLE history_average_pairs
ADD PRIMARY KEY (id),
ADD KEY history_average_pairs_currency_id_asset_id_foreign (currency_id,asset_id),
ALTER TABLE history_average_pairs
MODIFY id bigint(20) unsigned NOT NULL AUTO_INCREMENT;
It contains more than 10 000 000 rows, and query takes
Showing rows 0 - 24 (32584 total, Query took 27.8344 seconds.)
But without currency_id = 1, it takes like 4 sec.
UPDATE 1
Okey, I updated key from currency_id, asset_id to currency_id, asset_id, created_at and it takes
Showing rows 0 - 24 (32784 total, Query took 6.4831 seconds.)
Its much faster, any proposal to do it more faster?
GROUP BY here to take only first row for every hour.
For example:
19:01:10
19:02:14
19:23:15
I need only 19:01:10

You can rephrase the filtering predicate to avoid using expressions on columns. For example:
select max(`price`) as max_price, `asset_id`
from `history_average_pairs`
where `currency_id` = 1
and created_at >= date_add(curdate(), interval - 7 day)
group by hour(created_at), date(created_at), asset_id
order by `created_at` asc
Then, this query could be much faster if you added the index:
create index ix1 on `history_average_pairs` (`currency_id`, created_at);

You must make the test "sargeable"; change
date(`created_at`) >= DATE_SUB(NOW(), INTERVAL 7 DAY)
to
created_at >= CURDATE() - INTERVAL 7 DAY
Then the optimal index is
INDEX(currency_id, -- 1st because of "=" test
created_at, -- 2nd to finish out WHERE
asset_id) -- only for "covering"
When designing an index, it is usually best to handle the WHERE first.
The GROUP BY cannot use the index. Did you really want the hour first?
"I need only 19:01:10" is unclear, so I have not factored that in. Where's the date? Where's the asset_id? See "only_full_group_by". Do you need "groupwise max"?
Making the ORDER BY have the same columns as the GROUP BY avoids a sort. (In your query, the order may be slightly different, but it probably does not matter.)
Datatype issues...
BIGINT takes 8 bytes; INT takes only 4 bytes and is usually big enough. Shrinking the table provides some speed.
double(8,2) takes 8 bytes -- Don't use (m,n) on FLOAT or DOUBLE; it adds an extra rounding. Perhaps you meant DECIMAL(8,2), which takes 4 bytes.

Related

Why does SQL query visit all rows and is very slow

The table 'reading' contains readings taken every 40s, for today. The query returns averages for 180s periods. 'time_stamp' is indexed. The query below returns a reasonable number of rows (a few hundred) but visits ALL rows and get slower the bigger the table gets. WHERE clause does not seem to be restricting it to today's rows only.
EXPLAIN SELECT
DATE_FORMAT(time_stamp, '%Y-%m-%dT%T+00:00') ,
AVG(temp_c)
FROM reading
WHERE DATE(time_stamp) = CURDATE()
GROUP BY round(UNIX_TIMESTAMP(time_stamp) / 180)
Table schema:
CREATE TABLE reading (
id bigint(20) NOT NULL AUTO_INCREMENT,
time_stamp timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
temp_c float NOT NULL,
pressure_hpa float NOT NULL,
wind_speed_kt int(11) NOT NULL,
wind_dir_degree int(11) NOT NULL,
rain_mm float NOT NULL,
rain_day_mm float NOT NULL,
wind_gust_kt int(11) NOT NULL,
humidity float DEFAULT NULL,
PRIMARY KEY (id),
KEY time_stamp (time_stamp),
KEY time_stamp_idx (time_stamp)
) ENGINE=InnoDB AUTO_INCREMENT=1747097 DEFAULT CHARSET=latin1;
EXPLAIN SELECT
DATE_FORMAT(time_stamp, '%Y-%m-%dT%T+00:00') ,
AVG(temp_c)
FROM reading
WHERE DATE(time_stamp) = CURDATE()
GROUP BY round(UNIX_TIMESTAMP(time_stamp) / 180)
When the above query is executed, MySQL optimizer isn't interested in index scan (could be because of cost factor) rather full table scan is initiated and the issue appears to be because of WHERE DATE(time_stamp) = CURDATE().
Having changed your where clause to time_stamp >= CURDATE(), I've seen index being used and less number of rows were fetched shunning full scan.
Hence, your final query will be:
EXPLAIN SELECT
DATE_FORMAT(time_stamp, '%Y-%m-%dT%T+00:00') ,
AVG(temp_c)
FROM reading
WHERE time_stamp >= CURDATE()
GROUP BY round(UNIX_TIMESTAMP(time_stamp) / 180);
I suspect date(time_stamp) isn't that efficient with index. Similar topic was discussed here (see ypercube's answer).
The above query can be further improved by choosing an alternate of round(UNIX_TIMESTAMP(time_stamp) / 180) as UNIX_TIMESTAMP(timestamp) doesn't use index. But, I'm not trying furthermore.
Hope this helps!

MySQL subquery count with calendar table slow

I have a sales table in MySQL (InnoDB). It's +- 1 million records big. I would like to show some nice charts. Fetching the right data is not a problem. Fetching it fast is...
So I like to count the amount of sales in table A grouped per day (later on also month, and year) for PERIOD A till Z. Concrete; for the last 30 days I like to know for each day how many sales records we have in the DB.
So MySQL would have to return something like this:
I like to achieve that MySQL returns the data like this:
date, count
2017-04-01, 2482
2017-04-02, 1934
2017-04-03, 2701
...
The structure of the Sales basically like this:
CREATE TABLE `sales` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`deleted_at` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `contacts_created_at_index` (`created_at`),
KEY `contacts_deleted_at_index` (`deleted_at`),
KEY `ind_created_at_deleted_at` (`created_at`,`deleted_at`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
Some days (datapoints) might not have any results, but I don't like to have gaps in the data. So I also have some 'calendar' table.
CREATE TABLE `time_dimension` (
`id` int(11) NOT NULL,
`db_date` date NOT NULL,
`year` int(11) NOT NULL,
`month` int(11) NOT NULL,
`day` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `td_ymd_idx` (`year`,`month`,`day`),
UNIQUE KEY `td_dbdate_idx` (`db_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Fetching 30 rows (30 days) with a count per day takes 30 secs...
This is the first query I tried:
SELECT
`db_date` AS `date`,
(SELECT
COUNT(1)
FROM
sales
WHERE
DATE(created_at) = db_date) AS count
FROM
`time_dimension`
WHERE
`db_date` >= '2017-04-11'
AND `db_date` <= '2017-04-25'
ORDER BY `db_date` ASC
But like I said it's really slow (11.9 secs). I tried al kinds of other approaches, but without luck. For example:
SELECT time_dimension.db_date AS DATE,
COUNT(1) AS count
FROM sales RIGHT JOIN time_dimension ON (DATE(sales.created_at) =
time_dimension.db_date)
WHERE
(time_dimension.db_date BETWEEN '2017-03-11' AND '2017-04-11')
GROUP BY
DATE
A query for just 1 datapoint takes only 5.4ms:
SELECT COUNT(1) FROM sales WHERE created_at BETWEEN '2017-04-11 00:00:00' AND '2017-04-25 23:59:59'
I haven't checked innodb_buffer_poolsize on my local machine. I will check that as well. Any ideas on how to make queries like this fast? In the future I would even need to where clauses and joins, to filter the set of sales records..
Thanks.
Nick
You could try to count sale data first, then join count result with your calendar table.
SELECT time_dimension.db_date AS date,
by_date.sale_count
FROM time_dimension
LEFT JOIN (SELECT DATE(sales.created_at) sale_date,
COUNT(1) AS sale_count
FROM sales
WHERE created_at BETWEEN '2017-03-11 00:00:00' AND
'2017-04-11 23:59:59'
GROUP BY DATE(sales.created_at)) by_date
ON time_dimension.db_date = by_date.sale_date
WHERE time_dimension.db_date BETWEEN '2017-03-11' AND '2017-04-11'
The problematic part of your query is the data type conversion DATE(created_at), which effectively prevents Mysql from using the index at created_at.
Your 1 datapoint query avoids that, and that is why it is working fast.
To fix this you should check if created_at is within a range of specific day, like that:
created_at BETWEEN db_date AND DATE_ADD(db_date,INTERVAL 1 DAY)
This way Mysql will be able to make use of index on it (do a range lookup), as appropriate.
WHERE DATE(created_at) = db_date)
-->
WHERE created_at >= db_date
AND created_at < db_date + INTERVAL 1 DAY
This avoids including midnight of second day (as BETWEEN does)
Work for all flavors: DATE, DATETIME, DATETIME(6)
Does not hid the created_at inside a function where the index cannot see it.
For time_dimension, get rid of PRIMARY KEY (id) and change UNIQUE(db_date) to the PK.
After making these changes, your original subquery may be competitive with the LEFT JOIN ( SELECT ... ). (It depends on which version of MySQL.)

MySQL select optimization

A table with a few Million rows, something like this:
my_table (
`CONTVISITID` bigint(20) NOT NULL AUTO_INCREMENT,
`NODE_ID` bigint(20) DEFAULT NULL,
`CONT_ID` bigint(20) DEFAULT NULL,
`NODE_NAME` varchar(50) DEFAULT NULL,
`CONT_NAME` varchar(100) DEFAULT NULL,
`CREATE_TIME` datetime DEFAULT NULL,
`HITS` bigint(20) DEFAULT NULL,
`UPDATE_TIME` datetime DEFAULT NULL,
`CLIENT_TYPE` varchar(20) DEFAULT NULL,
`TYPE` bigint(1) DEFAULT NULL,
`PLAY_TIMES` bigint(20) DEFAULT NULL,
`FIRST_PUBLISH_TIME` bigint(20) DEFAULT NULL,
PRIMARY KEY (`CONTVISITID`),
KEY `cont_visit_contid` (`CONT_ID`),
KEY `cont_visit_createtime` (`CREATE_TIME`),
KEY `cont_visit_publishtime` (`FIRST_PUBLISH_TIME`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=57676834 DEFAULT CHARSET=utf8
I had a query that I have managed to optimize to the following departing from a flat select:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits,type,first_publish_time
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
Can this be further optimized?
Edit:
I started with a FLAT select like I mentioned before, what I mean by flat select not to have a composite select like my current one. Instead of the single select that someone responded with. A single select is twice slower, so not viable in my case.
Edit2: I have a DBA friend who suggested me to change the query to this:
SELECT a.cont_id, SUM(a.hits)
FROM (
SELECT cont_id,hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time>1398310263000
AND type=1) as a group by a.cont_id
order by sum(HITS) DESC LIMIT 10;
As I do not need the fields extra (type,first_publish_time) and the TMP table is smaller, this makes the query faster about about 1/4 total time of the fastest version I have. He also suggested to add a composite index between (create_time, cont_id, hits). He says with this index I will get really good performance, but I have not done that as this is a production DB and the alter might affect replication. I will post results once done.
INDEX(type, first_publish_time)
INDEX(type, create_time)
Then do
SELECT cont_id, SUM(hits) AS tot_hits
FROM my_table
where create_time > '2017-03-10 00:00:00'
AND first_publish_time > 1398310263000
AND type = 1
group by cont_id
order by tot_hits DESC
LIMIT 10;
Start the index with any = filters (type, in this case); then you get one chance to us a range.
The reason for 2 indexes -- The Optimizer will look at statistics and decide which look better based on the values given.
Consider shrinking the BIGINTs (8 bytes) to some smaller INT type. Saving space will help speed, especially if the table is too big to be cached.
For further discussion, please provide EXPLAIN SELECT ...;.

Speed up this MySQL query

I have a query which is getting slower and slower because there are more and more records in my table. So I'm trying to speed things up.
Database size:
Records: 1,200,000
Data 22,9 MiB
Index 46,8 MiB
Total 69,7 MiB
The purpose of the query is counting the number of records that exist that match the conditions. The conditions are a date (current date) and a status number. See query below:
SELECT
COUNT(id) AS total
FROM
order_process
WHERE
DATE(datetime) = CURDATE() AND
status = '7';
At the moment, this query is taking 800ms. And I need to run this query multiple times with different dates. These are all in the same script so script execution is going over the 3 seconds at the moment. How can I speed this up?
What have I already done:
Created indexes (Index on status and datetime both don't speed up the query).
Tested InnoDB engine (which is slower, mostly reading on this table)
To make it complete, below the current table setup.
CREATE TABLE IF NOT EXISTS `order_process` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`order_id` int(11) NOT NULL,
`status` int(11) NOT NULL,
`datetime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`remark` text NOT NULL,
PRIMARY KEY (`id`),
KEY `orderid` (`order_id`),
KEY `datetime` (`datetime`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
When you use date() function on a timestamp/datetime column and even if the column is indexed it can't use the index
So you need to construct the query as
where
datetime >= concat(CURDATE(),' 00:00:00')
and datetime <= concat(CURDATE(),' 23:59:59')
and status = '7'

How to optimize GROUP BY on calculated field (to use index)?

I have a large (nearly 10M records) data table which, for performance reasons, has a secondary aggregation companion table. The aggregation table is regularly populated with sofar unaggregated data:
REPLACE INTO aggregate (channel_id, type, timestamp, value, count)
SELECT channel_id, 'day' AS type, MAX(timestamp) AS timestamp, SUM(value) AS value, COUNT(timestamp) AS count FROM data
WHERE timestamp < UNIX_TIMESTAMP(DATE_FORMAT(NOW(), "%Y-%m-%d")) * 1000
AND timestamp >= IFNULL((SELECT UNIX_TIMESTAMP(DATE_ADD(FROM_UNIXTIME(MAX(timestamp)/1000, "%Y-%m-%d"),
INTERVAL 1 day)) * 1000 FROM aggregate WHERE type = 'day'), 0)
GROUP BY channel_id, YEAR(FROM_UNIXTIME(timestamp/1000)), DAYOFYEAR(FROM_UNIXTIME(timestamp/1000));
I've found that the SELECT part of the statement is pretty slow (2+ seconds on fast PC) even when no data is being returned. As the aggregation needs to be running on embedded devices this is a concern. Here is the plan:
id select_type table type key key_len rows Extra
1 PRIMARY data ALL 9184560 Using where; Using temporary; Using filesort
2 SUBQUERY aggregate index ts_uniq 22 1940 Using where; Using index
The sub-query itself is instant. Apparently data doesn't use the channel_id/timestamp index due to the calculation in the GROUP BY clause:
CREATE TABLE `data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`channel_id` int(11) DEFAULT NULL,
`timestamp` bigint(20) NOT NULL,
`value` double NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ts_uniq` (`channel_id`,`timestamp`),
KEY `IDX_ADF3F36372F5A1AA` (`channel_id`)
) ENGINE=MyISAM AUTO_INCREMENT=10432870 DEFAULT CHARSET=latin1;
Can the query be further optimized?
Update: adding requested information
SHOW INDEXES FROM data;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Null Index_type
data 0 PRIMARY 1 id A 9184560 BTREE
data 0 ts_uniq 1 channel_id A 164 YES BTREE
data 0 ts_uniq 2 timestamp A 9184560 BTREE
data 1 IDX_ADF3.. 1 channel_id A 164 YES BTREE
CREATE TABLE `aggregate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`channel_id` int(11) NOT NULL,
`type` varchar(8) NOT NULL,
`timestamp` bigint(20) NOT NULL,
`value` double NOT NULL,
`count` int(11) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `ts_uniq` (`channel_id`,`type`,`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=1941 DEFAULT CHARSET=latin1;
I've also noticed that the query becomes instant when changing the GROUP BY to channel_id, timestamp. Unfortunately adding the data calculations as columns is not desirable as the grouping is dynamically calculated.
I'm failing to understand why the GROUP BY index should be such a problem when there isn't even any data to be grouped. I've tried running
SELECT channel_id, 'day' AS type, MAX(timestamp) AS timestamp, SUM(value) AS value, COUNT(timestamp) AS count FROM data
WHERE timestamp < UNIX_TIMESTAMP(DATE_FORMAT(NOW(), "%Y-%m-%d")) * 1000
AND timestamp >= IFNULL((SELECT UNIX_TIMESTAMP(DATE_ADD(FROM_UNIXTIME(MAX(timestamp)/1000, "%Y-%m-%d"), INTERVAL 1 day)) * 1000
FROM aggregate WHERE type = 'day'), 0)
which is just as slow so the GROUP doesn't seem to be the problem?
Update 2
Digging further down that road shows that
SELECT channel_id, 'day' AS type, timestamp, value, 1 FROM data
WHERE timestamp >= (SELECT UNIX_TIMESTAMP(DATE_ADD(FROM_UNIXTIME(MAX(timestamp)/1000, "%Y-%m-%d"),
INTERVAL 1 day)) * 1000 FROM aggregate WHERE type = 'day');
is still slow (1.4sec)- so not a GROUP BY problem at all.
Update 3
And this is still slow:
SELECT channel_id, 'day' AS type, timestamp, value, 1 FROM data WHERE timestamp >= 1380837600000;
So- the problem is that the inner comparison is for timestamp which cannot make use of the channel_id, timestamp index although that is part of the GROUP BY clause.
Which leads to the question on how to force that index?
Add a year and dayofyear column to data table, and have an index on (channel_id, year, dayofyear). Populate the two new columns when you insert a row.