DISTINCT COUNT with GROUP BY query is too slow despite indexes

DISTINCT COUNT with GROUP BY query is too slow despite indexes - mysql

I have the following query that counts the number of vessels in each zone for each week:
SELECT zone,
DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE zone IS NOT NULL
AND creation_date >= DATE_SUB(CURDATE(), INTERVAL 12 MONTH)
GROUP BY zone, date;
The table has about 40 million rows. The execution plan for this is:
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
| 1 | SIMPLE | vessel_position | NULL | range | creation_date,zone | zone | 5 | NULL | 21190904 | 50.00 | Using where; Using index; Using filesort |
+----+-------------+-----------------+------------+-------+--------------------+------+---------+------+----------+----------+------------------------------------------+
Columns vessel_imo, zone and creation_date each indexed. Primary key is the composite key (vessel_imo, creation_date).
When I look at the query profile, I can see that a large amount of time i spent doing Creating sort index.
Is there anything I can do to improve this query further?

Assuming the data, once inserted, does not change, then build and maintain a Summary Table.
The table would have three columns: the zone, the week, and the count-distinct for that week. At the start of each week, build only the rows for the previous week (one per zone; skip NULL). Then build a query to work against that table -- it will be extremely fast since it will be fetching far fewer rows.
Meanwhile, the INDEX(creation_date, zone, vessel_imo) as a secondary index, will make the weekly task reasonably efficient (~52 times as fast as your current query).

It depends on how selective your filtering condition is, and your table structure. Does the filtering condition selects 20% of the rows, 5%, 1%, 0.1%?
If your answer is less than 5% then the following index could help:
create index ix1_date_zone on vessel_position (creation_date, zone);
If your table has many and/or heavy columns, then this option could still be slow, depending on how selective your filtering condition is.
Otherwise, you could try using a more expensive index, to avoid using the table and do:
create index ix2_date_zone_imo on vessel_position
(creation_date, zone, vessel_imo);
This index is more expensive to maintain -- read insert, update, delete rows -- but it would be faster for your select.
Try both options and pick the best for your needs.

SET #mystartdate = DATE_SUB(CURDATE(), INTERVAL 12 MONTH);
SELECT zone, DATE_FORMAT(creation_date, '%Y%u') AS date,
COUNT(DISTINCT vessel_imo) AS vessel_count
FROM vessel_position
WHERE creation_date >= #mystartdate
AND zone > 0
GROUP BY zone, date;
may provide results in less time, please post your comparative times of second run of each ( old and suggested )
Please post new EXPLAIN SELECT … to confirm index of creation date is now used.
Unless old data is allowed to change, why do you have to gather 12 months history, the numbers more than 1 month ago are NOT going to change.

Related

MariaDB 5.5.68, prevent toxic selects?

I know that this MariaDB version 5.5.68 is really out of date, but I have to stay with this old version for a while.
Is there a way to prevent toxic selects, possibly blocking MyISAM tables for a longer time (minutes)? The thing is that the select creates a READ BLOCK on the whole MyISAM table and further inserts wait until they're all gone. So the long running select starts to block the system.
Take this example table:
CREATE TABLE `tbllog` (
`LOGID` bigint unsigned NOT NULL auto_increment,
`LOGSOURCE` smallint unsigned default NULL,
`USERID` int unsigned default NULL,
`LOGDATE` datetime default NULL,
`SUBPROVIDERID` int unsigned default NULL,
`ACTIONID` smallint unsigned default NULL,
`COMMENT` varchar(255) default NULL,
PRIMARY KEY (`LOGID`),
KEY `idx_LogDate` (`LOGDATE`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
The following select works fine until less than 1 Mio entries in the table (the customers set the date range):
SELECT *
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500;
But it becomes toxic if there are 10 Mio entries or more in the table. Then it starts to run for minutes, consumes a lot of memory and starts blocking the app.
This is the query plan with ~600.000 entries in the table:
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | tbllog | index | idx_LogDate | PRIMARY | 8 | NULL | 624 | Using where |
+------+-------------+--------+-------+---------------+---------+---------+------+------+-------------+
The thing is, that I need to know if this becomes toxic or not before execution. So maybe I can warn the user that this might block the system for a while or even deny execution.
I know that InnoDB might not have this issue, but I don't know the drawbacks of a switch yet and I think it might be best to stay for the moment.
I tried to do a simple SELECT COUNT(*) FROM tbllog WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 before (removing LIMIT and ORDER BY), but it is not really much faster than the real query and produces double the load in the worst case.
I also considered a worker thread (like mentioned here). But this is a relevant change to the whole system, too. InnoDB would be less impact I think.
Any ideas about this issue?

Your EXPLAIN report shows that it's doing an index-scan on the primary key index. I believe this is because the range of dates is too broad, so the optimizer thinks that it's not much help to use the index instead of simply reading the whole table. By doing an index-scan of the primary key (logid), the optimizer can at least ensure that the rows are read in the order you requested in your ORDER BY clause, so it can skip sorting.
If I test your query (I created the table and filled it with 1M rows of random data), but make it ignore the primary key index, I get this EXPLAIN report:
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate | idx_LogDate | 6 | NULL | 271471 | 10.00 | Using index condition; Using where; Using filesort |
+----+-------------+--------+------------+-------+---------------+-------------+---------+------+--------+----------+----------------------------------------------------+
This makes it use the index on the logdate, so it examine fewer rows, according to the proportion matched by the date range condition. But the resulting rows must be sorted ("Using filesort" in the Extra column) before it can apply the LIMIT.
This won't help at all if your range of dates covers the whole table anyway. In fact, it will be worse, because it will access rows indirectly by the logdate index, and then it will have to sort rows. This solution helps only if the range of dates in the query matches a small portion of the table.
A somewhat better index is a compound index on (subproviderid, logdate).
mysql> alter table tbllog add index (subproviderid, logdate);
mysql> explain SELECT * FROM tbllog IGNORE INDEX(PRIMARY) WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00' AND subproviderid=1 ORDER BY logid LIMIT 500;
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
| 1 | SIMPLE | tbllog | NULL | range | idx_LogDate,SUBPROVIDERID | SUBPROVIDERID | 11 | NULL | 12767 | 100.00 | Using index condition; Using filesort |
+----+-------------+--------+------------+-------+---------------------------+---------------+---------+------+-------+----------+---------------------------------------+
In my test, this helps the estimate of rows examined drop from 271471 to 12767 because they're restricted by subproviderid, then by logdate. How effective this is depends on how frequently subproviderid=1 is matched. If that's matched by virtually all of the rows anyway, then it won't be any help. If there are many different values of subproviderid and they each have a small fraction of rows, then it will help more to add this to the index.
In my test, I made an assumption that there are 20 different values of subproviderid with equal frequency. That is, my random data inserted round(rand()*20) as the value of subproviderid on each row. Thus it is expected that adding subproviderid resulted in 1/20th of the examined rows in my test.
To choose the order of columns listed in the index, columns referenced in equality conditions must be listed before the column referenced in range conditions.
There's no way to get a prediction of the runtime of a query. That's not something the optimizer can predict. You should block users from requesting a range of dates that will match too great a portion of the table.

For this
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
Add both of these and hope that the Optimizer picks the better one:
INDEX(subproviderid, logdate, logid)
INDEX(subproviderid, logid)
Better yet would be to also change to this (assuming it is 'equivalent' for your purposes):
ORDER BY logdate, logid
Then that first index will probably work nicely.
You really should change to InnoDB. (Caution: the table is likely to triple in size.) With InnoDB, there would be another indexing option. And, with an updated version, you could do "instant" index adding. Meanwhile, MyISAM will take a lot of time to add those indexes.

Try creating a multi-column index specifically for your query.
CREATE INDEX sub_date_logid ON tbllog (subproviderid, logdate, logid);
This index should satisfy the WHERE filters in your query directly. Then it should present the rows in logid order so your ORDER BY ... LIMIT clauses don't have to sort the whole table. Will this help on long-dead MariaDB 5.5 with MyISAM? Hard to say for sure.
If it doesn't solve your performance problem, keep the multicolumn index and try doing the ORDER BY...LIMIT on the logid values rather than all the rows.
SELECT *
FROM tbllog
WHERE logid IN (
SELECT logid
FROM tbllog
WHERE logdate BETWEEN '2021-01-01 00:00:00' AND '2022-10-25 00:00:00'
AND subproviderid=1
ORDER BY logid
LIMIT 500 )
ORDER BY logid;
This can speed things up because it lets MariaDB sort just the logid values to find the ones it wants. Then the outer query fetches only the 500 rows needed for your result set. Less data to sort = faster.

One of the options, although an external one, would be to use ProxySQL. It has capabilities to shape the traffic. You can create rules deciding how to process queries that match them. You could, for example, create a query rule that would check if a query is accessing a given table (you can use regular expressions to match the query) and, for example, block that query or introduce a delay in execution.
Another option could be to use pt-kill. It's a script that's part of the Percona Toolkit and it's intended to, well, kill queries. You can define which queries you want to kill (matching them by regular expressions, by how long they ran or in other ways).
Having said that, if SELECTs can be optimized by rewriting or adding proper indexes, that may be the best option to go for.

Create index for a few different queries on 4+ million row table

The table is currently a 4+ million (~50 GB) row table and growing rapidly.
We don't want to include any rows where the EndTime is invalid and thus less than StartTime, because there's at least 1,000 rows where it's zero.
My question is what kind index would be best for these three queries?
I'm guessing maybe a composite index with EndTime first and StartTime second?
The StartTime and EndTime fields both contain unix timestamps like: 1401951888
SELECT AVG(EndTime-StartTime) FROM sessions WHERE EndTime>StartTime;
SELECT MAX(EndTime-StartTime) FROM sessions WHERE EndTime>StartTime;
SELECT MIN(EndTime-StartTime) FROM sessions WHERE EndTime>StartTime;
+----------------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+------------+------+-----+---------+-------+
| Uuid | char(36) | NO | PRI | NULL | |
| StartTime | int(11) | YES | | NULL | |
| EndTime | int(11) | YES | | NULL | |
+----------------------+------------+------+-----+---------+-------+

The table is currently a 4+ million (~50 GB) row table and growing rapidly.
4M rows with just those 3 columns and it's 50GB? Wow... is there a problem somewhere?
We don't want to include any rows where the EndTime is invalid and thus less than StartTime, because there's at least 1,000 rows where it's zero.
Since there are no other conditions, the query will have to process the entire table, minus 1000 rows. Therefore, any index will be useless.
Unless the table has lots more columns than you showed, in which case the only use for the index will be to be much smaller than the table on-disk, therefore much faster to scan.
Now, in recent versions of MySQL, you can now create functional indexes on virtual columns! Therefore, you can create an index on:
endTime - startTime
If your max() and min() use the index, they will be instantaneous, since finding the min/max in a sorted set is a O(1) operation which only needs to look at the first or last entry. However, your avg() will, of course, have to examine all rows to compute the average.

Improve performance of count and sum when already indexed

First, here is the query I have:
SELECT
COUNT(*) as velocity_count,
SUM(`disbursements`.`amount`) as summation_amount
FROM `disbursements`
WHERE
`disbursements`.`accumulation_hash` = '40ad7f250cf23919bd8cc4619850a40444c5e90c978f88635a09ccf66a82ffb38e39ea51cdfd651b0ebdac5f5ca37cd7a17e0f60fea6cbce1397ccff5fa37346'
AND `disbursements`.`caller_id` = 1
AND `disbursements`.`active` = 1
AND (version_hash != '86b4111677294b27a1805643d193b8d437b6ddb170b4ed5dec39aa89bf070d160cbbcd697dfc1988efea8429b1f1557625bf956180c65d3dcd3a318280e0d2da')
AND (`disbursements`.`created_at` BETWEEN '2012-12-15 23:33:22'
AND '2013-01-14 23:33:22') LIMIT 1
Explain extended returns the following:
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
| 1 | SIMPLE | disbursements | range | unique_request_index,index_disbursements_on_caller_id,disbursement_summation_index,disbursement_velocity_index,disbursement_version_out_index | disbursement_summation_index | 1543 | NULL | 191422 | 100.00 | Using where; Using index |
+----+-------------+---------------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------+------------------------------+---------+------+--------+----------+--------------------------+
The actual query counts about 95,000 rows. If I explain another query that hits ~50 rows the explain is identical, just with fewer rows estimated.
The index being chosen covers accumulation_hash, caller_id, active, version_hash, created_at, amount in that order.
I've tried playing around with doing COUNT(id) or COUNT(caller_id) since these are non-null fields and return the same thing as count(*), but it doesn't have any impact on the plan or the run time of the actual query.
This is also a heavy insert table, essentially every single query will have had a row inserted or updated since the last time it was run, so the mysql query cache isn't entirely useful.
Before I go and make some sort of bucketed time sequence cache with something like memcache or redis, is there an obvious solution to getting this to work much faster? A normal ~50 row query returns in 5MS, the ones across 90k+ rows are taking 500-900MS and I really can't afford anything much past 100MS.
I should point out the dates are a rolling 30 day window that needs to be essentially real time. Expiration could probably happen with ~1 minute granularity, but new items need to be seen immediately upon commit. I'm also on RDS, Read IOPS are essentially 0, and cpu is about 60-80%. When I'm not querying the giant 90,000+ record items, CPU typically stays below 10%.

You could try an index that has created_at before version_hash (might get a better shot at having an index range scan... not clear how that non-equality predicate on the version_hash affects the plan, but I suspect it disables a range scan on the created_at column.
Other than that, the query and the index look about as good as you are going to get, the EXPLAIN output shows the query being satisfied from the index.
And the performance of the statement doesn't sound too unreasonable, given that it's aggregating 95,000+ rows, especially given the key length of 1543 bytes. That's a much larger size than I normally deal with.
What are the datatypes of the columns in the index, and what is the cluster key or primary key?
accumulation_hash - 128-character representation of 512-bit value
caller_id - integer or numeric (?)
active - integer or numeric (?)
version_hash - another 128-characters
created_at - datetime (8bytes) or timestamp (4bytes)
amount - numeric or integer
95,000 rows at 1543 bytes each is on the order of 140MB of data.

Search distinct date parts fast in mysql

I've got a database of ~10 million entries, each of which contains a date stored as DATE.
I've indexed that column using a non-unique BTREE.
I'm running a query that counts the number of entries for each distinct year:
SELECT DISTINCT(YEAR(awesome_date)) as year, COUNT(id) as count
FROM all_entries
WHERE awesome_date IS NOT NULL
GROUP BY YEAR(awesome_date)
ORDER BY year DESC;
The query takes about 90 seconds to run at the moment, and the EXPLAIN output demonstrates why:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
----------------------------------------------------------------------------------------------------------------------------------------
1 | SIMPLE | all_entries | ALL | awesome_date | | | | 9759848 | Using where; Using temporary; Using filesort
If I FORCE KEY(awesome_date) that drops the rows count down to ~8 million and the key_len = 4, but is still Using where; Using temporary; Using filesort.
I also run queries selecting DISTINCT(MONTH(awesome_date)) and DISTINCT(DAY(awesome_date)) with the relevant WHERE conditions restricting them to a particular year or month.
Other than storing the year, month and day information in separate columns, is there a way of speeding up this query and/or avoiding temporary tables and filesort?

Without splitting the date to 3 columns, you could:
First, you should remove the DISTINCT, it is useless. – ypercube 1 min ago edit
Remove the ORDER BY year, it would help improve speed (a bit). Change the Group By to: GROUP BY YEAR(awesome_date) DESC (this works in MySQL dialect only).
Change the COUNT(id) to COUNT(*) (assuming that id can never be NULL, this is faster in many MySQL versions).
In all, the query will become:
SELECT YEAR(awesome_date) AS year
, COUNT(*) AS cnt --- not good practise to use reserved words
--- for aliases
FROM all_entries
WHERE awesome_date IS NOT NULL
GROUP BY YEAR(awesome_date) DESC ;
Even better (faster) solutions are:
your proposal to split the column into 3 (year, month, day)
change from MySQL to MariaDB (that is a MySQL fork) and use VIRTUAL PERISTENT column for the year, and add an index on that virtual column.
stay in MySQL and add a persistent year column yourself - by using triggers.

Optimizing MySQL Aggregation Query

I've got a very large table (~100Million Records) in MySQL that contains information about files. One of the pieces of information is the modified date of each file.
I need to write a query that will count the number of files that fit into specified date ranges. To do that I made a small table that specifies these ranges (all in days) and looks like this:
DateRanges
range_id range_name range_start range_end
1 0-90 0 90
2 91-180 91 180
3 181-365 181 365
4 366-1095 366 1095
5 1096+ 1096 999999999
And wrote a query that looks like this:
SELECT r.range_name, sum(IF((DATEDIFF(CURDATE(),t.file_last_access) > r.range_start and DATEDIFF(CURDATE(),t.file_last_access) < r.range_end),1,0)) as FileCount
FROM `DateRanges` r, `HugeFileTable` t
GROUP BY r.range_name
However, quite predictably, this query takes forever to run. I think that is because I am asking MySQL to go through the HugeFileTable 5 times, each time performing the DATEDIFF() calculation on each file.
What I want to do instead is to go through the HugeFileTable record by record only once, and for each file increment the count in the appropriate range_name running total. I can't figure out how to do that....
Can anyone help out with this?
Thanks.
EDIT: MySQL Version: 5.0.45, Tables are MyISAM
EDIT2: Here's the descibe that was asked for in the comments
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE r ALL NULL NULL NULL NULL 5 Using temporary; Using filesort
1 SIMPLE t ALL NULL NULL NULL NULL 96506321

First, create an index on HugeFileTable.file_last_access.
Then try the following query:
SELECT r.range_name, COUNT(t.file_last_access) as FileCount
FROM `DateRanges` r
JOIN `HugeFileTable` t
ON (t.file_last_access BETWEEN
CURDATE() + INTERVAL r.range_start DAY AND
CURDATE() + INTERVAL r.range_end DAY)
GROUP BY r.range_name;
Here's the EXPLAIN plan that I got when I tried this query on MySQL 5.0.75 (edited down for brevity):
+-------+-------+------------------+----------------------------------------------+
| table | type | key | Extra |
+-------+-------+------------------+----------------------------------------------+
| t | index | file_last_access | Using index; Using temporary; Using filesort |
| r | ALL | NULL | Using where |
+-------+-------+------------------+----------------------------------------------+
It's still not going to perform very well. By using GROUP BY, the query incurs a temporary table, which may be expensive. Not much you can do about that.
But at least this query eliminates the Cartesian product that you had in your original query.
update: Here's another query that uses a correlated subquery but I have eliminated the GROUP BY.
SELECT r.range_name,
(SELECT COUNT(*)
FROM `HugeFileTable` t
WHERE t.file_last_access BETWEEN
CURDATE() - INTERVAL r.range_end DAY AND
CURDATE() - INTERVAL r.range_start DAY
) as FileCount
FROM `DateRanges` r;
The EXPLAIN plan shows no temporary table or filesort (at least with the trivial amount of rows I have in my test tables):
+----+--------------------+-------+-------+------------------+--------------------------+
| id | select_type | table | type | key | Extra |
+----+--------------------+-------+-------+------------------+--------------------------+
| 1 | PRIMARY | r | ALL | NULL | |
| 2 | DEPENDENT SUBQUERY | t | index | file_last_access | Using where; Using index |
+----+--------------------+-------+-------+------------------+--------------------------+
Try this query on your data set and see if it performs better.

Well, start by making sure that file_last_access is an index for the table HugeFileTable.
I'm not sure if this is possible\better, but try to compute the dates limits first (files from date A to date B), then use some query with >= and <=. It will, theoretically at least, improve the performance.
The comparison would be something like:
t.file_last_access >= StartDate AND t.file_last_access <= EndDate

You could get a small improvement by removing CURDATE() and putting a date in the query as it will run this function for each row twice in your SQL.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

DISTINCT COUNT with GROUP BY query is too slow despite indexes - mysql

Related

MariaDB 5.5.68, prevent toxic selects?

Create index for a few different queries on 4+ million row table

Improve performance of count and sum when already indexed

Search distinct date parts fast in mysql

Optimizing MySQL Aggregation Query

Categories

Resources