Why is my MySQL group by so slow? - mysql

I am trying to query against a partitioned table (by month) approaching 20M rows. I need to group by DATE(transaction_utc) as well as country_id. The rows that get returned if i turn off the group by and aggregates is just over 40k, which isn't too many, however adding the group by makes the query substantially slower unless said GROUP BY is on the transaction_utc column, in which case it gets FAST.
I've been trying to optimize this first query below by tweaking the query and/or the indexes, and got to the point below (about 2x as fast as initially) however still stuck with a 5s query for summarizing 45k rows, which seems way too much.
For reference, this box is a brand new 24 logical core, 64GB RAM, Mariadb-5.5.x server with way more INNODB buffer pool available than index space on the server, so shouldn't be any RAM or CPU pressures.
So, I'm looking for ideas on what is causing this slow down and suggestions on speeding it up. Any feedback would be greatly appreciated! :)
Ok, onto the details...
The following query (the one I actually need) takes approx 5 seconds (+/-), and returns less than 100 rows.
SELECT lss.`country_id` AS CountryId
, Date(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`, DATE(lss.`transaction_utc`)
EXPLAIN SELECT for the same query is as follows. Notice that it's not using the transaction_utc key. Shouldn't it be using my covering index instead?
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE lss ref idx_unique,transaction_utc,country_id idx_unique 50 const 1208802 Using where; Using temporary; Using filesort
1 SIMPLE c eq_ref PRIMARY PRIMARY 4 georiot.lss.country_id 1
Now onto a couple other options that I've tried to attempt to determine whats going on...
The following query (changed group by) takes about 5 seconds (+/-), and returns only 3 rows:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' ) GROUP BY lss.`country_id`
The following query (removed group by) takes 4-5 seconds (+/-) and returns 1 row:
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
The following query takes .00X seconds (+/-) and returns ~45k rows. This to me shows that at max we're only trying to group 45K rows into less than 100 groups (as in my initial query):
SELECT lss.`country_id` AS CountryId
, DATE(lss.`transaction_utc`) AS TransactionDate
, c.`name` AS CountryName, lss.`country_id` AS CountryId
, COALESCE(SUM(lss.`sale_usd`),0) AS SaleUSD
, COALESCE(SUM(lss.`commission_usd`),0) AS CommissionUSD
FROM `sales` lss
JOIN `countries` c ON lss.`country_id` = c.`country_id`
WHERE ( lss.`transaction_utc` BETWEEN '2012-09-26' AND '2012-10-26' AND lss.`username` = 'someuser' )
GROUP BY lss.`transaction_utc`
TABLE SCHEMA:
CREATE TABLE IF NOT EXISTS `sales` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`user_linkshare_account_id` int(11) unsigned NOT NULL,
`username` varchar(16) NOT NULL,
`country_id` int(4) unsigned NOT NULL,
`order` varchar(16) NOT NULL,
`raw_tracking_code` varchar(255) DEFAULT NULL,
`transaction_utc` datetime NOT NULL,
`processed_utc` datetime NOT NULL ,
`sku` varchar(16) NOT NULL,
`sale_original` decimal(10,4) NOT NULL,
`sale_usd` decimal(10,4) NOT NULL,
`quantity` int(11) NOT NULL,
`commission_original` decimal(10,4) NOT NULL,
`commission_usd` decimal(10,4) NOT NULL,
`original_currency` char(3) NOT NULL,
PRIMARY KEY (`id`,`transaction_utc`),
UNIQUE KEY `idx_unique` (`username`,`order`,`processed_utc`,`sku`,`transaction_utc`),
KEY `raw_tracking_code` (`raw_tracking_code`),
KEY `idx_usd_amounts` (`sale_usd`,`commission_usd`),
KEY `idx_countries` (`country_id`),
KEY `transaction_utc` (`transaction_utc`,`username`,`country_id`,`sale_usd`,`commission_usd`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE ( TO_DAYS(`transaction_utc`))
(PARTITION pOLD VALUES LESS THAN (735112) ENGINE = InnoDB,
PARTITION p201209 VALUES LESS THAN (735142) ENGINE = InnoDB,
PARTITION p201210 VALUES LESS THAN (735173) ENGINE = InnoDB,
PARTITION p201211 VALUES LESS THAN (735203) ENGINE = InnoDB,
PARTITION p201212 VALUES LESS THAN (735234) ENGINE = InnoDB,
PARTITION pMAX VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */ AUTO_INCREMENT=19696320 ;

The offending part is probably the GROUP BY DATE(transaction_utc). You also claim to have a covering index for this query but I see none. Your 5-column index has all the columns used in the query but not in the best order (which is: WHERE - GROUP BY - SELECT).
So, the engine, finding no useful index, would have to evaluate this function for all the 20M rows. Actually, it finds an index that starts with username (the idx_unique) and it uses that, so it has to evaluate the function for (only) 1.2M rows. If you had a (transaction_utc) or a (username, transaction_utc) it would choose the most useful of the three.
Can you afford to change the table structure by splitting the column into date and time parts?
If you can, then an index on (username, country_id, transaction_date) or (changing the order of the two columns used for grouping), on (username, transaction_date, country_id) would be quite efficient.
A covering index on (username, country_id, transaction_date, sale_usd, commission_usd) even better.
If you want to keep the current structure, try changing the order inside your 5-column index to:
(username, country_id, transaction_utc, sale_usd, commission_usd)
or to:
(username, transaction_utc, country_id, sale_usd, commission_usd)
Since you are using MariaDB, you can use the VIRTUAL columns feature, without changing the existing columns:
Add a virtual (persistent) column and the appropriate index:
ALTER TABLE sales
ADD COLUMN transaction_date DATE NOT NULL
AS DATE(transaction_utc)
PERSISTENT
ADD INDEX special_IDX
(username, country_id, transaction_date, sale_usd, commission_usd) ;

Related

Optimize COUNT(*)

I have a table items from which I'm selecting 40 rows at a time ordered by the popularity of the item.
The popularity score is simply downloads/impressions;
Query:
SELECT id, name
FROM items
ORDER BY (SELECT COUNT(*) FROM downloads WHERE item = items.id)/
(SELECT COUNT(*) FROM impressions WHERE item = items.id)
LIMIT 40;
The problem is that the query takes forever to complete (ranging from 2 to 10 seconds).
At the moment we have 25K items, 18M impressions, and 560k download.
We already tried adding the fields downloads and impressions in the table items and keeping the count updated using triggers (after an insert in the tables impressions and downloads we increment the values), but we've had some issues with deadlocking.
Is there a better way to optimize this query?
Thanks.
Edit
Here's the output of EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY items ALL NULL NULL NULL NULL 20496 Using filesort
3 DEPENDENT SUBQUERY impressions ref PRIMARY PRIMARY 4 db.items.id 74 Using index
2 DEPENDENT SUBQUERY downloads ref PRIMARY PRIMARY 4 db.items.id 274 Using index
Tables:
CREATE TABLE `items` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(35) DEFAULT '',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=24369 DEFAULT CHARSET=utf8mb4;
CREATE TABLE `impressions` (
`item` int(10) unsigned NOT NULL,
`user` char(36) NOT NULL DEFAULT '',
PRIMARY KEY (`item`,`user`),
CONSTRAINT `impression_ibfk_1` FOREIGN KEY (`item`) REFERENCES `items` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
CREATE TABLE `downloads` (
`item` int(10) unsigned NOT NULL,
`user` char(36) NOT NULL DEFAULT '',
PRIMARY KEY (`item`,`user`),
CONSTRAINT `download_ibfk_1` FOREIGN KEY (`item`) REFERENCES `items` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
I think next query can resolve your problem:
SELECT
item,items.name, downloads.cnt/impressions.cnt AS rate
FROM (
SELECT item, COUNT(*) AS cnt FROM downloads GROUP BY item
) AS downloads
JOIN (
SELECT item, COUNT(*) AS cnt FROM impressions GROUP BY item
) impressions
JOIN items ON items.id = downloads.items
ORDER BY rate DESC
LIMIT 40;
Also care about downloads and impressions tables have indexed by item field.
Not solvable with that approach.
There are two solutions:
Keep counters (by item.id) for impressions and downloads.
Summary tables.
Counters This involves adding an extra column for each counter to the items table. Or building a parallel table with id and the various counters. For really high volumn of counts, the latter avoids some clashes between various queries.
Summary Tables Build and incrementally augment a table(s) that summarize counts like these, plus perhaps other SUMs, COUNTs, etc. The table would perhaps be augmented daily for the previous day's information. Then the "sum the counts" to get the grand total; this will be much faster than your current query.
More on Summary Tables: http://mysql.rjweb.org/doc.php/summarytables
I'd count downloads and impressions first and then get the top 40:
with d as (select item, count(*) as total from downloads group by item)
, i as (select item, count(*) as total from impressions group by item)
, top40 as select item from d join i using (item) order by d.total / i.total limit 40)
select *
from items
where id in
(
select item from top40
);
The WITH clause is available as of MySQL 8. In earlier versions, you'd work with subqueries instead.
As item is a foreign key in downloads and impressions and id is the primary key in items, I suppose there is an index on them. Otherwise create one:
create unique index idx1 on items(id);
create index idx2 on downloads(item);
create index idx3 on impressions(item);

MySQL query with IN clause loses performance

I have a table to store data from csv files. It is a large table (over 40 million rows). This is its structure:
CREATE TABLE `imported_lines` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`day` date NOT NULL,
`name` varchar(256) NOT NULL,
`origin_id` int(11) NOT NULL,
`time` time(3) NOT NULL,
`main_index` tinyint(4) NOT NULL DEFAULT 0,
`transaction_index` tinyint(4) NOT NULL DEFAULT 0,
`data` varchar(4096) NOT NULL,
`error` bit(1) NOT NULL,
`expressions_applied` bit(1) NOT NULL,
`count_records` smallint(6) NOT NULL DEFAULT 0,
`client_id` tinyint(4) NOT NULL DEFAULT 0,
`receive_date` datetime(3) NOT NULL,
PRIMARY KEY (`id`,`client_id`),
UNIQUE KEY `uq` (`client_id`,`name`,`origin_id`,`receive_date`),
KEY `dh` (`day`,`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (`client_id`) PARTITIONS 15 */
When I perform a SELECT with one day filter, it returns data very quick (0.4 s). But, as I increase date range, it becomes slow, until gets a timeout error.
This is the query:
SELECT origin_id, error, main_index, transaction_index,
expressions_applied, name, day,
COUNT(id) AS total, SUM(count_records) AS sum_records
FROM imported_lines FORCE INDEX (dh)
WHERE client_id = 1
AND day >= '2017-07-02' AND day <= '2017-07-03'
AND name IN ('name1', 'name2', 'name3', ...)
GROUP BY origin_id, error, main_index, transaction_index, expressions_applied, name, day;
I think the IN clause may be losing performance. I also tried to add uq index to this query, which gave a little gain (FORCE INDEX (dh, uq)).
Plus, I tried to INNER JOIN (SELECT name FROM providers WHERE id = 2) prov ON prov.name = il.name but doesn't result in a quicker query as well.
EDIT
EXPLAINing the query
id - 1
select_type - SIMPLE
table - imported_lines
type - range
possible_keys - uq, dh
key - dh
key_len - 261
ref - NULL
rows - 297988
extra - Using where; Using temporary; Using filesort
Any suggestions what it should do?
I have done a few changes, adding a new index with multiple columns (as suggested by #Uueerdo) and rewritten query as another user suggested too (but he deleted his answer).
I ran a few EXPLAIN PARTITIONS with queries, tested with SQL_NO_CACHE in order to guarantee it wouldn't use cache and searching data for one whole month now takes 1.8s.
It's so much faster!
This is what I did:
ALTER TABLE `imported_lines` DROP INDEX dh;
ALTER TABLE `imported_lines` ADD INDEX dhc (`day`, `name`, `client_id`);
Query:
SELECT origin_id, error, main_index, transaction_index,
expressions_applied, name, day,
COUNT(id) AS total, SUM(count_records) AS sum_records
FROM imported_lines il
INNER JOIN (
SELECT id FROM imported_lines
WHERE client_id = 1
AND day >= '2017-07-01' AND day <= '2017-07-31'
AND name IN ('name1', 'name2', 'name3', ...)
) AS il_filter
ON il_filter.id = il.id
WHERE il.client_id = 1
GROUP BY origin_id, error, main_index, transaction_index, expressions_applied, name, day;
I realized using INNER JOIN, EXPLAIN PARTITIONS it began to use index. Also with WHERE il.client_id = 1, query reduces the number of partitions to look up.
Thanks for your help!

How to properly index table

We use to index our tables basing on where statement. It works fine during our MSSQL days, but now we are using MySQL and things are differenct. Sub-query has terrible performance. Consider this table :
# 250K records per day
create table t_101(
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`transaction_date` datetime not null,
`memo_1` nvarchar(255) not null,
`memo_2` nvarchar(255) not null,
`product_id` bigint not null,
#many more columns
PRIMARY KEY (`id`),
key `index.t_101.101`(`transaction_date`, `product_id`, `memo_1`),
key `index.t_101.102`(`transaction_date`, `product_id`, `memo_2`)
)ENGINE=MyIsam;
A temporary table where I store condition values :
# 150 records
create temporary table `temporary.user.accessibleProducts`
(
product_id bigint not null,
PRIMARY KEY (`product_id`)
)Engine=MyIsam;
And this is the original query :
select
COUNT(a.id) as rowCount_la1,
COUNT(DISTINCT a.product_id) as productCount
from t_101 a
where a.transaction_date = '2017-05-01'
and a.product_id in(select xa.product_id from `temporary.user.accessibleProducts` xa)
and a.memo_1 <> '';
it takes 7 seconds to be executed, while this query :
select
COUNT(a.id) as rowCount_la1,
COUNT(DISTINCT a.product_id) as productCount
from t_101 a
inner join `temporary.user.accessibleProducts` b on b.product_id = a.product_id
where a.transaction_date = '2017-05-01'
and a.memo_1 <> '';
takes 0.063 seconds to execute.. Even though 0.063 seconds is acceptable, I'm worrying about index. With the given above, how do I index t_101 properly?
We are using MySQL 5.5.42.

Optimize a query

How can I proceed to make my response time more faster, approximately the average time of response is 0.2s ( 8039 records in my items table & 81 records in my tracking table )
Query
SELECT a.name, b.cnt FROM `items` a LEFT JOIN
(SELECT guid, COUNT(*) cnt FROM tracking WHERE
date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day ) GROUP BY guid) b ON
a.`id` = b.guid WHERE a.`type` = 'streaming' AND a.`state` = 1
ORDER BY b.cnt DESC LIMIT 15 OFFSET 75
Tracking table structure
CREATE TABLE `tracking` (
`id` bigint(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`ip` int(11) NOT NULL,
`date` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `i1` (`ip`,`guid`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=4303 DEFAULT CHARSET=latin1;
Items table structure
CREATE TABLE `items` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`guid` int(11) DEFAULT NULL,
`type` varchar(255) DEFAULT NULL,
`name` varchar(255) DEFAULT NULL,
`embed` varchar(255) DEFAULT NULL,
`url` varchar(255) DEFAULT NULL,
`description` text,
`tags` varchar(255) DEFAULT NULL,
`date` int(11) DEFAULT NULL,
`vote_val_total` float DEFAULT '0',
`vote_total` float(11,0) DEFAULT '0',
`rate` float DEFAULT '0',
`icon` text CHARACTER SET ascii,
`state` int(11) DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=9258 DEFAULT CHARSET=latin1;
Your query, as written, doesn't make much sense. It produces all possible combinations of rows in your two tables and then groups them.
You may want this:
SELECT a.*, b.cnt
FROM `items` a
LEFT JOIN (
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE `date` > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day)
GROUP BY guid
) b ON a.guid = b.guid
ORDER BY b.cnt DESC
The high-volume data in this query come from the relatively large tracking table. So, you should add a compound index to it, using the columns (date, guid). This will allow your query to random-access the index by date and then scan it for guid values.
ALTER TABLE tracking ADD INDEX guid_summary (`date`, guid);
I suppose you'll see a nice performance improvement.
Pro tip: Don't use SELECT *. Instead, give a list of the columns you want in your result set. For example,
SELECT a.guid, a.name, a.description, b.cnt
Why is this important?
First, it makes your software more resilient against somebody adding columns to your tables in the future.
Second, it tells the MySQL server to sling around only the information you want. That can improve performance really dramatically, especially when your tables get big.
Since tracking has significantly fewer rows than items, I will propose the following.
SELECT i.name, c.cnt
FROM
(
SELECT guid, COUNT(*) cnt
FROM tracking
WHERE date > UNIX_TIMESTAMP(NOW() - INTERVAL 1 day )
GROUP BY guid
) AS c
JOIN items AS i ON i.id = c.guid
WHERE i.type = 'streaming'
AND i.state = 1;
ORDER BY c.cnt DESC
LIMIT 15 OFFSET 75
It will fail to display any items for which cnt is 0. (Your version displays the items with NULL for the count.)
Composite indexes needed:
items: The PRIMARY KEY(id) is sufficient.
tracking: INDEX(date, guid) -- "covering"
Other issues:
If ip is an IP-address, it needs to be INT UNSIGNED. But that covers only IPv4, not IPv6.
It seems like date is not just a "date", but really a date+time. Please rename it to avoid confusion.
float(11,0) -- Don't use FLOAT for integers. Don't use (m,n) on FLOAT or DOUBLE. INT UNSIGNED makes more sense here.
OFFSET is naughty when it comes to performance -- it must scan over the skipped records. But, in your query, there is no way to avoid collecting all the possible rows, sorting them, stepping over 75, and only finally delivering 15 rows. (And, with no more than 81, it won't be a full 15.)
What version are you using? There have been important changes to the Optimization of LEFT JOIN ( SELECT ... ). Please provide EXPLAIN SELECT for each query under discussion.

MySQL innoDB: Long time of query execution

I'm having troubles to run this SQL:
I think it's a index problem but I don't know because I dind't make this database and I'm just a simple programmer.
The problem is, that table has 64260 records, so that query gets crazy when executing, I have to stop mysql and run again because the computer get frozen.
Thanks.
EDIT: table Schema
CREATE TABLE IF NOT EXISTS `value_magnitudes` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`value` float DEFAULT NULL,
`magnitude_id` int(11) DEFAULT NULL,
`sdi_belongs_id` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`reading_date` datetime DEFAULT NULL,
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=1118402 ;
Query
select * from value_magnitudes
where id in
(
SELECT min(id)
FROM value_magnitudes
WHERE magnitude_id = 234
and date(reading_date) >= '2013-04-01'
group by date(reading_date)
)
EDIT2
First, add an index on (magnitude_id, reading_date):
ALTER TABLE
ADD INDEX magnitude_id__reading_date__IX -- just a name for the index
(magnitude_id, reading_date) ;
Then try this variation:
SELECT vm.*
FROM value_magnitudes AS vm
JOIN
( SELECT MIN(id) AS id
FROM value_magnitudes
WHERE magnitude_id = 234
AND reading_date >= '2013-04-01' -- changed so index is used
GROUP BY DATE(reading_date)
) AS vi
ON vi.id = vm.id ;
The GROUP BY DATE(reading_date) will still need to apply the function to all the selected (thorugh the index) rows and that cannot be improved, unless you follow #jurgen's advice and split the column into date and time columns.
Since you want to get results for every day you need to extract the date from a datetime column with the function date(). That makes indexes useless.
You can split up the reading_date column into reading_date and reading_time. Then you can run the query without a function and indexes will work.
Additionally you can change the query into a join
select *
from value_magnitudes v
inner join
(
SELECT min(id) as id
FROM value_magnitudes
WHERE magnitude_id = 234
and reading_date >= '2013-04-01'
group by reading_date
) x on x.id = v.id
For starters, I would change your query to:
select * from value_magnitudes where id = (
select min(id) from value_magnitudes
where magnitude_id = 234
and DATE(reading_date) >= '2013-04-01'
)
You don't need to use the IN clause when the subquery is only going to return one record.
Then, I would make sure you have an index on magnitude_id and reading_date (probably a two field index) as that's what you are querying against in the subquery. Without that index, you are scanning the table each time.
Also if possible change magnitude_id and reading_date to non null. Null values and indexes are not great fits.