how to speed up query when using count and group by - mysql

I have two tables named seller and item. They are connected through a third table (seller_item) using a "n" to "m" foreign key relation.
Now I a try to answer the requirement: "I as a seller want a list of my competitors with a count of items I am selling and they are selling as well".
So a list of all sellers with the count of overlapping items in relation to one specific seller.
Also I want this to be sorted by count and limited.
But the query is using temp table and filesort which is very slow.
Explain says:
Using where; Using index; Using temporary; Using filesort
How can I speed this up ?
Here is the query:
SELECT
COUNT(*) AS itemCount,
s.sellerName
FROM
seller s,
seller_item si
WHERE
si.itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
AND
si.sellerId=s.id
GROUP BY
sellerName
ORDER BY
itemCount DESC
LIMIT 50;
the table defs:
CREATE TABLE `seller` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerName` varchar(50) NOT NULL
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`sellerName`),
) ENGINE=InnoDB
contains about 200.000 rows
--
CREATE TABLE `item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`itemName` varchar(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`itemName`),
) ENGINE=InnoDB
contains about 100.000.000 rows
--
CREATE TABLE `seller_item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerId` bigint(20) unsigned NOT NULL,
`itemId` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `sellerId` (`sellerId`,`itemId`),
KEY `item_id` (`itemId`),
CONSTRAINT `fk_1` FOREIGN KEY (`sellerId`) REFERENCES `seller` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION,
CONSTRAINT `fk_2` FOREIGN KEY (`itemId`) REFERENCES `item` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB
contains about 170.000.000 rows
Database is Mysql Percona 5.6
Output of EXPLAIN:
+----+-------------+-------------+--------+----------------------+----- ---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
| 1 | SIMPLE | s | index | PRIMARY,unique_index | unique_index | 152 | NULL | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | si | ref | sellerId,item_id | sellerId | 8 | tmp.s.id | 1 | Using index |
| 1 | SIMPLE | seller_item | eq_ref | sellerId,item_id | sellerId | 16 | const,tmp.si.itemId | 1 | Using where; Using index |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+

I doubt it's feasible to make a query like that run fast in realtime on a database of your size, especially for sellers with lots of popular items in stock.
You should materialize it. Create a table like this
CREATE TABLE
matches
(
seller INT NOT NULL,
competitor INT NOT NULL,
matches INT NOT NULL,
PRIMARY KEY
(seller, competitor)
)
and update it in batches in a cron script:
DELETE
FROM matches
WHERE seller = :seller
INSERT
INTO matches (seller, competitor, matches)
SELECT si.seller, sc.seller, COUNT(*) cnt
FROM seller_item si
JOIN seller_item sc
ON sc.item = si.item
AND sc.seller <> si.seller
WHERE si.seller = :seller
GROUP BY
si.seller, sc.seller
ORDER BY
cnt DESC
LIMIT 50
You also need to make (seller, item) the PRIMARY KEY on seller_item. The way it is now, finding a seller by item requires two lookups instead of one: first id by item using KEY (item), then seller by id using the PRIMARY KEY (id)

I believe you're under a misimpression about your ability to eliminate the Using temporary; Using filesort steps to satisfy your query. Queries of the form
SELECT COUNT(*), grouping_value
FROM table
GROUP BY grouping_value
ORDER BY COUNT(*)
LIMIT n
always use a temporary in-memory result set, and always sort that resultset. That's because the result set doesn't exist anywhere until the query runs, and it has to be sorted before the LIMIT clause can be satisfied.
"Filesort" is somewhat misnamed. It doesn't necessarily mean the sorting is happening on a file in the file system, just that a temporary resultset is being sorted. If that resultset is massive, the sort can spill out of RAM into the filesystem, but it doesn't have to. Please read this. https://www.percona.com/blog/2009/03/05/what-does-using-filesort-mean-in-mysql/ Don't get distracted by the Using filesort item in your EXPLAIN results.
One of the tricks to getting better performance from this sort of query is to minimize the size of the sorted results. You've already filtered them down to the stuff you want; that's good.
But, you can still arrange to sort less stuff, by sorting just the seller.id and the count, then joining the (longer) sellerName in after you know the exact fifty rows you need. That also has the benefit of letting you do your aggregating with just the seller_item table, rather than with the resultset that comes from joining the two.
Here's what I mean. This subquery generates the list of fifty sellerId values you need. All it has to sort is the count and sellerId. That's faster than sorting the count and sellerName because there's less data, and fixed-length data, to shuffle around in the sort operation.
SELECT COUNT(*) AS itemCount,
sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
Notice that this sorts a big result set, then discards most of it. It gives you the exact fifty seller id values you need.
You can make this even faster by filtering out more rows by adding HAVING COUNT(*) > 1 right after your GROUP BY clause, but that changes the meaning of your query and may not meet your business requirements.
Once you have those fifty items, you can retrieve the seller names. The whole query looks like this:
SELECT s.sellerName, c.itemCount
FROM seller s
JOIN (
SELECT COUNT(*) AS itemCount, sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
) c ON c.sellerId = s.id
ORDER BY c.itemCount DESC
Your indexing effort should be spent trying to make the inner queries fast. The outer query will be fast no matter what; it's only handling fifty rows, and using an indexed id value to look up other values.
The inmost query is SELECT itemId FROM seller_item WHERE sellerId = 4711. This will benefit greatly from your existing (sellerId, itemId) compound index: it can random-access and then scan that index, which is very quick.
The SELECT COUNT(*)... query will benefit from a (itemId, sellerId) compound index. That part of your query is the hard and slow part, but still, this index will help.
Look, others have mentioned this, and so will I. Having both a unique composite key (sellerId, itemId) and a primary key id on that seller_item table is, with respect, incredibly wasteful.
It makes your updates and inserts slower.
It means your table is organized as a tree based on the meaningless id rather than the meaningful value pair.
If you make one of the two indexes I mentioned the primary key, and create the other one without making it unique, you'll have a much more efficient table. These many-to-many join tables don't need, and should not have, surrogate keys.

Reformulation
I think this is what you really wanted:
SELECT si2.sellerId, COUNT(DISTINCT si2.itemId) AS itemCount
FROM seller_item si1
JOIN seller_item si2 ON si2.itemId = si1.itemId
WHERE si1.sellerId = 4711
GROUP BY si2.sellerId
ORDER BY itemCount DESC
LIMIT 50;
(Note: DISTINCT is probably unnecessary.)
In words: For seller #4711, find the items he sells, then find which sellers are selling nearly the same set of items. (I did not try to filter out #4711 from the resultset.)
More efficient N:M
But there is still an inefficiency. Let's dissect your many-to-many mapping table (seller_item).
It has an id which is probably not used for anything. Get rid of it.
Then promote UNIQUE(sellerId, itemId) to PRIMARY KEY(sellerId, itemId).
Now change INDEX(itemId) to INDEX(itemId, sellerId) so that the last stage of the query can be "using index".
Blog discussing that further.
You have a very large dataset; you have debugged your app. Consider removing the FOREIGN KEYs; they are somewhat costly.
Getting sellerName
It may be possible to JOIN to sellers to get sellerName. But try it with just sellerId first. Then add the name. Verify that the count does not inflate (that often happens) and that the query does not slow down.
If either thing goes wrong, then do
SELECT s.sellerName, x.itemCount
FROM ( .. the above query .. ) AS x
JOIN sellers AS s USING(sellerId);
(Optionally you could add ORDER BY sellerName.)

I'm not sure how fast this would be on your database but I'd write the query like this.
select * from (
select seller.sellerName,
count(otherSellersItems.itemId) itemCount from (
select sellerId, itemId from seller_item where sellerId != 4711
) otherSellersItems
inner join (
select itemId from seller_item where sellerId = 4711
) thisSellersItems
on otherSellersItems.itemId = thisSellersItems.itemId
inner join seller
on otherSellersItems.sellerId = seller.id
group by seller.sellerName
) itemsSoldByOtherSellers
order by itemCount desc
limit 50 ;

Since we are limiting the (potentially large) resultset to at most 50 rows, I would put off getting the sellername until after we have the counts, so we only need to get 50 seller names.
First, we get the itemcount by seller_id
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
For improved performance, I would make a suitable covering index available for the join to so. e.g.
CREATE UNIQUE INDEX seller_item_UX2 ON seller_item(item_id,seller_id)
By using a "covering index", MySQL can satisfy the query entirely from the index pages, without a need to visit the pages in the underlying table.
Once the new index is created, I would drop the index on the singleton item_id column, since that index is now redundant. (Any query that could make effective use of that index will be able to make effective use of the composite index which has item_id as the leading column.)
There's no getting around a "Using filesort" operation. MySQL has to evaluate the COUNT() aggregate on each row, before it can perform a sort. There's no way (given the current schema) for MySQL to return the rows in order using an index to avoid a sort operation.
Once we have that set of (at most) fifty rows, then we can get the sellername.
To get the sellername, we could either use a correlated subquery in the SELECT list, or a join operation.
1) Using a correlated subquery in SELECT list, e.g.
SELECT so.seller_id
, ( SELECT s.sellername
FROM seller s
WHERE s.seller_id = so.seller_id
ORDER BY s.seller_id, s.sellername
LIMIT 1
) AS sellername
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
(We know that subquery will be executed (at most) fifty times, once for each row returned by the outer query. Fifty executions (with a suitable index available) isn't that bad, at least compared to 50,000 executions.)
Or, 2) using a join operation, e.g.
SELECT c.seller_id
, s.sellername
, c.itemcount
FROM (
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
) c
JOIN seller s
ON s.seller_id = c.seller_id
ORDER BY c.itemcount DESC, c.seller_id DESC
(Again, we know the the inline view c will return (at most) fifty rows, and getting fifty sellername (using a suitable index) should be fast.
SUMMARY TABLE
If we denormalized the implementation, and added summary table containing item_id (as the primary key) and a "count" of the number of sellers of that item_id, our query could take advantage of that.
As an illustration of what that might look like:
CREATE TABLE item_seller_count
( item_id BIGINT NOT NULL PRIMARY KEY
, seller_count BIGINT NOT NULL
) Engine=InnoDB
;
INSERT INTO item_seller_count (item_id, seller_count)
SELECT d.item_id
, COUNT(*)
FROM seller_item d
GROUP BY d.item_id
ORDER BY d.item_id
;
CREATE UNIQUE INDEX item_seller_count_IX1
ON item_seller_count (seller_count, item_id)
;
The new summary table will become "out of sync" when rows are inserted/updated/deleted from the seller_item table.
And populating this table would take resources. But having this available would speed up queries of the type we're working on.

Related

Query time that I can't understand in MYSQL

I'm new to this platform, even this is my first question. Sorry for my bad English, I use translate. Let me know if I have used inappropriate language.
my table is like this
CREATE TABLE tbl_records (
id int(11) NOT NULL,
data_id int(11) NOT NULL,
value double NOT NULL,
record_time datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
ALTER TABLE tbl_records
ADD PRIMARY KEY (id),
ADD KEY data_id (data_id),
ADD KEY record_time (record_time);
ALTER TABLE tbl_records
MODIFY id int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
my first query
It takes 0.0096 seconds
SELECT b.* FROM tbl_records b
INNER JOIN
(SELECT MAX(id) AS id FROM tbl_records GROUP BY data_id) a
ON a.id=b.id;
my second query
Its takes 2.4957 seconds
SELECT MAX(id) AS id FROM tbl_records GROUP BY data_id;
When I do these operations over and over again, the result is similar.
There are 20 million data in the table.
Why is the one with the subquery faster?
Also what I really need is MAX(record_time) but
SELECT b.* FROM tbl_records b
INNER JOIN
(SELECT MAX(record_time) AS id FROM tbl_records GROUP BY data_id) a
ON a.id=b.id
It takes minutes when I run it.
I also need records such as hourly, daily, and monthly. I couldn't see much performance difference between GROUP BY SUBSTR(record_time,1,10) or GROUP BY DATE_FORMAT(record_time,'%Y%m%d') both take minutes.
What am I doing wrong?
The first query can be simplified to
SELECT * FROM tbl_records
ORDER BY id DESC
LIMIT 1.
The second:
SELECT id FROM tbl_records
ORDER BY data_id DESC
LIMIT 1;
I don't know what the third is trying to do. This does not make sense: MAX(record_time) AS id -- it is a DATETIME that will subsequently be compared to an INT in ON a.id=b.id.
Another option for turning a DATETIME into a DATE is simply DATE(record_time). But it will not be significantly faster.
If the goal is to build daily counts and subtotals, then there is a much better way. Build and incrementally maintain a Summary table .
(responding to Comment)
The GROUP BY that you have is improper and probably incorrect. I took the liberty of changing from id to data_id:
SELECT b.*
FROM
( SELECT data_id, MAX(record_time) AS max_time
FROM tbl_records
GROUP BY data_id
) AS a
FROM tbl_records AS b
ON a.data_id = b.data_id
AND a.max_time = b.record_time
And have
INDEX(data_id, record_time)
Can there be duplicate times for one data_id? To discuss that and other "groupwise-max" queries, see http://mysql.rjweb.org/doc.php/groupwise_max

How to count duplicates by multiple records in subtable

Say I have the following table structure:
products
id | name | price
products_ean
id | product_id | ean
A product can (unfortunately) have multiple EAN numbers. Two products can have one or more of the same EAN numbers.
What is the best practice to count the amount of duplicate products by comparing multiple EAN numbers from the products_ean table?
I've tried something like the following, but that makes the query really slower:
SELECT
`products`.`name`,
(
SELECT
COUNT(*)
FROM
`products_ean`
WHERE
`ean` IN(
SELECT
`ean`
FROM
`products_ean`
WHERE
`product_id` = `products`.`id`
) AND `products_ean`.`product_id` != `products`.`id`
GROUP BY `product_id`
) AS `ProductEANCount`
FROM
`products`
LIMIT 12
Using joins is the simplest way to generate related information. I've GROUP BY the product.id which means the eans are the aggregated field because those are the only ones that can duplicate. I've added HAVING part after the query to select only those results with 2 or more (its optional).
SELECT p.id, name, price, count(ean) as eans
FROM products p
JOIN products_ean e
ON p.id = e.product_id
GROUP BY p.id
HAVING eans >= 2
On query efficiency, having the product_id,ean as a composite primary key for the products_ean table is probably most efficient. Since that is unique its not obvious why the products_ean.id column is needed.

MySQL - indexing , order of fields

My database size is growing, so I have decided to add indexes (never done that before). So I need some advice. e.g.
SELECT field1, field2
FROM mytable1
WHERE account_id = :account_id and product_id = :product_id and version_name = :version_name
or
SELECT field1, field2
FROM mytable1
WHERE version_name = :version_name and and product_id = :product_id and account_id = :account_id
An account can have many products and product can have millions of versions, which select order in where is faster. version_name is a string.
If I have version_id available which is primary key should I always put that first.
Should I add index upon account_id and product_id together
Does it get all the rows for first condition in where, and then filters the result as per second condition and so on
or
It scans every row for all the three fields, given no index is added
The order of the clauses does not matter. What matters is the order of the fields in the index: if you have an index on 1. account_id, 2. product_id, that index will only be used when you have in the WHERE clause account_id.
For your queries you can put an index on account_id, product_id and version_name.

MySQL query running very slow on Composite key table

I've got a composite key table CUSTOMER_PRODUCT_XREF
__________________________________________________________________
|CUSTOMER_ID (PK NN VARCHAR(191)) | PRODUCT_ID(PK NN VARCHAR(191))|
-------------------------------------------------------------------
In my batch program I need to select 500 updated customers and also get the PRODUCT_ID's purchased by CUSTOMERs separated by comma and update our SOLR index. In my query I'm select 500 customers and doing a left join to CUSTOMER_PRODUCT_XREF
SELECT
customer.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM
CUSTOMER customer
LEFT JOIN CUSTOMER_PRODUCT_XREF xref ON customer.CUSTOMER_ID=xref.CUSTOMER_ID
group by customer.CUSTOMER_ID
LIMIT 500;
EDIT: EXPLAIN QUERY
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customer ALL PRIMARY NULL NULL NULL 74236 Using where; Using temporary; Using filesort
1 SIMPLE xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using join buffer (Block Nested Loop)
I got lost connection exception after 20 minutes running the above query.
I tried with the following (sub query) and it took 1.7 seconds to get result but still slow.
SELECT
customer.*, (SELECT group_concat(PRODUCT_ID separator ', ')
FROM CUSTOMER_PRODUCT_XREF xref
WHERE customer.CUSTOMER_ID=xref.CUSTOMER_ID
GROUP BY customer.CUSTOMER_ID)
FROM
CUSTOMER customer
LIMIT 500;
EDIT: EXPLAIN QUERY produces
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY customer ALL NULL NULL NULL NULL 74236 NULL
2 DEPENDENT SUBQUERY xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using temporary; Using filesort
Question
CUSTOMER_PRODUCT_XREF already has both columns set as PRIMARY_KEY and NOT_NULL but why is my query still very slow ? I thought having Primary Key on a column was enough to build an index for it. Do I need further indexing ?
DATABASE INFO:
All the ID's in my database are VARCHAR(191) because the id's can contain alphabets.
I'm using utf8mb4_unicode_ci character encoding
I'm using SET group_concat_max_len := ##max_allowed_packet to get maximum number of product_ids for each customer. Prefer using group_concat in one main query so that I don't have to do multiple separate queries to get products for each customer.
Your original version of the query is doing the join first and then sorting all the resulting data -- which is probably pretty big given how large the fields are.
You can "fix" that version by selecting 500 hundred customers first and then doing the join:
SELECT c.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
CUSTOMER_PRODUCT_XREF xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID
group by c.CUSTOMER_ID ;
An alternative that might or might not have a big impact would be to doing the aggregation by customer in a subquery and join that, as in:
SELECT c.*, xref.products
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
(select customer_id, group_concat(xref.PRODUCT_ID separator ', ') as products
from CUSTOMER_PRODUCT_XREF xref
) xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID;
What you have discovered is that the MySQL optimizer does not recognize this situation (where the limit has a big impact on performance). Some other database engines do a better job of optimization in this case.
Alright the speed of the queries in my question shot up when I created an index just on the CUSTOMER_ID in CUSTOMER_PRODUCT_XREF table.
So I've got two indexes now
PRIMARY_KEY_INDEX on PRODUCT_ID and CUSTOMER_ID
CUSTOMER_ID_INDEX on CUSTOMER_ID

Why is this mySQL query extremely slow?

Given is a mySQL table named "orders_products" with the following relevant fields:
products_id
orders_id
Both fields are indexed.
I am running the following query:
SELECT products_id, count( products_id ) AS counter
FROM orders_products
WHERE orders_id
IN (
SELECT DISTINCT orders_id
FROM orders_products
WHERE products_id = 85094
)
AND products_id != 85094
GROUP BY products_id
ORDER BY counter DESC
LIMIT 4
This query takes extremely long, around 20 seconds. The database is not very busy otherwise, and performs well on other queries.
I am wondering, what causes the query to be so slow?
The table is rather big (around 1,5 million rows, size around 210 mb), could this be a memory issue?
Is there a way to tell exactly what is taking mySQL so long?
Output of Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY orders_products range products_id products_id 4 NULL 1577863 Using where; Using temporary; Using filesort
2 DEPENDENT SUBQUERY orders_products ref orders_id,products_id products_id 4 const 2 Using where; Using temporary
Queries that use WHERE ID IN (subquery) perform notoriously badly with mysql.
With most cases of such queries however, it is possible to rewrite them as a JOIN, and this one is no exception:
SELECT
t2.products_id,
count(t2.products_id) AS counter
FROM orders_products t1
JOIN orders_products t2
ON t2.orders_id = t1.orders_id
AND t2.products_id != 85094
WHERE t1.products_id = 85094
GROUP BY t2.products_id
ORDER BY counter DESC
LIMIT 4
If you want to return rows where there are no other products (and show a zero count for them), change the join to a LEFT JOIN.
Note how the first instance of the table has the WHERE products_id = X, which allows index look up and immediately reduces the number of rows, and the second instance of the table has the target data, but it looked up on the id field (again fast), but filtered in the join condition to count the other products.
Give these a try:
MySQL does not optimize IN with a subquery - join the tables together.
Your query contains != condition, which is very difficult to deal with - can you narrow down products and use multiple lookups rather than inequity comparison?