I have two tables named seller and item. They are connected through a third table (seller_item) using a "n" to "m" foreign key relation.
Now I a try to answer the requirement: "I as a seller want a list of my competitors with a count of items I am selling and they are selling as well".
So a list of all sellers with the count of overlapping items in relation to one specific seller.
Also I want this to be sorted by count and limited.
But the query is using temp table and filesort which is very slow.
Explain says:
Using where; Using index; Using temporary; Using filesort
How can I speed this up ?
Here is the query:
SELECT
COUNT(*) AS itemCount,
s.sellerName
FROM
seller s,
seller_item si
WHERE
si.itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
AND
si.sellerId=s.id
GROUP BY
sellerName
ORDER BY
itemCount DESC
LIMIT 50;
the table defs:
CREATE TABLE `seller` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerName` varchar(50) NOT NULL
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`sellerName`),
) ENGINE=InnoDB
contains about 200.000 rows
--
CREATE TABLE `item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`itemName` varchar(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`itemName`),
) ENGINE=InnoDB
contains about 100.000.000 rows
--
CREATE TABLE `seller_item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerId` bigint(20) unsigned NOT NULL,
`itemId` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `sellerId` (`sellerId`,`itemId`),
KEY `item_id` (`itemId`),
CONSTRAINT `fk_1` FOREIGN KEY (`sellerId`) REFERENCES `seller` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION,
CONSTRAINT `fk_2` FOREIGN KEY (`itemId`) REFERENCES `item` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB
contains about 170.000.000 rows
Database is Mysql Percona 5.6
Output of EXPLAIN:
+----+-------------+-------------+--------+----------------------+----- ---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
| 1 | SIMPLE | s | index | PRIMARY,unique_index | unique_index | 152 | NULL | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | si | ref | sellerId,item_id | sellerId | 8 | tmp.s.id | 1 | Using index |
| 1 | SIMPLE | seller_item | eq_ref | sellerId,item_id | sellerId | 16 | const,tmp.si.itemId | 1 | Using where; Using index |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
I doubt it's feasible to make a query like that run fast in realtime on a database of your size, especially for sellers with lots of popular items in stock.
You should materialize it. Create a table like this
CREATE TABLE
matches
(
seller INT NOT NULL,
competitor INT NOT NULL,
matches INT NOT NULL,
PRIMARY KEY
(seller, competitor)
)
and update it in batches in a cron script:
DELETE
FROM matches
WHERE seller = :seller
INSERT
INTO matches (seller, competitor, matches)
SELECT si.seller, sc.seller, COUNT(*) cnt
FROM seller_item si
JOIN seller_item sc
ON sc.item = si.item
AND sc.seller <> si.seller
WHERE si.seller = :seller
GROUP BY
si.seller, sc.seller
ORDER BY
cnt DESC
LIMIT 50
You also need to make (seller, item) the PRIMARY KEY on seller_item. The way it is now, finding a seller by item requires two lookups instead of one: first id by item using KEY (item), then seller by id using the PRIMARY KEY (id)
I believe you're under a misimpression about your ability to eliminate the Using temporary; Using filesort steps to satisfy your query. Queries of the form
SELECT COUNT(*), grouping_value
FROM table
GROUP BY grouping_value
ORDER BY COUNT(*)
LIMIT n
always use a temporary in-memory result set, and always sort that resultset. That's because the result set doesn't exist anywhere until the query runs, and it has to be sorted before the LIMIT clause can be satisfied.
"Filesort" is somewhat misnamed. It doesn't necessarily mean the sorting is happening on a file in the file system, just that a temporary resultset is being sorted. If that resultset is massive, the sort can spill out of RAM into the filesystem, but it doesn't have to. Please read this. https://www.percona.com/blog/2009/03/05/what-does-using-filesort-mean-in-mysql/ Don't get distracted by the Using filesort item in your EXPLAIN results.
One of the tricks to getting better performance from this sort of query is to minimize the size of the sorted results. You've already filtered them down to the stuff you want; that's good.
But, you can still arrange to sort less stuff, by sorting just the seller.id and the count, then joining the (longer) sellerName in after you know the exact fifty rows you need. That also has the benefit of letting you do your aggregating with just the seller_item table, rather than with the resultset that comes from joining the two.
Here's what I mean. This subquery generates the list of fifty sellerId values you need. All it has to sort is the count and sellerId. That's faster than sorting the count and sellerName because there's less data, and fixed-length data, to shuffle around in the sort operation.
SELECT COUNT(*) AS itemCount,
sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
Notice that this sorts a big result set, then discards most of it. It gives you the exact fifty seller id values you need.
You can make this even faster by filtering out more rows by adding HAVING COUNT(*) > 1 right after your GROUP BY clause, but that changes the meaning of your query and may not meet your business requirements.
Once you have those fifty items, you can retrieve the seller names. The whole query looks like this:
SELECT s.sellerName, c.itemCount
FROM seller s
JOIN (
SELECT COUNT(*) AS itemCount, sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
) c ON c.sellerId = s.id
ORDER BY c.itemCount DESC
Your indexing effort should be spent trying to make the inner queries fast. The outer query will be fast no matter what; it's only handling fifty rows, and using an indexed id value to look up other values.
The inmost query is SELECT itemId FROM seller_item WHERE sellerId = 4711. This will benefit greatly from your existing (sellerId, itemId) compound index: it can random-access and then scan that index, which is very quick.
The SELECT COUNT(*)... query will benefit from a (itemId, sellerId) compound index. That part of your query is the hard and slow part, but still, this index will help.
Look, others have mentioned this, and so will I. Having both a unique composite key (sellerId, itemId) and a primary key id on that seller_item table is, with respect, incredibly wasteful.
It makes your updates and inserts slower.
It means your table is organized as a tree based on the meaningless id rather than the meaningful value pair.
If you make one of the two indexes I mentioned the primary key, and create the other one without making it unique, you'll have a much more efficient table. These many-to-many join tables don't need, and should not have, surrogate keys.
Reformulation
I think this is what you really wanted:
SELECT si2.sellerId, COUNT(DISTINCT si2.itemId) AS itemCount
FROM seller_item si1
JOIN seller_item si2 ON si2.itemId = si1.itemId
WHERE si1.sellerId = 4711
GROUP BY si2.sellerId
ORDER BY itemCount DESC
LIMIT 50;
(Note: DISTINCT is probably unnecessary.)
In words: For seller #4711, find the items he sells, then find which sellers are selling nearly the same set of items. (I did not try to filter out #4711 from the resultset.)
More efficient N:M
But there is still an inefficiency. Let's dissect your many-to-many mapping table (seller_item).
It has an id which is probably not used for anything. Get rid of it.
Then promote UNIQUE(sellerId, itemId) to PRIMARY KEY(sellerId, itemId).
Now change INDEX(itemId) to INDEX(itemId, sellerId) so that the last stage of the query can be "using index".
Blog discussing that further.
You have a very large dataset; you have debugged your app. Consider removing the FOREIGN KEYs; they are somewhat costly.
Getting sellerName
It may be possible to JOIN to sellers to get sellerName. But try it with just sellerId first. Then add the name. Verify that the count does not inflate (that often happens) and that the query does not slow down.
If either thing goes wrong, then do
SELECT s.sellerName, x.itemCount
FROM ( .. the above query .. ) AS x
JOIN sellers AS s USING(sellerId);
(Optionally you could add ORDER BY sellerName.)
I'm not sure how fast this would be on your database but I'd write the query like this.
select * from (
select seller.sellerName,
count(otherSellersItems.itemId) itemCount from (
select sellerId, itemId from seller_item where sellerId != 4711
) otherSellersItems
inner join (
select itemId from seller_item where sellerId = 4711
) thisSellersItems
on otherSellersItems.itemId = thisSellersItems.itemId
inner join seller
on otherSellersItems.sellerId = seller.id
group by seller.sellerName
) itemsSoldByOtherSellers
order by itemCount desc
limit 50 ;
Since we are limiting the (potentially large) resultset to at most 50 rows, I would put off getting the sellername until after we have the counts, so we only need to get 50 seller names.
First, we get the itemcount by seller_id
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
For improved performance, I would make a suitable covering index available for the join to so. e.g.
CREATE UNIQUE INDEX seller_item_UX2 ON seller_item(item_id,seller_id)
By using a "covering index", MySQL can satisfy the query entirely from the index pages, without a need to visit the pages in the underlying table.
Once the new index is created, I would drop the index on the singleton item_id column, since that index is now redundant. (Any query that could make effective use of that index will be able to make effective use of the composite index which has item_id as the leading column.)
There's no getting around a "Using filesort" operation. MySQL has to evaluate the COUNT() aggregate on each row, before it can perform a sort. There's no way (given the current schema) for MySQL to return the rows in order using an index to avoid a sort operation.
Once we have that set of (at most) fifty rows, then we can get the sellername.
To get the sellername, we could either use a correlated subquery in the SELECT list, or a join operation.
1) Using a correlated subquery in SELECT list, e.g.
SELECT so.seller_id
, ( SELECT s.sellername
FROM seller s
WHERE s.seller_id = so.seller_id
ORDER BY s.seller_id, s.sellername
LIMIT 1
) AS sellername
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
(We know that subquery will be executed (at most) fifty times, once for each row returned by the outer query. Fifty executions (with a suitable index available) isn't that bad, at least compared to 50,000 executions.)
Or, 2) using a join operation, e.g.
SELECT c.seller_id
, s.sellername
, c.itemcount
FROM (
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
) c
JOIN seller s
ON s.seller_id = c.seller_id
ORDER BY c.itemcount DESC, c.seller_id DESC
(Again, we know the the inline view c will return (at most) fifty rows, and getting fifty sellername (using a suitable index) should be fast.
SUMMARY TABLE
If we denormalized the implementation, and added summary table containing item_id (as the primary key) and a "count" of the number of sellers of that item_id, our query could take advantage of that.
As an illustration of what that might look like:
CREATE TABLE item_seller_count
( item_id BIGINT NOT NULL PRIMARY KEY
, seller_count BIGINT NOT NULL
) Engine=InnoDB
;
INSERT INTO item_seller_count (item_id, seller_count)
SELECT d.item_id
, COUNT(*)
FROM seller_item d
GROUP BY d.item_id
ORDER BY d.item_id
;
CREATE UNIQUE INDEX item_seller_count_IX1
ON item_seller_count (seller_count, item_id)
;
The new summary table will become "out of sync" when rows are inserted/updated/deleted from the seller_item table.
And populating this table would take resources. But having this available would speed up queries of the type we're working on.
I have the following query:
SELECT *
FROM s
JOIN b ON s.borrowerId = b.id
JOIN (
SELECT MIN(id) AS id
FROM tbl
WHERE dealId IS NULL
GROUP BY borrowerId, created
) s2 ON s.id = s2.id
Is there a simple way to optimize this so that I can do the JOIN directly and utilize indexes?
UPDATE
The created field is part of the GROUP BY statement because due to the limitations of our version of MySQL and the ORM being used it is possible to have multiple records with the same created timestamp value. As a result I need to find the first record for each combination of borrowerId and created.
Typically I might attempt something like this:
SELECT *
FROM s
INNER JOIN b ON s.borrowerId = b.id
LEFT OUTER JOIN s2
ON s.borrowerId = s2.borrowerId
AND s.created = s2.created
AND s.id <> s2.id
AND s.id < s2.id
WHERE s2.id IS NULL
AND s.dealId IS NULL;
But I'm not sure if that works 100% the way I want.
EXPLAIN from MySQL outputs the following:
1 PRIMARY b ALL NULL NULL NULL NULL 129690
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 317751 Using join buffer
1 PRIMARY s eq_ref PRIMARY,borrowerId_2,borrowerId PRIMARY 4 s2.id 1 Using where
2 DERIVED statuses ref dealId dealId 5 183987 Using where; Using temporary; Using filesort
As you can see, it has to query a massive number of records to build the subquery data set and when joining to the derived subquery, no indexes are found and so no indexes are used.
The first query needs this composite index:
INDEX(borrowerId, created, id)
Note that MySQL rarely uses two indexes for one SELECT, but a composite index is often very handy.
The second query seems grossly inefficient.
Please provide SHOW CREATE TABLE for each table.
I am trying to run this query:
delete FROM `customer` where customer_id
not in (SELECT distinct customer_id FROM `order`);
But it's taking a long time, leading to timout and breaks in middle because customer table has more than a million records.
How can I make it fast? any alternative?
edit
Here is EXPLAIN of same with SELECT:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY customer index NULL PRIMARY 4 NULL 127659 Using where; Using index
2 DEPENDENT SUBQUERY order ALL NULL NULL NULL NULL 25141 Using where
try:
DELETE customer FROM customer t1 LEFT JOIN `order` t2 on t1.customer_id = t2.customer_id WHERE t2.customer_id is null;
First you can remove DISTINCT from this DELETE command. It's not necessary here. So try this
delete FROM `customer` where customer_id
not in (SELECT customer_id FROM `order`);
Second please check EXECUTION PLAN with this rebuild query
SELECT customer_id FROM `customer` where customer_id
not in (SELECT customer_id FROM `order`);
This will show how MySQL executes this query and what's going on.
Given is a mySQL table named "orders_products" with the following relevant fields:
products_id
orders_id
Both fields are indexed.
I am running the following query:
SELECT products_id, count( products_id ) AS counter
FROM orders_products
WHERE orders_id
IN (
SELECT DISTINCT orders_id
FROM orders_products
WHERE products_id = 85094
)
AND products_id != 85094
GROUP BY products_id
ORDER BY counter DESC
LIMIT 4
This query takes extremely long, around 20 seconds. The database is not very busy otherwise, and performs well on other queries.
I am wondering, what causes the query to be so slow?
The table is rather big (around 1,5 million rows, size around 210 mb), could this be a memory issue?
Is there a way to tell exactly what is taking mySQL so long?
Output of Explain:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY orders_products range products_id products_id 4 NULL 1577863 Using where; Using temporary; Using filesort
2 DEPENDENT SUBQUERY orders_products ref orders_id,products_id products_id 4 const 2 Using where; Using temporary
Queries that use WHERE ID IN (subquery) perform notoriously badly with mysql.
With most cases of such queries however, it is possible to rewrite them as a JOIN, and this one is no exception:
SELECT
t2.products_id,
count(t2.products_id) AS counter
FROM orders_products t1
JOIN orders_products t2
ON t2.orders_id = t1.orders_id
AND t2.products_id != 85094
WHERE t1.products_id = 85094
GROUP BY t2.products_id
ORDER BY counter DESC
LIMIT 4
If you want to return rows where there are no other products (and show a zero count for them), change the join to a LEFT JOIN.
Note how the first instance of the table has the WHERE products_id = X, which allows index look up and immediately reduces the number of rows, and the second instance of the table has the target data, but it looked up on the id field (again fast), but filtered in the join condition to count the other products.
Give these a try:
MySQL does not optimize IN with a subquery - join the tables together.
Your query contains != condition, which is very difficult to deal with - can you narrow down products and use multiple lookups rather than inequity comparison?
I have a many-to-many query that i'd like to optimize,
what indexes should i create for it?
SELECT (SELECT COUNT(post_id)
FROM posts
WHERE post_status = 1) as total,
p.*,
GROUP_CONCAT(t.tag_name) tagged
FROM tags_relation tr
JOIN posts p ON p.post_id = tr.rel_post_id
JOIN tags t ON t.tag_id = tr.rel_tag_id
WHERE p.post_status=1
GROUP BY p.post_id
EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY p ALL PRIMARY NULL NULL NULL 5 Using where; Using filesort
You can take a look at the query execution plan using the Explain statement. This will show you whether a full table scan is happening or if it was able to find an index to retrieve the data. From that point on you can optimize further.
Edit
Based on your query execution plan, first optimization step check your tables have the primary key defined and you can set an index on post_status and tag_name columns.