Our server company advised we switch to Percona when setting up our new DB servers. So we're currently on Percona Server (GPL), Release 82.0 (version 5.6.36-82.0) , Revision 58e846a and there's one behavior I'm trying to wrap my head around that we definitely weren't experiencing before with MySql so I thought I'd reach out:
This is a query we perform fairly regularly to pull an article from our db
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND a.status_field = 'open' AND b.filter_field = 'no_filter' AND b.view_field = 'article'
ORDER BY a.unixtimestamp DESC LIMIT 1
This used to complete very quickly but under Percona, the combination of the where conditions from table b and ordering from table a makes the whole query take ~3s. I don't fully understand this behaviour.
If I alter it to:
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND a.status_field = 'open' AND b.filter_field = 'no_filter' AND b.view_field = 'article'
ORDER BY b.unixtimestamp DESC LIMIT 1
Then it completes very quickly (< 0.05s)
Is this sort of an expected behavior with Percona?
I just wanted to know before changing any db structure to compensate.
Edit:
For the explain, I simplified the query and it still has the same issue (id = entry_id):
Slow Query (1.5122389793):
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND b.special_filter = 'no_filter'
ORDER BY a.id DESC LIMIT 1
Slow Query explain:
1 SIMPLE table_b ref PRIMARY,entry_id,special_filter special_filter 26 const 130733 Using where; Using temporary; Using filesort
1 SIMPLE table_a eq_ref PRIMARY PRIMARY 4 db_name.table_b.entry_id 1 Using index
Fast Query (0.0006549358):
SELECT * FROM table_a a, table_b b
WHERE a.id = b.id AND b.special_filter = 'no_filter'
ORDER BY b.id DESC LIMIT 1
Fast Query explain:
1 SIMPLE table_b ref PRIMARY,entry_id,special_filter special_filter 26 const 130733 Using where
1 SIMPLE table_a eq_ref PRIMARY PRIMARY 4 db_name.table_b.entry_id 1 Using index
I've tried to omit as much info from the table as possible for security reasons but if I'm grossly missing something I can add it back in
table_a:
Relevant Keys:
table_a 0 PRIMARY 1 entry_id A 321147 BTREE
Create table:
table_a CREATE TABLE table_a ( entry_id int(10) unsigned NOT NULL AUTO_INCREMENT) ENGINE=InnoDB AUTO_INCREMENT=356198 DEFAULT CHARSET=utf8 DELAY_KEY_WRITE=1
table_b:
Relevant Keys:
table_b 0 PRIMARY 1 entry_id A 261467 BTREE
table_b 1 entry_id 1 entry_id A 261467 BTREE
table_b 1 special_filter 1 special_filter A 14 8 BTREE
Create Table:
table_b CREATE TABLE table_b ( entry_id int(10) unsigned NOT NULL DEFAULT '0', special_filter text NOT NULL, ) ENGINE=InnoDB DEFAULT CHARSET=utf8 DELAY_KEY_WRITE=1
Both tables have ~ 350k rows
It seems to be a mysql optimizer issue where it chooses not to join on the primary key under some conditions. Closely related to or the same as this issue: https://dba.stackexchange.com/questions/53274/mysql-innodb-issue-not-using-indexes-correctly-percona-5-6
Explicitly writing the query to use a STRAIGHT_JOIN with the tables in a specific order solves the issue. Writing USE INDEX(PRIMARY) after the JOIN keyword is easier though and doesn't rely on the table order.
I have two tables named seller and item. They are connected through a third table (seller_item) using a "n" to "m" foreign key relation.
Now I a try to answer the requirement: "I as a seller want a list of my competitors with a count of items I am selling and they are selling as well".
So a list of all sellers with the count of overlapping items in relation to one specific seller.
Also I want this to be sorted by count and limited.
But the query is using temp table and filesort which is very slow.
Explain says:
Using where; Using index; Using temporary; Using filesort
How can I speed this up ?
Here is the query:
SELECT
COUNT(*) AS itemCount,
s.sellerName
FROM
seller s,
seller_item si
WHERE
si.itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
AND
si.sellerId=s.id
GROUP BY
sellerName
ORDER BY
itemCount DESC
LIMIT 50;
the table defs:
CREATE TABLE `seller` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerName` varchar(50) NOT NULL
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`sellerName`),
) ENGINE=InnoDB
contains about 200.000 rows
--
CREATE TABLE `item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`itemName` varchar(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`itemName`),
) ENGINE=InnoDB
contains about 100.000.000 rows
--
CREATE TABLE `seller_item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerId` bigint(20) unsigned NOT NULL,
`itemId` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `sellerId` (`sellerId`,`itemId`),
KEY `item_id` (`itemId`),
CONSTRAINT `fk_1` FOREIGN KEY (`sellerId`) REFERENCES `seller` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION,
CONSTRAINT `fk_2` FOREIGN KEY (`itemId`) REFERENCES `item` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB
contains about 170.000.000 rows
Database is Mysql Percona 5.6
Output of EXPLAIN:
+----+-------------+-------------+--------+----------------------+----- ---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
| 1 | SIMPLE | s | index | PRIMARY,unique_index | unique_index | 152 | NULL | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | si | ref | sellerId,item_id | sellerId | 8 | tmp.s.id | 1 | Using index |
| 1 | SIMPLE | seller_item | eq_ref | sellerId,item_id | sellerId | 16 | const,tmp.si.itemId | 1 | Using where; Using index |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
I doubt it's feasible to make a query like that run fast in realtime on a database of your size, especially for sellers with lots of popular items in stock.
You should materialize it. Create a table like this
CREATE TABLE
matches
(
seller INT NOT NULL,
competitor INT NOT NULL,
matches INT NOT NULL,
PRIMARY KEY
(seller, competitor)
)
and update it in batches in a cron script:
DELETE
FROM matches
WHERE seller = :seller
INSERT
INTO matches (seller, competitor, matches)
SELECT si.seller, sc.seller, COUNT(*) cnt
FROM seller_item si
JOIN seller_item sc
ON sc.item = si.item
AND sc.seller <> si.seller
WHERE si.seller = :seller
GROUP BY
si.seller, sc.seller
ORDER BY
cnt DESC
LIMIT 50
You also need to make (seller, item) the PRIMARY KEY on seller_item. The way it is now, finding a seller by item requires two lookups instead of one: first id by item using KEY (item), then seller by id using the PRIMARY KEY (id)
I believe you're under a misimpression about your ability to eliminate the Using temporary; Using filesort steps to satisfy your query. Queries of the form
SELECT COUNT(*), grouping_value
FROM table
GROUP BY grouping_value
ORDER BY COUNT(*)
LIMIT n
always use a temporary in-memory result set, and always sort that resultset. That's because the result set doesn't exist anywhere until the query runs, and it has to be sorted before the LIMIT clause can be satisfied.
"Filesort" is somewhat misnamed. It doesn't necessarily mean the sorting is happening on a file in the file system, just that a temporary resultset is being sorted. If that resultset is massive, the sort can spill out of RAM into the filesystem, but it doesn't have to. Please read this. https://www.percona.com/blog/2009/03/05/what-does-using-filesort-mean-in-mysql/ Don't get distracted by the Using filesort item in your EXPLAIN results.
One of the tricks to getting better performance from this sort of query is to minimize the size of the sorted results. You've already filtered them down to the stuff you want; that's good.
But, you can still arrange to sort less stuff, by sorting just the seller.id and the count, then joining the (longer) sellerName in after you know the exact fifty rows you need. That also has the benefit of letting you do your aggregating with just the seller_item table, rather than with the resultset that comes from joining the two.
Here's what I mean. This subquery generates the list of fifty sellerId values you need. All it has to sort is the count and sellerId. That's faster than sorting the count and sellerName because there's less data, and fixed-length data, to shuffle around in the sort operation.
SELECT COUNT(*) AS itemCount,
sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
Notice that this sorts a big result set, then discards most of it. It gives you the exact fifty seller id values you need.
You can make this even faster by filtering out more rows by adding HAVING COUNT(*) > 1 right after your GROUP BY clause, but that changes the meaning of your query and may not meet your business requirements.
Once you have those fifty items, you can retrieve the seller names. The whole query looks like this:
SELECT s.sellerName, c.itemCount
FROM seller s
JOIN (
SELECT COUNT(*) AS itemCount, sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
) c ON c.sellerId = s.id
ORDER BY c.itemCount DESC
Your indexing effort should be spent trying to make the inner queries fast. The outer query will be fast no matter what; it's only handling fifty rows, and using an indexed id value to look up other values.
The inmost query is SELECT itemId FROM seller_item WHERE sellerId = 4711. This will benefit greatly from your existing (sellerId, itemId) compound index: it can random-access and then scan that index, which is very quick.
The SELECT COUNT(*)... query will benefit from a (itemId, sellerId) compound index. That part of your query is the hard and slow part, but still, this index will help.
Look, others have mentioned this, and so will I. Having both a unique composite key (sellerId, itemId) and a primary key id on that seller_item table is, with respect, incredibly wasteful.
It makes your updates and inserts slower.
It means your table is organized as a tree based on the meaningless id rather than the meaningful value pair.
If you make one of the two indexes I mentioned the primary key, and create the other one without making it unique, you'll have a much more efficient table. These many-to-many join tables don't need, and should not have, surrogate keys.
Reformulation
I think this is what you really wanted:
SELECT si2.sellerId, COUNT(DISTINCT si2.itemId) AS itemCount
FROM seller_item si1
JOIN seller_item si2 ON si2.itemId = si1.itemId
WHERE si1.sellerId = 4711
GROUP BY si2.sellerId
ORDER BY itemCount DESC
LIMIT 50;
(Note: DISTINCT is probably unnecessary.)
In words: For seller #4711, find the items he sells, then find which sellers are selling nearly the same set of items. (I did not try to filter out #4711 from the resultset.)
More efficient N:M
But there is still an inefficiency. Let's dissect your many-to-many mapping table (seller_item).
It has an id which is probably not used for anything. Get rid of it.
Then promote UNIQUE(sellerId, itemId) to PRIMARY KEY(sellerId, itemId).
Now change INDEX(itemId) to INDEX(itemId, sellerId) so that the last stage of the query can be "using index".
Blog discussing that further.
You have a very large dataset; you have debugged your app. Consider removing the FOREIGN KEYs; they are somewhat costly.
Getting sellerName
It may be possible to JOIN to sellers to get sellerName. But try it with just sellerId first. Then add the name. Verify that the count does not inflate (that often happens) and that the query does not slow down.
If either thing goes wrong, then do
SELECT s.sellerName, x.itemCount
FROM ( .. the above query .. ) AS x
JOIN sellers AS s USING(sellerId);
(Optionally you could add ORDER BY sellerName.)
I'm not sure how fast this would be on your database but I'd write the query like this.
select * from (
select seller.sellerName,
count(otherSellersItems.itemId) itemCount from (
select sellerId, itemId from seller_item where sellerId != 4711
) otherSellersItems
inner join (
select itemId from seller_item where sellerId = 4711
) thisSellersItems
on otherSellersItems.itemId = thisSellersItems.itemId
inner join seller
on otherSellersItems.sellerId = seller.id
group by seller.sellerName
) itemsSoldByOtherSellers
order by itemCount desc
limit 50 ;
Since we are limiting the (potentially large) resultset to at most 50 rows, I would put off getting the sellername until after we have the counts, so we only need to get 50 seller names.
First, we get the itemcount by seller_id
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
For improved performance, I would make a suitable covering index available for the join to so. e.g.
CREATE UNIQUE INDEX seller_item_UX2 ON seller_item(item_id,seller_id)
By using a "covering index", MySQL can satisfy the query entirely from the index pages, without a need to visit the pages in the underlying table.
Once the new index is created, I would drop the index on the singleton item_id column, since that index is now redundant. (Any query that could make effective use of that index will be able to make effective use of the composite index which has item_id as the leading column.)
There's no getting around a "Using filesort" operation. MySQL has to evaluate the COUNT() aggregate on each row, before it can perform a sort. There's no way (given the current schema) for MySQL to return the rows in order using an index to avoid a sort operation.
Once we have that set of (at most) fifty rows, then we can get the sellername.
To get the sellername, we could either use a correlated subquery in the SELECT list, or a join operation.
1) Using a correlated subquery in SELECT list, e.g.
SELECT so.seller_id
, ( SELECT s.sellername
FROM seller s
WHERE s.seller_id = so.seller_id
ORDER BY s.seller_id, s.sellername
LIMIT 1
) AS sellername
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
(We know that subquery will be executed (at most) fifty times, once for each row returned by the outer query. Fifty executions (with a suitable index available) isn't that bad, at least compared to 50,000 executions.)
Or, 2) using a join operation, e.g.
SELECT c.seller_id
, s.sellername
, c.itemcount
FROM (
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
) c
JOIN seller s
ON s.seller_id = c.seller_id
ORDER BY c.itemcount DESC, c.seller_id DESC
(Again, we know the the inline view c will return (at most) fifty rows, and getting fifty sellername (using a suitable index) should be fast.
SUMMARY TABLE
If we denormalized the implementation, and added summary table containing item_id (as the primary key) and a "count" of the number of sellers of that item_id, our query could take advantage of that.
As an illustration of what that might look like:
CREATE TABLE item_seller_count
( item_id BIGINT NOT NULL PRIMARY KEY
, seller_count BIGINT NOT NULL
) Engine=InnoDB
;
INSERT INTO item_seller_count (item_id, seller_count)
SELECT d.item_id
, COUNT(*)
FROM seller_item d
GROUP BY d.item_id
ORDER BY d.item_id
;
CREATE UNIQUE INDEX item_seller_count_IX1
ON item_seller_count (seller_count, item_id)
;
The new summary table will become "out of sync" when rows are inserted/updated/deleted from the seller_item table.
And populating this table would take resources. But having this available would speed up queries of the type we're working on.
I have the following query:
SELECT *
FROM s
JOIN b ON s.borrowerId = b.id
JOIN (
SELECT MIN(id) AS id
FROM tbl
WHERE dealId IS NULL
GROUP BY borrowerId, created
) s2 ON s.id = s2.id
Is there a simple way to optimize this so that I can do the JOIN directly and utilize indexes?
UPDATE
The created field is part of the GROUP BY statement because due to the limitations of our version of MySQL and the ORM being used it is possible to have multiple records with the same created timestamp value. As a result I need to find the first record for each combination of borrowerId and created.
Typically I might attempt something like this:
SELECT *
FROM s
INNER JOIN b ON s.borrowerId = b.id
LEFT OUTER JOIN s2
ON s.borrowerId = s2.borrowerId
AND s.created = s2.created
AND s.id <> s2.id
AND s.id < s2.id
WHERE s2.id IS NULL
AND s.dealId IS NULL;
But I'm not sure if that works 100% the way I want.
EXPLAIN from MySQL outputs the following:
1 PRIMARY b ALL NULL NULL NULL NULL 129690
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 317751 Using join buffer
1 PRIMARY s eq_ref PRIMARY,borrowerId_2,borrowerId PRIMARY 4 s2.id 1 Using where
2 DERIVED statuses ref dealId dealId 5 183987 Using where; Using temporary; Using filesort
As you can see, it has to query a massive number of records to build the subquery data set and when joining to the derived subquery, no indexes are found and so no indexes are used.
The first query needs this composite index:
INDEX(borrowerId, created, id)
Note that MySQL rarely uses two indexes for one SELECT, but a composite index is often very handy.
The second query seems grossly inefficient.
Please provide SHOW CREATE TABLE for each table.
I've got a composite key table CUSTOMER_PRODUCT_XREF
__________________________________________________________________
|CUSTOMER_ID (PK NN VARCHAR(191)) | PRODUCT_ID(PK NN VARCHAR(191))|
-------------------------------------------------------------------
In my batch program I need to select 500 updated customers and also get the PRODUCT_ID's purchased by CUSTOMERs separated by comma and update our SOLR index. In my query I'm select 500 customers and doing a left join to CUSTOMER_PRODUCT_XREF
SELECT
customer.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM
CUSTOMER customer
LEFT JOIN CUSTOMER_PRODUCT_XREF xref ON customer.CUSTOMER_ID=xref.CUSTOMER_ID
group by customer.CUSTOMER_ID
LIMIT 500;
EDIT: EXPLAIN QUERY
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE customer ALL PRIMARY NULL NULL NULL 74236 Using where; Using temporary; Using filesort
1 SIMPLE xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using join buffer (Block Nested Loop)
I got lost connection exception after 20 minutes running the above query.
I tried with the following (sub query) and it took 1.7 seconds to get result but still slow.
SELECT
customer.*, (SELECT group_concat(PRODUCT_ID separator ', ')
FROM CUSTOMER_PRODUCT_XREF xref
WHERE customer.CUSTOMER_ID=xref.CUSTOMER_ID
GROUP BY customer.CUSTOMER_ID)
FROM
CUSTOMER customer
LIMIT 500;
EDIT: EXPLAIN QUERY produces
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY customer ALL NULL NULL NULL NULL 74236 NULL
2 DEPENDENT SUBQUERY xref index NULL PRIMARY 1532 NULL 121627 Using where; Using index; Using temporary; Using filesort
Question
CUSTOMER_PRODUCT_XREF already has both columns set as PRIMARY_KEY and NOT_NULL but why is my query still very slow ? I thought having Primary Key on a column was enough to build an index for it. Do I need further indexing ?
DATABASE INFO:
All the ID's in my database are VARCHAR(191) because the id's can contain alphabets.
I'm using utf8mb4_unicode_ci character encoding
I'm using SET group_concat_max_len := ##max_allowed_packet to get maximum number of product_ids for each customer. Prefer using group_concat in one main query so that I don't have to do multiple separate queries to get products for each customer.
Your original version of the query is doing the join first and then sorting all the resulting data -- which is probably pretty big given how large the fields are.
You can "fix" that version by selecting 500 hundred customers first and then doing the join:
SELECT c.*, group_concat(xref.PRODUCT_ID separator ', ')
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
CUSTOMER_PRODUCT_XREF xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID
group by c.CUSTOMER_ID ;
An alternative that might or might not have a big impact would be to doing the aggregation by customer in a subquery and join that, as in:
SELECT c.*, xref.products
FROM (select c.*
from CUSTOMER customer c
order by c.customer_id
limit 500
) c LEFT JOIN
(select customer_id, group_concat(xref.PRODUCT_ID separator ', ') as products
from CUSTOMER_PRODUCT_XREF xref
) xref
ON c.CUSTOMER_ID=xref.CUSTOMER_ID;
What you have discovered is that the MySQL optimizer does not recognize this situation (where the limit has a big impact on performance). Some other database engines do a better job of optimization in this case.
Alright the speed of the queries in my question shot up when I created an index just on the CUSTOMER_ID in CUSTOMER_PRODUCT_XREF table.
So I've got two indexes now
PRIMARY_KEY_INDEX on PRODUCT_ID and CUSTOMER_ID
CUSTOMER_ID_INDEX on CUSTOMER_ID
I hold a set of nodes in one mysql table1 and a table of edges in another one (table2). Nodes come with primary keys and edges use this "foreign key"
**table1**
id label
1 node1
2 node2
3 node3
**table2**
FK_first FK_sec rel
1 3 guardian
2 1 guardian
1 3 times
I know the db-design is not perfect, but its simple...
Now i want the number of 'rel' for every node and do a query like:
SELECT
label,
COUNT( rel ) as freq
FROM
`table1`
LEFT JOIN table2 ON (id=FK_first OR id=FK_second)
GROUP BY label
ORDER BY freq DESC
I have about 1000 nodes and 2000 edges. A query with ON (id=FK_first OR id=FK_second), then the query is way faster (<1 sec). The other query needs about 6 sec which is ver slow.
I would appreciate some comments to speed this up a bit :-)
LEFT JOIN table2 ON (id=FK_first OR id=FK_second) ~6 sec
LEFT JOIN table2 ON (id=FK_first) ~0.16 sec
LEFT JOIN table2 ON (id=FK_second) ~0.16 sec
LEFT JOIN table2 ON id IN (FK_first,FK_second) ~6 sec
EXPLAIN 1:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table1 ALL NULL NULL NULL NULL 2571 Using temporary; Using filesort
1 SIMPLE table2 ALL FK_first,FK_second,FK_first_2 NULL NULL NULL 3858
EXPLAIN 2:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table1 index NULL PRIMARY 2 NULL 2571 Using index; Using temporary; Using filesort
1 SIMPLE table2 ref FK_first,FK_first_2 FK_first_2 4 table1.id 1
Try doing two joins and moving the "OR" into the COUNT() function:
For every row, this joins table2 once on FK1, then again on FK2 (if it is not already joined to that row via FK1. Then in the COUNT, we specify that only rows which have either join's rel column as non-null.
SELECT
label,
COUNT( table2A.rel || table2B.rel ) as freq
FROM
`table1`
LEFT JOIN
table2 as table2A
ON id=table2A.FK_first
LEFT JOIN
table2 as table2B
ON id=table2B.FK_second
AND table2A.FKFirst != table2B.FKFirst
GROUP BY label
ORDER BY freq DESC