How to count duplicates by multiple records in subtable - mysql

Say I have the following table structure:
products
id | name | price
products_ean
id | product_id | ean
A product can (unfortunately) have multiple EAN numbers. Two products can have one or more of the same EAN numbers.
What is the best practice to count the amount of duplicate products by comparing multiple EAN numbers from the products_ean table?
I've tried something like the following, but that makes the query really slower:
SELECT
`products`.`name`,
(
SELECT
COUNT(*)
FROM
`products_ean`
WHERE
`ean` IN(
SELECT
`ean`
FROM
`products_ean`
WHERE
`product_id` = `products`.`id`
) AND `products_ean`.`product_id` != `products`.`id`
GROUP BY `product_id`
) AS `ProductEANCount`
FROM
`products`
LIMIT 12

Using joins is the simplest way to generate related information. I've GROUP BY the product.id which means the eans are the aggregated field because those are the only ones that can duplicate. I've added HAVING part after the query to select only those results with 2 or more (its optional).
SELECT p.id, name, price, count(ean) as eans
FROM products p
JOIN products_ean e
ON p.id = e.product_id
GROUP BY p.id
HAVING eans >= 2
On query efficiency, having the product_id,ean as a composite primary key for the products_ean table is probably most efficient. Since that is unique its not obvious why the products_ean.id column is needed.

Related

How to match up dates on 2 different tables and join to an ID table?

I have a table full of product ids and their attributes. I want to join sales data and receipt data from 2 different tables and a different row for each id and date combo. So I want the output to look like this:
I tried joining the product id table and sales table to the receipt table but I'm not sure how to get the dates from the sales and receipts table to match up. Not sure of how to approach this. Thanks!
Calculate the counts for each table and combine them usung UNION ALL
select
product_id
,sales_date
-- combine the counts from both tables
,sum(sales_count)
,sum(receipt_count)
from
(
-- get the counts for the sales table
select
product_id
,sales_date
,count(*) as sales_count
-- UNION needs the same number of columns in both Select -> adding a dummy column
-- 1st Select determines the datatype
,cast(0 as int) as receipt_count
from sales
group by product_id, sales_date
UNION ALL
-- get the counts for the receipts table
select
product_id
,receipt_date
,0
,count(*)
from receipts
group by product_id, receipt_date
) as dt
group by product_id, receipt_date
select p.product_id, s.sales_date, s.sales_count, r.receipt_count
from
products p,
(select count(*) sales_count, sales_date, product_id from sales group by 2,3) s
(select count(*) receipt_count, receipt_date, product_id from receipts group by 2,3) r
where
p.product_id = s.product_id
and p.product_id = r.product_id
and s.sales_date=r.receipt_date
;

how to speed up query when using count and group by

I have two tables named seller and item. They are connected through a third table (seller_item) using a "n" to "m" foreign key relation.
Now I a try to answer the requirement: "I as a seller want a list of my competitors with a count of items I am selling and they are selling as well".
So a list of all sellers with the count of overlapping items in relation to one specific seller.
Also I want this to be sorted by count and limited.
But the query is using temp table and filesort which is very slow.
Explain says:
Using where; Using index; Using temporary; Using filesort
How can I speed this up ?
Here is the query:
SELECT
COUNT(*) AS itemCount,
s.sellerName
FROM
seller s,
seller_item si
WHERE
si.itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
AND
si.sellerId=s.id
GROUP BY
sellerName
ORDER BY
itemCount DESC
LIMIT 50;
the table defs:
CREATE TABLE `seller` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerName` varchar(50) NOT NULL
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`sellerName`),
) ENGINE=InnoDB
contains about 200.000 rows
--
CREATE TABLE `item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`itemName` varchar(20) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `unique_index` (`itemName`),
) ENGINE=InnoDB
contains about 100.000.000 rows
--
CREATE TABLE `seller_item` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`sellerId` bigint(20) unsigned NOT NULL,
`itemId` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `sellerId` (`sellerId`,`itemId`),
KEY `item_id` (`itemId`),
CONSTRAINT `fk_1` FOREIGN KEY (`sellerId`) REFERENCES `seller` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION,
CONSTRAINT `fk_2` FOREIGN KEY (`itemId`) REFERENCES `item` (`id`) ON DELETE CASCADE ON UPDATE NO ACTION
) ENGINE=InnoDB
contains about 170.000.000 rows
Database is Mysql Percona 5.6
Output of EXPLAIN:
+----+-------------+-------------+--------+----------------------+----- ---------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
| 1 | SIMPLE | s | index | PRIMARY,unique_index | unique_index | 152 | NULL | 1 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | si | ref | sellerId,item_id | sellerId | 8 | tmp.s.id | 1 | Using index |
| 1 | SIMPLE | seller_item | eq_ref | sellerId,item_id | sellerId | 16 | const,tmp.si.itemId | 1 | Using where; Using index |
+----+-------------+-------------+--------+----------------------+--------------+---------+---------------------+------+----------------------------------------------+
I doubt it's feasible to make a query like that run fast in realtime on a database of your size, especially for sellers with lots of popular items in stock.
You should materialize it. Create a table like this
CREATE TABLE
matches
(
seller INT NOT NULL,
competitor INT NOT NULL,
matches INT NOT NULL,
PRIMARY KEY
(seller, competitor)
)
and update it in batches in a cron script:
DELETE
FROM matches
WHERE seller = :seller
INSERT
INTO matches (seller, competitor, matches)
SELECT si.seller, sc.seller, COUNT(*) cnt
FROM seller_item si
JOIN seller_item sc
ON sc.item = si.item
AND sc.seller <> si.seller
WHERE si.seller = :seller
GROUP BY
si.seller, sc.seller
ORDER BY
cnt DESC
LIMIT 50
You also need to make (seller, item) the PRIMARY KEY on seller_item. The way it is now, finding a seller by item requires two lookups instead of one: first id by item using KEY (item), then seller by id using the PRIMARY KEY (id)
I believe you're under a misimpression about your ability to eliminate the Using temporary; Using filesort steps to satisfy your query. Queries of the form
SELECT COUNT(*), grouping_value
FROM table
GROUP BY grouping_value
ORDER BY COUNT(*)
LIMIT n
always use a temporary in-memory result set, and always sort that resultset. That's because the result set doesn't exist anywhere until the query runs, and it has to be sorted before the LIMIT clause can be satisfied.
"Filesort" is somewhat misnamed. It doesn't necessarily mean the sorting is happening on a file in the file system, just that a temporary resultset is being sorted. If that resultset is massive, the sort can spill out of RAM into the filesystem, but it doesn't have to. Please read this. https://www.percona.com/blog/2009/03/05/what-does-using-filesort-mean-in-mysql/ Don't get distracted by the Using filesort item in your EXPLAIN results.
One of the tricks to getting better performance from this sort of query is to minimize the size of the sorted results. You've already filtered them down to the stuff you want; that's good.
But, you can still arrange to sort less stuff, by sorting just the seller.id and the count, then joining the (longer) sellerName in after you know the exact fifty rows you need. That also has the benefit of letting you do your aggregating with just the seller_item table, rather than with the resultset that comes from joining the two.
Here's what I mean. This subquery generates the list of fifty sellerId values you need. All it has to sort is the count and sellerId. That's faster than sorting the count and sellerName because there's less data, and fixed-length data, to shuffle around in the sort operation.
SELECT COUNT(*) AS itemCount,
sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
Notice that this sorts a big result set, then discards most of it. It gives you the exact fifty seller id values you need.
You can make this even faster by filtering out more rows by adding HAVING COUNT(*) > 1 right after your GROUP BY clause, but that changes the meaning of your query and may not meet your business requirements.
Once you have those fifty items, you can retrieve the seller names. The whole query looks like this:
SELECT s.sellerName, c.itemCount
FROM seller s
JOIN (
SELECT COUNT(*) AS itemCount, sellerId
FROM seller_item
WHERE itemId IN
(SELECT itemId FROM seller_item WHERE sellerId = 4711)
GROUP BY SellerId
ORDER BY COUNT(*) DESC
LIMIT 50
) c ON c.sellerId = s.id
ORDER BY c.itemCount DESC
Your indexing effort should be spent trying to make the inner queries fast. The outer query will be fast no matter what; it's only handling fifty rows, and using an indexed id value to look up other values.
The inmost query is SELECT itemId FROM seller_item WHERE sellerId = 4711. This will benefit greatly from your existing (sellerId, itemId) compound index: it can random-access and then scan that index, which is very quick.
The SELECT COUNT(*)... query will benefit from a (itemId, sellerId) compound index. That part of your query is the hard and slow part, but still, this index will help.
Look, others have mentioned this, and so will I. Having both a unique composite key (sellerId, itemId) and a primary key id on that seller_item table is, with respect, incredibly wasteful.
It makes your updates and inserts slower.
It means your table is organized as a tree based on the meaningless id rather than the meaningful value pair.
If you make one of the two indexes I mentioned the primary key, and create the other one without making it unique, you'll have a much more efficient table. These many-to-many join tables don't need, and should not have, surrogate keys.
Reformulation
I think this is what you really wanted:
SELECT si2.sellerId, COUNT(DISTINCT si2.itemId) AS itemCount
FROM seller_item si1
JOIN seller_item si2 ON si2.itemId = si1.itemId
WHERE si1.sellerId = 4711
GROUP BY si2.sellerId
ORDER BY itemCount DESC
LIMIT 50;
(Note: DISTINCT is probably unnecessary.)
In words: For seller #4711, find the items he sells, then find which sellers are selling nearly the same set of items. (I did not try to filter out #4711 from the resultset.)
More efficient N:M
But there is still an inefficiency. Let's dissect your many-to-many mapping table (seller_item).
It has an id which is probably not used for anything. Get rid of it.
Then promote UNIQUE(sellerId, itemId) to PRIMARY KEY(sellerId, itemId).
Now change INDEX(itemId) to INDEX(itemId, sellerId) so that the last stage of the query can be "using index".
Blog discussing that further.
You have a very large dataset; you have debugged your app. Consider removing the FOREIGN KEYs; they are somewhat costly.
Getting sellerName
It may be possible to JOIN to sellers to get sellerName. But try it with just sellerId first. Then add the name. Verify that the count does not inflate (that often happens) and that the query does not slow down.
If either thing goes wrong, then do
SELECT s.sellerName, x.itemCount
FROM ( .. the above query .. ) AS x
JOIN sellers AS s USING(sellerId);
(Optionally you could add ORDER BY sellerName.)
I'm not sure how fast this would be on your database but I'd write the query like this.
select * from (
select seller.sellerName,
count(otherSellersItems.itemId) itemCount from (
select sellerId, itemId from seller_item where sellerId != 4711
) otherSellersItems
inner join (
select itemId from seller_item where sellerId = 4711
) thisSellersItems
on otherSellersItems.itemId = thisSellersItems.itemId
inner join seller
on otherSellersItems.sellerId = seller.id
group by seller.sellerName
) itemsSoldByOtherSellers
order by itemCount desc
limit 50 ;
Since we are limiting the (potentially large) resultset to at most 50 rows, I would put off getting the sellername until after we have the counts, so we only need to get 50 seller names.
First, we get the itemcount by seller_id
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
For improved performance, I would make a suitable covering index available for the join to so. e.g.
CREATE UNIQUE INDEX seller_item_UX2 ON seller_item(item_id,seller_id)
By using a "covering index", MySQL can satisfy the query entirely from the index pages, without a need to visit the pages in the underlying table.
Once the new index is created, I would drop the index on the singleton item_id column, since that index is now redundant. (Any query that could make effective use of that index will be able to make effective use of the composite index which has item_id as the leading column.)
There's no getting around a "Using filesort" operation. MySQL has to evaluate the COUNT() aggregate on each row, before it can perform a sort. There's no way (given the current schema) for MySQL to return the rows in order using an index to avoid a sort operation.
Once we have that set of (at most) fifty rows, then we can get the sellername.
To get the sellername, we could either use a correlated subquery in the SELECT list, or a join operation.
1) Using a correlated subquery in SELECT list, e.g.
SELECT so.seller_id
, ( SELECT s.sellername
FROM seller s
WHERE s.seller_id = so.seller_id
ORDER BY s.seller_id, s.sellername
LIMIT 1
) AS sellername
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
(We know that subquery will be executed (at most) fifty times, once for each row returned by the outer query. Fifty executions (with a suitable index available) isn't that bad, at least compared to 50,000 executions.)
Or, 2) using a join operation, e.g.
SELECT c.seller_id
, s.sellername
, c.itemcount
FROM (
SELECT so.seller_id
, COUNT(*) AS itemcount
FROM seller_item si
JOIN seller_item so
ON so.item_id = si.item_id
WHERE si.seller_id = 4711
GROUP BY so.seller_id
ORDER BY COUNT(*) DESC, so.seller_id DESC
LIMIT 50
) c
JOIN seller s
ON s.seller_id = c.seller_id
ORDER BY c.itemcount DESC, c.seller_id DESC
(Again, we know the the inline view c will return (at most) fifty rows, and getting fifty sellername (using a suitable index) should be fast.
SUMMARY TABLE
If we denormalized the implementation, and added summary table containing item_id (as the primary key) and a "count" of the number of sellers of that item_id, our query could take advantage of that.
As an illustration of what that might look like:
CREATE TABLE item_seller_count
( item_id BIGINT NOT NULL PRIMARY KEY
, seller_count BIGINT NOT NULL
) Engine=InnoDB
;
INSERT INTO item_seller_count (item_id, seller_count)
SELECT d.item_id
, COUNT(*)
FROM seller_item d
GROUP BY d.item_id
ORDER BY d.item_id
;
CREATE UNIQUE INDEX item_seller_count_IX1
ON item_seller_count (seller_count, item_id)
;
The new summary table will become "out of sync" when rows are inserted/updated/deleted from the seller_item table.
And populating this table would take resources. But having this available would speed up queries of the type we're working on.

Single query instead of multiple select statements

Hello I am stuck in a SQL statement situation. Can anyone help me with the statement:
Table 1.
Customers
customer_id | customer_name | custom_url | customer_desc
Table 2.
Categories
category_id | customer_id | category_name
select customer_id where category_name="Realtor";
gives me a list of customer ids. Now I use them to find all their names, desc and url from the customers table.
I need to do this in a single query and process the customer details to display on the UX. I am really stuck how to do this.
Here is what i have (which I know is wrong)
select * from customers where customer_id = Loop (select customer_id from categories where category_name="Realtor");
Modify your query to a JOIN query like
select c.*
from customers c
join categories ca on c.customer_id = ca.customer_id
where ca.category_name='Realtor';
You can do what you want using IN or ANY or EXISTS:
select *
from customers
where customer_id = ANY (select customer_id
from categories
where category_name = 'Realtor'
);
Or:
select *
from customers
where customer_id IN (select customer_id
from categories
where category_name = 'Realtor'
);

Rewrite this IN query using JOIN for speedy results

I have this query which works but it takes 30 mintues to calculate. I know IN is slow, but looking for a join alternative.
SELECT *, COUNT(*) as Results from member_preferences_products_data
WHERE member_preferences_products_data.Member_ID IN (SELECT Member_ID from member_preferences_products_data WHERE Product_ID = '623')
GROUP by Product_ID
ORDER by Results asc
LIMIT 10
A direct replacement for you query would be
SELECT *, COUNT(*) as Results
FROM member_preferences_products_data
INNER JOIN
(
SELECT DISTINCT Member_ID
FROM member_preferences_products_data
WHERE Product_ID = '623'
) Sub1
ON member_preferences_products_data.Member_ID = Sub1.Member_ID
GROUP by Product_ID
ORDER by Results asc
LIMIT 10
However I am a bit confused about what you are trying to find. It seems you want a count of the rows for each product id where that product id has been bought by a member who bought product id 623.
Also, if the field Product_ID a string or an INT. If an INT the quotes are not required, but if it is a string then it will probably be slower than an INT field assuming the values are numeric

MySQL select join where AND where

I have two tables in my database:
Products
id (int, primary key)
name (varchar)
ProductTags
product_id (int)
tag_id (int)
I would like to select products having all given tags. I tried:
SELECT
*
FROM
Products
JOIN ProductTags ON Products.id = ProductTags.product_id
WHERE
ProductTags.tag_id IN (1, 2, 3)
GROUP BY
Products.id
But it gives me products having any of given tags, instead of having all given tags. Writing WHERE tag_id = 1 AND tag_id = 2 is pointless, because no rows will be returned.
This type of problem is known as relational division
SELECT Products.*
FROM Products
JOIN ProductTags ON Products.id = ProductTags.product_id
WHERE ProductTags.tag_id IN (1,2,3)
GROUP BY Products.id /*<--This is OK in MySQL other RDBMSs
would want the whole SELECT list*/
HAVING COUNT(DISTINCT ProductTags.tag_id) = 3 /*Assuming that there is a unique
constraint on product_id,tag_id you
don't need the DISTINCT*/
you need to have a group by / count to ensure all are accounted for
select Products.*
from Products
join ( SELECT Product_ID
FROM ProductTags
where ProductTags.tag_id IN (1,2,3)
GROUP BY Products.id
having count( distinct tag_id ) = 3 ) PreQuery
on ON Products.id = PreQuery.product_id
The MySQL WHERE fieldname IN (1,2,3) is essentially shorthand for WHERE fieldname = 1 OR fieldname = 2 OR fieldname = 3. So if you aren't getting the desired functionality with WHERE ... IN then try switching to ORs. If that still doesn't give you the results you want, then perhaps WHERE ... IN is not the function you need to use.