mysql aggregate functions in query with two joins gives unexpected results

mysql aggregate functions in query with two joins gives unexpected results - mysql

Given the following (very simplified) mysql table structure:
products
id
product_categories
id
product_id
status (integer)
product_tags
id
product_id
some_other_numeric_value
I am trying to find every product that has an association to a certain product_tag, and that a relation to at least one category whichs status-attribute is 1.
I tried the following query:
SELECT *
FROM `product` p
JOIN `product_categories` pc
ON p.`product_id` = pc.`product_id`
JOIN `product_tags` pt
ON p.`product_id` = pt.`product_id`
WHERE pt.`some_value` = 'some comparison value'
GROUP BY p.`product_id`
HAVING SUM( pc.`status` ) > 0
ORDER BY SUM( pt.`some_other_numeric_value` ) DESC
Now my problem is: The SUM(pt.some_other_numeric_value) returns unexpected values.
I realized that if the product in question has more then one relation to the product_categories table, then every relation to the product_tags table is counted as many timed as there are relations to the product_categories table!
For example: If product with id=1 has a relation to product_categories with ids = 2, 3 and 4, and a relation with the product_tags with ids 5 and 6 - then if I insert a GROUP_CONCAT(pt.id), then it does give 5,6,5,6,5,6 instead of the expected 5,6.
At first I suspected it was a problem with the join type (left join, right join, inner join, and so on), so I tried every join type that I know of, but to no avail. I also tried to include more id-fields into the GROUP BY clause, but this didn´t solve the problem either.
Can somebody explain to me what is actually going wrong here?

You join a "main" (product) table to two tables (tags and categories) via 1:n relationships, so this is expected, you are creating a mini cartesian product. For those products that have both more than one associated tags and more than one associated categories, multiple rows are created in the result set. If you Group By, you have wrong results in aggregate functions.
One way to avoid this is to remove one of the two joins, which is a valid startegy if you don't need results from that table. Say you don't need anything in the SELECT list from the product_categories table. Then you can use a semi-join (the EXISTS subquery)to that table:
SELECT p.*,
SUM( pt.`some_other_numeric_value` )
FROM `product` p
JOIN `product_tags` pt
ON p.`product_id` = pt.`product_id`
WHERE pt.`some_value` = 'some comparison value'
AND EXISTS
( SELECT *
FROM product_categories pc
WHERE pc.product_id = pc.product_id
AND pc.status = 1
)
GROUP BY p.`product_id`
ORDER BY SUM( pt.`some_other_numeric_value` ) DESC ;
Another way to circumvent this problem is - after the GROUP BY MainTable.pk - to use DISTINCT inside the COUNT() or GROUP_CONCAT() aggregate functions. This works but you can't use it with SUM(). So, it's not useful in your specific query.
A third option - which works always - is to first group by the two (or more) side tables and then join to the main table. Something like this in your case:
SELECT p.* ,
COALESCE(pt.sum_other_values, 0) AS sum_other_values
COALESCE(pt.cnt, 0) AS tags_count,
COALESCE(pc.cnt, 0) AS categories_count,
COALESCE(category_titles, '') AS category_titles
FROM `product` p
JOIN
( SELECT product_id
, COUNT(*) AS cnt
, GROUP_CONCAT(title) AS category_titles
FROM `product_categories` pc
WHERE status = 1
GROUP BY product_id
) AS pc
ON p.`product_id` = pc.`product_id`
JOIN
( SELECT product_id
, COUNT(*) AS cnt
, SUM(some_other_numeric_value) AS sum_other_values
FROM `product_tags` pt
WHERE some_value = 'some comparison value'
GROUP BY product_id
) AS pt
ON p.`product_id` = pt.`product_id`
ORDER BY sum_other_values DESC ;
The COALESCE() are not strictly needed there - just in case you chnage the inner joins to LEFT outer joins.

you cant order by a sum function
instead you could do it like that
SELECT * ,SUM( pt.`some_other_numeric_value` ) as sumvalues
FROM `product` p
JOIN `product_categories` pc
ON p.`product_id` = pc.`product_id`
JOIN `product_tags` pt
ON p.`product_id` = pt.`product_id`
WHERE pt.`some_value` = 'some comparison value'
GROUP BY p.`product_id`
HAVING SUM( pc.`status` ) > 0
ORDER BY sumvalues DESC

Related

optimizing SQL counts

I have to select a list of Catalogs from one table, and perform counts in two other tables: Stores and Categories. The counters should show how many Stores and Categories are linked to each Catalog.
I have managed to get the functionality I need using this SQL query:
SELECT `catalog`.`id` AS `id`,
`catalog`.`name` AS `name`,
(
SELECT COUNT(*)
FROM `category`
WHERE `category`.`catalog_id` = `catalog`.`id`
AND `category`.`is_archive` = 0
AND `category`.`company_id` = 2
) AS `category_count`,
(
SELECT COUNT(*)
FROM `store`
WHERE `store`.`catalog_id` = `catalog`.`id`
AND `store`.`is_archive` = 0
AND `store`.`company_id` = 2
) AS `store_count`
FROM `catalog`
WHERE `catalog`.`company_id` = 2
AND `catalog`.`is_archive` = 0
ORDER BY `catalog`.`id` ASC;
This works as expected. But I don't like to perform sub-queries, as they are slow and this query may perform badly on LARGE lists.. Is there any method of optimizing this SQL using JOINs?
Thanks in advance.

You can make this a lot faster by refactoring the dependent subqueries in your SELECT clause into, as you mention, JOINed aggregate subqueries.
The first subquery you can write this way.
SELECT COUNT(*) num, catalog_id, company_id
FROM category
WHERE is_archive = 0
GROUP BY catalog_id, company_id
The second one like this.
SELECT COUNT(*) num, catalog_id, company_id
FROM store
WHERE is_archive = 0
GROUP BY catalog_id, company_id
Then, use those in your main query aas if they were tables containing the counts you want.
SELECT catalog.id,
catalog.name,
category.num category_count,
store.num store_count
FROM catalog
LEFT JOIN (
SELECT COUNT(*) num, catalog_id, company_id
FROM category
WHERE is_archive = 0
GROUP BY catalog_id, company_id
) category ON catalog.id = category.catalog_id
AND catalog.company_id = category.company_id
LEFT JOIN (
SELECT COUNT(*) num, catalog_id, company_id
FROM store
WHERE is_archive = 0
GROUP BY catalog_id, company_id
) store ON catalog.id = store.catalog_id
AND catalog.company_id = store.company_id
WHERE catalog.is_archive = 0
AND catalog.company_id = 2
ORDER BY catalog.id ASC;
This is faster than your example because each subquery need only run once, rather than once per catalog entry. It also has the nice feature that you only need say WHERE catalog.company_id = 2 once. The MySQL optimizer knows what to do with that.
I suggest LEFT JOIN operations so you'll still see catalog entries even if they're not mentioned in your category or store tables.

Subqueries are fine, but you can simplify your query:
SELECT c.id, c.name,
COUNT(*) OVER (PARTITION BY c.catalog_id) as category_count,
(SELECT COUNT(*)
FROM store s
WHERE s.catalog_id = s.id AND
s.is_archive = 0 AND
s.company_id = c.company_id
) AS store_count
FROM catalog c
WHERE c.company_id = 2 AND c.is_archive = 0
ORDER BY c.id ASC;
For performance, you want indexes on:
catalog(company_id, is_archive, id)
store(catalog_id, company_id, is_archive)
Because of the filtering in the outer query, a correlated subquery is probably the best performing way to get the results from store.
Also note some changes to the query:
I removed the backticks. They are unnecessary and just clutter the query.
An expression like c.id as id is redundant. The expression is given id as the alias anyway.
I changed the s.company_id = 2 to s.company_id = c.company_id. It seems like a correlation clause.

Mysql (doctrine) - count inner join having count > X

I have SQL to count products with specific properties. I am using it in the products filter. SQL is very long, but here is the primary part:
SELECT COUNT(products.id) as products_count, property_items.description, property_items.id as id
FROM property_items
INNER JOIN product_properties ON property_items.id = product_properties.property_item_id
INNER JOIN products ON product_properties.product_id
INNER JOIN product_properties pp ON products.id = pp.product_id AND (pp.property_item_id IN ($ids))
GROUP BY property_items.id
HAVING COUNT(pp.id) >= $countIds
This works perfectly when I have only the one element in $ids, but when i choose one more, the result is bad. It looks like the sql returns count of all products with any property from $ids, but I need to count only products that contains all properties.
First get all available properties. On each property join products that contains this property and go back to all properties of this product to check, if product contains already checked properties too. Or it is bad idea? I need to keep primary table (FROM table) as property_items.
I need to get result in this format:
=============================
id|description|products_count
=============================
1 |lorem ipsum|10
-----------------------------
2 |dolore sit |2
Thanks for any idea.

Try to use SELECT COUNT (DISTINCT products.id) as cnt

You can get the product ids that have all the properties by doing:
SELECT pp.property_id
FROM property_items pi INNER JOIN
product_properties pp
ON pi.id = pp.property_item_id INNER JOIN
products p
ON pp.product_id = p.id
WHERE pp.property_item_id IN ($ids)
GROUP BY pp.property_id
HAVING COUNT(DISTINCT pp.property_item_id) = $countIds -- has all of them
Note that I rationalized the joins. I think your simplification of the query wasn't quite right. I also added table aliases, so the query is easier to write and to read.
If you want the count of such products, use a subquery:
SELECT COUNT(*)
FROM (SELECT pp.property_id
FROM property_items pi INNER JOIN
product_properties pp
ON pi.id = pp.property_item_id INNER JOIN
products p
ON pp.product_id = p.id
WHERE find_in_set(pp.property_item_id, $ids)
GROUP BY pp.property_id
HAVING COUNT(DISTINCT pp.property_item_id) = $countIds -- has all of them
) ;
Your problem is probably because of this line:
WHERE pp.property_item_id IN ($ids)
If you are passing $ids as a comma-separated string, then your query will not work. Note the replacement above.

Anti Join with group/conditions

Note: I have simplified the question since both that and the answer have become I believe more complex than intended.
I want to an an anti-join that has a condition other than just not existing in the first table.
Table Product / Manufacturer
Widget / Acme
Paddle / Acme
Ball / Acme
Gas / Exxon
Pump / Exxon
Table: Customer / Product
Karen / Ball
Bob / Paddle
Karen / Gas
Bob / Pump
A "normal" anti-join would find out which products have not been ordered via
Select Products from `Product / Manufacturer` as T1
Left Join `Customer / Product` as T2
On T2.Zip is NULL
However what I am looking for is which customers didn't order which products, in essence:
Select Products from `Product / Manufacturer`
where Manufacturer = 'Acme' that do not exist in `Customer / Product`
where Customer = 'Karen'
and
Select Products from `Product / Manufacturer`
where Manufacturer = 'Exxon' that do not exist in `Customer / Product`
where Customer = 'Karen'
and
Select Products from `Product / Manufacturer`
where Manufacturer = 'Acme' that do not exist in `Customer / Product`
where Customer = 'Bob'
and
Select Products from `Product / Manufacturer`
where Manufacturer = 'Exxon' that do not exist in `Customer / Product`
where Customer = 'Bob'
'
But as one query since there are 100s of "Customers" and 100s of Manufacturers.

If you want to exclude all products for a manufacturer for which no product from that manufacturer appears in any order...
Then that means that you only want to include only products from certain manufacturers...
Which manufacturers have had a product appear in an order ?
SELECT r.manufacturer
FROM products r
JOIN orders s
ON s.product = r.product
GROUP BY r.manufacturer
You can wrap that query in parens and include it as an inline view ...
SELECT p.*
FROM ( SELECT r.manufacturer
FROM product r
JOIN orders s
ON s.product = r.product
GROUP BY r.manufacturer
) q
JOIN product p
ON p.manufacturer = q.manufacturer
LEFT
JOIN orders o
ON o.product = p.Product
WHERE o.product IS NULL
There are other query patterns that will return an equivalent result.
FOLLOWUP
NOTE: The "breakdown by gender/hour" part wasn't made clear in the original specification.
The query pattern is very much the same. Use an inline view query to return a distinct list of manufacturers for each gender/hour.
Then join that set to the product table, to get every product from those manufacturer. That will included products that were ordered, as well as products that weren't ordered.
Then apply the anti-join pattern, to exclude the products that were ordered by gender/hour.
SELECT q.gender
, q.hour
, p.manufacturer
, p.product
FROM ( SELECT s.gender
, s.hour
, r.manufacturer
FROM orders s
JOIN product r
ON r.product = s.product
GROUP
BY s.gender
, s.hour
, r.manufacturer
) q
JOIN product p
ON p.manufacturer = q.manufacturer
LEFT
JOIN orders o
ON o.gender = q.gender
AND o.hour = q.hour
AND o.product = p.product
WHERE o.product IS NULL
If that's not clear, consider that the following query returns an equivalent set. The inline line view query t returns the set of all products from a manufacturer, by gender/hour.
This query is somewhat less efficient (at least in MySQL) due to the additional inline view. And while longer, it may be more understandable, since the view query t makes explicit the set of all possible rows that could be returned... every product by manufacturer/gender/hour. (To see that set, the view query t can be pulled out and run separately to see what it returns.)
In the outermost query, t is referenced as if it were a table. If it t were replaced by a simple table reference, the query would just be a simple anti-join. All rows from t excluding rows that have a match.
SELECT t.gender
, t.hour
, t.manufacturer
, t.product
FROM (
SELECT q.gender
, q.hour
, q.manufacturer
, p.product
FROM ( SELECT s.gender
, s.hour
, r.manufacturer
FROM orders s
JOIN product r
ON r.product = s.product
GROUP
BY s.gender
, s.hour
, r.manufacturer
) q
JOIN product p
ON p.manufacturer = q.manufacturer
) t
LEFT
JOIN orders o
ON o.gender = t.gender
AND o.hour = t.hour
AND o.product = t.product
WHERE o.product IS NULL
I recommend you get the set of rows returned first. Before you futz with adding a GROUP BY and a GROUP_CONCAT aggregate to collapse the rows.
If you want to group multiple values of "hour" into just "am" or "pm", you can use an expression (in place of "hour") that returns "am" or "pm". (Think in terms of that expression being another column in the table; but instead of referencing a column in the table, you use an expression that derives the value from other columns in the table.
IF(x.hour<12,'am','pm')

Retrieve records from multiple tables some distinct, some not

I have 4 tables in an existing mysql database of a directory type site.
Table mt_links contains basic info for each listing
Table mt_cl contains which listing above is in what category (I only want cat_id=1)
Table mt_cfvalues contains more details for each listing It Can have repeated values
Table mt_images contains image names for each listing.
I want all records from mt_links where the mt_cl cat_id=1, and for each of those records, I need all records in mt_cfvalues and cf_images matching the link_id.
I set up a select with Group_Concat and left joins, but ended up with repeating values in my results. I added Distinct, which cured the repeating values, but mt_cfvalues can have records with the same value, so now I'm missing a value I should have.
SELECT a.link_id,
a.link_name,
a.link_desc,
GROUP_CONCAT(DISTINCT b.value ORDER BY b.cf_ID) AS details,
GROUP_CONCAT(DISTINCT c.filename ORDER BY c.ordering) AS images
FROM mt_links a
LEFT JOIN mt_cfvalues b ON a.link_id = b.link_ID
LEFT JOIN mt_images c ON b.link_id = c.link_ID
LEFT JOIN mt_cl d ON a.link_id = d.link_ID WHERE d.cat_ID = '1'
GROUP BY a.link_id
I put together a SQLFiddle here: http://www.sqlfiddle.com/#!2/f39e9/1
Is there an easier way? How do I fix the repeating / no repeating issue?

Here is one way of accomplishing what you seek. Because the two subqueries return independent results, you can't combine the GROUP BY, which is why you were getting duplicates.
SELECT a.link_id,
a.link_name,
a.link_desc,
cvf.details,
imgs.images
FROM mt_links a
LEFT JOIN (
SELECT link_ID, GROUP_CONCAT(value ORDER BY cf_ID) AS details
FROM mt_cfvalues
GROUP BY link_ID
) cvf ON cvf.link_ID = a.link_id
LEFT JOIN (
SELECT link_ID, GROUP_CONCAT(filename ORDER BY ordering) AS images
FROM mt_images
GROUP BY link_ID
) imgs ON imgs.link_ID = a.link_id
INNER JOIN mt_cl d ON a.link_id = d.link_ID
WHERE d.cat_ID = '1'

SQL query inner joins and limit to max 3 most recent

I have the following query:
SELECT * FROM `product` INNER JOIN `shop`
ON `product`.shop_id= `shop`.id;
I wanted to get all of the products from all the shops I have, but I wanted to get 3 products max from each shop. Is there a way to specify MAX on each joins?
Here's my product table:
Here's my shop table:

Try this:
SELECT *
FROM (SELECT *
FROM (SELECT *, IF(#shop = (#shop:=p.shop_id), #id:=#id + 1, #id := 1) temp
FROM `product` p, (SELECT #shop:=0, #id:=1) AS A
ORDER BY p.shop_id, p.updated DESC) AS B
WHERE temp <= 3) AS C
INNER JOIN `shop` s ON C.shop_id= s.id;

Query:
SELECT *
FROM `product` p
INNER JOIN `shop` s
ON `p`.shop_id= `s`.id
WHERE p.id IN (SELECT p2.id
FROM `product` p2
WHERE p2.shop_id = s.id
ORDER BY p2.updated DESC
LIMIT 3)
OR maybe:
SELECT *
FROM `product` p
INNER JOIN `shop` s
ON `p`.shop_id= `s`.id
WHERE EXISTS (SELECT p2.id
FROM `product` p2
WHERE p2.shop_id = s.id
ORDER BY p2.updated DESC
LIMIT 3)

Specifying limits within a subquery is a bit challenging in MySQL (not impossible, but a bit complicated).
If you just want the three most recent product ids for each shop, and you can live with them on one row, then you can use group_concat(). The query is much simpler:
SELECT shop.*,
substring_index(group_concat(product.id order by product.updated desc), ',', 3) as ThreeProducts
FROM `product` INNER JOIN
`shop`
ON `product`.shop_id= `shop`.id
group by shop.id;
The results will place the product ids in a single field like this: '1,2,3'.

It is important to know the tables definitions in terms of primary keys, foreign keys, etc to come up with a SQL to solve the problem. From the images it is not clear if product.id is unique or not. I suspect there is possibly a data model definition issue here.
If the tables are not normalized to a necessary extent, it will be very difficult (sometime not possible) to read appropriate data back.
A reasonably normalized tables should look like.
Product(id primary key, ....)
Shop(id primary key,....)
and a relation table say.
Shop_Product (shop_id references Shop(id), prod_id references Product(id), ...)
It will be helpful to help you out if you could send table definitions.

try to use limit in your code. It may work

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

mysql aggregate functions in query with two joins gives unexpected results - mysql

Related

optimizing SQL counts

Mysql (doctrine) - count inner join having count > X

Anti Join with group/conditions

Retrieve records from multiple tables some distinct, some not

SQL query inner joins and limit to max 3 most recent

Categories

Resources