Should I use GROUP BY when I use IN clause? - mysql

I'm working on someone else's project. There is a query like this:
SELECT posts.id, posts.title, posts.body, posts.keywords
FROM posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE tags.name IN ( :keywords )
GROUP BY posts.id
The new policy is to replace IN with =. So the query I've written looks like this:
SELECT posts.id, posts.title, posts.body, posts.keywords
FROM posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE tags.name = :keyword
GROUP BY posts.id
Now I want to know, is GROUP BY redundant in this case? I say so because I think the reason of GROUP BY is omitting duplicate posts which are matched by each keyword.

First things first, when using GROUP BY within a SELECT statement every column that is not included within the grouping clause should be wrapped up with an aggregate function.
Just because MySQL allows this kind of odd behaviour doesn't make it best practices. Other DBMS for example PostgreSQL wouldn't allow this query to execute at all.
Saying that, how it works internally within MySQL is just that you get a unique record for each posts.id, but random values from potentially different rows for all non-aggregated and non-grouped column.
You should be using DISTINCT from what I can see.
Answer to your question
Replacing IN with = doesn't affect grouping at all, so you are free to go with it especially if you are not passing list but a single value to that query, but GROUP BY is not redundant in any case (or should be completely removed in both). It would change the output you receive.
If you, for instance, grouped by a unique column within a table and join that to a table with 1:1 relationship GROUP BY would be redundant. As a second example constructing proper WHERE clause with conditions might make it redundant as well.

Related

Why does COUNT(*) does the same as COUNT(column) in this query?

I am finishing a SQL course and they have a query example that I don't quite understand.
I have
these tables
and I need to get how many 'etiquetas' (tags in Spanish), are in each post, so they have this solution:
SELECT posts.titulo, COUNT(*) num_etiquetas
FROM posts
INNER JOIN posts_etiquetas ON posts.id = posts_etiquetas.post_id
INNER JOIN etiquetas ON etiquetas.id = posts_etiquetas.etiqueta_id
GROUP BY posts.id
ORDER BY num_etiquetas DESC;
I have been trying to understand this query and two questions came up:
Why COUNT(asterisk) does the same as COUNT(etiquetas.nombre)? For me only the latter makes sense, I don't quite understand why COUNT(asterisk) works in the given solution, isn't COUNT(*) supposed to count the total number of rows? maybe the issue is that I don't really understand how GROUP BY really works.
Why does deleting this line doesn't change the result? What is its use in the original solution?
INNER JOIN etiquetas ON etiquetas.id = posts_etiquetas.etiqueta_id

MySql: order by along with group by - performance

I have the performance problem with query that have order by and group by. I have checked similar problems on SO but I did not find the solution to this:(
I have something like this in my db schema:
pattern has many pattern_file belongs to project_template which belongs to project
Now I want to get projects filtered by some data(additional tables that I join) and want to get the result ordered for example by projects.priority and grouped by patterns.id. I have tried many things and to get the desired result I've figured out this query:
SELECT DISTINCT `projects`.* FROM `projects`
INNER JOIN `project_templates` ON `project_templates`.`project_id` = `projects`.`id`
INNER JOIN `pattern_files` ON `pattern_files`.`id` = `project_templates`.`pattern_file_id`
INNER JOIN `patterns` ON `patterns`.`id` = `pattern_files`.`pattern_id`
...[ truncated ]
INNER JOIN (SELECT DISTINCT projects.id FROM `projects` INNER JOIN `project_templates` ON `project_templates`.`project_id` = `projects`.`id`
INNER JOIN `pattern_files` ON `pattern_files`.`id` = `project_templates`.`pattern_file_id`
INNER JOIN `patterns` ON `patterns`.`id` = `pattern_files`.`pattern_id`
...[ truncated ]
WHERE [here my conditions] ORDER BY [here my order]) P
ON P.id = projects.id
WHERE [here my conditions]
GROUP BY patterns.id
ORDER BY [here my order]
From my research I have to INNER JOIN with subquery to conquer the problem "ORDER BY before GROUPing BY" => then I have put the same conditions on the outer query for performance purpose. The order by I had to use again in the outer query too, otherwise the result will be sorted by default.
Now there is real performance problem as I have about 6k projects and when I run this query without any conditions it takes about 15s :/ When I narrow the result by specify the conditions the time drastically dropped down. I've found somewhere that the subquery is run for every outer query row result which could be true when you watch at the execution time :/
Could you please give some advice how I can optimize the query? I do not work much with sql so maybe I do it from the wrong side from the very beginning?
P.S. I have tried WHERE projects.id IN (Select project.id FROM projects ....) and that discarded the performance issue but also discarded the ORDER BY before GROUPing BY
EDIT.
I want to retrieve list of projects, but I want also to filter it and order, and finally I want to get patterns.id unique(that is why I use the group by).
order by in your inner query (p) doesn't make sense (any inner sort will only
have an arbitrary effect).
#Solarflare Unfortunately it does. group by will take first row from grouped result. It preserve the order for join. Well, I believe that it is specific to MySql. Furthermore to keep the order from subquery I could use ORDER BY NULL in outer query :-)
Also, select projects.* ... group by pattern.id is fishy (although MySQL, in contrast to every other dbms, allows you to do this)
so we can assume I retrieve only projects.id, but from docs:
MySQL extends the use of GROUP BY to permit selecting fields that are not mentioned in the GROUP BY clause

SQL query with JOIN

I'm creating a product filter for e-commerce store. I have a product table, characteristics table and a table in which I store product_id, characteristic_id and a single filter value.
shop_products - id, name
shop_characteristics - id, values (json)
shop_values - product_id, characteristic_id, value
I can build a query to get all the products by a single value like this:
SELECT `p`.* FROM `shop_products` `p`
LEFT JOIN `shop_values` `fv` ON `p`.`id` = `fv`.`product_id`
WHERE ((`fv`.`characteristic_id`=3) AND (`fv`.`value`='outdoor'))
It works fine. Also, I can modify this query and get all the products by multiple values that belong to the very same characteristics group (have identical characteristics_id) like this:
SELECT `p`.* FROM `shop_products` `p`
LEFT JOIN `shop_values` `fv` ON `p`.`id` = `fv`.`product_id`
WHERE ((`fv`.`characteristic_id`=3) AND (`fv`.`value`='outdoor'))
OR ((`fv`.`characteristic_id`=3) AND (`fv`.`value`='indoor'))
but when I try to create a query for multiple conditions with different characteristic_id I get nothing
SELECT `p`.* FROM `shop_products` `p`
LEFT JOIN `shop_values` `fv` ON `p`.`id` = `fv`.`product_id`
WHERE ((`fv`.`characteristic_id`=3) AND (`fv`.`value`='outdoor'))
AND ((`fv`.`characteristic_id`=5) AND (`fv`.`value`='white'))
My guess it does not work because of AND operator that I am using wrong in this case due to there are no records in shop_values table that have both characteristic_id 3 and 5.
So my question is how to combine or modify my query to get all related products or maybe it is a flaw to store data like this and I need to create a different kind of shop_values table?
Use aggregation. You can also use tuples with the in clause. So:
SELECT p.*
FROM shop_products p JOIN
shop_values v
ON p.id = v.product_id
WHERE (v.characteristic_id, v.value) IN ( (3, 'outdoor'), (5, 'white'))
GROUP BY p.id
HAVING COUNT(DISTINCT v.characteristic_id) = 2;
Notes:
Unnecessarily escaping column and table aliases (with backticks) just makes the query harder to write and to read.
In general, using SELECT p.* and GROUP BY p.id is really, really bad form. The one exception is when you are grouping by a unique or primary key. This latter form is actually supported in the ANSI standard.
A LEFT JOIN is not needed. You need to find matches between the tables for the logic to work.
The use of AND and OR is fine for the WHERE clause. MySQL happens to support tuples with IN, which somewhat simplifies the logic.

How to write mysql subselect properly with conditions and limiting

I have three Tables:
Posts:
id, title, authorId, text
authors:
id, name, country
Comments:
id, authorId, text, postId
I want to run a mysql command which selects the first 5 posts which were written by authors, whose country is 'Ireland'. In the same call, I want to retrieve all the comments for those five posts, and also the author info.
I've tried the following:
SELECT posts.id as 'posts.id', posts.title as 'posts.title' (etc. etc. list all fields in three table)
FROM
(SELECT * FROM posts, authors WHERE authors.country = 'ireland' AND authors.id = posts.authorId LIMIT 0, 5 ) as posts
LEFT JOIN
comments ON comments.postId = posts.id,
authors
WHERE
authors.id = posts.authorId
I had to include every field with an alias ^ because there was a duplicate for id, and more fields in future may become duplicates as I'm looking for a generic solution.
My two questions are:
1) I am getting a duplicate field entry from within my subselect for id, so do I have to list out all my fields as aliases again within the subselect or is there only one field I need for a subselect
2) Is there a way to auto-alias my call? At the moment I've just aliased every field in the main select but can it do this for me so there are no duplicates?
Sorry if this isn't very clear it's a bit of a messy problem! Thanks.
You are doing an unnecessary join back to the author table in your query. You get all the fields you want in the posts subquery. I would rename this to something other than an existing table, perhaps pa to indicate posts and authors.
You say you want the first 5 posts, but have no order clause. A better form of the query is:
SELECT pa.id as 'posts.id', pa.title as 'posts.title' (etc. etc. list all fields in three table)
FROM (SELECT *
FROM posts join
authors
on authors.id = posts.authorId
WHERE authors.country = 'ireland'
order by post.date
LIMIT 0, 5
) pa LEFT JOIN
comments c
ON c.postId = pa.id
Note that this returns the first five posts and their authors (as specified in the question). But one author may be responsible for all five posts.
In MySQL, you can use * and it will get rid of duplicate aliases in the from clause. I think this is dangerous. It is better to list all the columns you want.
To answer your questions:
You can select as many (or as few) columns as you need from a sub-query
You do not need to join the authors table again since you already selected all fields in the sub-query (and so get rid of duplicate columns names).
A few additional remarks...
... about the JOIN syntax
Prefer the form
FROM t1 JOIN t2 ON (t1.fk = t2.pk)
to the obsolete, obscure
FROM t1, t2 WHERE t1.fk = t2.pk
... about the use of a LIMIT clause without an ORDER BY clause
The order in which rows are returned by a SELECT statement without an ORDER BY clause is undefined. Therefore, a LIMIT n clause without an ORDER BY clause could return any n rows in theory.
Your final query should look like this:
SELECT *
FROM (
SELECT *
FROM posts
JOIN authors ON (authors.id = posts.authorId )
WHERE authors.country = 'ireland'
ORDER BY posts.id DESC -- assuming this column is monotonically increasing
LIMIT 5
) AS last_posts
LEFT JOIN comments ON ( comments.postId = last_posts .id )

Using SELECT within SELECT in mysql query

It is common to use SELECT within SELECT to reduce the number of queries; but as I examined this leads to slow query (which is obviously harmful for mysql performance). I had a simple query as
SELECT something
FROM posts
WHERE id IN (
SELECT tag_map.id
FROM tag_map
INNER JOIN tags
ON tags.tag_id=tag_map.tag_id
WHERE tag IN ('tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6')
)
This leads to slow queries of "query time 3-4s; lock time about 0.000090s; with about 200 rows examined".
If I split the SELECT queries, each of them will be quite fast; but this will increase the number of queries which is not good at high concurrency.
Is it the usual situation, or something is wrong with my coding?
In MySQL, doing a subquery like this is a "correlated query". This means that the results of the outer SELECT depend on the result of the inner SELECT. The outcome is that your inner query is executed once per row, which is very slow.
You should refactor this query; whether you join twice or use two queries is mostly irrelevant. Joining twice would give you:
SELECT something
FROM posts
INNER JOIN tag_map ON tag_map.id = posts.id
INNER JOIN tags ON tags.tag_id = tag_map.tag_id
WHERE tags.tag IN ('tag1', ...)
For more information, see the MySQL manual on converting subqueries to JOINs.
Tip: EXPLAIN SELECT will show you how the optimizer plans on handling your query. If you see DEPENDENT SUBQUERY you should refactor, these are mega-slow.
You could improve it by using the following:
SELECT something
FROM posts
INNER JOIN tag_map ON tag_map.id = posts.id
INNER JOIN tags
ON tags.tag_id=tag_map.tag_id
WHERE <tablename>.tag IN ('tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6')
Just make sure you only select what you need and do not use *; also state in which table you have the tag column so you can substitute <tablename>
Join does filtering of results. First join will keep results having 1st ON condition satisfied and then 2nd condition gives final result on 2nd ON condition.
SELECT something
FROM posts
INNER JOIN tag_map ON tag_map.id = posts.id
INNER JOIN tags ON tags.tag_id = tag_map.tag_id AND tags.tag IN ('tag1', 'tag2', 'tag3', 'tag4', 'tag5', 'tag6');
You can see these discussions on stack overflow :
question1
question2
Join helps to decrease time complexity and increases stability of server.
Information for converting sub queries to joins:
link1
link2
link3