Order of WHERE clause is affecting result - mysql

I have some strange problem with one select.
Is it possible that the order in WHERE clause can influence the result?
Here is my select:
select u.userName, u.fullName, g.uuid as groupUuid, g.name as `group`,
m.number as module, count(distinct b.uuid) as buildCount, max(b.datetime),
count(distinct e.buildId) as errorBuildCount, e.id as errorId
from User u
inner join GROUP_USER GU on GU.user_id = u.id
inner join `Group` g on g.id = GU.group_id
inner join Course c on c.id = g.courseId
left outer join Build b on b.userId = u.id
left outer join Module m on m.id = b.moduleId
left outer join Error e on e.buildId = b.id
where c.uuid = 'HMUUcabR1S4GRTIwt3wWxzCO' and g.uuid = 'abcdefghijklmnopqrstuvwz'
group by u.userName,m.number,c.uuid, g.uuid
order by g.id asc, u.fullName asc, m.number asc
this will reproduce this result:
http://dl.dropbox.com/u/4892450/sqlSelectProblem/select1.PNG
When I use this condition:
where g.uuid = 'abcdefghijklmnopqrstuvwz' and c.uuid = 'HMUUcabR1S4GRTIwt3wWxzCO'
(different order) I get a different result (see errorId column):
http://dl.dropbox.com/u/4892450/sqlSelectProblem/select2.PNG
Could you please help me? Is the whole select wrong, or can it be a mysql bug?

The only difference between the results is an errorId column. Ungrouped and unaggregated columns are not allowed by sql standard (sql-92 standard, check out the link) and will not even run in most db engines. So, engine's behavior in this situation is not specified. Accoding to docs (thanks to Marcus Adams):
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate.
You can get errorId as an aggregate value:
MAX(e.id) as errorId
or include it in GROUP BY list:
group by u.userName,m.number,c.uuid, g.uuid,e.Id
Then your query results should be stable.
Further reading:
Why does MySQL add a feature that conflicts with SQL standards? - detailed explanation of differences between sql standard and mysql implementation. (Thanks to GarethD)

You've got two different JOIN trees in your code, essentially:
user
/ \
group_user build
/ \
group module
| |
course error
such a construct leads to undefined results, especially if the results of the joins in one branch has a different number of matching records than in the other branch. MySQL has to try and fill in the missing bits, and guesses. Changing the order of your WHERE clause can and WILL change the full results because you're changing the way mysql does its guesses.

Group by all columns before aggregation. Best Practices...IN MOST CASES. and could very possibly be distorting your answers...

Related

MySQL: Optimizing Sub-queries

I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";
This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)
Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition
The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.

Why can HAVING reference COUNT(column_name) but not the column_name itself?

I am trying to better understand how MySQL works. I came across a problem with subgroups. From question Unknown column in 'having clause', I understand why this code will return an error:
SELECT b.Title, b.Isbn
FROM Book AS b
INNER JOIN Writing AS w ON w.Book_id = b.ID
GROUP BY b.ID
HAVING w.Author_id = 1 AND b.Title LIKE "%Head%"
That error is: "Unknown column 'w.Author_id' in 'having clause'" because:
The SQL standard requires that HAVING must reference only columns in the GROUP BY clause or columns used in aggregate functions. However, MySQL supports an extension to this behavior, and permits HAVING to refer to columns in the SELECT list and columns in outer subqueries as well.
But, if instead of w.Author_id = 1, I use COUNT(w.Author_id) > 1, the code exectues and works correctly:
SELECT b.Title, b.Isbn
FROM Book AS b
INNER JOIN Writing AS w ON w.Book_id = b.ID
GROUP BY b.ID
HAVING COUNT(w.Author_id) > 1 AND b.Title LIKE "%Head%"
So, my question is: what is it about COUNT() that makes w.Author_id accessible? I apologize if this is a silly/obvious question - I'm still a novice at SQL.
The documentation seems pretty clear on this subject. But let me see if I can explain it better for you.
The HAVING clause is essentially a WHERE clause that "takes place" after the GROUP BY. That is, the aggregation has already happened, so the data that is available is the aggregated data.
In your example, there is no Author_Id returned by the aggregation. And MySQL doesn't know how to generate one.
However, COUNT(w.Author_Id) is an aggregated result. MySQL can just add that (conceptually) to the results returned by the aggregation and filter on it.
Your query is equivalent to:
SELECT Title, Isbn
FROM (SELECT b.Title, b.Isbn, COUNT(*) as cnt
FROM Book b JOIN
Writing w
ON w.Book_id = b.ID
GROUP BY b.ID
) b
WHERE cnt > 1 AND b.Title LIKE '%Head%';
That said, the query is better written as:
SELECT b.Title, b.Isbn
FROM Book b JOIN
Writing w
ON w.Book_id = b.ID
WHERE b.Title LIKE '%Head%'
GROUP BY b.ID
HAVING COUNT(*) > 1;
You can filter on the title before aggregating and that is usually much more efficient.
When you put w.Author_id in the expression COUNT(w.Author_id) > 1 you are satisfying the requirement for HAVING clause that
HAVING must reference only columns in the GROUP BY clause or columns used in aggregate functions
while the ID column alone won't, because it is not part of the GROUP BY clause nor is within an aggregate function (which COUNT is).

MySql: order by along with group by - performance

I have the performance problem with query that have order by and group by. I have checked similar problems on SO but I did not find the solution to this:(
I have something like this in my db schema:
pattern has many pattern_file belongs to project_template which belongs to project
Now I want to get projects filtered by some data(additional tables that I join) and want to get the result ordered for example by projects.priority and grouped by patterns.id. I have tried many things and to get the desired result I've figured out this query:
SELECT DISTINCT `projects`.* FROM `projects`
INNER JOIN `project_templates` ON `project_templates`.`project_id` = `projects`.`id`
INNER JOIN `pattern_files` ON `pattern_files`.`id` = `project_templates`.`pattern_file_id`
INNER JOIN `patterns` ON `patterns`.`id` = `pattern_files`.`pattern_id`
...[ truncated ]
INNER JOIN (SELECT DISTINCT projects.id FROM `projects` INNER JOIN `project_templates` ON `project_templates`.`project_id` = `projects`.`id`
INNER JOIN `pattern_files` ON `pattern_files`.`id` = `project_templates`.`pattern_file_id`
INNER JOIN `patterns` ON `patterns`.`id` = `pattern_files`.`pattern_id`
...[ truncated ]
WHERE [here my conditions] ORDER BY [here my order]) P
ON P.id = projects.id
WHERE [here my conditions]
GROUP BY patterns.id
ORDER BY [here my order]
From my research I have to INNER JOIN with subquery to conquer the problem "ORDER BY before GROUPing BY" => then I have put the same conditions on the outer query for performance purpose. The order by I had to use again in the outer query too, otherwise the result will be sorted by default.
Now there is real performance problem as I have about 6k projects and when I run this query without any conditions it takes about 15s :/ When I narrow the result by specify the conditions the time drastically dropped down. I've found somewhere that the subquery is run for every outer query row result which could be true when you watch at the execution time :/
Could you please give some advice how I can optimize the query? I do not work much with sql so maybe I do it from the wrong side from the very beginning?
P.S. I have tried WHERE projects.id IN (Select project.id FROM projects ....) and that discarded the performance issue but also discarded the ORDER BY before GROUPing BY
EDIT.
I want to retrieve list of projects, but I want also to filter it and order, and finally I want to get patterns.id unique(that is why I use the group by).
order by in your inner query (p) doesn't make sense (any inner sort will only
have an arbitrary effect).
#Solarflare Unfortunately it does. group by will take first row from grouped result. It preserve the order for join. Well, I believe that it is specific to MySql. Furthermore to keep the order from subquery I could use ORDER BY NULL in outer query :-)
Also, select projects.* ... group by pattern.id is fishy (although MySQL, in contrast to every other dbms, allows you to do this)
so we can assume I retrieve only projects.id, but from docs:
MySQL extends the use of GROUP BY to permit selecting fields that are not mentioned in the GROUP BY clause

My SQL query is returning results but they are repeated ~50 times. I don't understand why

The query I'm using calls on a few tables in the database and works fine. However, when I add line 10 to the mix it returns 50 or more repeated results. I'm still somewhat new to SQL and Sequel Pro so I'm sure the solution isn't too complicated but I am truly stumped right now.
Here is the code:
SELECT c.first_name, c.last_name, ca.company, ca.city, ca.state, ct.certificate_number, ct.certificate_date
FROM customer c, customer_type ctype, cust_address ca, certification ct, cust_prof_cert cp
WHERE ca.id_customer = c.id_customer LIKE cp.prof_cert_id_prof_cert
AND c.customer_type_id_customer_type = ctype.id_customer_type
AND ct.customer_id_customer = c.id_customer
AND ca.id_customer = c.id_customer
AND ctype.customer_type IN('CIRA','CIRA, CDBV')
AND ct.course_type_id_course_type = 1
AND ct.certificate_number IS NOT NULL
AND cp.prof_cert_id_prof_cert = "1"
ORDER BY ct.certificate_number ASC, c.last_name ASC;
Thank you for your time.
By Doing your SQL like that you are not relating the data, just selecting it. I would recommend changing your SQL to use JOINS.
SELECT Orders.OrderID, Customers.CustomerName, Orders.OrderDate
FROM Orders
INNER JOIN Customers
ON Orders.CustomerID=Customers.CustomerID;
Here is an article that might be able to help you a bit: w3schools, Joins
Here's your query using the SQL92 syntax for joins. You should use this syntax instead of the SQL89 "comma-style" joins.
SELECT c.first_name, c.last_name, ca.company, ca.city, ca.state,
ct.certificate_number, ct.certificate_date
FROM customer AS c
INNER JOIN customer_type AS ctype ON c.customer_type_id_customer_type = ctype.id_customer_type
INNER JOIN cust_address AS ca ON ca.id_customer = c.id_customer
INNER JOIN certification AS ct ON ct.customer_id_customer = c.id_customer
INNER JOIN cust_prof_cert AS cp -- what's this join condition?
WHERE ca.id_customer = c.id_customer LIKE cp.prof_cert_id_prof_cert
AND ctype.customer_type IN('CIRA','CIRA, CDBV')
AND ct.course_type_id_course_type = 1
AND ct.certificate_number IS NOT NULL
AND cp.prof_cert_id_prof_cert = '1'
ORDER BY ct.certificate_number ASC, c.last_name ASC;
A few weird things I notice in this query:
The first term in the WHERE clause is strange. You should know that LIKE has higher precedence than = so this might not be doing what you think it's doing. It's as if you wrote
WHERE ca.id_customer = (c.id_customer LIKE cp.prof_cert_id_prof_cert)
Which means evaluate the LIKE and produce a 0 or a 1 to represent the boolean condition. Then look for a ca.id_customer matching that 0 or 1.
Given that strange term, I can find no other join condition for the cp table. The default join if you give no restriction for it is that every row matches every row in the joined tables. So if you have 50 rows where cp.prof_cert_id_prof_cert = 1, then it will effectively multiply the results from the rest of the joined tables by 50.
This is called a Cartesian product, or in MySQL parlance it's counted in SHOW STATUS as a Full join.
ctype.customer_type IN('CIRA','CIRA, CDBV') You have quoted the second and third strings together. Basically, this means you are trying to match the column against two strings, one of which happens to contain a comma.
You probably meant to write ctype.customer_type IN('CIRA','CIRA','CDBV') so the column may match any of these three values.
I would suggest not querying multiple tables in your FROM clause, I believe this is the cause of your duplicate rows. If you separate out the tables into separate inner or left joins, (whichever you need) you should be able to match which ever keys in each table manually, instead of having SQL attempt to automatically do this.

ORDER BY is being ignored in subquery join?

I have 3 tables: users, projects, and files. Here's the relevant columns:
users: [userid](int)
projects: [userid](int) [projectid](int) [modified](datetime)
files: [userid](int) [projectid](int) [fileid](int) [filecreated](datetime)
I'm using a query to list all projects, but I also want to include the most recent file from another table. My approach to this was using a subquery to join on.
Here's what I came up with, but my problem is that it's returning the oldest file:
SELECT * FROM projects
INNER JOIN users ON projects.userid = users.userid
JOIN (SELECT filename,projectid FROM files
GROUP BY files.projectid
ORDER BY filecreated DESC) AS f
ON projects.projectid = f.projectid
ORDER BY modified DESC
I would think ORDER BY filecreated DESC would solve this, but it seems completely ignored.
I'm fairly new to SQL, perhaps I'm not approaching this the right way?
Your problem is here, in your subquery:
(SELECT filename,projectid FROM files
GROUP BY files.projectid
ORDER BY filecreated DESC) AS f
since you're using that kind of mixing grouped and non-grouped columns I assume you're using MySQL. Remember, ORDER BY clause will have no effect after applying GROUP BY clause - you can not rely on the fact, that MySQL allows such syntax (in general, in normal SQL this is incorrect query at all).
To fix that you need to get properly formed records in your subquery. That could be done, for example:
SELECT
files.filename,
files.projectid
FROM
(SELECT
MAX(filecreated) AS max_date,
projectid
FROM
files
GROUP BY
projectid) AS files_dates
LEFT JOIN
files
ON files_dates.max_date=files.filecreated AND files_dates.projectid=files.projectid
I assume you want a list of projects with the latest file and the user that created it:
SELECT projects.projectid, f.username, f.filename, f.filecreated
FROM projects
LEFT OUTER JOIN (
SELECT TOP 1 username, filename, filecreated
FROM files
INNER JOIN users ON users.userid = files.userid
ORDER BY filecreated DESC
) AS f ON projects.projectid = f.projectid
ORDER BY modified DESC