Filtering Condition in n-Table Joins - mysql

We seem to have a need for a multi-table JOIN operation and I am referring to some notes from an RDBMS class that I took several years ago. In this class the instructor graphically depicted the structure of a generic N-table JOIN query.
The figure seems to conform to examples of multi-table JOINs that I have seen but I have a question. Does the WHERE clause, for providing filtering, necessarily have to be the last clause in the query? Intuitively it appears that we can impose filtering conditions before a following JOIN clause, in order to properly scope the data, before we input it to the next JOIN operation.

Syntactically, the where clause has to be at the end. But the query plan will take it into account and use it to filter wherever possible. Note that just because you specify the from and joins in a given order doesn't mean the query will actually execute that way; it may rearrange them to whatever order it thinks will work best (unless you specify straight_join).
That said, having the where at the end does make some queries actually harder to read.

SQL queries consist of a sequence of clauses. The diagram you have is rather misleading. Common clauses -- and the order they must appear for a valid query -- are:
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
Note that JOIN is not a clause. It is an operator, and an operator that specifically appears only in the FROM clause.
So, the answer to your question is that WHERE clauses immediately follow the FROM clause. The only "sort-of" exception is when a "window" clause is included and that is syntactically between the FROM and the WHERE.
Next, multiple table joins are often quite efficient and there is no reason whatsoever to discourage their use. Support for joins, in fact, is one of the key design features that databases are designed around.
And finally. What actually gets executed is not the string that you create. A query, in fact, describes the result set you want. It does not describe the processing. SQL is a descriptive language, not a procedural language.
The SQL engine has two steps to convert your query string to an executable form (typically a directed acyclic graph). One is to compile the query, and the second is to optimize the query. So, where filtering actually occurs . . . that depends on what the optimizer decides. And where it occurs has little relationship to what you think of when you think of SQL queries (DAGs don't generally have nodes called "select" or "join").

Related

Does DISTINCT will automatically sort the result in MySQL?

Here is the tutorial of GROUP_CONCAT() in GeeksForGeeks.
In "Queries 2", the Output is ascending. But there is no ORDER BY clause.
here is the picture of "Queries 2"
Could anyone can tell me why?
Any help would be really appreciated!
This is one of those oddballs where there is likely an implicit sort happening behind the scenes to optimize the DISTINCT execution by mysql.
You can test this yourself pretty easily:
CREATE TABLE t1 (c1 VARCHAR(50));
INSERT INTO t1 VALUES ('zebra'),('giraffe'),('cattle'),('fox'),('octopus'),('yak');
SELECT GROUP_CONCAT(c1) FROM t1;
SELECT GROUP_CONCAT(DISTINCT c1) FROM t1;
GROUP_CONCAT(c1)
zebra,giraffe,cattle,fox,octopus,yak
GROUP_CONCAT(DISTINCT c1)
cattle,fox,giraffe,octopus,yak,zebra
It's not uncommon to find sorted results where no ORDER BY was specified. Window Functions output are a good example of this.
You can imagine if you were tasked, as a human, to only pick distinct items from a list. You would likely first sort the list and then pick out duplicates, right? And when you hand the list back to the person that requested this from you, you wouldn't scramble the data back up to be unsorted, I would assume. Why do the extra work? What you are seeing here is a byproduct of the optimized execution path chosen by the mysql server.
The key takeaway is "byproduct". If I specifically wanted the output of GROUP_CONCAT to be sorted, I would specify exactly what I want and I would not rely on this implicit sorting behavior. We can't guess what the execution path will be. There are a lot of decisions an RDBMS makes when SQL is submitted to optimize the execution and depending on data size and other steps it needs to take in the sql, this behavior may work on one sql statement and not another. Likewise, it may work one day, and not another.
TL;DR Never omit an ORDER BY clause from a query if you rely on the order for something.
Does DISTINCT will automatically sort the result in MySQL?
No. NO! Be careful!
SQL is all about sets of rows. Without ORDER BY clauses, SQL queries return the rows of their result sets in an "unpredictable" order. "Unpredictable" is like random, but worse. If the order is truly random, you have a chance to catch any ordering problem when you're testing. Unpredictable means the server returns rows in any convenient order. This means everything works as you expect until some day in the future when it doesn't, without warning. (MySQL might start using some kind of parallel algorithm in the future.)
Now it is true that DISTINCT result sets from modestly sized tables are often generated using a sorting / deduplicating algorithm in the server. But that is an implementation detail. MySql and other table servers are complex enough that relying on implementation details is not wise. The good news: If you include an ORDER BY clause showing the same order as that methodology generates, usually performance is not changed.
SQL is declarative, not procedural. We specify what we want, not how to get it. It's probably the only declarative language most of us ever see, so it's easy to make the mistake of thinking it is procedural.

Is it wrong to move part of your WHERE clause to a HAVING clause if it runs faster?

In my long, complicated query that is not using aggregation, I have moved one of the ANDed where clause parts to a new HAVING clause.
Logically, the result is the same, rows are filtered before returned.
Semantically, the result may be different in some way I don't understand.
But performance-wise, this runs 3x faster. I understand this is because the thing I moved is doing an expensive NOT EXISTS (SELECT ...). Previously the server was spending time evaluating this for rows that could be excluded using the other simpler rules.
Are there any official or unofficial rules I have broken with this optimization approach?
no there are no rules as such.
As the joins come before the WHERE clause, you would reduce the number of rows, that will be checked against the WHERE clause.
It is usually somewhat fawned upon, because you could miss some rows that are needed.
So basically you can do it, but have to check , if all wanted rows are there.
The order of WHERE clauses ANDed together --> The optimizer if free to rearrange, however
There are some exceptions: FULLTEXT search first; subqueries last. (I am not sure of this.)
Referencing aggregations --> must be in HAVING
Otherwise WHERE and HAVING have the same semantics.
WHERE is logically done before GROUP BY; HAVING is done after.
It seems that you have discovered that NOT EXISTS is more efficient if it is somehow forced to come after other tests; and moving it to HAVING seems to have achieved that.
Submit a bug report (jira.mariadb.com) suggesting that you have found a case where the Optimizer is not juggling them the clauses as well as it should.
If you show us the actual query, we might be able to dig deeper.

Does the order of JOIN vs WHERE in SQL affect performance?

In SQL, how much the order of JOIN vs WHERE affect the performance of a query?
a) SELECT […] FROM A JOIN ( SELECT […] FROM B WHERE CONDITION ) ON […]
b) SELECT […] FROM A JOIN ( SELECT […] FROM B ) ON […] WHERE CONDITION
My inner feeling tells me that option a) should be more performant: if we do first a join and then we run a where, it seems way less performant than first running a where on one table, and from the resutls doing a join. But I’m not sure as this depends on the internal optimizations of the SQL library itself.
Would be nice to know if the behavior is the same for both MySQL and
PostgreSQL, and also if it depends on any other decorators as group by or order by.
Postgres has a smart optimizer so the two versions should have similar execution plans, under most cases (I'll return to that in a moment).
MySQL has a tendency to materialize subqueries. Although this has gotten better in more recent versions, I still recommend avoiding it. Materializing subqueries prevents the use of indexes and can have a significant impact on performance.
One caveat: If the subquery is complicated, then it might be better to filter as part of the subquery. For instance, if it is an aggregation, then filtering before aggregating usually results in better performance. That said, Postgres is smart about pushing conditions into the subquery. So, if the outer filtering is on a key used in aggregation, Postgres is smart enough to push the condition into the subquery.
All other factors being equal, I would expect the A version to perform better than the B version, as you also seem to expect. The main reason for this is that the A version lets the database throw out rows using the WHERE clause in the subquery. Then the join only has to involve a smaller intermediate table. The exact difference in performance between the two would depend on the underlying data and the actual queries. Note that it is even possible that both queries could be optimized under the hood to the same or very similar execution plan.

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.

MySQL - SELECT, JOIN

Few months ago I was programming a simple application with som other guy in PHP. There we needed to preform a SELECT from multiple tables based on a userid and another value that you needed to get from the row that was selected by userid.
My first idea was to create multiple SELECTs and parse all the output in the PHP script (with all that mysql_num_rows() and similar functions for checking), but then the guy told me he'll do that. "Okay no problem!" I thought, just much more less for me to write. Well, what a surprise when i found out he did it with just one SQL statement:
SELECT
d.uid AS uid, p.pasmo_cas AS pasmo, d.pasmo AS id_pasmo ...
FROM
table_values AS d, sectors AS p
WHERE
d.userid='$userid' and p.pasmo_id=d.pasmo
ORDER BY
datum DESC, p.pasmo_id DESC
(shortened piece of the statement (...))
Mostly I need to know the differences between this method (is it the right way to do this?) and JOIN - when should I use which one?
Also any references to explanations and examples of these two would come in pretty handy (not from the MySQL ref though - I'm really a novice in this kind of stuff and it's written pretty roughly there.)
, notation was replaced in ANSI-92 standard, and so is in one sense now 20 years out of date.
Also, when doing OUTER JOINs and other more complex queries, the JOIN notation is much more explicit, readable, and (in my opinion) debuggable.
As a general principle, avoid , and use JOIN.
In terms of precedence, a JOIN's ON clause happens before the WHERE clause. This allows things like a LEFT JOIN b ON a.id = b.id WHERE b.id IS NULL to check for cases where there is NOT a matching row in b.
Using , notation is similar to processing the WHERE and ON conditions at the same time.
This definitely looks like the ideal scenario for a join so you can avoid returning more data then you actually need. This: http://www.w3schools.com/sql/sql_join.asp or this: http://en.wikipedia.org/wiki/Join_(SQL) should help you get started with joins. I'm also happy to help you write the statement if you can give me a brief outline of the columns / data in each table (primarily I need two matching columns to join on).
The use of the WHERE clause is a valid approach, but as #Dems noted, has been superseded by the use of the JOINS syntax.
However, I would argue that in some cases, use of the WHERE clauses to achieve joins can be more readable and understandable than using JOINs.
You should make yourself familiar with both methods of joining tables.