Let's say I have a query like this:
SELECT bla WHERE foo LIKE '%bar%' AND boo = 'bar' AND whatvr IN ('foo', 'bar')...
I was wondering if MySQL continues to check all conditions when retrieving results.
For eg. if foo is not LIKE %bar%, will it continue to check if boo = 'bar', and so on ?
Would it be any faster if I put conditions that are less likely to be true at the end?
I'm sorry if this seems to be stupid question, I'm a complete noob when it comes to SQL :)
I don't think there are any guarantees about whether or not multiple conditions will be short-circuited, but...
In general, you should treat the query optimiser as a black box and assume -- unless you have evidence to the contrary -- that it will do its job properly. The optimiser's job is to ensure that the requested data is retrieved as efficiently as possible. If the most efficient plan involves short-circuiting then it'll do it; if it doesn't then it won't.
(Of course, query optimisers aren't perfect. If you have evidence that a query isn't being executed optimally then it's often worth re-ordering and/or re-stating the query to see if anything changes.)
What you're looking for is documentation on MySQL's short-circuit evaluation. I have, however, not been able to find anything better than people who were not able to find the documentation, but they claim to have tested it and found it to be true, i.e., MySQL short-circuits.
Would it be any faster if I put conditions that are less likely to be true at the end?
No, the optimizer will try and optimize (!) the order of processing. So, as for the order of tests, you should not assume anything.
I would not count on that : Where Optimisations. That link explains that other criterias prevail on the order.
You can't rely on MySQL evaluating conditions from left to right (as opposed to any programming language). This is because the "WHERE clause optimizer" looks for columns that are indexed and will look for this subset first.
For query optimization see the chapter Optimizing SELECT Statements in the MySQL reference manual.
If it does short-circuit when first condition fails( which is most likely ), it would be best to put those conditions, that are most likely to fail, first!
let's say we have 3 conditions, and all must be true( separated by "AND" ).
slow case:
1. never fail. All rows are looked through and success.
2. sometimes fail. All rows are looked through and still success.
3. often fail. All rows are looked through and this time we fail.
Result: It took a while, but can't find a match.
fast case:
1. often fail. All rows are looked through and matching fail.
2. sometimes fail. NOT looked through, because searching ended due to short-circuit.
3. never fail. NOT looked through, because searching ended due to short-circuit.
Result: Quickly figured, no match.
Correct me if I'm wrong.
I can imagine, that all conditions are checked, for each row looked ad. Which makes this matter ALOT less.
If your fields are ordered in the same order, as your conditions, you could maybe measure the difference.
Related
In my long, complicated query that is not using aggregation, I have moved one of the ANDed where clause parts to a new HAVING clause.
Logically, the result is the same, rows are filtered before returned.
Semantically, the result may be different in some way I don't understand.
But performance-wise, this runs 3x faster. I understand this is because the thing I moved is doing an expensive NOT EXISTS (SELECT ...). Previously the server was spending time evaluating this for rows that could be excluded using the other simpler rules.
Are there any official or unofficial rules I have broken with this optimization approach?
no there are no rules as such.
As the joins come before the WHERE clause, you would reduce the number of rows, that will be checked against the WHERE clause.
It is usually somewhat fawned upon, because you could miss some rows that are needed.
So basically you can do it, but have to check , if all wanted rows are there.
The order of WHERE clauses ANDed together --> The optimizer if free to rearrange, however
There are some exceptions: FULLTEXT search first; subqueries last. (I am not sure of this.)
Referencing aggregations --> must be in HAVING
Otherwise WHERE and HAVING have the same semantics.
WHERE is logically done before GROUP BY; HAVING is done after.
It seems that you have discovered that NOT EXISTS is more efficient if it is somehow forced to come after other tests; and moving it to HAVING seems to have achieved that.
Submit a bug report (jira.mariadb.com) suggesting that you have found a case where the Optimizer is not juggling them the clauses as well as it should.
If you show us the actual query, we might be able to dig deeper.
My queries are running slow with some indexes chosen over others. I am trying to find a tool or a guide using which I can figure out why MySQL decided to give preference to 1 index or 1 table(in case of joins) than others so that I can fine-tune the index or the query.
Till now, I haven't come across an article which explains it in detail or a tool which can provide me with the details of it.
Any inputs will be appreciated. Thanks a ton in advance!
As the Optimizer gets more sophisticated, it gets harder to understand what it is doing. The latest improvements involve "cost-based" analysis of possible execution methods. For many queries, it is obvious that one index would be better than another.
There are 4 views into what is going on:
EXPLAIN is quite limited. It does not handle LIMIT very well, nor does it necessarily say which step uses the filesort, or even if there are multiple filesorts. In general, the "Rows" column is useful, but in certain situations, it is useless. Simple rule: A big number is a bad sign.
EXPLAIN EXTENDED + SHOW WARNINGS; provides the rewritten version of the query. This does not do a lot. It does give clues of the distinction between ON and WHERE in JOINs.
EXPLAIN FORMAT=JSON provide more details into the cost-based analysis and spells out various steps, including filesorts.
"Optimizer trace" goes a bit further. (It is rather tedious to read.)
As for a "visualization", no. Anyway, EXPLAIN and its friends only work with what they have. That is they do not give clues of "what if you added INDEX(a,b)". That is what is really needed. Nor does it effectively point out that you should not "hide an indexed column in a function call". Example: WHERE DATE(dt) = '2019-01-23'. Note that some 'operators' are effectively function calls.
I've seen a few "graphical Explains", but they seem to be nothing more than boxing up the rows of EXPLAIN and drawing lines between them.
I have been struggling with these problems for many years, and have written a partial answer -- namely a "cookbook". It approaches indexing from the other direction -- explaining what index to add for a given SELECT. Unfortunately, it only works for simpler queries. http://mysql.rjweb.org/doc.php/index_cookbook_mysql
I work on a lot of performance questions on this forum in hopes of getting more insight into what to add to the cookbook next. For now, you can help me by posting your tough query, together with EXPLAIN SELECT and SHOW CREATE TABLE(s).
Some random comments:
"Index merge intersection" is perhaps always not as good as a composite index.
"Index merge union" is almost never used. You might be able to turn OR into UNION effectively.
The newer Optimizer creates an index on the fly for "derived tables" (JOIN ( SELECT ... )). But this is rarely as efficient as rewriting the query to avoid such a subquery when it returns lots of rows. (Again, none of the EXPLAINs will point you this way.)
Something often forgotten about (but does show up as an unexplained large "Rows"): COLLATIONs must match.
A trick to make use of the PK clustering: PRIMARY KEY(foo, id), INDEX(id)
Nothing (except experience) says how nearly useless "prefix" indexing is (INDEX(bar(10))).
FORCE INDEX is handy for experimenting, but almost always a bad idea for production.
In a SELECT with a JOIN, and a WHERE that mentions only one of the tables, the Optimizer will usually pick the table mentioned in WHERE as the first table. Then it will do a "Nested Loop Join" into the other table(s).
(I should add some of this to my Cookbook. 'Stay tuned'. Update: Done.)
I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.
I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.
But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.
PostgreSQL:
It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.
MSSQL Server:
MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.
MySQL Server:
The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.
SQLite:
Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.
So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).
It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?
Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.
I think in some cases query parsing time (if it was not prepared) alone can kill performance.
Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.
But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.
Do you?
Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.
Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:
Returns 1 if expr is equal to any of the values in the IN list, else
returns 0. If all values are constants, they are evaluated according
to the type of expr and sorted. The search for the item then is done
using a binary search. This means IN is very quick if the IN value
list consists entirely of constants.
This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.
I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.
I think it is bad application design. Those values using IN operator are most probably not hardcoded but dynamic. In such case we should always use prepared statements the only reliable mechanism to prevent SQL injection.
In each case it will result in dynamically formatting the prepared statement (as number of placeholders is dynamic too) and it will also result in having excessive hard parsing (as many unique queries as we have number of IN values - IN (?), IN (?,?), ...).
I would either load these values into table as use join as you mentioned (unless loading is too overhead) or use Oracle pipelined function IN foo(params) where params argument can be complex structure (array) coming from memory (PLSQL/Java etc).
If the number of values is larger I would consider using EXISTS (select from mytable m where m.key=x.key) or EXISTS (select x from foo(params) instead of IN. In such case EXISTS provides better performance than IN.
I have a MYSQL question:
can anybody tell me a way how to measure if an IN() clause is getting nonperformance or not.
So far I am having a table which holds about 5.000 rows and the IN() will check up to 100 IDs. it may grow up to 50.000 in the next two years.
Thanks
NOTE
with nonperformant I mean, to be in effective, slowly, bad performance, ...
UPDATE
It's a decission finding problem; so the EXPLAIN Command in MySql does not answer my question. When the perfromance is bad, I can see it myself. But I want to know it before I start to design in a way, which might be the wrong...
UPDATE
I am searching for a measuring technique for general purpose.
You would use the EXPLAIN statement to check how the query is being executed. It displays information from the optimizer about the query execution plan, how it would process the statement, and how tables are joined and in which order.
There are many times that a JOIN can be used in place of an IN, which should yield better performance. Additionally, indices make a significant difference on how fast the query runs.
We would need to see your query and an EXPLAIN at the very least.
you can use the mysql explain statement to get the query plan. Just enter explain in front of your select and see what it says. You will need to learn how to read it but it is very helpful in identifying if a query is as fast as you would expect.
mysql also does not have the best query optimizer. In my experience sometimes it is faster to run 100 simple and fast queries than to run a complicated join. This is a rare case but I have gotten performance increases from it
I am sending queries to a very large database (meaning many entities/tables).
So I have some queries which include some 7 to 8 joins.
The problem is, that I do not know, how many entries the tables will have in near future. It could be between 1.000 to 100.000 rows each table (or even more).
I think about splitting my queries to perform two or three queries consecutively instead of one mega-query.
Is there a common/recommended limit of JOIN's in an MySQL Query?
How can I measure/calculate which type of splitting would be a good variant (depending on count-of-rows in the tables, and so on)?
I have many JOIN's on the same field (foreign-key) of the same table. Is there a way to optimize that as well? (one row in that table - has many relations/connections)
thanks ;)
UPDATE:
I saw it too late. Somebody was so nice and changed the title of the question.
Because of my bad English I wrote performant - meaning having good performance. I did't mean to perform!
Please consider this in your answers. thank again!
You probably want to learn about EXPLAIN which will show you what MySQL's plan is for executing your query. e.g
EXPLAIN SELECT foo FROM bar NATURAL JOIN baz
will tell you how MySQL would execute the query SELECT foo FROM bar NATURAL JOIN baz
From the EXPLAIN results you may see opportunities to add indexes to the database that will help your queries if they're slow, and in some cases, you may be able to add hints to the query e.g. telling MySQL to prefer one index over another if you have the experise to know that.
In general you will gain nothing from trying to "split up" a query unless your "splitting up" actually completely changes the semantics of what will need to be executed. e.g. if your query is fetching six unrelated things from the database, and you re-write this as six separate queries each fetching one thing, the aggregate time taken to execute will probably be no better (and may be much worse) for your separate queries.
use 'desc (query);' to get a sense of how MySQL will treat your query. You are generally better off having MySQL do the joining and optimizing than doing it yourself. That's what its good at.
This will also tell you where indexing is working or needs to be augmented.