How to judge the complexity of SQL queries - mysql

Any resource where it is explained how to judge the complexity of SQL queries would be much appreciated.

(By "complexity", I assume you mean "slowness"?) Some tips:
Subqueries may or may not slow down a query a lot.
GROUP BY and ORDER BY -- when both are present but different: Usually requires two sorts.
Usually only a single index is used per SELECT.
OR is almost always inefficient. Switching to UNION allows multiple indexes to be efficiently used.
UNION ALL, with a few restrictions, is more efficient than UNION DISTINCT (because of the dedupping pass)
Non-sargeable expressions cannot use an index, hence severely inefficient.
If the entire WHERE, GROUP BY and ORDER BY are handled by a single index can LIMIT be efficiently handled. (Else it must collect all the stuff, sort it, only then can it peel off a few rows.)
Entity-Attribute-Value schema is inefficient.
UUIDs and GUIDs are inefficient on very large tables.
A composite index is often better than a single-column index.
A "covering" index is somewhat better.
Sometimes, especially when a LIMIT is involved, it is better to turn the query inside-out. That is start with a subquery that finds the few ids that you need, then reaches back into the same table and into other tables to get the rest of the desired columns.
"Windowing functions" are poorly implemented in MySQL 8 and MariaDB 10.2. They are useful for "groupwise-max" and "hierarchical schemas". Until the Optimizer improves, I declare them to be "complex".
Recent versions have recognized "row constructors"; previously they were a performance hit.
Having an AUTO_INCREMENT id hurts performance in certain cases; helps in others.
EXPLAIN (or EXPLAIN FORMAT=JSON) tells you what is going on now; it fails to tell you how to rewrite the query or what better index to add.
More indexing tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql In that link, see "Handler counts" for a good way to measure complexity for specific queries. I use it for comparing query formulations, etc, even without populating a large table to get usable timings.
Give me a bunch of Queries; I'll point out the complexities, if any, in each.

Check out the official MySQL documentation on Query Execution Plan:
https://dev.mysql.com/doc/refman/5.7/en/execution-plan-information.html
You could use the EXPLAIN command to get more information about your query.

Related

Can I use index in MySQL in this way? [duplicate]

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..

Mysql 5.7 query execution visualization (how does mysql decide the execution of query)

My queries are running slow with some indexes chosen over others. I am trying to find a tool or a guide using which I can figure out why MySQL decided to give preference to 1 index or 1 table(in case of joins) than others so that I can fine-tune the index or the query.
Till now, I haven't come across an article which explains it in detail or a tool which can provide me with the details of it.
Any inputs will be appreciated. Thanks a ton in advance!
As the Optimizer gets more sophisticated, it gets harder to understand what it is doing. The latest improvements involve "cost-based" analysis of possible execution methods. For many queries, it is obvious that one index would be better than another.
There are 4 views into what is going on:
EXPLAIN is quite limited. It does not handle LIMIT very well, nor does it necessarily say which step uses the filesort, or even if there are multiple filesorts. In general, the "Rows" column is useful, but in certain situations, it is useless. Simple rule: A big number is a bad sign.
EXPLAIN EXTENDED + SHOW WARNINGS; provides the rewritten version of the query. This does not do a lot. It does give clues of the distinction between ON and WHERE in JOINs.
EXPLAIN FORMAT=JSON provide more details into the cost-based analysis and spells out various steps, including filesorts.
"Optimizer trace" goes a bit further. (It is rather tedious to read.)
As for a "visualization", no. Anyway, EXPLAIN and its friends only work with what they have. That is they do not give clues of "what if you added INDEX(a,b)". That is what is really needed. Nor does it effectively point out that you should not "hide an indexed column in a function call". Example: WHERE DATE(dt) = '2019-01-23'. Note that some 'operators' are effectively function calls.
I've seen a few "graphical Explains", but they seem to be nothing more than boxing up the rows of EXPLAIN and drawing lines between them.
I have been struggling with these problems for many years, and have written a partial answer -- namely a "cookbook". It approaches indexing from the other direction -- explaining what index to add for a given SELECT. Unfortunately, it only works for simpler queries. http://mysql.rjweb.org/doc.php/index_cookbook_mysql
I work on a lot of performance questions on this forum in hopes of getting more insight into what to add to the cookbook next. For now, you can help me by posting your tough query, together with EXPLAIN SELECT and SHOW CREATE TABLE(s).
Some random comments:
"Index merge intersection" is perhaps always not as good as a composite index.
"Index merge union" is almost never used. You might be able to turn OR into UNION effectively.
The newer Optimizer creates an index on the fly for "derived tables" (JOIN ( SELECT ... )). But this is rarely as efficient as rewriting the query to avoid such a subquery when it returns lots of rows. (Again, none of the EXPLAINs will point you this way.)
Something often forgotten about (but does show up as an unexplained large "Rows"): COLLATIONs must match.
A trick to make use of the PK clustering: PRIMARY KEY(foo, id), INDEX(id)
Nothing (except experience) says how nearly useless "prefix" indexing is (INDEX(bar(10))).
FORCE INDEX is handy for experimenting, but almost always a bad idea for production.
In a SELECT with a JOIN, and a WHERE that mentions only one of the tables, the Optimizer will usually pick the table mentioned in WHERE as the first table. Then it will do a "Nested Loop Join" into the other table(s).
(I should add some of this to my Cookbook. 'Stay tuned'. Update: Done.)

Is query cost the best metric for MySQL query optimization?

I am in the process of optimizing the queries in my MySQL database. While using Visual Explain and looking at various query costs, I'm repeatedly finding counter-intuitive values. Operations which use more efficient lookups (e.g. key lookup) seem to have a higher query cost than ostensibly less efficient operations (e.g full table scan or full index scan).
Examples of this can even be seen in the MySQL manual, in the section regarding Visual Explain on this page:
The query cost for the full table scan is a fraction of the key-lookup-based query costs. I see exactly the same scenario in my own database.
All this seems perfectly backwards to me, and raises this question: should I use query cost as the standard when optimizing a query? Or have I fundamentally misunderstood query cost?
MySQL does not have very good metrics relating to Optimization. One of the better ones is EXPLAIN FORMAT=JSON SELECT ..., but it is somewhat cryptic.
Some 'serious' flaws:
Rarely does anything account for a LIMIT.
Statistics on indexes are crude and do not allow for uneven distribution. (Histograms are coming 'soon'.)
Very little is done about whether data/indexes are currently cached, and nothing about whether you have a spinning drive or SSD.
I like this because it lets me compare two formulations/indexes/etc even for small tables where timing is next to useless:
FLUSH STATUS;
perform the query
SHOW SESSION STATUS LIKE "Handler%";
It provides exact counts (unlike EXPLAIN) of reads, writes (to temp table), etc. Its main flaw is in not differentiating how long a read/write took (due to caching, index lookup, etc). However, it is often very good at pointing out whether a query did a table/index scan versus lookup versus multiple scans.
The regular EXPLAIN fails to point out multiple sorts, such as might happen with GROUP BY and ORDER BY. And "Using filesort" does not necessarily mean anything is written to disk.

Queries executing in orders [duplicate]

My thinking is that if I put my ANDs that filter out a greater number of rows before those that filter out just a few, my query should run quicker since that selection set is much smaller between And statements.
But does the order of AND in the WHERE clause of an SQL Statement really effect the performance of the SQL that much or are the engines optimized already for this?
It really depends on the optimiser.
It shouldn't matter because it's the optimiser's job to figure out the optimal way to run your query regardless of how you describe it.
In practice, no optimiser is perfect so you might find that re-ordering the clauses does make a difference to particular queries. The only way to know for sure is to test it yourself with your own schema, data etc.
Most SQL engines are optimized to do this work for you. However, I have found situations in which trying to carve down the largest table first can make a big difference - it doesn't hurt !
A lot depends how the indices are set up. If an index exists which combines the two keys, the optimizer should be able to answer the query with a single index search. Otherwise if independent indices exist for both keys, the optimizer may get a list of the records satisfying each key and merge the lists. If an index exists for one condition but not the other, the optimizer should filter using the indexed list first. In any of those scenarios, it shouldn't matter what order the conditions are listed.
If none of the conditions apply, the order the conditions are specified may affect the order of evaluation, but since the database is going to have to fetch every single record to satisfy the query, the time spent fetching will likely dwarf the time spent evaluating the conditions.

Does this field need an index?

I currently have a summary table to keep track of my users' post counts, and I run SELECTs on that table to sort them by counts, like WHERE count > 10, for example. Now I know having an index on columns used in WHERE clauses speeds things up, but since these fields will also be updated quite often, would indexing provide better or worse performance?
If you have a query like
SELECT count(*) as rowcount
FROM table1
GROUP BY name
Then you cannot put an index on count, you need to put an index on the group by field instead.
If you have a field named count
Then putting an index in this query may speed up the query, it may also make no difference at all:
SELECT id, `count`
FROM table1
WHERE `count` > 10
Whether an index on count will speed up the query really depends on what percentage of the rows satisfy the where clause. If it's more than 30%, MySQL (or any SQL for that matter) will refuse to use an index.
It will just stubbornly insist on doing a full table scan. (i.e. read all rows)
This is because using an index requires reading 2 files (1 index file and then the real table file with the actual data).
If you select a large percentage of rows, reading the extra index file is not worth it and just reading all the rows in order will be faster.
If only a few rows pass the sets, using an index will speed up this query a lot
Know your data
Using explain select will tell you what indexes MySQL has available and which one it picked and (kind of/sort of in a complicated kind of way) why.
See: http://dev.mysql.com/doc/refman/5.0/en/explain.html
Indexes in general provide better read performance at the cost of slightly worse insert, update and delete performance. Usually the tradeoff is worth it depending on the width of the index and the number of indexes that already exist on the table. In your case, I would bet that the overall performance (reading and writing) will still be substantially better with the index than without but you would need to run tests to know for sure.
It will improve read performance and worsen write performance. If the tables are MyISAM and you have a lot of people posting in a short amount of time you could run into issues where MySQL is waiting for locks, eventually causing a crash.
There's no way of really knowing that without trying it. A lot depends on the ratio of reads to writes, storage engine, disk throughput, various MySQL tuning parameters, etc. You'd have to setup a simulation that resembles production and run before and after.
I think its unlikely that the write performance will be a serious issue after adding the index.
But note that the index won't be used anyway if it is not selective enough - if more than for example 10% of your users have count > 10 the fastest query plan might be to not use the index and just scan the entire table.