Clarification of join order for creation of temporary tables - mysql

I have a large query in mysql that involves joining multiple tables together. It's too slow, so I've done "explain" and see that it's creating a temporary table, which I suspect of taking most of the execution time. I found some related information:
The mysql docs describe conditions when a temporary table might be created. ("The server creates temporary tables under conditions such as these..." [Emphasis added])
This related SO question Using index, using temporary, using filesort - how to fix this?, which provides a link to the doc and applies it in a specific case.
This related SO question Order of join conditions important? that talks about the order in which joins are evaluated.
My query does not appear to meet any of the conditions listed in the docs #1, in the order that the joins were written by me. By experimentation, however, I find that if I remove my order by clause, the temporary table is not created. That makes me look at this rule from the doc:
Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.
This is the same rule that played in the example at #2 above, but in #2, the OP explicitly had columns from multiple tables in the order by clause, so that's at least superficially different.
Moreover, when I look at the output from explain, it appears that the table that I listed first is not used first by the optimizer. Putting down a pseudo-query for example:
select * from A
join B on A.c1=B.c1
join C on A.c2=C.c2
where A.c3='value'
order by A.c4
I would say that my order by clause does use only columns from the "first table in the join queue" based on the order that I've written the query. On the other hand, the output from explain suggests that it first considers table B then A.
So here are the questions:
Does the quoted rule above for use of temporary tables refer to the order that I write the tables or the order that the software chooses to evaluate them?
If it's the order that I write them, does this mean that the order of the joins does impact performance? (Seems to contradict the claims at #3 above.)
If it's the order that the software chooses to evaluate them, is there any way to coerce or trick it into selecting and order that doesn't use the table?

It refers to the order in which the optimiser evaluates them (join queue). The optimiser may not even be aware of the order of the tables in your sql statement.
No, it does not contradict what's been written in #3, since the answer explicitly writes (emphasis is mine):
has no effect on the result
The result and performance are two different things. Actually, there is an upvoted comment to the answer saying that
But it might affect the query plan (=> performance)
You can tell the optimiser which table to process first by using straight_join:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer puts the tables in the wrong order.
However, you need to be careful with that because you tie the optimiser's hand. See this SO topic on discussing advantages and disadvantages of straight_join.
Number of records, where criteria, indexes - they all play their part in the optimiser's decision of the processing order of tables. There is no magic bullet, you need to play around a bit and probably you can trick the optimiser to change the order of the tables.

select * from A
join B on A.c1=B.c1
join C on A.c2=C.c2
where A.c3='value'
order by A.c4
The optimizer will use various heuristics to decide which order to look at the tables. In this case, it will start with A because of the filters (WHERE...).
This "composite" index on A should eliminate the tmp&filesort for ORDER BY: INDEX(c3, c4). No that is not the same as INDEX(c3), INDEX(c4).
After getting the rows from A, either B or C could be reached into ("Nested Loop Join"). These indexes are important: B: (c1) and C: (c2).
STRAIGHT_JOIN and FORCE INDEX are usually a bad idea, and should be used only as a last resort. It may help today's query, but hurt tomorrow.
EXPLAIN FORMAT=JSON SELECT ... gives more information, sometimes even pointing out that two tmp tables are needed.
More tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Related

The process order of SQL order by, group by, distinct and aggregation function?

Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.

Rewrite a group-by over a randomly-ordered sub-query using only one select

Here's the thing. I'm having 3 tables, and I'm doing this query:
select t.nomefile, t.tipo_nome, t.ordine
from
(select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
order by RAND()
) as t
group by t.tipo_nome
order by t.ordine
It's applied to 3 tables, all in relationship 1-N, which need to be joined and then take 1 random result from each different result in the higher level table. This query works just fine, the problem is that I'm being asked to rewrite this query USING ONLY ONE SELECT. I've come with another way of doing this with only one select, the thing is that according to SQL sintax the GROUP BY must be before the ORDER BY, so it's pointless to order by random when you already have only the first record for each value in the higher level table.
Someone has a clue on how to write this query using only one select?
Generally, if I am not much mistaken, an ORDER BY clause in the subquery of a query like this has to do with a technique that allows you to pull non-GROUP BY columns (in the outer query) according the order specified. And so you may be out of luck here, because that means the subquery is important to this query.
Well, because in this specific case the order chosen is BY RAND() and not by a specific column/set of columns, you may have a very rough equivalent by doing both the joins and the grouping on the same level, like this:
select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
group by tipo_nome
order by categorie.ordine
You must understand, though, why this is not an exact equivalent. The thing is, MySQL does allow you to pull non-GROUP BY columns in a GROUP BY query, but if they are not correlated to the GROUP BY columns, then the values returned would be... no, not random, the term used by the manual is indeterminate. On the other hand, the technique mentioned in the first paragraph takes advantage of the fact that if the row set is ordered explicitly and unambiguously prior to grouping, then the non-GROUP BY column values will always be the same*. So indeterminateness has to do with the fact that "normally" rows are not ordered explicitly before grouping.
Now you can probably see the difference. The original version orders the rows explicitly. Even if it's BY RAND(), it is intentionally so, to ensure (as much as possible) different results in the output most of the times. But the modified version is "robbed" of the explicit ordering, and so you are likely to get identical results for many executions in a row, even if they are kind of "random".
So, in general, I consider your problem unsolvable for the above stated reasons, and if you choose to use something like the suggested modified version, then just be aware that it is likely to behave slightly differently from the original.
* The technique may not be well documented, by the way, and may have been found rather empirically than by following manuals.
I was not able to understand the reasons behind the request to rewrite this query, however, i found out that there is a solution which uses the "select" word only once. Here's the query:
SELECT g.type, SUBSTRING_INDEX(GROUP_CONCAT(
i.nomefile ORDER BY
RAND()),',',1) nomefile
FROM t_gallerie_new g JOIN t_gallerie_immagini_new i ON g.id=i.id_ref
GROUP BY g.type;
for anyone interested in this question.
NOTE: The use of GROUP_CONCAT has a couple of downsides: It is not recommended to use this keyword when using medium/large tables since it could increase the server side payload. Also, there is a limit to the size of the string returned by GROUP_CONTACT, by default 1024, so, it's necessary to modify a parameter in the mySql server to be able to receive a bigger string from this instruction.

How does JOIN work in MySQL?

Although the question title is duplicate of many discussions, I did not find a answer to this question:
Consider a simple join for normalized tables of tags as
SELECT tags.tag
FROM tags
INNER JOIN tag_map
ON tags.tag_id=tag_map.tag_id
WHERE article_id=xx
Does JOIN work with the entire tables of tags and tag_map then filter the created (JOINed) table to find rows with WHERE clause for the article id
OR JOIN will only join rows of tag_map table in which article_id=xx ?
The latter method should be quite faster!
It will do the former, to my knowledge WHERE's are explicitly performed on the resulting JOINed table. (Disclaimer: MySQL may optimize this in some cases, I don't know).
To force the latter behaviour and execute the WHERE first, you can add an extra filter to your JOIN ON statement:
SELECT tags.tag
FROM tags
INNER JOIN tag_map
ON tags.article_id=xx
AND tags.tag_id=tag_map.tag_id
WHERE article_id=xx
The Joins work on ONLY those records qualified from the WHERE clause of the first table returning records.. That said, you are doing a join to tag_map, but your where clause does not specify which alias the "Article_ID" is associated with. Its typically better to always qualify your fields with either the table name or alias the are coming from.
So, if article_id is coming from TAGS, then it will first look at that list as the primary set of records, and optimized with index if one so exists and return a small set. From that, the join is applied to the tag_map and will grab all records that match the join "ON" condition.
Just to clarify something. If the JOIN was applied FIRST, before the WHERE clause optimization, queries would take forever. The join basically PREPARES the relationship before the record selection actually occurs. Hence, the execution plan that shows the indexes that would be used.
It depends on the engine. Earlier version of many database engines would generate the join results first and then it would filter. Newer versions of engines generate a execution plan that achieves the fastest results. Test would have to be done with the db engine reviewing execution plans for your version/database to find "what is best"
Assuming it is simple or inner join:
The answer is: in relational model, first answer is correct, it creates a table that contains every row from first crossed with every row from the second table, so if you have N rows in first and M in second, it will create a table with NxM and then eliminate those where conditions do not match.
Now, that is mathematical model, but in implementation, depending on the engine, it will use some smarter way, typically choosing one table that seems faster and jioing from there using hopefully indexed join field. But this depends between engines: there is a lot of documetnation on that(google it) and some people, poster of this answer included, are paid to optimize join queries...
In case of MYSQL (just noticed the tag) you can use following syntax:
EXPLAIN [EXTENDED] SELECT select_options
as explained here and MYSQL will tell you how it would execute such query. It is faster then reading the docuemtnation.
You can always check Execution plan to see how your query gets executed step by step. In MySQL I don't know whether it can be presented graphically using any third party tools (as you can on MS SQL out of the box with Management Studio) but you can still check it using explain language constructs. Check documentation.
Not knowing your table schema
If article_id is of table tags then tag_map table isn't scanned at all unless join column in FK table is nullable.
If article_id is indexed (ie. primary key) then index is being scanned...
etc...
What I'd like to say is that we'd need your table schema definition to tell you some details. We can't know how your schema works.

Composite Primary and Cardinality

I have some questions on Composite Primary Keys and the cardinality of the columns. I searched the web, but did not find any definitive answer, so I am trying again. The questions are:
Context: Large (50M - 500M rows) OLAP Prep tables, not NOSQL, not Columnar. MySQL and DB2
1) Does the order of keys in a PK matter?
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Thanks in advance.
You've got "MySQL and DB2". This answer is for DB2, MySQL has none of this.
Yes, of course that is logical, but the optimiser takes much more than just that into account.
Generally, the order of the columns in the WHERE clause (join) do not (and should not) matter.
However, there are two items related to the order of predicates which may be the reason for your question.
What does matter, is the order of the columns in the index, against which the WHERE clause is processed. Yes, there it is best to specify the columns in the order of highest cardinality to lowest. That allows the optimiser to target a smaller range of rows.
And along those lines do not bother implementing indices for single-column, low cardinality columns (there are useless). If the index is correct, then it will be used more often.
.
The order of tables being joined (not columns in the join) matters very much, it is probably the most important consideration. In fact Join Transitive Closure is automatic, and the optimiser evaluates all possible join orders, and chooses what it thinks is the best, based on Statistics (which is why UPDATE STATS is so important).
Regardless of the no of rows in the tables, if you are joining 100 rows from table_A on a bad index with 1,000,000 rows in table_B on a good index, you want the order A:B, not B:A. If you are getting less than the max IOPS, you may want to do something about it.
The correct sequence of steps is, no surprise:
check that the index is correct as per (1). Do not just add another index, correct the ones you have.
check that update stats is being executed regularly
always try the default operation of the optimiser first. Set stats on and measure I/Os. Use representative sets of values (that the user will use in production).
check the shoowplan, to ensure that the code is correct. Of course that will also identify the join order chosen.
if the performance is not good enough, and you believe that the the join order chosen by the optimiser for those sets of values is sub-optimal, SET JTC OFF (syntax depends on your version of DB2), then specify the order that you want in the WHERE clause. Measure I/Os. Use representative sets
form an opinion. Choose whichever is better performance overall. Never tune for single queries.
1) Does the order of keys in a PK matter?
Yes, it changes the order of the record for the index that is used to police the PRIMARY KEY.
2) If the cardinality of the columns varies heavily, which should be used first. For example, if I have CLIENT/CAMPAIGN/PROGRAM where CLIENT is highly cardinal, CAMPAIGN is moderate, PROGRAM is almost like a bitmap index, what order is the best?
For select queries, this totally depends on the queries you are going to use. If you are searching for all three columns at once, the order is not important; if you are searching for two or one columns, they should be leading in the index.
For inserts, it is better to make the leading column match the order in which the records are inserted.
3) What order is the best for Join, if there is a Where clause and when there is no Where Clause (for views)
Again, this depends on the WHERE clause.

MySQL FULLTEXT Search Across >1 Table

As a more general case of this question because I think it may be of interest to more people...What's the best way to perform a fulltext search on two tables? Assume there are three tables, one for programs (with submitter_id) and one each for tags and descriptions with object_id: foreign keys referring to records in programs. We want the submitter_id of programs with certain text in their tags OR descriptions. We have to use MATCH AGAINST for reasons that I won't go into here. Don't get hung up on that aspect.
programs
id
submitter_id
tags_programs
object_id
text
descriptions_programs
object_id
text
The following works and executes in a 20ms or so:
SELECT p.submitter_id
FROM programs p
WHERE p.id IN
(SELECT t.object_id
FROM titles_programs t
WHERE MATCH (t.text) AGAINST ('china')
UNION ALL
SELECT d.object_id
FROM descriptions_programs d
WHERE MATCH (d.text) AGAINST ('china'))
but I tried to rewrite this as a JOIN as follows and it runs for a very long time. I have to kill it after 60 seconds.
SELECT p.id
FROM descriptions_programs d, tags_programs t, programs p
WHERE (d.object_id=p.id AND MATCH (d.text) AGAINST ('china'))
OR (t.object_id=p.id AND MATCH (t.text) AGAINST ('china'))
Just out of curiosity I replaced the OR with AND. That also runs in s few milliseconds, but it's not what I need. What's wrong with the above second query? I can live with the UNION and subselects, but I'd like to understand.
Join after the filters (e.g. join the results), don't try to join and then filter.
The reason is that you lose use of your fulltext index.
Clarification in response to the comment: I'm using the word join generically here, not as JOIN but as a synonym for merge or combine.
I'm essentially saying you should use the first (faster) query, or something like it. The reason it's faster is that each of the subqueries is sufficiently uncluttered that the db can use that table's full text index to do the select very quickly. Joining the two (presumably much smaller) result sets (with UNION) is also fast. This means the whole thing is fast.
The slow version winds up walking through lots of data testing it to see if it's what you want, rather than quickly winnowing the data down and only searching through rows you are likely to actually want.
Just in case you don't know: MySQL has a built in statement called EXPLAIN that can be used to see what's going on under the surface. There's a lot of articles about this, so I won't be going into any detail, but for each table it provides an estimate for the number of rows it will need to process. If you look at the "rows" column in the EXPLAIN result for the second query you'll probably see that the number of rows is quite large, and certainly a lot larger than from the first one.
The net is full of warnings about using subqueries in MySQL, but it turns out that many times the developer is smarter than the MySQL optimizer. Filtering results in some manner before joining can cause major performance boosts in many cases.
If you join both tables you end up having lots of records to inspect. Just as an example, if both tables have 100,000 records, fully joining them give you with 10,000,000,000 records (10 billion!).
If you change the OR by AND, then you allow the engine to filter out all records from table descriptions_programs which doesn't match 'china', and only then joining with titles_programs.
Anyway, that's not what you need, so I'd recommend sticking to the UNION way.
The union is the proper way to go. The join will pull in both full text indexes at once and can multiple the number of checks actually preformed.