How does JOIN work in MySQL? - mysql

Although the question title is duplicate of many discussions, I did not find a answer to this question:
Consider a simple join for normalized tables of tags as
SELECT tags.tag
FROM tags
INNER JOIN tag_map
ON tags.tag_id=tag_map.tag_id
WHERE article_id=xx
Does JOIN work with the entire tables of tags and tag_map then filter the created (JOINed) table to find rows with WHERE clause for the article id
OR JOIN will only join rows of tag_map table in which article_id=xx ?
The latter method should be quite faster!

It will do the former, to my knowledge WHERE's are explicitly performed on the resulting JOINed table. (Disclaimer: MySQL may optimize this in some cases, I don't know).
To force the latter behaviour and execute the WHERE first, you can add an extra filter to your JOIN ON statement:
SELECT tags.tag
FROM tags
INNER JOIN tag_map
ON tags.article_id=xx
AND tags.tag_id=tag_map.tag_id
WHERE article_id=xx

The Joins work on ONLY those records qualified from the WHERE clause of the first table returning records.. That said, you are doing a join to tag_map, but your where clause does not specify which alias the "Article_ID" is associated with. Its typically better to always qualify your fields with either the table name or alias the are coming from.
So, if article_id is coming from TAGS, then it will first look at that list as the primary set of records, and optimized with index if one so exists and return a small set. From that, the join is applied to the tag_map and will grab all records that match the join "ON" condition.
Just to clarify something. If the JOIN was applied FIRST, before the WHERE clause optimization, queries would take forever. The join basically PREPARES the relationship before the record selection actually occurs. Hence, the execution plan that shows the indexes that would be used.

It depends on the engine. Earlier version of many database engines would generate the join results first and then it would filter. Newer versions of engines generate a execution plan that achieves the fastest results. Test would have to be done with the db engine reviewing execution plans for your version/database to find "what is best"

Assuming it is simple or inner join:
The answer is: in relational model, first answer is correct, it creates a table that contains every row from first crossed with every row from the second table, so if you have N rows in first and M in second, it will create a table with NxM and then eliminate those where conditions do not match.
Now, that is mathematical model, but in implementation, depending on the engine, it will use some smarter way, typically choosing one table that seems faster and jioing from there using hopefully indexed join field. But this depends between engines: there is a lot of documetnation on that(google it) and some people, poster of this answer included, are paid to optimize join queries...
In case of MYSQL (just noticed the tag) you can use following syntax:
EXPLAIN [EXTENDED] SELECT select_options
as explained here and MYSQL will tell you how it would execute such query. It is faster then reading the docuemtnation.

You can always check Execution plan to see how your query gets executed step by step. In MySQL I don't know whether it can be presented graphically using any third party tools (as you can on MS SQL out of the box with Management Studio) but you can still check it using explain language constructs. Check documentation.
Not knowing your table schema
If article_id is of table tags then tag_map table isn't scanned at all unless join column in FK table is nullable.
If article_id is indexed (ie. primary key) then index is being scanned...
etc...
What I'd like to say is that we'd need your table schema definition to tell you some details. We can't know how your schema works.

Related

Clarification of join order for creation of temporary tables

I have a large query in mysql that involves joining multiple tables together. It's too slow, so I've done "explain" and see that it's creating a temporary table, which I suspect of taking most of the execution time. I found some related information:
The mysql docs describe conditions when a temporary table might be created. ("The server creates temporary tables under conditions such as these..." [Emphasis added])
This related SO question Using index, using temporary, using filesort - how to fix this?, which provides a link to the doc and applies it in a specific case.
This related SO question Order of join conditions important? that talks about the order in which joins are evaluated.
My query does not appear to meet any of the conditions listed in the docs #1, in the order that the joins were written by me. By experimentation, however, I find that if I remove my order by clause, the temporary table is not created. That makes me look at this rule from the doc:
Evaluation of statements that contain an ORDER BY clause and a different GROUP BY clause, or for which the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue.
This is the same rule that played in the example at #2 above, but in #2, the OP explicitly had columns from multiple tables in the order by clause, so that's at least superficially different.
Moreover, when I look at the output from explain, it appears that the table that I listed first is not used first by the optimizer. Putting down a pseudo-query for example:
select * from A
join B on A.c1=B.c1
join C on A.c2=C.c2
where A.c3='value'
order by A.c4
I would say that my order by clause does use only columns from the "first table in the join queue" based on the order that I've written the query. On the other hand, the output from explain suggests that it first considers table B then A.
So here are the questions:
Does the quoted rule above for use of temporary tables refer to the order that I write the tables or the order that the software chooses to evaluate them?
If it's the order that I write them, does this mean that the order of the joins does impact performance? (Seems to contradict the claims at #3 above.)
If it's the order that the software chooses to evaluate them, is there any way to coerce or trick it into selecting and order that doesn't use the table?
It refers to the order in which the optimiser evaluates them (join queue). The optimiser may not even be aware of the order of the tables in your sql statement.
No, it does not contradict what's been written in #3, since the answer explicitly writes (emphasis is mine):
has no effect on the result
The result and performance are two different things. Actually, there is an upvoted comment to the answer saying that
But it might affect the query plan (=> performance)
You can tell the optimiser which table to process first by using straight_join:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer puts the tables in the wrong order.
However, you need to be careful with that because you tie the optimiser's hand. See this SO topic on discussing advantages and disadvantages of straight_join.
Number of records, where criteria, indexes - they all play their part in the optimiser's decision of the processing order of tables. There is no magic bullet, you need to play around a bit and probably you can trick the optimiser to change the order of the tables.
select * from A
join B on A.c1=B.c1
join C on A.c2=C.c2
where A.c3='value'
order by A.c4
The optimizer will use various heuristics to decide which order to look at the tables. In this case, it will start with A because of the filters (WHERE...).
This "composite" index on A should eliminate the tmp&filesort for ORDER BY: INDEX(c3, c4). No that is not the same as INDEX(c3), INDEX(c4).
After getting the rows from A, either B or C could be reached into ("Nested Loop Join"). These indexes are important: B: (c1) and C: (c2).
STRAIGHT_JOIN and FORCE INDEX are usually a bad idea, and should be used only as a last resort. It may help today's query, but hurt tomorrow.
EXPLAIN FORMAT=JSON SELECT ... gives more information, sometimes even pointing out that two tmp tables are needed.
More tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Return number of related records for the results of a query

I have 2 related tables, Tapes & Titles related through the Fields TapeID
Tapes.TapeID & Titles.TapeID
I want to be able to query the Tapes Table on the Column Artist and then return the number of titles for each of the matching Artist records
My Query is as follows
SELECT Tapes.Artist,COUNT(Titles.TapeID)
FROM Tapes
INNER JOIN Titles on Titles.TapeID=Tapes.TapeID
GROUP BY Tapes.Artist
HAVING TAPES.Artist LIKE <ArtistName%>"
The query appears to run then seems to go into an indefinite loop
I get no syntax errors and no results
Please point out the error in my query
Here are two likely culprits for this poor performance. The first would be the lack of index on Tapes.TapeId. Based on the naming, I would expect this to be the primary key on the Tapes table. If there are no indexes, then you could get poor performance.
The second would involve the selectivity of the having clause. As written, MySQL is going to aggregate all the data for the group by and then filter out the groups. In many cases, this would not make much of a difference. But, if you have lots of data and the condition is selective (meaning few rows match), then moving it to a where clause would make a difference.
There are definitely other possibilities. For instance, the server could be processing other queries. An update query could be locking one of the tables. Or, the columns TapeId could have different types in the two tables.
You can modify your question to include the definition of the two tables. Also, put explain before the query and include the output in the question. This indicates the execution plan chosen by MySQL.

MySQL - SELECT, JOIN

Few months ago I was programming a simple application with som other guy in PHP. There we needed to preform a SELECT from multiple tables based on a userid and another value that you needed to get from the row that was selected by userid.
My first idea was to create multiple SELECTs and parse all the output in the PHP script (with all that mysql_num_rows() and similar functions for checking), but then the guy told me he'll do that. "Okay no problem!" I thought, just much more less for me to write. Well, what a surprise when i found out he did it with just one SQL statement:
SELECT
d.uid AS uid, p.pasmo_cas AS pasmo, d.pasmo AS id_pasmo ...
FROM
table_values AS d, sectors AS p
WHERE
d.userid='$userid' and p.pasmo_id=d.pasmo
ORDER BY
datum DESC, p.pasmo_id DESC
(shortened piece of the statement (...))
Mostly I need to know the differences between this method (is it the right way to do this?) and JOIN - when should I use which one?
Also any references to explanations and examples of these two would come in pretty handy (not from the MySQL ref though - I'm really a novice in this kind of stuff and it's written pretty roughly there.)
, notation was replaced in ANSI-92 standard, and so is in one sense now 20 years out of date.
Also, when doing OUTER JOINs and other more complex queries, the JOIN notation is much more explicit, readable, and (in my opinion) debuggable.
As a general principle, avoid , and use JOIN.
In terms of precedence, a JOIN's ON clause happens before the WHERE clause. This allows things like a LEFT JOIN b ON a.id = b.id WHERE b.id IS NULL to check for cases where there is NOT a matching row in b.
Using , notation is similar to processing the WHERE and ON conditions at the same time.
This definitely looks like the ideal scenario for a join so you can avoid returning more data then you actually need. This: http://www.w3schools.com/sql/sql_join.asp or this: http://en.wikipedia.org/wiki/Join_(SQL) should help you get started with joins. I'm also happy to help you write the statement if you can give me a brief outline of the columns / data in each table (primarily I need two matching columns to join on).
The use of the WHERE clause is a valid approach, but as #Dems noted, has been superseded by the use of the JOINS syntax.
However, I would argue that in some cases, use of the WHERE clauses to achieve joins can be more readable and understandable than using JOINs.
You should make yourself familiar with both methods of joining tables.

Sorting items filtered by tag

I want to implement a very common feature - filtering some items by tag. There are many tutorials on the internet with examples of how to do it. The query is quite simple and fast (assuming proper indexes exist).
But usually the filtered items need to be sorted by some field. For example, when you filter questions by tag on SO, you get your results sorted.
To accomplish this task (assuming we need to sort by rating), one could write:
SELECT item.id FROM item
INNER JOIN taggeditem ON taggeditem.item_id = item.id
WHERE
taggeditem.tag_id = 1234
ORDER BY item.rating DESC
We have indexes (taggeditem.tag_id), (item.id), (item.rating)
The problem with this query is that mysql can't use index on item.rating, because the key used to fetch the rows is not the same as the one used in the ORDER BY (MySQL: ORDER BY Optimization). This leads to using a temporary table and filesort, which in turn leads to slow execution time.
The solution I came up with is to denormalize sort field to the taggeditem table, so that I could create index (tag_id, item_rating) on taggeditem.
I've searched for similar questions at SO, and found only this one: Mysql slow query: INNER JOIN + ORDER BY causes filesort. The solution was the same.
So, I want to ask, is this a common solution to this problem? Is it a good practice to denormalize a bunch of sort fields to taggeditem, such as created, rating? At SO you can sort using 4 different parameters (newest, hot, votes, active) - does it mean that they denormalized fields which are used to sort results?
Are there any alternatives to this solution?
There is a standard alternative - change server system variables.
For example, you can experiment with sort_buffer_size value (default 2MB).
More about it.
As soon as you're using a JOIN, and filter out on the joined table, you're stuck with bad performance.
As you said, the only way to avoid this is to create a denormalized table.
For SO's sorts, I think they have no such issue: they just have to sort answers by a column of the answers' table (something like SELECT * FROM answers WHERE question_id = 1234 SORT BY answer_date, with an index on question_id, answer_date)
I'm also looking for such solutions, with multi-valued columns, and that's really difficult (denormalized data would be huge, as it needs to cross all values in the multi-valued columns)

MySQL FULLTEXT Search Across >1 Table

As a more general case of this question because I think it may be of interest to more people...What's the best way to perform a fulltext search on two tables? Assume there are three tables, one for programs (with submitter_id) and one each for tags and descriptions with object_id: foreign keys referring to records in programs. We want the submitter_id of programs with certain text in their tags OR descriptions. We have to use MATCH AGAINST for reasons that I won't go into here. Don't get hung up on that aspect.
programs
id
submitter_id
tags_programs
object_id
text
descriptions_programs
object_id
text
The following works and executes in a 20ms or so:
SELECT p.submitter_id
FROM programs p
WHERE p.id IN
(SELECT t.object_id
FROM titles_programs t
WHERE MATCH (t.text) AGAINST ('china')
UNION ALL
SELECT d.object_id
FROM descriptions_programs d
WHERE MATCH (d.text) AGAINST ('china'))
but I tried to rewrite this as a JOIN as follows and it runs for a very long time. I have to kill it after 60 seconds.
SELECT p.id
FROM descriptions_programs d, tags_programs t, programs p
WHERE (d.object_id=p.id AND MATCH (d.text) AGAINST ('china'))
OR (t.object_id=p.id AND MATCH (t.text) AGAINST ('china'))
Just out of curiosity I replaced the OR with AND. That also runs in s few milliseconds, but it's not what I need. What's wrong with the above second query? I can live with the UNION and subselects, but I'd like to understand.
Join after the filters (e.g. join the results), don't try to join and then filter.
The reason is that you lose use of your fulltext index.
Clarification in response to the comment: I'm using the word join generically here, not as JOIN but as a synonym for merge or combine.
I'm essentially saying you should use the first (faster) query, or something like it. The reason it's faster is that each of the subqueries is sufficiently uncluttered that the db can use that table's full text index to do the select very quickly. Joining the two (presumably much smaller) result sets (with UNION) is also fast. This means the whole thing is fast.
The slow version winds up walking through lots of data testing it to see if it's what you want, rather than quickly winnowing the data down and only searching through rows you are likely to actually want.
Just in case you don't know: MySQL has a built in statement called EXPLAIN that can be used to see what's going on under the surface. There's a lot of articles about this, so I won't be going into any detail, but for each table it provides an estimate for the number of rows it will need to process. If you look at the "rows" column in the EXPLAIN result for the second query you'll probably see that the number of rows is quite large, and certainly a lot larger than from the first one.
The net is full of warnings about using subqueries in MySQL, but it turns out that many times the developer is smarter than the MySQL optimizer. Filtering results in some manner before joining can cause major performance boosts in many cases.
If you join both tables you end up having lots of records to inspect. Just as an example, if both tables have 100,000 records, fully joining them give you with 10,000,000,000 records (10 billion!).
If you change the OR by AND, then you allow the engine to filter out all records from table descriptions_programs which doesn't match 'china', and only then joining with titles_programs.
Anyway, that's not what you need, so I'd recommend sticking to the UNION way.
The union is the proper way to go. The join will pull in both full text indexes at once and can multiple the number of checks actually preformed.