INNER JOIN execution/evaluation order - mysql

I was wondering how exactly inner joins works in mysql.
If I do
SELECT * FROM A a
INNER JOIN B b ON a.row = b.row
INNER JOIN C c ON c.row2 = b.row2
WHERE name='Paul';
Does it do the joins first, then pick the ones where name = paul? Because when I do it this way, it is SUPER DUPER slow.
is there a way to do something along the lines:
SELECT * FROM (A a WHERE name='paul')
INNER JOIN B b ON a.row = b.row
INNER JOIN C c ON c.row2 = b.row2]
When I try it that way, I just get an error.
or alternately, is it better to just have 3 separate queries, one for A, B and C? example:
string query1 = "SELECT * FROM A WHERE name = 'paul'";
//send query, get data reader
string query2 = "SELECT * FROM b WHERE b = " + query1.b;
//send query, get data reader
string query3 = "SELECT * FROM C WHERE c = " + query1.c;
//send query, get data reader
Obviously this is just pseudo code, but I think it illustrates the point.
Which way is faster/recommended?
Edit
Table structure:
**tblTimesheet**
int timesheetID (primary key)
datetime date
varchar username
int projectID
string description
float hours
**tblProjects**
int projectID (primary key)
string project name
int clientID
**tblClients**
int clientID
string clientName
The join that I want is:
select * from tblTimesheet time
INNER JOIN tblProject proj on time.projectID = proj.projectID
INNER JOIN tblClient client on proj.clientID = client.clientID
WHERE username = 'paul';
something like that

You are probably missing an index on a key table; you can use the MySql EXPLAIN keyword to help in finding out where your query is slow.
To answer another section of your question;
is there a way to do something along the lines:
SELECT * FROM (A a WHERE name='paul')
INNER JOIN B b ON a.row = b.row
INNER JOIN C c ON c.row2 = b.row2]
You can use a SubQuery;
SELECT *
FROM (SELECT * FROM tblTimesheet WHERE username = 'Paul') AS time
INNER JOIN tblProject proj on time.projectID = proj.projectID
INNER JOIN tblClient client on proj.clientID = client.clientID
What this query is effectively doing is attempting to prefilter the fields the JOIN will operate on. Rather than join all the fields to together, and then filter those down, it only attempts to JOIN fields from tblTimesheet where the name is 'Paul' first.
However, the query optimizer should already be doing this so this query should perform similarly to your original query.
For more help with indexes, the understanding of which will aid you greatly in database development, start by looking at a tutorial like this one.

It's fantastically unlikely a join will be slower than three database hits. Reordering the clauses shouldn't have an impact either if MySQL's query optimizer is at all competent. Are the columns in the WHERE / ON clauses indexed?

I think you'll find the query optimiser will give you the best possible query most of the time. You need to look at the execution plan to find out why the query is slow - my guess is lack of indexes.
When MySql looks in these tables, it will usually do it in the best way to get the best speed - a simple join as you've illustrated won't confuse the query optimiser, but missing indexes can cause the database engine to scan tables instead of looking up values (i.e. it needs to walk through the table row by row to match the criteria you specified)
An index ensures that the engine doesn't need to go searching down to the leaf page level and will usually speed up queries
What's the table structure here or is this all hypothetical?
The general rule of thumb with SQL is - try it and see!

Use your first query, the mySql query optimizer should pick the fastest strategy
if you want it to be faster, make sure that there is an index on the name column

Related

Optimizing MySQL query with subselect

I am trying to make the following query run faster than 180 secs:
SELECT
x.di_on_g AS deviceid, SUM(1) AS amount
FROM
(SELECT
g.device_id AS di_on_g
FROM
guide g
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
WHERE
g.operator_id IN (1 , 1)
AND g.locale_id = 1
AND (g.device_id IN ("many (~1500) comma separated IDs coming from my code"))
GROUP BY g.device_id , g.guide_type_id) x
GROUP BY x.di_on_g
ORDER BY amount;
Screenshot from EXPLAIN:
https://ibb.co/da5oAF
Even if I run the subquery as separate query it is still very slow...:
SELECT
g.device_id AS di_on_g
FROM
guide g
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
WHERE
g.operator_id IN (1 , 1)
AND g.locale_id = 1
AND (g.device_id IN (("many (~1500) comma separated IDs coming from my code")
Screenshot from EXPLAIN:
ibb.co/gJHRVF
I have indexes on g.device_id and on other appropriate places.
Indexes:
SHOW INDEX FROM guide;
ibb.co/eVgmVF
SHOW INDEX FROM operator_guide_type;
ibb.co/f0TTcv
SHOW INDEX FROM operator_device;
ibb.co/mseqqF
I tried creating a new temp table for the ids and using a JOIN to replace the slow IN clause but that didn't make the query much faster.
All IDs are Integers and I tried creating a new temp table for the ids that come from my code and JOIN that table instead of the slow IN clause but that didn't make the query much faster. (10 secs faster)
None of the tables have more then 300,000 rows and the mysql configuration is good.
And the visual plan:
Query Plan
Any help will be appreciated !
Let's focus on the subquery. The main problem is "inflate-deflate", but I will get to that in a moment.
Add the composite index:
INDEX(locale_id, operator_id, device_id)
Why the duplicated "1" in
g.operator_id IN (1 , 1)
Why does the GROUP BY have 2 columns, when you select only 1? Is there some reason for using GROUP BY instead of DISTINCT. (The latter seems to be your intent.)
The only reason for these
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
would be to verify that there are guides and devices in those other table. Is that correct? Are these the PRIMARY KEYs, hence unique?: ogt.guide_type_id and od.device_id. If so, why do you need the GROUP BY? Based on the EXPLAIN, it sounds like both of those are related 1:many. So...
SELECT g.device_id AS di_on_g
FROM guide g
WHERE EXISTS( SELECT * FROM operator_guide_type WHERE guide_type_id = g.guide_type_id )
AND EXISTS( SELECT * FROM operator_device WHERE device_id = g.device_id
AND g.operator_id IN (1)
AND g.locale_id = 1
AND g.device_id IN (...)
Notes:
The GROUP BY is no longer needed.
The "inflate-deflate" of JOIN + GROUP BY is gone. The Explain points this out -- 139K rows inflated to 61M -- very costly.
EXISTS is a "semijoin", meaning that it does not collect all matches, but stops when it finds any match.
"the mysql configuration is good" -- How much RAM do you have? What Engine is the table? What is the value of innodb_buffer_pool_size?

MySQL LEFT JOIN order of ON conditions

I have a MySQL query with inner joins and one left join and a lot of data in my database, and it's running quite slow. This is roughly my query:
SELECT
main_table.*
FROM
main_table
INNER JOIN
...
LEFT JOIN
second_table ON (main_table.id = second_table.ref_id AND second_table.type = 'foo' AND second_table.bar IS NULL
WHERE
second_table.id IS NULL
;
An entry from main_table may have one or more referenced entries in second_table. I want to get all results from main_table, that either have no results in second_table, or only has irrelevant data in the second table (type 'foo' or bar is NULL).
Taking a look into the EXPLAIN, MySQL searches for bar IS NULL first, followed by type = 'foo', that would still result in many thousands of result, whereas checking for ref_id first would only leave very few results to check the other conditions on.
I only have an index on ref_id, not for type or bar and I don't feel the need to index them if I could just get the query search for ref_id first.
--EDIT: I noticed that on the copy of the database (where it has the actual data and runs slow) does also have an index on type and bar individually, so that's probably why MySQL prefers bar over the other keys. I'm considering a key spanning multiple fields.--
Does anybody have an idea how to optimize this kind of query? Is it possible to force MySQL using a certain order in the ON conditions?
"Solution": I added an index spanned over all the relevant fields.
I don't consider this being a real solution, because I believe, it would also have been faster if the JOIN was done on the indexed ref_id first. It probably did so when that was the only index, however my colleague had the idea to add an index separately on the other fields as well for some reason, probably needed somewhere else in our application.
What happens if you move the "Irrelevant" rows to the where part?
Seems to me the DB should have an easier time joining the tables, and will use the index
Something like
SELECT
main_table.*
FROM
main_table
INNER JOIN
...
LEFT JOIN
second_table ON main_table.id = second_table.ref_id
WHERE
second_table.id IS NULL OR
(second_table.type = 'foo' AND second_table.bar IS NULL)
In MYSQL JOIN is faster then LEFT JOIN so you can write your query like this.
SELECT
main_table.*
FROM
main_table
INNER JOIN
...
LEFT JOIN (SELECT main_table.*,second_table.* FROM main_table
JOIN second_table ON main_table.id = second_table.ref_id AND
second_table.type = 'foo' AND second_table.bar IS NULL) AS main_table2 ON
main_table2.id = main_table.id
WHERE
second_table.id IS NULL;

how to properly use JOIN?

I have two tables and a foreign_key index table:
table xymply_locations
id
name
lat
lng
table xymply_categories
id
name
table xymply_categoryf_key
locid
catid
and i want to select the categories that are assigned to locid 1. How do I do this, I tried
SELECT *
FROM `xymply_categoryf_key`, xymply_categories
JOIN `xymply_categories` ON
xymply_categories.id = xymply_categoryf_key.catid
WHERE locid = 1;
but I get "Not unique table/alias: 'xymply_categories' " and I'm wondering why...?
You're mixing implicit (all tables listed in the FROM clause) and explicit JOIN styles in your code, hence the error.
SELECT xc.id, xc.name
FROM xymply_categories xc
INNER JOIN xymply_categoryf_key xck
ON xc.id = xck.catid
WHERE xck.locid = 1;
In your query, you're selecting from two tables. One of them is xymply_categoryf_key, the other is a JOIN of two instances of xymply_categories. You're using two instances of the same table, so when you write xymply_categories.id it is not clear which instance you mean - the one that is the first argument of JOIN, or that one which is the second argument? That's what "Not unique table/alias" means. If I understand correctly what you want to do, try this:
SELECT c.id, c.name FROM xymply_categories c, xymply_categoryf_key k WHERE c.id = k.catid AND k.locid = 1;
This was done without JOIN, although the evaluation of
WHERE c.id = k.catid
maybe would be faster with JOIN, I am not sure. Also, note the usage of k and c as aliases for the tables xymply_categoryf_key (k for key) and xymply_categories c (c for categories). This is how to avoid the problem of "Not unique table/alias" which occured to you before. In your case, you would use e.g.
xymply_categories a JOIN xymply_categories b WHERE a.id = ...
So, although I gave an example how to write the query without using JOIN - as I mentioned, using JOIN will maybe produce a faster query. Therefore, all you should do is to add the aliases.
Because you are "joining" xymply_catagories twice so the db wants an alias for the tables in order to know which one to go to when selecting a column.
you can do joins several ways depending on what you want. A straight inner join (which appears to be what you want) can be
select * from xymply_categoryf_key a, xymply_categories b where a.catid = b.id
WHERE b.locid = 1;
or you can also do an explicit inner join as Joe Stefanelli shows. Either of these gives you the records where there is matching info from each table.

Query efficiency (multiple selects)

I have two tables - one called customer_records and another called customer_actions.
customer_records has the following schema:
CustomerID (auto increment, primary key)
CustomerName
...etc...
customer_actions has the following schema:
ActionID (auto increment, primary key)
CustomerID (relates to customer_records)
ActionType
ActionTime (UNIX time stamp that the entry was made)
Note (TEXT type)
Every time a user carries out an action on a customer record, an entry is made in customer_actions, and the user is given the opportunity to enter a note. ActionType can be one of a few values (like 'designatory update' or 'added case info' - can only be one of a list of options).
What I want to be able to do is display a list of records from customer_records where the last ActionType was a certain value.
So far, I've searched the net/SO and come up with this monster:
SELECT * FROM (
SELECT * FROM (
SELECT * FROM `customer_actions` ORDER BY `EntryID` DESC
) list1 GROUP BY `CustomerID`
) list2 WHERE `ActionType`='whatever' LIMIT 0,30
Which is great - it lists each customer ID and their last action. But the query is extremely slow on occasions (note: there are nearly 20,000 records in customer_records). Can anyone offer any tips on how I can sort this monster of a query out or adjust my table to give faster results? I'm using MySQL. Any help is really appreciated, thanks.
Edit: To be clear, I need to see a list of customers who's last action was 'whatever'.
To filter customers by their last action, you could use a correlated sub-query...
SELECT
*
FROM
customer_records
INNER JOIN
customer_actions
ON customer_actions.CustomerID = customer_records.CustomerID
AND customer_actions.ActionDate = (
SELECT
MAX(ActionDate)
FROM
customer_actions AS lookup
WHERE
CustomerID = customer_records.CustomerID
)
WHERE
customer_actions.ActionType = 'Whatever'
You may find it more efficient to avoid the correlated sub-query as follows...
SELECT
*
FROM
customer_records
INNER JOIN
(SELECT CustomerID, MAX(ActionDate) AS ActionDate FROM customer_actions GROUP BY CustomerID) AS last_action
ON customer_records.CustomerID = last_action.CustomerID
INNER JOIN
customer_actions
ON customer_actions.CustomerID = last_action.CustomerID
AND customer_actions.ActionDate = last_action.ActionDate
WHERE
customer_actions.ActionType = 'Whatever'
I'm not sure if I understand the requirements but it looks to me like a JOIN would be enough for that.
SELECT cr.CustomerID, cr.CustomerName, ...
FROM customer_records cr
INNER JOIN customer_actions ca ON ca.CustomerID = cr.CustomerID
WHERE `ActionType` = 'whatever'
ORDER BY
ca.EntryID
Note that 20.000 records should not pose a performance problem
Please note that I've adapted Lieven's answer (I made a separate post as this was too long for a comment). Any credit for the solution itself goes to him, I'm just trying to show you some key points for improving performance.
If speed is a concern then the following should give you some suggestions for improving it:
select top 100 -- Change as required
cr.CustomerID ,
cr.CustomerName,
cr.MoreDetail1,
cr.Etc
from customer_records cr
inner join customer_actions ca
on ca.CustomerID = cr.CustomerID
where ca.ActionType = 'x'
order by cr.CustomerID
A few notes:
In some cases I find left outer joins to be faster then inner joins - It would be worth measuring performance for both for this query
Avoid returning * wherever possible
You don't have to reference 'cr.x' in the initial select but it's a good habit to get into for when you start working on large queries that can have multiple joins in them (this will make a lot of sense once you start doing this
When using joins always join on a primary key
Maybe I'm missing something but what's wrong with a simple join and a where clause?
Select ActionType, ActionTime, Note
FROM Customer_Records CR
INNER JOIN customer_Actions CA
ON CR.CustomerID = CA.CustomerID
Where ActionType = 'added case info'

In what order are MySQL JOINs evaluated?

I have the following query:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123;
I have the following questions:
Is the USING syntax synonymous with ON syntax?
Are these joins evaluated left to right? In other words, does this query say: x = companies JOIN users; y = x JOIN jobs; z = y JOIN useraccounts;
If the answer to question 2 is yes, is it safe to assume that the companies table has companyid, userid and jobid columns?
I don't understand how the WHERE clause can be used to pick rows on the companies table when it is referring to the alias "j"
Any help would be appreciated!
USING (fieldname) is a shorthand way of saying ON table1.fieldname = table2.fieldname.
SQL doesn't define the 'order' in which JOINS are done because it is not the nature of the language. Obviously an order has to be specified in the statement, but an INNER JOIN can be considered commutative: you can list them in any order and you will get the same results.
That said, when constructing a SELECT ... JOIN, particularly one that includes LEFT JOINs, I've found it makes sense to regard the third JOIN as joining the new table to the results of the first JOIN, the fourth JOIN as joining the results of the second JOIN, and so on.
More rarely, the specified order can influence the behaviour of the query optimizer, due to the way it influences the heuristics.
No. The way the query is assembled, it requires that companies and users both have a companyid, jobs has a userid and a jobid and useraccounts has a userid. However, only one of companies or user needs a userid for the JOIN to work.
The WHERE clause is filtering the whole result -- i.e. all JOINed columns -- using a column provided by the jobs table.
I can't answer the bit about the USING syntax. That's weird. I've never seen it before, having always used an ON clause instead.
But what I can tell you is that the order of JOIN operations is determined dynamically by the query optimizer when it constructs its query plan, based on a system of optimization heuristics, some of which are:
Is the JOIN performed on a primary key field? If so, this gets high priority in the query plan.
Is the JOIN performed on a foreign key field? This also gets high priority.
Does an index exist on the joined field? If so, bump the priority.
Is a JOIN operation performed on a field in WHERE clause? Can the WHERE clause expression be evaluated by examining the index (rather than by performing a table scan)? This is a major optimization opportunity, so it gets a major priority bump.
What is the cardinality of the joined column? Columns with high cardinality give the optimizer more opportunities to discriminate against false matches (those that don't satisfy the WHERE clause or the ON clause), so high-cardinality joins are usually processed before low-cardinality joins.
How many actual rows are in the joined table? Joining against a table with only 100 values is going to create less of a data explosion than joining against a table with ten million rows.
Anyhow... the point is... there are a LOT of variables that go into the query execution plan. If you want to see how MySQL optimizes its queries, use the EXPLAIN syntax.
And here's a good article to read:
http://www.informit.com/articles/article.aspx?p=377652
ON EDIT:
To answer your 4th question: You aren't querying the "companies" table. You're querying the joined cross-product of ALL four tables in your FROM and USING clauses.
The "j.jobid" alias is just the fully-qualified name of one of the columns in that joined collection of tables.
In MySQL, it's often interesting to ask the query optimizer what it plans to do, with:
EXPLAIN SELECT [...]
See "7.2.1 Optimizing Queries with EXPLAIN"
Here is a more detailed answer on JOIN precedence. In your case, the JOINs are all commutative. Let's try one where they aren't.
Build schema:
CREATE TABLE users (
name text
);
CREATE TABLE orders (
order_id text,
user_name text
);
CREATE TABLE shipments (
order_id text,
fulfiller text
);
Add data:
INSERT INTO users VALUES ('Bob'), ('Mary');
INSERT INTO orders VALUES ('order1', 'Bob');
INSERT INTO shipments VALUES ('order1', 'Fulfilling Mary');
Run query:
SELECT *
FROM users
LEFT OUTER JOIN orders
ON orders.user_name = users.name
JOIN shipments
ON shipments.order_id = orders.order_id
Result:
Only the Bob row is returned
Analysis:
In this query the LEFT OUTER JOIN was evaluated first and the JOIN was evaluated on the composite result of the LEFT OUTER JOIN.
Second query:
SELECT *
FROM users
LEFT OUTER JOIN (
orders
JOIN shipments
ON shipments.order_id = orders.order_id)
ON orders.user_name = users.name
Result:
One row for Bob (with the fulfillment data) and one row for Mary with NULLs for fulfillment data.
Analysis:
The parenthesis changed the evaluation order.
Further MySQL documentation is at https://dev.mysql.com/doc/refman/5.5/en/nested-join-optimization.html
SEE http://dev.mysql.com/doc/refman/5.0/en/join.html
AND start reading here:
Join Processing Changes in MySQL 5.0.12
Beginning with MySQL 5.0.12, natural joins and joins with USING, including outer join variants, are processed according to the SQL:2003 standard. The goal was to align the syntax and semantics of MySQL with respect to NATURAL JOIN and JOIN ... USING according to SQL:2003. However, these changes in join processing can result in different output columns for some joins. Also, some queries that appeared to work correctly in older versions must be rewritten to comply with the standard.
These changes have five main aspects:
The way that MySQL determines the result columns of NATURAL or USING join operations (and thus the result of the entire FROM clause).
Expansion of SELECT * and SELECT tbl_name.* into a list of selected columns.
Resolution of column names in NATURAL or USING joins.
Transformation of NATURAL or USING joins into JOIN ... ON.
Resolution of column names in the ON condition of a JOIN ... ON.
Im not sure about the ON vs USING part (though this website says they are the same)
As for the ordering question, its entirely implementation (and probably query) specific. MYSQL most likely picks an order when compiling the request. If you do want to enforce a particular order you would have to 'nest' your queries:
SELECT c.*
FROM companies AS c
JOIN (SELECT * FROM users AS u
JOIN (SELECT * FROM jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123)
)
as for part 4: the where clause limits what rows from the jobs table are eligible to be JOINed on. So if there are rows which would join due to the matching userids but don't have the correct jobid then they will be omitted.
1) Using is not exactly the same as on, but it is short hand where both tables have a column with the same name you are joining on... see: http://www.java2s.com/Tutorial/MySQL/0100__Table-Join/ThekeywordUSINGcanbeusedasareplacementfortheONkeywordduringthetableJoins.htm
It is more difficult to read in my opinion, so I'd go spelling out the joins.
3) It is not clear from this query, but I would guess it does not.
2) Assuming you are joining through the other tables (not all directly on companyies) the order in this query does matter... see comparisons below:
Origional:
SELECT c.*
FROM companies AS c
JOIN users AS u USING(companyid)
JOIN jobs AS j USING(userid)
JOIN useraccounts AS us USING(userid)
WHERE j.jobid = 123
What I think it is likely suggesting:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = u.userid
JOIN useraccounts AS us on us.userid = u.userid
WHERE j.jobid = 123
You could switch you lines joining jobs & usersaccounts here.
What it would look like if everything joined on company:
SELECT c.*
FROM companies AS c
JOIN users AS u on u.companyid = c.companyid
JOIN jobs AS j on j.userid = c.userid
JOIN useraccounts AS us on us.userid = c.userid
WHERE j.jobid = 123
This doesn't really make logical sense... unless each user has their own company.
4.) The magic of sql is that you can only show certain columns but all of them are their for sorting and filtering...
if you returned
SELECT c.*, j.jobid....
you could clearly see what it was filtering on, but the database server doesn't care if you output a row or not for filtering.