I was given this query to update a report, and it was taking a long time to run on my computer.
select
c.category_type, t.categoryid, t.date, t.clicks
from transactions t
join category c
on c.category_id = t.categoryid
I asked the DBA if there were any issues with the query, and the DBA optimized the query in this manner:
select
(select category_type
from category c where c.category_id = t.categoryid) category_type,
categoryid,
date, clicks
from transactions t
He described the first subquery as a "point-in-select". I have never heard of this before. Can someone explain this concept?
I want to note that the two queries are not the same, unless the following is true:
transactions.categoryid is always present in category.
category has no duplicate values of category_id.
In practice, these would be true (in most databases). The first query should be using a left join version for closer equivalence:
select c.category_type, t.categoryid, t.date, t.clicks
from transactions t left join
category c
on c.category_id = t.categoryid;
Still not exactly the same, but more similar.
Finally, both versions should make use of an index on category(category_id), and I would expect the performance to be very similar in MySQL.
Your DBA's query is not the same, as others noted, and afaik nonstandard SQL. Yours is much preferable just for its simplicity alone.
It's usually not advantageous to re-write queries for performance. It can help sometimes, but the DBMS is supposed to execute logically equivalent queries equivalently. Failure to do so is a flaw in the query planner.
Performance issues are often a function of physical design. In your case, I would look for indexes on the category and transactions tables that contain categoryid as first column. If neither exist, your join is O(mn) because the category table must be scanned for each transaction row.
Not being a MySQL user, I can only advise you to get query planner output and look for indexing opportunities.
Related
Is Query 1 more optimized say for example for a larger database than Query 2 even by slight or am I just doubling the work with an additional WHERE clause?
Query 1:
SELECT sample_data
FROM table1 INNER JOIN table2 ON table1.key = table2.key
WHERE table1.key = table2.key;
Query 2:
SELECT sample_data
FROM table1 INNER JOIN table2 ON table1.key = table2.key;
Because I read this article saying that using filters in JOIN clauses improve the performance..:
Is Query 1 more optimized say for example for a larger database than Query 2?
No, it is not more optimized. Query 2 is the correct way to handle the JOIN. Query 1 does the same thing, but with extra verbiage for the MySQL server software to scrub out as it figures out how to satisfy your query.
The advice at the Adobe documentation about filtering both tables in a join does not relate to the join's ON-condition. Their example says to do this...
SELECT whatever, whatever
FROM table1
JOIN table2 ON table2.table1_id = table1.table1_id
WHERE table1.date >= '2021-01-01'
AND table2.date >= '2021-01-01' /* THIS LINE IS WHAT THEY SUGGEST */
Their suggestion, from 2015, has to do with filtering non-join attributes from both tables. It's a suggestion to use to optimize a query if it just isn't fast enough for you. And, in my experience, it's not a very good suggestion. Ignore it, at least for now. More recent MySQL versions have gotten more efficient.
Let me add to this. SQL is a so-called "declarative" language. You declare what you want and the MySQL server figures out how to get it for you. SQL software is getting really good at doing that; keep in mind that MySQL is now a quarter century old. In that time its programmers have been continuously making it smarter at figuring out how to get stuff. You probably can't outsmart it. But you may need to add indexes when your tables get really big. https://use-the-index-luke.com/
Other languages are "procedural": you, as a programmer, spell out a procedure for getting what you want. You don't need to do that for SQL.
I like to put it this way:
ON is where you specify how the tables are related.
WHERE is for filtering.
That makes it easy for a human reading the query to understand it.
In reality (for MySQL), JOIN (aka INNER JOIN) treats ON and WHERE identically. That is, there is no performance difference. Your Query 1 unnecessarily specifies the "relation" twice.
Also, MySQL's Optimizer is smart enough to realize when two columns have the same value. For example,
SELECT ...
FROM a
JOIN bb ON a.foo = bb.foo
WHERE a.foo = 123
If the Optimizer decides that starting with the filter bb.foo = 123 is more optimal, it will do so. Note: This is not the same as the example you showed; it joins on one thing (id) but filters on another (date). The two queries there are not equivalent!
LEFT JOIN, necessarily treats ON and WHERE differently. (But that is another topic.)
In mysql which inner join sql is most effective and best?
1.
select t01.uname, t02.deptname
from user t01, department t02
where t01.deptid = t02.deptid
and t01.uid = '001'
2.
select t01.uname, t02.deptname
from user t01, department t02
where t01.uid = '001'
and t01.deptid = t02.deptid
3.
select t01.uname, t02.deptname
from user t01 inner join department t02 on t01.deptid = t02.deptid
and t01.uid = '001'
4.
select t01.uname, t02.deptname
from user t01 inner join department t02 on t01.deptid = t02.deptid
where t01.uid = '001'
My mysql is 5.1
All of those are functionally equivalent. Even the separation between WHERE clause and JOIN condition will not change the results when working entirely with INNER joins (it can matter with OUTER joins). Additionally, all of those should work out into the exact same query plan (effectively zero performance difference). The order in which you include items does not matter. The query engine is free to optimize as it sees best fit within the functional specification of the query. Even when you identify specific behavior with regards to order, you shouldn't count on it. The specification allows for tomorrow's patch to change today's behavior in this area. Remember: the whole point of SQL is to be set-based and declarative: you tell the database what you want it to do, not how you want it to do it.
Now that correctness and performance are out of the way, we're down to matters of style: things like programmer productivity and readability/maintainability of the code. In that regard, option #4 in that list is by far the best choice, with #3 the next best, especially as you start to get into more complicated queries. Just don't use the A,B syntax anymore; it's been obsolete since the 1992 version of the SQL standard. Always write out the full INNER JOIN (or LEFT JOIN/RIGHT JOIN/CROSS JOIN etc).
All that said, while order does (or, at least, should) not matter to performance, I do find it helpful when I'm writing SQL to use a convention in my approach that does dictate the order. This helps me identify errors or false assumptions later when debugging and troubleshooting. This general guide that I try to follow is to behave as if the order does matter, and then with that in mind try to keep the working set of memory needed by the database to fulfill the query as small as possible for as long as possible: start with smaller tables first and then join to the larger; when considering table size, take into account conditions in the WHERE clause that match up with an index; prefer the inner joins before outer when you have the choice; list join conditions to favor indexes (especially primary/clustered keys) first, and other conditions on the join second.
What is the difference between these two mysql queries
select t.id,
(select count(c.id) from comment c where c.topic_id = t.id) as comments_count
from topic;
AND
select t.id,comments.count from topic
left join
(
select count(c.id) count,c.topic_id from comment c group by topic_id
) as comments on t.id = comments.topic_id
I know theres not much information. Just wanted to know when to use a subquery and joined subquery and whats the difference between them.
Thanks
This is a good question, but I would also add a third option (the more standard way of doing this):
select t.id, count(c.topic_id) as count
from topic left join
comment c
on t.id = c.topic_id
group by t.id;
The first way is often the most efficient in MySQL. MySQL can take advantage of an index on comment(topic_id) to generate the count. This may be true in other databases as well, but it is particularly noticeable in MySQL which does not use indexes for group by in practice.
The second query does the aggregation and then a join. The subquery is materialized, adding additional overhead, and then the join cannot use an index on comment. It could possibly use an index on topic, but the left join may make that option less likely. (You would need to check the execution plan in your environment.)
The third option would be equivalent to the first in many databases, but not in MySQL. It does the join to comment (taking advantage of an index on comment(topic_id), if available). However, it then incurs the overhead of a file sort for the final aggregation.
Reluctantly, I must admit that the first choice is often the best in terms of performance in MySQL, particularly if the right indexes are available. Without indexes, any of the three might be the best choice. For instance, without indexes, the second is the best if comments is empty or has very few topics.
I start to learn SQL. And I find that we often can achieve the same result with help of JOINs or Inner Select statements.
Question1 (broad): Where JOINs are faster than inner selects and vise versa?
Question2 (narrow): Can you explain me what causes performance difference of three queries below?
P.S. There is very nice site which calculates query performance, but I can't understand it estimation results.
Query1:
SELECT DISTINCT maker
FROM Product pro INNER JOIN Printer pri
on pro.model = pri.model
Query2:
SELECT DISTINCT maker
FROM Product
WHERE model IN (
SELECT model FROM Printer
)
Query3:
SELECT distinct maker
FROM Product pro, Printer pri
WHERE pro.model = pri.model
When the server evaluate a JOIN it matches the join equivalence scanning only the columns needed only for the value in the other table, and filter out everything else, it is usually done with a specific action.
When you have a subquery the server need to evaluate the plan for the subquery before the JOIN equivalence match, so if the subquery doesn't make up for the extra effort filtering out a lot of noise you have a better perfomance without it.
The server are quite smart, and they try to shave everything they don't need to evaluate the join. Then they try to use every index they can to have the best performance, where the best performance mean the best they can find in a limited amount of time, so that the plan time itself don't kill the performance.
Added after the comment of the OP
The O(n) estimation depent on the complexity of the query and the subquery, if you are interested on the query plan building you'll have to navigate the help section of your database of choice and probably you will not find a lot, if the DB is not opensource.
In layman term:
a the simple join is evaluated on one level, the main query plan
a sub query is evaluated on two level, the subquery plan and the main query plan.
Some DB IDE can display a visual rappresentation of the total plan, that usually help to understand some of those point (I don't know if mySQL has that)
Query1 is faster in general but RDBMC could optimize the Query2 to provide approximately the same result.
If the IN subquery rather complicated with dependencies from main table(s) it could be executed for each row retrieved to check the condition.
Normally INNER JOIN is to join two different table values ,where as INNER SELECT is to select a particular value from a different table and use the result to produce a single output.
I came across writing the query in differnt ways like shown below
Type-I
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS
INNER
JOIN Countries C
ON JS.CountryID = C.CountryID
INNER
JOIN SalaryBracket S
ON JS.MinSalaryID = S.SalaryID
INNER
JOIN DegreeDisciplines DD
ON JS.DegreeDisciplineID = DD.DisciplineID
INNER
JOIN DegreeType DT
ON JS.DegreeTypeID = DT.DegreeTypeID
WHERE
JS.ShowCV = 'Yes'
Type-II
SELECT JS.JobseekerID
, JS.FirstName
, JS.LastName
, JS.Currency
, JS.AccountRegDate
, JS.LastUpdated
, JS.NoticePeriod
, JS.Availability
, C.CountryName
, S.SalaryAmount
, DD.DisciplineName
, DT.DegreeLevel
FROM Jobseekers JS, Countries C, SalaryBracket S, DegreeDisciplines DD
, DegreeType DT
WHERE
JS.CountryID = C.CountryID
AND JS.MinSalaryID = S.SalaryID
AND JS.DegreeDisciplineID = DD.DisciplineID
AND JS.DegreeTypeID = DT.DegreeTypeID
AND JS.ShowCV = 'Yes'
I am using Mysql database
Both works really well, But I am wondering
which is best practice to use all time for any situation?
Performance wise which is better one?(Say the database as a millions records)
Any advantages of one over the other?
Is there any tool where I can check which is better query?
Thanks in advance
1- It's a no brainer, use the Type I
2- The type II join are also called 'implicit join', whereas the type I are called 'explicit join'. With modern DBMS, you will not have any performance problem with normal query. But I think with some big complex multi join query, the DBMS could have issue with the implicit join. Using explicit join only could improve your explain plan, so faster result !
3- So performance could be an issue, but most important maybe, the readability is improve for further maintenance. Explicit join explain exactly what you want to join on what field, whereas implicit join doesn't show if you make a join or a filter. The Where clause is for filter, not for join !
And a big big point for explicit join : outer join are really annoying with implicit join. It is so hard to read when you want multiple join with outer join that explicit join are THE solution.
4- Execution plan are what you need (See the doc)
Some duplicates :
Explicit vs implicit SQL joins
SQL join: where clause vs. on clause
INNER JOIN ON vs WHERE clause
in the most code i've seen, those querys are done like your Type-II - but i think Type-I is better because of readability (and more logic - a join is a join, so you should write it as a join (althoug the second one is just another writing style for inner joins)).
in performance, there shouldn't be a difference (if there is one, i think the Type-I would be a bit faster).
Look at "Explain"-syntax
http://dev.mysql.com/doc/refman/5.1/en/explain.html
My suggestion.
Update all your tables with some amount of records. Access the MySQL console and run SQL both command one by one. You can see the time execution time in the console.
For the two queries you mentioned (each with only inner joins) any modern database's query optimizer should produce exactly the same query plan, and thus the same performance.
For MySQL, if you prefix the query with EXPLAIN, it will spit out information about the query plan (instead of running the query). If the information from both queries is the same, them the query plan is the same, and the performance will be identical. From the MySQL Reference Manual:
EXPLAIN returns a row of information
for each table used in the SELECT
statement. The tables are listed in
the output in the order that MySQL
would read them while processing the
query. MySQL resolves all joins using
a nested-loop join method. This means
that MySQL reads a row from the first
table, and then finds a matching row
in the second table, the third table,
and so on. When all tables are
processed, MySQL outputs the selected
columns and backtracks through the
table list until a table is found for
which there are more matching rows.
The next row is read from this table
and the process continues with the
next table.
When the EXTENDED keyword is used,
EXPLAIN produces extra information
that can be viewed by issuing a SHOW
WARNINGS statement following the
EXPLAIN statement. This information
displays how the optimizer qualifies
table and column names in the SELECT
statement, what the SELECT looks like
after the application of rewriting and
optimization rules, and possibly other
notes about the optimization process.
As to which syntax is better? That's up to you, but once you move beyond inner joins to outer joins, you'll need to use the newer syntax, since there's no standard for describing outer joins using the older implicit join syntax.