What is the difference between subquery and a joined subquery? - mysql

What is the difference between these two mysql queries
select t.id,
(select count(c.id) from comment c where c.topic_id = t.id) as comments_count
from topic;
AND
select t.id,comments.count from topic
left join
(
select count(c.id) count,c.topic_id from comment c group by topic_id
) as comments on t.id = comments.topic_id
I know theres not much information. Just wanted to know when to use a subquery and joined subquery and whats the difference between them.
Thanks

This is a good question, but I would also add a third option (the more standard way of doing this):
select t.id, count(c.topic_id) as count
from topic left join
comment c
on t.id = c.topic_id
group by t.id;
The first way is often the most efficient in MySQL. MySQL can take advantage of an index on comment(topic_id) to generate the count. This may be true in other databases as well, but it is particularly noticeable in MySQL which does not use indexes for group by in practice.
The second query does the aggregation and then a join. The subquery is materialized, adding additional overhead, and then the join cannot use an index on comment. It could possibly use an index on topic, but the left join may make that option less likely. (You would need to check the execution plan in your environment.)
The third option would be equivalent to the first in many databases, but not in MySQL. It does the join to comment (taking advantage of an index on comment(topic_id), if available). However, it then incurs the overhead of a file sort for the final aggregation.
Reluctantly, I must admit that the first choice is often the best in terms of performance in MySQL, particularly if the right indexes are available. Without indexes, any of the three might be the best choice. For instance, without indexes, the second is the best if comments is empty or has very few topics.

Related

MySQL performance - cross join vs left join

I am wondering how MySQL (or its underlying engine) processes the queries.
There are two set queries below (one uses left join and the other one uses cross join), which eventually will give the same result.
My question is, how come the processing time of the two sets of queries are similar?
What I expected is that the first set query will run quicker because the computer is dealing with left join so the size of the "table" won't be expanding, while the second set of queries makes the size of the "table" (what I assume is that the computer needs to get the result of the cross-join from multiple tables before it can go ahead and do the where clause) relatively larger.
select s.*, a.score as score_01, b.score as score_02
from student s
left join (select \* from sc where cid = '01') a using (sid)
left join (select \* from sc where cid = '02') b using (sid)
where a.score > b.score;
select s.*, a.score as score_01, b.score as score_02
from student s
,(select * from sc where cid = '01') a
,(select * from sc where cid = '02') b
where a.score > b.score and a.sid = b.sid and s.sid = a.sid;
I tried both sets of queries and expected the processing time for the first set query will be shorter, but it is not the case.
Add this to sc:
INDEX(sid, cid, score)
Better yet, if you have a useless id on side replace it with
PRIMARY KEY(sid, cid)`
(Assuming that pair is Unique.)
With either of those fixes, I expect both of your queries run at similar speed, and faster than currently.
For further discussion, please provide SHOW CREATE TABLE.
Addressing some of the Comments
MySQL ignores the keywords INNER, OUTER, and CROSS. So, it up to the WHERE to figure whether it is "inner" or "outer".
MySQL throws the ON and WHERE conditions together (except when it matters for LEFT), then decides what is used for filtering (WHERE) so it may be able to do that first. Then other conditions (which belonged in ON) help it get to the 'next' table.
So... Please use ON to say how the tables are related; use WHERE for filtering. (And don't use the old comma-join.)
That is, MySQL will [usually] look at one table at a time, doing a "Nested Loop Join" (NLJ) to get to the next.
There are many possible ways to evaluate a JOIN; MySQL ponders which one might be best, then uses that.
The order of non-LEFT JOINs does not matter, nor does the order of expressions AND'd together in WHERE.
In some situations, a HAVING expression can (and is) moved to the WHERE clause.
Although FROM comes before WHERE, the two get somewhat tangled up together. But, in general, the clauses are required to be in a certain order, and that order is logically the order that things have to happen in.
It is up to the Optimizer to combine steps. For example
WHERE a = 1
ORDER BY b
and the table has INDEX(a,b) -- The index will be used to do both, essentially at the same time. Ditto for
SELECT a, MAX(b)
...
GROUP BY a
ORDER BY a
can hop through the BTree index on (a,b) and deliver the results without an extra sort pass for either the GROUP BY or ORDER BY.
SELECT x is executed after WHERE y = 'abc' -- Well, in some sense it is. But if you have INDEX(y,x), the Optimizer is smart enough to grab the x values while it is performing the WHERE.
When a WHERE references more than one table of a JOIN, the Optimizer has a quandary. Which table should it start its NLJ with? It has some statistics to help make the decision, but it does not always get it right. It will usually
filter on one of the tables
NLJ to get to the next table, meanwhile throwing in any WHERE clauses for that table in with the ON clause.
Repeat for other tables.
When there is both a WHERE and an ORDER BY, the Optimizer will usually filter filter, then sort. But sometimes (not always correctly) it will decide to use an index for the ORDER BY (thereby eliminating the sort) and filter as it reads the table. LIMIT, which is logically done last further muddies the decision.
MySQL does not have FULL OUTER JOIN. It can be simulated with two JOIN and a UNION. (It is only very rarely needed.)

Does a JOIN query that selects on the joined table benefit from an index?

This is a pretty simple question, however I'm having trouble finding a straight answer for it on the internet.
Let's say I have two tables:
article - id, some other properties
localisation - id, articleId, locale, title, content
and 1 article has many localisations, and theres an index on locale and we want to filter by locale.
My question is, does querying by article and joining on localisation with a where clause, like this:
SELECT * FROM article AS a JOIN localisation AS l ON a.id = l.articleId WHERE l.locale = 5;
does benefit the same from the locale index as querying by the reverse:
SELECT * FROM localisation AS l JOIN article AS a ON l.articleId = a.id WHERE l.locale = 5;
Or do I need to do the latter to make proper use of my index? Assuming the cardinality is correct of course.
By default, the order you specify tables in your query isn't necessarily the order they will be joined.
Inner join is commutative. That is, A JOIN B and B JOIN A produce the same result.
MySQL's optimizer knows this fact, and it can reorder the tables if it estimates it would be a less expensive query if it joined the tables in the opposite order to that which you listed them in your query. You can specify an optimizer hint to prevent it from reordering tables, but by default this behavior is enabled.
Using EXPLAIN will tell you which table order the optimizer prefers use for a given query. There may be edge cases where the optimizer chooses something you didn't expect. Some of the optimizer's estimate depends on the frequency of data values in your table, so you should test in your own environment.
P.S.: I expect this query would probably benefit from a compound index on the pair of columns: (locale, articleId).

What is a "point-in-select" in MySQL?

I was given this query to update a report, and it was taking a long time to run on my computer.
select
c.category_type, t.categoryid, t.date, t.clicks
from transactions t
join category c
on c.category_id = t.categoryid
I asked the DBA if there were any issues with the query, and the DBA optimized the query in this manner:
select
(select category_type
from category c where c.category_id = t.categoryid) category_type,
categoryid,
date, clicks
from transactions t
He described the first subquery as a "point-in-select". I have never heard of this before. Can someone explain this concept?
I want to note that the two queries are not the same, unless the following is true:
transactions.categoryid is always present in category.
category has no duplicate values of category_id.
In practice, these would be true (in most databases). The first query should be using a left join version for closer equivalence:
select c.category_type, t.categoryid, t.date, t.clicks
from transactions t left join
category c
on c.category_id = t.categoryid;
Still not exactly the same, but more similar.
Finally, both versions should make use of an index on category(category_id), and I would expect the performance to be very similar in MySQL.
Your DBA's query is not the same, as others noted, and afaik nonstandard SQL. Yours is much preferable just for its simplicity alone.
It's usually not advantageous to re-write queries for performance. It can help sometimes, but the DBMS is supposed to execute logically equivalent queries equivalently. Failure to do so is a flaw in the query planner.
Performance issues are often a function of physical design. In your case, I would look for indexes on the category and transactions tables that contain categoryid as first column. If neither exist, your join is O(mn) because the category table must be scanned for each transaction row.
Not being a MySQL user, I can only advise you to get query planner output and look for indexing opportunities.

Complex query optimization improve speed

I have the following query that i would like to optimize:
SELECT
*, #rownum := #rownum + 1 AS rank
FROM (
SELECT
SUM(a.id = 1) as KILLS,
SUM(a.id = 2) as DEATHS,
SUM(a.id = 3) as WINS,
tb1.totalPlaytime,
p.playerName
FROM
(
SELECT
player_id,
SUM(pg.timeEnded - pg.timeStarted) as totalPlaytime
FROM playergame pg
INNER JOIN player p
ON pg.player_id = p.id
WHERE pg.game_id IN(1, 2, 3)
GROUP BY
p.id
ORDER BY
p.playerName ASC
) tb1
INNER JOIN playeraction pa
ON pa.player_id = tb1.player_id
INNER JOIN action a
ON pa.action_id = a.id
INNER JOIN player p
ON pa.player_id = p.id
GROUP BY
p.id
ORDER BY
KILLS DESC) tb2
WHERE tb2.playerName LIKE "%"
Somehow i am having the feeling that this is not suited for mysql. I keep a lot of actions in different tables for a good statistical approach but this slows down everything. (perhaps big data?)
This is my model
Now i tried doing the following:
Combining joins in a view
I Combined the many JOINS into a VIEW. This gave me no improvements.
Index the tables
I indexed the frequently used keys, this did speed up but i can't manage to get the entire resultset below 0.613s.
Start from the action table and use left joins
This gave me a somewhat different approach but yet the joins keep being slow (the first example is still the fastest)
indexes:
Any hints, tips, additions, improvements are welcome
I removed my previous answer as it was wrong and did not help, and here I am just summarizing our conversation in the comments with additional comments from myself
There are several ways to speed up the query.
Make sure you are not making any redundant queries.
Do as few joins as possible.
Make indexes on multiple columns if possible.
Make indexes clustered if needed/possible http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
Regarding the query you wrote in the question:
Remove ORDER BY in the inner query
Remove INNER JOIN in the inner query and replace GROUP BY p.id by GROUP BY player_id
Few words on where indexes make sense and where not.
In your case it would not make sense to have index on gameid on table playergame because that probably would return loads of rows. So that is all what you can do about the most inner query.
The joins can also be a bit optimized if you know what you expect from the tables, i.e., the amount of data they may face. you may think of it as a question are you building database behind a MMO game of FPS. MMO will have millions of users per game, FPS will have only a few. Also different types of games may have different actions. That would imply that you may try to optimize the query by making the index more precise. If you are able to define in the inner join of action that gameid IN (...) then creating an index on tuple (gameid, id) might help.
Wildcart in WHERE clause. You may try to create an index on playername but it will only work if you look with a wildcard at the end of your search string, for one in the beginning you would need a separate index, and hope that query optimizer will be smart enough to switch between them each time you make a query.
Keep in mind that more indexes imply slowed insert and delete, so keep only as few as possible.
Another thing would be redesigning the structure a bit. You may still keep the database normalized, but maybe it would be usefull to have a table with summary of some games. You may have a table with summary of games that happened before yesterday, and your query would only summarize the data for today, and then join both tables if needed. Then you could optimize it by either creating and index on timestamp or partitioning table by day. Everything depends on the load you expect.
The topic is rather deep, so everything depends on what is the story behind the data.

Where is better to put 'on' conditions in multiple joins? (mysql)

I have multiple joins including left joins in mysql. There are two ways to do that.
I can put "ON" conditions right after each join:
select * from A join B ON(A.bid=B.ID) join C ON(B.cid=C.ID) join D ON(c.did=D.ID)
I can put them all in one "ON" clause:
select * from A join B join C join D ON(A.bid=B.ID AND B.cid=C.ID AND c.did=D.ID)
Which way is better?
Is it different if I need Left join or Right join in my query?
For simple uses MySQL will almost inevitably execute them in the same manner, so it is a manner of preference and readability (which is a great subject of debate).
However with more complex queries, particularly aggregate queries with OUTER JOINs that have the potential to become disk and io bound - there may be performance and unseen implications in not using a WHERE clause with OUTER JOIN queries.
The difference between a query that runs for 8 minutes, or .8 seconds may ultimately depend on the WHERE clause, particularly as it relates to indexes (How MySQL uses Indexes): The WHERE clause is a core part of providing the query optimizer the information it needs to do it's job and tell the engine how to execute the query in the most efficient way.
From How MySQL Optimizes Queries using WHERE:
"This section discusses optimizations that can be made for processing
WHERE clauses...The best join combination for joining the tables is
found by trying all possibilities. If all columns in ORDER BY and
GROUP BY clauses come from the same table, that table is preferred
first when joining."
For each table in a join, a simpler WHERE is constructed to get a fast
WHERE evaluation for the table and also to skip rows as soon as
possible
Some examples:
Full table scans (type = ALL) with NO Using where in EXTRA
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id AND cr.role = cr2.role AND cr.util = 1000
[Err] Out of memory
Uses where to optimize results, with index (Using where,Using index):
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
WHERE cr.role = cr2.role
AND cr.util = 1000
515661 rows in set (0.124s)
****Combination of ON/WHERE - Same result - Same plan in EXPLAIN*******
[SQL] SELECT cr.id,cr2.role FROM CReportsAL cr
LEFT JOIN CReportsCA cr2
ON cr.id = cr2.id
AND cr.role = cr2.role
WHERE cr.util = 1000
515661 rows in set (0.121s)
MySQL is typically smart enough to figure out simple queries like the above and will execute them similarly but in certain cases it will not.
Outer Join Query Performance:
As both LEFT JOIN and RIGHT JOIN are OUTER JOINS (Great in depth review here) the issue of the Cartesian product arises, the avoidance of Table Scans must be avoided, so that as many rows as possible not needed for the query are eliminated as fast as possible.
WHERE, Indexes and the query optimizer used together may completely eliminate the problems posed by cartesian products when used carefully with aggregate functions like AVERAGE, GROUP BY, SUM, DISTINCT etc. orders of magnitude of decrease in run time is achieved with proper indexing by the user and utilization of the WHERE clause.
Finally
Again, for the majority of queries, the query optimizer will execute these in the same manner - making it a manner of preference but when query optimization becomes important, WHERE is a very important tool. I have seen some performance increase in certain cases with INNER JOIN by specifying an indexed col as an additional ON..AND ON clause but I could not tell you why.
Put the ON clause with the JOIN it applies to.
The reasons are:
readability: others can easily see how the tables are joined
performance: if you leave the conditions later in the query, you'll get way more joins happening than need to - it's like putting the conditions in the where clause
convention: by following normal style, your code will be more portable and less likely to encounter problems that may occur with unusual syntax - do what works