Join vs subquery to count nested objects - mysql

Let's say my model contains 2 tables: persons and addresses. One person can have O, 1 or more addresses. I'm trying to execute a query that lists all persons and includes the number of addresses they have respectively. Here is the 2 queries that I have to achieve that:
SELECT
persons.*,
count(addresses.id) AS number_of_addresses
FROM `persons`
LEFT JOIN addresses ON persons.id = addresses.person_id
GROUP BY persons.id
and
SELECT
persons.*,
(SELECT COUNT(*)
FROM addresses
WHERE addresses.person_id = persons.id) AS number_of_addresses
FROM `persons`
And I was wondering if one is better than the other in term of performance.

The way to determine performance characteristics is to actually run the queries and see which is better.
If you have no indexes, then the first is probably better. If you have an index on addresses(person_id), then the second is probably better.
The reason is a little complicated. The basic reason is that group by (in MySQL) uses a sort. And, sorts are O(n * log(n)) in complexity. So, the time to do a sort grows faster than a data (not much faster, but a bit fast). The consequence is that a bunch of aggregations for each person is faster than one aggregation by person over all the data.
That is conceptual. In fact, MySQL will use the index for the correlated subquery, so it is often faster than the overall group by, which does not make use of an index.

I think the first query is optimum and more optimization can provided by changing table structure. For example define both person_id and address_id fields (order is important) as primary key in addresses table to join faster.
The mysql table storage structure is indexed organized table(clustered index) so the primary key index is very faster than normal index specially in join operation.

Related

Does a JOIN query that selects on the joined table benefit from an index?

This is a pretty simple question, however I'm having trouble finding a straight answer for it on the internet.
Let's say I have two tables:
article - id, some other properties
localisation - id, articleId, locale, title, content
and 1 article has many localisations, and theres an index on locale and we want to filter by locale.
My question is, does querying by article and joining on localisation with a where clause, like this:
SELECT * FROM article AS a JOIN localisation AS l ON a.id = l.articleId WHERE l.locale = 5;
does benefit the same from the locale index as querying by the reverse:
SELECT * FROM localisation AS l JOIN article AS a ON l.articleId = a.id WHERE l.locale = 5;
Or do I need to do the latter to make proper use of my index? Assuming the cardinality is correct of course.
By default, the order you specify tables in your query isn't necessarily the order they will be joined.
Inner join is commutative. That is, A JOIN B and B JOIN A produce the same result.
MySQL's optimizer knows this fact, and it can reorder the tables if it estimates it would be a less expensive query if it joined the tables in the opposite order to that which you listed them in your query. You can specify an optimizer hint to prevent it from reordering tables, but by default this behavior is enabled.
Using EXPLAIN will tell you which table order the optimizer prefers use for a given query. There may be edge cases where the optimizer chooses something you didn't expect. Some of the optimizer's estimate depends on the frequency of data values in your table, so you should test in your own environment.
P.S.: I expect this query would probably benefit from a compound index on the pair of columns: (locale, articleId).

MYSQL search query optimization from two many-to-many tables

I have three tables.
tbl_post for a table of posts. (post_idx, post_created, post_title, ...)
tbl_mention for a table of mentions. (mention_idx, mention_name, mention_img, ...)
tbl_post_mention for a unique many-to-many relation between the two tables. (post_idx, mention_idx)
For example,
PostA can have MentionA and MentionB.
PostB can have MentionA and MentionC.
PostC cannot have MentionC and MentionC.
tbl_post has about million rows, tbl_mention has less than hundred rows, and tbl_post_mention has a couple of million rows. All three tables are heavily loaded with foreign keys, unique indices, etc.
I am trying to make two separate search queries.
Search for post ids with all the given mention ids[AND condition]
Search for post ids with any of the given mention ids[OR condition]
Then join with tbl_post and tbl_mention to populate with meaningful data, order the results, and return the top n. In the end, I hope to have a n list of posts with all the data required for my service to display on the front end.
Here are the respective simpler queries
SELECT post_idx
FROM
(SELECT post_idx, count(*) as c
FROM tbl_post_mention
WHERE mention_idx in (1,95)
GROUP BY post_idx) AS A
WHERE c >= 2;
The problem with this query is that it is already inefficient before the joins and ordering. This process alone takes 0.2 seconds.
SELECT DISTINCT post_idx
FROM tbl_post_mention
WHERE mention_idx in (1,95);
This is a simple index range scan, but because of the IN statement, the query becomes expensive again once you start joining it with other tables.
I tried more complex and "clever" queries and tried indexing different sets of columns with no avail. Are there special syntaxes that I could use in this case? Maybe a clever trick? Partitioning? Or am I missing some fundamental concept here... :(
Send help.
The query you want is this:
SELECT post_idx
FROM tbl_post_mention
WHERE mention_idx in (1,95)
GROUP BY post_idx
HAVING COUNT(*) >= 2
The HAVING clause does your post-GROUP BY filtering.
The index that will help you is this.
CREATE INDEX mentionsdex ON tbl_post_mention (mention_idx, post_idx);
It covers your query by allowing rapid lookup by mention_idx then grouping by post_idx.
Often so-called join tables with two columns -- like your tbl_post_mention -- work most efficiently when they have a pair of indexes with the columns in opposite orders.

Most efficient way to join "most recent row"

I know this question has been asked 100 times, and this isn't a "how do I do it", but an efficiency question - a topic I don't know much about.
From my internet reading I have settled on one way of solving the most recent problem that sounds like it's pretty efficient - LEFT JOIN a "max" table (grouped by the matching conditions) and then LEFT JOIN the row that matches the grouped conditions. Something like this:
Select employee.*, evaluation.* form employee
LEFT JOIN (select max(report_date) report_date, employee_id
from evaluation group by employee_id) most_recent_eval
on most_recent_eval.employee_id = employee.id
LEFT JOIN evaluation
on evaluation.employee_id = employee.id and evaluation.report_date = most_recent_eval.report_date
Are there problems with this that I don't know about? Is this doing 2 table scans (one to find the max, and one to find the row)? Does it have to do 2 full scans for every employee?
The reason I'm asking is that I am now looking at joining on 3 tables where I need the most recent row (evaluations, security clearance, and project) and it seems like any inefficiencies are going to be massively multiplied.
Can anyone give me some advice on this?
You should be in pretty good shape with the query pattern you propose.
One possible suggestion, that will help if your evaluation table has its own autoincrementing id column. You may be able to find the latest evaluation for each employee with this subquery:
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
Then your join can look like this:
FROM employee
LEFT JOIN (
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
) most_recent_eval ON most_recent_eval.employee_id=employee.id
LEFT JOIN evaluation ON most_recent_eval.id = evaluation.id
This will work if your id values and your report_date values in your evaluation table are in the same order. Only you know if that's the case in your application. But if it is, this is a very helpful optimization.
Other than that, you may need to add some compound indexes to some tables to speed up your queries. Get them working correctly first. Read http://use-the-index-luke.com/ . Remember that lots of single-column indexes are generally harmful to MySQL query performance unless they're chosen to accelerate particular queries.
If you create a compound index on (employee_id, report_date), this subquery
select max(report_date) report_date, employee_id
from evaluation
group by employee_id
can be satisfied with an astonishingly efficient loose index scan. Similarly, if you're using InnoDB, the query
SELECT MAX(id) id
FROM evaluation
GROUP BY employee_id
can be satisfied by a loose index scan on a single-column index on employee_id. (If you're using MyISAM, you need a compound index on (employee_id, id) because InnoDB puts the primary key column implicitly into every index.)

Is JOIN less efficient than two sql queries?

I have two tables
Table A (Primary Key is ID)
id \ firstname \ lastname \ zip \ state
Table B
some_field\ business name \ zip \ id
I need to get the first name and last name associated with the id using the id from Table B (note this is the same id as in Table A)
I did a JOIN on Table A and Table B so that I could get the first name and last name
A friend of mine said I should not use JOIN this way and that I should of just done two separate queries. Does that make any sense?
Does JOIN do anything that makes the process slower than two seperate queries? How could two seperate queries ever be faster than one query?
Q: Does this make sense?
A: No, without some valid reasons, it doesn't make sense.
Q: Does JOIN do anything that makes the process slower than two separate queries?
A: Yes, there are some things that can make a join slower, so we can't rule out that possibility. We can't make a blanket statement that "two separate queries will be faster" or that a "join will be slower".
An equijoin of two tables that are properly indexed is likely to be more efficient. But performance is best gauged by actually executing the statements, at expected production volumes of data, and observing and measuring performance.
Some of the things that could potentially make a join slower... complicated join predicate (involving columns wrapped in functions, inequality comparisons, compound predicates combined with OR, multiple tables involved where the optimizer has more join paths and operations to consider to come up with an execution plan. Or, a join that produces a hugh jass intermediate result that is later collapsed with a GROUP BY. (In short, it is possible to write a horrendously inefficient statement that uses a join operation. But it is usually not the join operation that is the culprit. This list of things is just a sampling, it's not an exhaustive list.)
A JOIN is the normative pattern for the use case you describe. It's not clear why your friend recommended that you avoid a JOIN operation. what reason your friend gives.
If your main query is primarily against (the unfortunately named) Table_B, and you are wanting to do a lookup of first_name and last_name from Table_A, the JOIN is suited to that.
If you are only returning a one row (or a few rows) from Table_B, then an extra roundtrip for another query to get first_name and last_name won't be a problem. But if you are returning thousands for rows from Table_B, then executing thousands of separate, singleton queries against Table_A is going to kill performance and scalability.
If your friend is concerned that a value in the foreign key column in Table_B won't match a value in the id column of Table_A, or there is a NULL value in the foreign key column, your friend would be right to point out that an inner join would prevent the row from Table_B from being returned.
In that case, we'd use an outer join, so we can return the row from Table_B even when a matching row from Table_A is not found.
Your friend might also be concerned about performance of the JOIN operation, possibly because your friend has been burned by not having suitable indexes defined.
Assuming that a suitable index exists on Table_A (with a leading column id). and that id is UNIQUE in Table_A... then performance of a single query with a simple equijoin between a single column foreign key and single column primary key will likely be more efficient than running a bloatload of separate statements.
Or, perhaps your friend is concerned with issue with an immature ORM framework, one that doesn't efficiently handle the results returned from a join query.
If the database is being implemented in way that the two tables could be on separate database servers, then using a JOIN would fly in the face of that design. And if that was the design intent, a separation of the tables, then the application should also be using a separate connection for each of the two tables.
Unless your friend can provide some specific reason for avoiding a JOIN operation, my recommendation is that you ignore his advice.
(There has to be a good reason to avoid a JOIN operation. I suspect that maybe your friend doesn't understand how relational databases work.)
In your case it doesn't make any big difference because you just have an id as a foreign key on it which anyways has an index. Since it's indexed, it will be efficient and having a join on that is the best thing.
It becomes more complicated based on what you want, what are the fields and what you want to accomplish etc.
So, yes, no big difference in your case.

When to use straight_join?

What is the order MySQL joins the tables, how is it chosen and when does STRAIGHT_JOIN comes in handy?
MySQL is only capable of doing nested loops (possibly using indexes), so if both join tables are indexed, the time for the join is calculated as A * log(B) if A is leading and B * log(A) if B is leading.
It is easy to see that the table with fewer records satisfying the WHERE condition should be made leading.
There are some other factors that affect the join performance, such as WHERE conditions, ORDER BY and LIMIT clauses etc. MySQL tries to predict the time for the join orders and if statistics are up to date does it quite well.
STRAIGHT_JOIN is useful when the statistics are not accurate (say, naturally skewed) or in case of bugs in the optimizer.
For instance, the following spatial join:
SELECT *
FROM a
JOIN b
ON MBRContains(a.area, b.area)
is subject to a join swap (the smaller table is made leading), however, MBRContains is not converted to MBRWithin and the resulting plan does not make use of the index.
In this case you should explicitly set the join order using STRAIGHT_JOIN.
As others have stated about the optimizer and which tables may meet the criteria on smaller result sets, but that may not always work. As I had been working with gov't contract / grants database. The table was some 14+ million records. However, it also had over 20 lookup tables (states, congressional districts, type of business classification, owner ethnicity, etc)
Anyhow with these smaller tables, the join was using one of the small lookups, back to the master table and then joining all the others. It CHOKED the database and cancelled the query after 30+ hours. Since my primary table was listed FIRST, and all subsequent were lookup and joined AFTER, just adding STRAIGHT_JOIN at the top FORCED the order I had listed and the complex query was running again in just about 2 hrs (expected for all it had to do).
Get whatever is your primary basis to the top with all subsequent extras later I've found, definitely helps.
The order of tables is specified by the optimizer. Straight_join comes in handy when the optimizer does it wrong, which is not so often. I used it only once in a big join, where the optimizer gave one particular table at first place in join (I saw it in explain select command), so I placed the table so that it is joined later in the join. It helped a lot to speed up the query.