MySQL polymorphic join condition with OR not using index - mysql

I have tables departments, employees, and emails in MySQL 5.6.17 (for a Rails app). Each department has many employees, and both departments and employees have many emails. I want to sort departments by the number of emails to the entire department and individual employees within the department. My attempt:
SELECT departments.*, COUNT(DISTINCT employees.id) AS employees_count, COUNT(DISTINCT emails.id) AS emails_count
FROM departments
LEFT OUTER JOIN employees
ON employees.department_id = departments.id AND employees.is_employed = true
LEFT OUTER JOIN emails
ON (emails.emailable_id = departments.id AND emails.emailable_type = 'department')
OR (emails.emailable_id = employees.id AND emails.emailable_type = 'employee')
GROUP BY departments.id
ORDER BY emails_count DESC
LIMIT 20;
Unfortunately, this query takes over 3 minutes to complete. Since this query will be used in a web interface, that's not a workable timeframe. An EXPLAIN gives:
+----+-------------+-------------+-------+-------------------------------------------------+----------------------------------+---------+-------------------------------+-------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+-------------------------------------------------+----------------------------------+---------+-------------------------------+-------+------------------------------------------------+
| 1 | SIMPLE | departments | index | PRIMARY | PRIMARY | 4 | NULL | 37468 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | employees | ref | index_employees_on_department_id | index_employees_on_department_id | 5 | development_db.departments.id | 5 | Using where |
| 1 | SIMPLE | emails | ALL | index_emails_on_emailable_id_and_emailable_type | NULL | NULL | NULL | 10278 | Range checked for each record (index map: 0x2) |
+----+-------------+-------------+-------+-------------------------------------------------+----------------------------------+---------+-------------------------------+-------+------------------------------------------------+
The index on emails is, then, not being used. This index is used successfully when I join emails only to departments or only to employees, but not with both at once.
Why is this? What can I do about this? Is there a more efficient way to query the desired data?

It might help to do the aggregation first before the joins:
SELECT d.*, e.employees_count, em.emails_count
FROM d LEFT OUTER JOIN
(SELECT e.department_id, count(*) as employees_count
FROM employees e
WHERE e.is_employed = true
GROUP BY e.department_id
) e
ON e.department_id = d.id LEFT OUTER JOIN
(SELECT department_id, count(distinct id) as emails_count
FROM (SELECT em.emailable_id as department_id, em.id
FROM emails em
WHERE em.emailable_type = 'department'
UNION ALL
SELECT e.department_id, em.id
FROM emails em JOIN
employees e
ON em.emailable_id = e.id AND em.emailable_type = 'employee'
) ee
GROUP BY department_id
) em
ON em.department_id = d.id LEFT OUTER JOIN
ORDER BY emails_count DESC
LIMIT 20;
You also want an index on emails(emailable_id, emailable_type, id) and on emails(emailable_type, emailable_id, id).

Related

Multiple left join optimization

tables:
employee
employee_orgn
Joint primary key(employee_id,orgn_id)
two index:key1:employee_id,index2:orgn_id
orgn
Some employee have no organization.
sql:
explain SELECT DISTINCT
e.*
FROM
employee e
LEFT JOIN
employee_orgn eo ON eo.employee_id = e.id
LEFT JOIN
orgn o ON o.id = eo.orgn_id
WHERE
e.state != 'deleted'
AND e.state != 'hidden'
AND (o.state != 'hidden' OR o.state IS NULL)
ORDER BY e.id DESC
explain:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | e |all | NULL | NULL |NULL | NULL | 12792 |Using where;USing tempory;Using filesort |
| 1 | SIMPLE | eo |index | PRIMARY | idx_orgn_id |8 | NULL | 13226 |Using index:Distinct |
| 1 | SIMPLE | o |eq_ref | PRIMARY | PRIMARY |8 | eo.orgn_id | 1 |Using where:Distinct |
Q:
Here left join, mysql nested loop query 10 orders of magnitude 8?
Why are there temporary tables, and why sorting is file sorting?
Why the second line is the overlay index
I hope someone will explain this explain result and optimize the analysis.
Thanks in advance.
You shouldn't put conditions on tables that are left joined in the WHERE clause. Instead, put them in the ON clause. Then you don't need to use OR o.state IS NULL, which causes optimizer problems.
SELECT DISTINCT
e.*
FROM
employee e
LEFT JOIN
employee_orgn eo ON eo.employee_id = e.id
LEFT JOIN
orgn o ON o.id = eo.orgn_id AND o.state != hidden
WHERE
e.state NOT IN ('deleted', 'hidden')
ORDER BY e.id DESC
I would recommend re-writing the query -- to get rid of the select distinct. I think this is the logic you want:
SELECT e.*
FROM employee e
WHERE e.state not in ('deleted', 'hidden')
NOT EXISTS (SELECT 1
FROM employee_orgn eo JOIN
orgn o
ON o.id = eo.orgn_id AND o.state = 'hidden'
WHERE eo.employee_id = e.id
)
ORDER BY e.id DESC;
For this query, you want an index on employee_orgn(employee_id, orgn_id) and orgn(id, state).

Can I be selective on what rows I join on in MySQL

Suppose I have two tables, people and emails. emails has a person_id, an address, and an is_primary:
people:
id
emails:
person_id
address
is_primary
To get all email addresses per person, I can do a simple join:
select * from people join emails on people.id = emails.person_id
What if I only want (at most) one row from the right table for each row in the left table? And, if a particular person has multiple emails and one is marked as is_primary, is there a way to prefer which row to use when joining?
So, if I have
people: emails:
------ -----------------------------------------
| id | | id | person_id | address | is_primary |
------ -----------------------------------------
| 1 | | 1 | 1 | a#b.c | true |
| 2 | | 2 | 1 | b#b.c | false |
| 3 | | 3 | 2 | c#b.c | true |
| 4 | | 4 | 4 | d#b.c | false |
------ -----------------------------------------
is there a way to get this result:
------------------------------------------------
| people.id | emails.id | address | is_primary |
------------------------------------------------
| 1 | 1 | a#b.c | true |
| 2 | 3 | c#b.c | true | // chosen over b#b.c because it's primary
| 3 | null | null | null | // no email for person 3
| 4 | 4 | d#b.c | false | // no primary email for person 4
------------------------------------------------
You got it a bit wrong, how left/right joins work.
This join
select * from people join emails on people.id = emails.person_id
will get you every column from both tables for all records that match your ON condition.
The left join
select * from people left join emails on people.id = emails.person_id
will give you every record from people, regardless if there's a corresponding record in emails or not. When there's not, the columns from the emails table will just be NULL.
If a person has multiple emails, multiple records will be in the result for this person. Beginners often wonder then, why the data has duplicated.
If you want to restrict the data to the rows where is_primary has the value 1, you can do so in the WHERE clause when you're doing an inner join (your first query, although you ommitted the inner keyword).
When you have a left/right join query, you have to put this filter in the ON clause. If you would put it in the WHERE clause, you would turn the left/right join into an inner join implicitly, because the WHERE clause would filter the NULL rows that I mentioned above. Or you could write the query like this:
select * from people left join emails on people.id = emails.person_id
where (emails.is_primary = 1 or emails.is_primary is null)
EDIT after clarification:
Paul Spiegel's answer is good, therefore my upvote, but I'm not sure if it performs well, since it has a dependent subquery. So I created this query. It may depend on your data though. Try both answers.
select
p.*,
coalesce(e1.address, e2.address) AS address
from people p
left join emails e1 on p.id = e1.person_id and e1.is_primary = 1
left join (
select person_id, address
from emails e
where id = (select min(id) from emails where emails.is_primary = 0 and emails.person_id = e.person_id)
) e2 on p.id = e2.person_id
Use a correlated subquery with LIMIT 1 in the ON clause of the LEFT JOIN:
select *
from people p
left join emails e
on e.person_id = p.id
and e.id = (
select e1.id
from emails e1
where e1.person_id = e.person_id
order by e1.is_primary desc, -- true first
e1.id -- If e1.is_primary is ambiguous
limit 1
)
order by p.id
sqlfiddle

MySQL Relational Division Query Performance

I have students that are associated many-to-many with groups via a join table groups_students. Each group has a group_type, which can either be a permission_group or not (boolean on group_types table).
I also have users, which are also associated many-to-many with groups via groups_users.
I want to return all students for which a particular user is associated with ALL the student's permission groups.
I've been lead to believe this requires relational division and here's where I am with it:
SELECT DISTINCT gs.student_id
FROM groups_students AS gs
INNER JOIN groups ON groups.id = gs.group_id
INNER JOIN groups_users gu ON gu.group_id = groups.id
INNER JOIN group_types ON group_types.id = groups.group_type_id
WHERE group_types.permission_group = 1
AND gu.user_id = 37
AND NOT EXISTS (
SELECT * FROM groups_students AS gs2
WHERE gs2.student_id = gs.student_id
AND NOT EXISTS (
SELECT gu2.group_id
FROM groups_users AS gu2
WHERE gu2.group_id = gs2.group_id AND gu2.user_id = gu.user_id
)
)
This works, but on my live database with over 20,000 rows in groups_students, it takes over 3 seconds.
Can I make it faster? I read about doing relational division with COUNT but I couldn't relate it to my scenario. Am I able to make cheap gains to bring this query well under half a second execution time or am I looking at a major restructure?
Edit - English language description: Students belong to classes (groups), and users have permission to view certain classes. I need to know the students for which a particular user has permission to view all the (permission) classes for.
EXPLAIN for the slow query:
+----+--------------------+-------------+--------+--------------------------------------------------------------+--------------------------------------------------+---------+-----------------------------+------+--------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+--------+--------------------------------------------------------------+--------------------------------------------------+---------+-----------------------------+------+--------------------------------+
| 1 | PRIMARY | gu | ref | index_groups_users_on_user_id,index_groups_users_on_group_id | index_groups_users_on_user_id | 5 | const | 1181 | Using where; Using temporary |
| 1 | PRIMARY | groups | eq_ref | PRIMARY | PRIMARY | 4 | my_db.gu.group_id | 1 | |
| 1 | PRIMARY | group_types | ALL | PRIMARY | NULL | NULL | NULL | 3 | Using where; Using join buffer |
| 1 | PRIMARY | gs | ref | index_groups_students_on_group_id_and_student_id | index_groups_students_on_group_id_and_student_id | 4 | my_db.groups.id | 9 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | gs2 | ref | index_groups_students_on_student_id_and_group_id | index_groups_students_on_student_id_and_group_id | 4 | my_db.gs.student_id | 8 | Using where; Using index |
| 3 | DEPENDENT SUBQUERY | gu2 | ref | index_groups_users_on_user_id,index_groups_users_on_group_id | index_groups_users_on_group_id | 5 | my_db.gs2.group_id | 99 | Using where |
+----+--------------------+-------------+--------+--------------------------------------------------------------+--------------------------------------------------+---------+-----------------------------+------+--------------------------------+
SQL Fiddle
"I want to return all students for which a particular user is associated with ALL the student's permission groups."
I don't really follow your query; it seems so complicated for this purpose. Instead, I think of it as follows:
Generate all students and their permissions
Generate all permissions for user 37
(outer) Join these together on permissions
Be sure that all permissions for a particular student are in the u37 group
The resulting query is:
select student_id
from (SELECT gs.student_id, g.id as group_id
FROM groups_students gs INNER JOIN
groups g
ON g.id = gs.group_id INNER JOIN
groups_users gu
ON gu.group_id = g.id INNER JOIN
group_types gt
ON gt.id = g.group_type_id
where gt.permission_group = 1
) s left outer join
(select g.id as group_id
from groups_users gu INNER JOIN
groups g
on gu.group_id = g.id INNER JOIN
group_types gt
ON gt.id = g.group_type_id
where gu.user_id = 37 and gt.permission_group = 1
) u37
on s.group_id = u37.group_id
group by s.student_id
having count(*) = count(u37.group_id);
Note: You can do this without the subqueries. Despite their overhead, I think they make the query much more understandable.
A simpler version of Gordon's idea...
SELECT gs.student_id
FROM groups_students gs
JOIN groups g
ON g.id = gs.group_id
JOIN group_types gt
ON gt.id = g.group_type_id
LEFT
JOIN groups_users gu
ON gu.group_id = gs.group_id
AND gu.user_id = 37
WHERE gt.permission_group
GROUP
BY student_id
HAVING COUNT(student_id) = COUNT(user_id)
I don't understand why you use subqueries. They are generally slow and should be avoided if possible. Maybe I do not understand your requirements correctly, but I would come up with something like this:
SELECT DISTINCT gs.student_id
FROM groups_students AS gs
INNER JOIN groups ON groups.id = gs.group_id
INNER JOIN groups_users gu ON gu.group_id = groups.id
INNER JOIN group_types ON group_types.id = groups.group_type_id
LEFT JOIN groups_students AS gs2 ON gs2.student_id = gs.student_id
LEFT JOIN groups_users AS gu2 ON gu2.group_id = gs2.group_id AND gu2.user_id = gu.user_id
WHERE group_types.permission_group = 1
AND gu.user_id = 37
AND gs2.student_id IS NULL
AND gu2.group_id IS NULL
You can force something to not exist by using a left join and checking, that the right table-column (use the primary key) contains null.

Making large SQL query efficicent

I'm stuck on a rather complex query.
I'm looking to write a query that shows the "top five customers" as well as some key metrics (counts with conditions) about each of those customers. Each of the different metrics uses a totally different join structure.
+-----------+------------+ +-----------+------------+ +-----------+------------+
| customer | | | metricn | | | metricn_lineitem |
+-----------+------------+ +-----------+------------+ +-----------+------------+
| id | Name | | id | customer_id| |id |metricn_id |
| 1 | Customer1 | | 1 | 1 | | 1 | 1 |
| 2 | Customer2 | | 2 | 2 | | 2 | 1 |
+-----------+------------+ +-----------+------------+ +-----------+------------+
The issue this is that I always want to group by this customer table.
I first tried to put all of my joins into the original query, but the query was abysmal with performance. I then tried using subqueries, but I couldn't get them to group by the original hospital id.
Here's a sample query
SELECT
customer.name,
(SELECT COUNT(metric1_lineitem.id)
FROM metric1 INNER JOIN metric1_lineitem
ON metric1_lineitem.metric1_id = metric1.id
WHERE metric1.customer_id = customer_id
) as metric_1,
(SELECT COUNT(metric2_lineitem.id)
FROM metric2 INNER JOIN metric2_lineitem
ON metric2_lineitem.metric2_id = metric2.id
WHERE metric2.customer_id = customer_id
) as metric_2
FROM customer
GROUP BY customer.name
SORT BY COUNT(metric1.id) DESC
LIMIT 5
Any advice? Thanks!
SELECT name, metric_1, metric_2
FROM customer AS c
LEFT JOIN (SELECT customer_id, COUNT(*) AS metric_1
FROM metric1 AS m
INNER JOIN metric1_lineitem AS l ON m.id = l.metric1_id
GROUP BY customer_id) m1
ON m1.customer_id = c.customer_id
LEFT JOIN (SELECT customer_id, COUNT(*) AS metric_2
FROM metric2 AS m
INNER JOIN metric2_lineitem AS l ON m.id = l.metric2_id
GROUP BY customer_id) m1
ON m2.customer_id = c.customer_id
ORDER BY metric_1 DESC
LIMIT 5
You should also avoid using COUNT(columnname) when you can use COUNT(*) instead. The former has to test every value to see if it's null.
Although your data structure may be lousy, your query may not be so bad, with two exceptions. I don't think you need the aggregation on the outer level. Also, the "correlation"s in the where clause (such as metric1.customer_id = customer_id) are not doing anything, because customer_id is coming from the local tables. You need metric1.customer_id = c.customer_id:
SELECT c.name,
(SELECT COUNT(metric1_lineitem.id)
FROM metric1 INNER JOIN
metric1_lineitem
ON metric1_lineitem.metric1_id = metric1.id
WHERE metric1.customer_id = c.customer_id
) as metric_1,
(SELECT COUNT(metric2_lineitem.id)
FROM metric2 INNER JOIN
metric2_lineitem
ON metric2_lineitem.metric2_id = metric2.id
WHERE metric2.customer_id = c.customer_id
) as metric_2
FROM customer c
ORDER BY 1 DESC
LIMIT 5;
How can you make this run faster? One way is to introduce indexes. I would recommend metric1(customer_id), metric2(customer_id), metric1_lineitem(metric1_id) and metric2_lineitem(metric2_id).
This may be faster than the aggregation method (proposed by Barmar) because MySQL is inefficient with aggregations. This should allow the aggregations to take place only using indexes instead of the base tables.

MySQL grouping query optimization

I have three tables: categories, articles, and article_events, with the following structure
categories: id, name (100,000 rows)
articles: id, category_id (6000 rows)
article_events: id, article_id, status_id (20,000 rows)
The highest article_events.id for each article row describes the current status of each article.
I'm returning a table of categories and how many articles are in them with a most-recent-event status_id of '1'.
What I have so far works, but is fairly slow (10 seconds) with the size of my tables. Wondering if there's a way to make this faster. All the tables have proper indexes as far as I know.
SELECT c.id,
c.name,
SUM(CASE WHEN e.status_id = 1 THEN 1 ELSE 0 END) article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN (
SELECT article_id, MAX(id) event_id
FROM article_events
GROUP BY article_id
) most_recent ON most_recent.article_id = a.id
LEFT JOIN article_events e ON most_recent.event_id = e.id
GROUP BY c.id
Basically I have to join to the events table twice, since asking for the status_id along with the MAX(id) just returns the first status_id it finds, and not the one associated with the MAX(id) row.
Any way to make this better? or do I just have to live with 10 seconds? Thanks!
Edit:
Here's my EXPLAIN for the query:
ID | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 124044 | Using index; Using temporary; Using filesort
1 | PRIMARY | a | ref | category_id | category_id | 4 | c.id | 3 |
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 6351 |
1 | PRIMARY | e | eq_ref | PRIMARY | PRIMARY | 4 | most_recent.event_id | 1 |
2 | DERIVED | article_events | ALL | NULL | NULL | NULL | NULL | 19743 | Using temporary; Using filesort
If you can eliminate subqueries with JOINs, it often performs better because derived tables can't use indexes. Here's your query without subqueries:
SELECT c.id,
c.name,
COUNT(a1.article_id) AS article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN article_events ae1
ON ae1.article_id = a.id
LEFT JOIN article_events ae2
ON ae2.article_id = a.id
AND ae2.id > a1.id
WHERE ae2.id IS NULL
GROUP BY c.id
You'll want to experiment with the indexes and use EXPLAIN to test, but here's my guess (I'm assuming id fields are primary keys and you are using InnoDB):
categories: `name`
articles: `category_id`
article_events: (`article_id`, `id`)
Didn't try it, but I'm thinking this will save a bit of work for the database:
SELECT ae.article_id AS ref_article_id,
MAX(ae.id) event_id,
ae.status_id,
(select a.category_id from articles a where a.id = ref_article_id) AS cat_id,
(select c.name from categories c where c.id = cat_id) AS cat_name
FROM article_events
GROUP BY ae.article_id
Hope that helps
EDIT:
By the way... Keep in mind that joins have to go through each row, so you should start your selection from the small end and work your way up, if you can help it. In this case, the query has to run through 100,000 records, and join each one, then join those 100,000 again, and again, and again, even if values are null, it still has to go through those.
Hope this all helps...
I don't like that index on categories.id is used, as you're selecting the whole table.
Try running:
ANALYZE TABLE categories;
ANALYZE TABLE article_events;
and re-run the query.