Performance between the subquery and join? - mysql

Both queries generates a list of department IDs along with the number
of employees assigned to each department.
I'm able to get results for above both using joins and subquery but I'm very keen to know
how both queries works in terms of performance which is better: joins or subquery.
I've added Explain Plan screen shot for both queries, but I don't understand what it means.
Using Join
SELECT d.dept_id, d.name, count(emp_id) AS num_employee
FROM department d INNER JOIN employee e ON e.dept_id = d.dept_id
GROUP BY dept_id;
Using Subquery
SELECT d.dept_id, d.name, e_cnt.how_many num_employees
FROM department d INNER JOIN
(SELECT dept_id, COUNT(*) how_many
FROM employee
GROUP BY dept_id) e_cnt
ON d.dept_id = e_cnt.dept_id;

The join is clearly better as you can see in your execution plan. :P
The subselect is using an index to get the initial table (count (*), dept_id) and then is using a buffer table to join to the outer select statement to get you your result.
The inner join uses the index on both department and employee to determine the matching rows saving yourself the creation of the buffer table and the initial index seek.

Related

Is CTE better for optimization than sub-queries in sql/mysql?

I'm giving one example
-- sub-query
SELECT p.first_name, p.last_name,
d.department_count, s.total_sales
FROM persons as p
INNER JOIN
(
SELECT department_id,
COUNT(people) as department_count
FROM department as d
WHERE department_type = 'sales'
GROUP BY department_id
) as d ON d.department_id = p.department_id
LEFT OUTER JOIN
(
SELECT person_id,
SUM(sales) as total_sales
FROM orders
WHERE orders.department_id = d.department_id
GROUP BY person_id
) as s ON s.person_id = p.person_id
-- cte
WITH deps as
(
SELECT department_id,
COUNT(people) as department_count
FROM department as d
WHERE department_type = 'sales'
GROUP BY department_id
), sales as
(
SELECT person_id,
SUM(sales) as total_sales
FROM orders
WHERE orders.department_id = d.department_id
GROUP BY person_id
)
SELECT p.first_name, p.last_name,
d.department_count, s.total_sales
FROM persons as p
INNER JOIN deps as d
ON d.department_id = p.department_id
LEFT OUTER JOIN sales as s
ON s.person_id = p.person_id
but I'm also wanting the answer in overall case. In some cases it may depend on the dataset and objective? But usually, which one is better for optimization/performance when running the query? Moreover, if there's few less lines in any of these procedure compared to the other, will that make the execution faster?
Both examples you show will be executed by MySQL using temporary tables. That is, the result of both the subquery or the CTE will be stored in a temporary table that lives for the duration of the query, then automatically dropped when the query ends.
Temporary tables are used for other types of queries in MySQL. You can read more about them here: https://dev.mysql.com/doc/refman/8.0/en/internal-temporary-tables.html
Temporary tables are often associated with performance overhead. It takes time for the temporary table to be created and filled with rows from the result of the subquery or CTE. This is unavoidable.
If you can run a different query to get the result you want without creating a temporary table, that's almost always better for performance. But in the examples you show, I don't think it's possible to do in a single query.
Almost every general rule about performance has exceptions, so you really need to be careful to evaluate performance on a case by case basis. Performance optimization is a complex subject.
These indexes may help:
orders: INDEX(department_id, person_id)
p: INDEX(department_id, first_name, last_name, person_id)
s: INDEX(person_id, total_sales)
d: INDEX(department_type, department_id)
Typically COUNT(*) is better than COUNT(col)

Get the each count of data by combining two tables in MySQL

I want to get the each count of data by combining two tables in MySQL. This is the scenario I have following tables. emp_tab(name, dept_id ) and dept_tab(dept_id, dept_name). I want to write a query to show the number of employees in each department with the department name.
tried code:
SELECT dept_tab.dept_name, number
FROM emp_tab
INNER JOIN dept_tab ON emp_tab.dept_id=dept_tab.dept_id;
My try is not successful. Can you please show me how can I solve this. I am beginner to MySQL
Two things:
You need to use a group by and count function
Your join was joining an invalid table
SELECT dept_tab.dept_name, COUNT(*) as number
FROM emp_tab
INNER JOIN dept_tab ON emp_tab.dept_id=dept_tab.dept_id
GROUP BY dept_tab.dept_name
You can use JOIN and GROUP BY by dept_name to count number of employees.
In your question, what is Customerstable? I assume that is dept_tab?
SELECT
d.dept_name,
COUNT(d.id) AS cnt
FROM
dept_tab d
LEFT JOIN empt_tab e
ON e.dept_id = d.dept_id
GROUP BY d.dept_name ;

Trying to get a row count in a subquery

I have two tables, one is departments and the other is employees. The department id is a foreign key in the employees table. The employee table has a name and a flag saying if the person is part-time. I can have zero or more employees in a department. I'm trying to figure out out to get a list of all departments where a department has at least one employee and if it does have at least one employee, that all the employees are part time. I think this has to be some kind of subquery to get this. Here's what I have so far:
SELECT dept.name
,dept.id
,employee.deptid
,count(employee.is_parttime)
FROM employee
,dept
WHERE dept.id = employee.deptid
AND employee.is_parttime = 1
GROUP BY employee.is_parttime
I would really appreciate any help at this point.
You must join (properly) the tables and group by department with a condition in the HAVING clause:
select d.name, d.id, count(e.id) total
from dept d inner join employee e
on d.id = e.deptid
group by d.name, d.id
having total = sum(e.is_parttime)
The inner join returns only departments with at least 1 employee.
The column is_parttime (I guess) is a flag with values 0 or 1 so by summing it the result is the number of employees that are part time in the department and this number is compared to the total number of employees of the department.
As a preliminary aside, I recommend expressing joins with the JOIN keyword, and segregating join conditions from filter conditions. Doing so would make the original query look like so:
select dept.name, dept.id, employee.deptid, count(employee.is_parttime)
from employee
join dept on dept.id = employee.deptid
where employee.is_parttime = 1
group by employee.is_parttime
It doesn't make much practical difference for inner joins, but it does make the structure of the data and the logic of the query a bit clearer. On the other hand, it does make a difference for outer joins, and there is value in consistency.
As for the actual question, yes, one can rewrite the original query using a subquery or an inline view to produce the requested result. (An "inline view" is technically what one should call an embedded query used as a table in the FROM clause, but some people lump these in with subqueries.)
Example using a subquery
select dept.name, dept.id
from dept
where dept.id in (
select deptid
from employee
group by deptid
having count(*) == sum(is_parttime)
)
Example using an inline view
select dept.name, dept.id
from dept
join (
select deptid
from employee
group by deptid
having count(*) == sum(is_parttime)
) pt_dept
on dept.id = pt_dept.deptid
In each case, the subquery / inline view does most of the work. It aggregates employees by department, then filters the groups (HAVING clause) to select only those in which the part-time employee count is the same as the total count. Naturally, departments without any employees will not be represented. If a list of department IDs would suffice for a list of departments, then that's actually all you need. To get the department names too, however, you need to combine that with data from the dept table, as demonstrated in the two example queries.

Selecting count(column) from different table

I have three tables area,vehicle and employee.
ward_no is the foreign key for vehicle and employee.
I want to select the number of vehicles and number of employees and display them along with other details of area.
The query i used is:
select a.* ,count(v.vid) as vehicles,count(e.eid) as employees from area a,vehicle v,employee e where v.ward_no=a.ward_no and e.ward_no=a.ward_no group by a.name;
But the output is not what I want. I get the same values in both the columns where the count is use instead of displaying the total number of vehicles/employees in that particular area.
I'm new to MySQl
The default behavior of count is to count the non-null values.
In your case, this counts repetitions of the value.
Try adding DISTINCT inside the count:
select a.* ,count(DISTINCT v.vid) as vehicles,count(DISTINCT e.eid) as employees
from area a,vehicle v,employee e
where v.ward_no=a.ward_no and e.ward_no=a.ward_no group by a.name;
Also, it's better to use explicit JOIN rather than implicit, like this:
select a.* ,count(DISTINCT v.vid) as vehicles,count(DISTINCT e.eid) as employees
from area a JOIN vehicle v ON v.ward_no=a.ward_no
JOIN employee e ON e.ward_no=a.ward_no
group by a.name;
There may be a chance that you are getting same vehicle and employee multiple times due to the joins, Use DISTINCT in COUNT() get count of unique vehicles and employees
SELECT
a.*,
COUNT(DISTINCT v.vid) AS vehicles,
COUNT(DISTINCT e.eid) AS employees
FROM
`area` a
JOIN vehicle v
ON v.ward_no = a.ward_no
JOIN employee e
ON e.ward_no = a.ward_no
GROUP BY a.name

Optimize SQL: Customers that haven't ordered for x days

I have created this SQL in order to find customers that haven't ordered for X days.
It is returning a result set, so this post is mainly just to get a second opinion on it, and possible optimizations.
SELECT o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email,
(SELECT COUNT(*)
FROM orders o2
WHERE o2.user_id=o.user_id
AND o2.order_status=1) AS order_count,
(SELECT o4.order_created
FROM orders o4
WHERE o4.user_id=o.user_id
AND o4.order_status=1
ORDER BY o4.order_created DESC LIMIT 1) AS last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
AND s.subscriber_status=1
AND s.subsriber_type=e
AND s.subscription_id=1
WHERE DATE(o.order_created) = "2013-12-14"
AND o.order_status=1
AND o.user_id NOT IN
(SELECT o3.user_id
FROM orders o3
WHERE o3.user_id=o.user_id
AND o3.order_status=1
AND DATE(o3.order_created) > "2013-12-14")
Can you guys find any potential problems with this SQL? Dates are dynamically inserted.
The final SQL that I put in production, will basically only include o.order_id, i.identity_id and o.order_count - this order_count will need to be correct. The other selected fields and 'last_order' subquery will not be included, it's only for testing.
This should give me a list of users that have their last order on that particular day, and is a newsletter subscriber. I am particular in doubt about correctness of the NOT IN part in the WHERE clause, and the order_count subquery.
There are several problems:
A. Using functions on indexable columns
You are searching for orders by comparing DATE(order_created) with some constant. This is a terrible idea, because a) the DATE() function is executed for every row (CPU) and b) the database can't use an index on the column (assuming one existed)
B. Using WHERE ID NOT IN (...)
Using a NOT IN (...) is almost always a bad idea, because optimizers usually have trouble with this construct, and often get the plan wrong. You can almost always express it as an outer join with a WHERE condition that filters for misses using an IS NULL condition for a joined column (and adds the side benefit of not needing DISTINCT, because there's only ever one miss returned)
C. Leaving joins that filtering out of large portions of rows too late
The earlier you can mask off rows by not making joins the better. You can do this by joining less likely to match tables earlier in the joined table list, and by putting non-key conditions into join rather than the where clause to get the rows excluded as early as possible. Some optimizers to this anyway, but I've often found they don't
D. Avoid correlated subqueries like the plague!
You have several correlated subqueries - ones that are executed for every row of the main table. That's really an incredibly bad idea. Again sometimes the optimizer can craft them into a join, but why rely (hope) on that. Most correlated subqueries can be expressed as a join; you examples are no exception.
With the above in mind, there are some specific changes:
o2 and o4 are the same join, so o4 may be dispensed with entirely - just use o2 after conversion to a join
DATE(order_created) = "2013-12-14" should be written as order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"
This query should be what you want:
SELECT
o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email,
count(o2.user_id) AS order_count,
max(o2.order_created) AS last_order
FROM orders o
LEFT JOIN orders o2 ON o2.user_id = o.user_id AND o2.order_status=1
LEFT JOIN orders o3 ON o3.user_id = o.user_id
AND o3.order_status=1
AND o3.order_created >= "2013-12-15 00:00:00"
JOIN user_identities ui ON o.user_id=ui.user_id
JOIN identities i ON ui.identity_id=i.identity_id AND i.identity_email != ''
JOIN subscribers s ON i.identity_id=s.identity_id
AND s.subscriber_status=1
AND s.subsriber_type=e
AND s.subscription_id=1
WHERE o.order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"
AND o.order_status=1
AND o3.order_created IS NULL -- This gets only missed joins on o3
GROUP BY
o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email;
The last line is how you achieve the same as NOT IN (...) using a LEFT JOIN
Disclaimer: Not tested.
Can't really comment on the results as you have not posted any table declares or example data, but your query has 3 correlated sub queries which is likely to make it perform poorly (OK, one of those is for last_order and is only for testing).
Eliminating the correlated sub queries and replacing them with joins would give something like this:-
SELECT o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email,
Sub1.order_count,
Sub2.last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
AND s.subscriber_status=1
AND s.subsriber_type=e
AND s.subscription_id=1
LEFT OUTER JOIN
(
SELECT user_id, COUNT(*) AS order_count
FROM orders
WHERE order_status=1
GROUP BY user_id
) Sub1
ON o.user_id = Sub1.user_id
LEFT OUTER JOIN
(
SELECT user_id, MAX(order_created) as last_order
FROM orders
WHERE order_status=1
GROUP BY user_id
) AS Sub2
ON o.user_id = Sub2.user_id
LEFT OUTER JOIN
(
SELECT DISTINCT user_id
FROM orders
WHERE order_status=1
AND DATE(order_created) > "2013-12-14"
) Sub3
ON o.user_id = Sub3.user_id
WHERE DATE(o.order_created) = "2013-12-14"
AND o.order_status=1
AND Sub3.user_id IS NULL