Subquery in SELECT or Subquery in JOIN? - mysql

I have a MYSQL query of this form:
SELECT
employee.name,
totalpayments.totalpaid
FROM
employee
JOIN (
SELECT
paychecks.employee_id,
SUM(paychecks.amount) totalpaid
FROM
paychecks
GROUP BY
paychecks.employee_id
) totalpayments on totalpayments.employee_id = employee.id
I've recently found that this returns MUCH faster in this form:
SELECT
employee.name,
(
SELECT
SUM(paychecks.amount)
FROM
paychecks
WHERE
paychecks.employee_id = employee.id
) totalpaid
FROM
employee
It surprises me that there would be a difference in speed, and that the lower query would be faster. I prefer the upper form for development, because I can run the subquery independently.
Is there a way to get the "best of both worlds": speedy results return AND being able to run the subquery in isolation?

Likely, the correlated subquery is able to make effective use of an index, which is why it's fast, even though that subquery has to be executed multiple times.
For the first query with the inline view, that causing MySQL to create a derived table, and for large sets, that's effectively a MyISAM table.
In MySQL 5.6.x and later, the optimizer may choose to add an index on the derived table, if that would allow a ref operation and the estimated cost of the ref operation is lower than the nested loops scan.
I recommend you try using EXPLAIN to see the access plan. (Based on your report of performance, I suspect you are running on MySQL version 5.5 or earlier.)
The two statements are not entirely equivalent, in the case where there are rows in employees for which there are no matching rows in paychecks.
An equivalent result could be obtained entirely avoiding a subquery:
SELECT e.name
, SUM(p.amount) AS total_paid
FROM employee e
JOIN paychecks p
ON p.employee_id = e.id
GROUP BY e.id
(Use an inner join to get a result equivalent to the first query, use a LEFT outer join to be equivalent to the second query. Wrap the SUM() aggregate in an IFNULL function if you want to return a zero rather than a NULL value when no matching row with a non-null value of amount is found in paychecks.)

Join is basically Cartesian product that means all the records of table A will be combined with all the records of table B. The output will be
number of records of table A * number of records of table b =rows in the new table
10 * 10 = 100
and out of those 100 records, the ones that match the filters will be returned in the query.
In the nested queries, there is a sample inner query and whatever is the total size of records of the inner query will be the input to the outter query that is why nested queries are faster than joins.

Related

Left join not returning all results

I am attempting to join the two tables below to show all the columns for the incident table and just a count of the corresponding records from the tickets table with the incident_id as the same in the incidents table.
As you can see below, none of the tickets have an incident_id assigned yet. The goal of my query is to show all of the records in the incident table with a count of the ticket_ids assigned to that ticket. I thought that this would work but it's returning only one row:
SELECT inc.incident_id, inc.title, inc.date_opened, inc.date_closed, inc.status, inc.description, issue_type, COUNT(ticket_id) as example_count
FROM fin_incidents AS inc
LEFT OUTER JOIN fin_tickets ON inc.incident_id = fin_tickets.incident_id;
What query can I use to return all of the incidents and their count of tickets, even if that count is 0?
Images:
Incident Table
Tickets Table
Result of my query
Your query should not work at all -- and would fail in the more recent versions of MySQL. The reason is that it is missing a GROUP BY clause:
SELECT inc.incident_id, inc.title, inc.date_opened,
inc.date_closed, inc.status, inc.description, inc.issue_type,
COUNT(t.ticket_id) as example_count
FROM fin_incidents inc LEFT OUTER JOIN
fin_tickets t
ON inc.incident_id = t.incident_id
GROUP BY inc.incident_id, inc.title, inc.date_opened,
inc.date_closed, inc.status, inc.description, inc.issue_type
You have an aggregation query with no GROUP BY. Such a query returns exactly one row, even if the tables referred to are empty.
Your code is not a valid aggregation query. You have an aggregate function in the SELECT clause (the COUNT()), but no GROUP BY clause. When executed this with sql mode ONLY_FULL_GROUP_BY disabled, MySQL gives you a single row with an overall count of tickets that are related to an incident, and any value from incident row. If that SQL mode was enabled, you would a compilation error instead.
I find that the logic you want is simpler expressed with a correlated subquery:
select i.*
(select count(*) from fin_tickets t where t.incident_id = i.incident_id) as example_count
from fin_incidents i
This query will take advantage of an index on fin_tickets(incident_id) - if you have defined a foreign key (as you should have), that index is already there.

Huge performance difference between two similar SQL queries

I have two SQL queries that provides the same output.
My first intuition was to use this:
SELECT * FROM performance_dev.report_golden_results
where id IN (SELECT max(id) as 'id' from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id)
Now, this took something like 70 secs to complete!
Searching for another solution I tried something similar:
SELECT * FROM performance_dev.report_golden_results e
join (SELECT max(id) as 'id'
from performance_dev.report_golden_results
group by platform_id, release_id, configuration_id) s
ON s.id = e.id;
Surprisingly, this took 0.05 secs to complete!!!
how come these two are so different?
thanks!
First thing which Might Cause the Time Lag is that MySQL uses 'semi-join' strategy for Subqueries.The Semi Join includes Following Steps :
If a subquery meets the preceding criteria, MySQL converts it to a
semi-join and makes a cost-based choice from these strategies:
Convert the subquery to a join, or use table pullout and run the query
as an inner join between subquery tables and outer tables. Table
pullout pulls a table out from the subquery to the outer query.
Duplicate Weedout: Run the semi-join as if it was a join and remove
duplicate records using a temporary table.
FirstMatch: When scanning the inner tables for row combinations and
there are multiple instances of a given value group, choose one rather
than returning them all. This "shortcuts" scanning and eliminates
production of unnecessary rows.
LooseScan: Scan a subquery table using an index that enables a single
value to be chosen from each subquery's value group.
Materialize the subquery into a temporary table with an index and use
the temporary table to perform a join. The index is used to remove
duplicates. The index might also be used later for lookups when
joining the temporary table with the outer tables; if not, the table
is scanned.
But giving an explicit join reduces these efforts which might be the Reason.
I hope it helped!
MySQL does not consider the first query as subject for semi-join optimization (MySQL converts semi joins to classic joins with some kind of optimization: first match, duplicate weedout ...)
Thus a full scan will be made on the first table and the subquery will be evaluated for each row generated by the outer select: hence the bad performances.
The second one is a classic join, what will happen in this case that MySQL will compute the result of derived query and then matches only values from this query with values from first query satisfying the condition, hence no full scan is needed on the first table (I assumed here that id is an indexed column).
The question right now is why MySQL does not consider the first query as subject to semi-join optimization: the answer is documented in MySQL https://dev.mysql.com/doc/refman/5.6/en/semijoins.html
In MySQL, a subquery must satisfy these criteria to be handled as a semijoin:
It must be an IN (or =ANY) subquery that appears at the top level of the WHERE or ON clause, possibly as a term in an AND expression. For example:
SELECT ...
FROM ot1, ...
WHERE (oe1, ...) IN (SELECT ie1, ... FROM it1, ... WHERE ...);
Here, ot_i and it_i represent tables in the outer and inner parts of the query, and oe_i and ie_i represent expressions that refer to columns in the outer and inner tables.
It must be a single SELECT without UNION constructs.
It must not contain a GROUP BY or HAVING clause.
It must not be implicitly grouped (it must contain no aggregate functions).
It must not have ORDER BY with LIMIT.
The STRAIGHT_JOIN modifier must not be present.
The number of outer and inner tables together must be less than the maximum number of tables permitted in a join.
Your subquery use GROUP BY hence semi-join optimization was not applied.

Subqueries and the possibility to reference outside with correlated subqueries

I am refreshing my SQL.
I was reading about subqueries and the possibility to reference outside with correlated subqueries.
Example:
SELECT *
FROM ORDERS O
WHERE 'ROAD BIKE' =
(SELECT DESCRIPTION FROM PART P WHERE P.PARTNUM = O.PARTNUM)
This is equivalent with a join:
SELECT O.ORDEREDON, O.NAME,
O.PARTNUM, O.QUANTITY, O.REMARKS
FROM ORDERS O, PART P
WHERE P.PARTNUM = O.PARTNUM AND P.DESCRIPTION = 'ROAD BIKE'
My problem is that I didn't get the first form and when/why we use it. When are outside referenced queries useful?
Orders have a reference to the part number, so the Orders table has a foreign key to the part numbers.
We want all the Orders where the part number is for "Road Bike".
The first form first does a sub-query on every record, to check if O.PARTNUM is a part number for "Road Bike".
The way to think of it is, the main query is going through every record in the Orders table. On each record, it does a sub query, where it's PARTNUM field is used in the query. So, if you use the Orders record's PARTNUM in the sub-query, select to find the record in the PART table with that PARTNUM, and select the DESCRIPTION field. Then the where clause of the main query is check if "Road Bike" equals the DESCRIPTION returned from the sub-query.
I would recommend against using the first form, as it is a correlated query, and you should avoid correlated queries for performance reasons, so use the second form. A better version of the first form is:
SELECT *
FROM ORDERS O
WHERE O.PARTNUM =
(SELECT P.PARTNUM FROM PART P WHERE DESCRIPTION = 'ROAD BIKE')
This is not a correlated query. The database can do the subquery once, get the PARTNUM for the record with "ROAD BIKE" as the DESCRIPTION, and then run the main query with the condition WHERE O.PARTNUM equals the result of the sub-query.
In short, you should avoid correlated subqueries like the plague.
Correlated subqueries execute the inner query once for every row in the outer table. This results in terrible performance (a 1 million row outer table will result in the inner query executing 1 million times!)
A join on the other hand is quite efficient and databases are very good at optimising them.
If possible, always express your query as a join in preference to a correlated subquery.
A scenario where a subquery might be appropriate is something like this:
select some fields
from some tables
where some conditions are met
and somefield = (select min(something) from etc)
However, I don't know if that's a correlated subquery. Semantics aren't my strong point.

MySQL: Using GROUP BY vs subselect in columns list

In my application I have two MySQL tables, 'units' and 'impressions' in relation one to many. I need to fetch list of all ad units from units table but also fetch impressions count for each ad unit.
I have two SELECT queries to do this task (simplified for this example), first using sub-select:
SELECT
(SELECT COUNT(*) FROM impressions WHERE impression_unit_id = unit_id) AS impressions_count,
unit_id
FROM units;
and second using GROUP BY:
SELECT
COUNT(impression_id) AS impressions_count,
unit_id
FROM units
LEFT JOIN impressions ON impression_unit_id = unit_id
GROUP BY unit_id;
Sub-select query runs for each record (ad unit) so GROUP BY looks smarter but it has one JOIN more. Which one to prefer for performance?
The GROUP BY query will perform better. The query optimizer might optimize the first query to use a join, but I wouldn't count on it since it is written to use a dependent sub-query, which will be much slower. As long as the tables are properly indexed, JOINs should not be a major concern for performance.
The first query, if it doesn't get optimized to use a JOIN will have to run the sub-query for each row in the unit table, where the JOIN query does it all in one operation.
To find out how the query gets optimized, run an EXPLAIN of both queries. If the first one uses a dependent sub-query, it will be slower.

MySQL query optimization and/or tweaks

I have the following query which both tables are huge. The query were very slow and I need your idea to optimize this query or do you have any other solution?
SELECT c.EstablishmentID,
(SELECT COUNT(ID)
FROM cleanpoi
WHERE EstablishmentID=c.EstablishmentID OR EstablishmentID
IN (SELECT ChildEstablishmentID
FROM crawlerchildren
WHERE ParentEstablishmentID=c.EstablishmentID)
) POI
FROM crawler c
GROUP BY c.EstablishmentID
BTW, I have the appropriate indexes applied.
UPDATE:
Okay, I have attached the explain result.
Try it by using JOIN
SELECT c.EstablishmentID, COUNT(d.ID)
FROM crawler c
LEFT JOIN cleanpoi d
ON c.establishmentid = d.establishmentID
LEFT JOIN
(
SELECT DISTINCT ChildEstablishmentID
FROM crawlerchildren
) e ON e.ParentEstablishmentID = c.EstablishmentID
GROUP BY c.EstablishmentID
IN() and NOT IN() subqueries are poorly optimized:
MySQL executes the subquery as a dependent subquery for each row in the outer query. This is a frequent cause of serious performance problems in MySQL 5.5 and older versions. The query probably should be rewritten as a JOIN or a LEFT OUTER JOIN, respectively.
Non-deterministic GROUP BY:
The SQL retrieves columns that are neither in an aggregate function nor the GROUP BY expression, so these values will be non-deterministic in the result.
GROUP BY or ORDER BY on different tables:
This will force the use of a temporary table and filesort, which can be a huge performance problem and can consume large amounts of memory and temporary space on disk.