What is faster, subselect or distinct (MySQL)?

What is faster, subselect or distinct (MySQL)? - mysql

Let me describe my doubt. I have system where I have three entities, Doctor, Patient and Appointment. An appointment has the doctor's id and patient Id.
I need now to retrieve all the patients which have an appointment with a concrete doctor, and I'm not sure what will be faster, a distinct or a subselect for the id's, these are the queries:
using distinct->
SELECT DISTINCT patient.id, patient.name, patient.surname FROM
appointment INNER JOIN patient ON patient.id = appointment.patientid WHERE
appointment.doctorid = #id;
using subselect->
SELECT patient.id, patient.name, patient.surname FROM patient
WHERE patient.id IN (select appointment.patientid FROM appointment
WHERE appointment.doctorid = #id);
Not sure it this will affect, the system will run on a MariaDB cluster.

As with any performance question, you should test on your data and your hardware. The suspect problem in the first version the DISTINCT after the JOIN; this can require a lot of extra processing.
You can write the second as:
SELECT p.id, p.name, p.surname
FROM patient p
WHERE p.id IN (select a.patientid FROM appointment a WHERE a.doctorid = #id);
For this, you want an index on appointment(doctorid, patientid).
You might consider this version as well:
select p.id, p.name, p.surname
from patient p join
(select distinct appointment.patientid
from appointment
where appointment.doctorid = #id
) a
on p.id = a.patientid;
This specifically wants the same index. This pushes the distinct so it is only operating on a single table, meaning that MySQL may be able to use the index for that operation.
And this one:
SELECT p.id, p.name, p.surname
FROM patient p
WHERE EXISTS (select 1
from appointment a
where a.doctorid = #id and a.patientid = p.id
);
This query wants an index on appointment(patientid, doctorid). It requires a full table scan of patient with a fast index lookup on each row. That could often be the fastest approach, depending on the data.
Note: which query performs better may also depends on the size and distribution of the data.

Neither.
These suffer from "inflate-deflate". That is, the JOIN leads to more rows in a temp table, only to prune back to what you need. This is costly. (And it can give wrong answers for COUNT and SUM.)
SELECT DISTINCT ... JOIN ...
and
SELECT ... JOIN ... GROUP BY ...
This performs poorly because of optimizer limitations:
... IN ( SELECT ... )
This is what you want:
SELECT ...
FROM ( SELECT id FROM ... WHERE ... )
JOIN ...
It is especially good if the subquery needs DISTINCT, GROUP BY, and/or LIMIT. This is because it will create a small set of rows before doing the JOIN, thereby decreasing the number of JOINs needed.

I think The appointment should have an id to join through ... so here is a code ... I hope it helps
SELECT patient.id, patient.name, patient.surname FROM patient
INNER JOIN appointment ON appointment.id = patient.patientid
INNER JOIN doctor ON doctor.id = appointment.id
WHERE appointment.doctorid = #id

Related

Is CTE better for optimization than sub-queries in sql/mysql?

I'm giving one example
-- sub-query
SELECT p.first_name, p.last_name,
d.department_count, s.total_sales
FROM persons as p
INNER JOIN
(
SELECT department_id,
COUNT(people) as department_count
FROM department as d
WHERE department_type = 'sales'
GROUP BY department_id
) as d ON d.department_id = p.department_id
LEFT OUTER JOIN
(
SELECT person_id,
SUM(sales) as total_sales
FROM orders
WHERE orders.department_id = d.department_id
GROUP BY person_id
) as s ON s.person_id = p.person_id
-- cte
WITH deps as
(
SELECT department_id,
COUNT(people) as department_count
FROM department as d
WHERE department_type = 'sales'
GROUP BY department_id
), sales as
(
SELECT person_id,
SUM(sales) as total_sales
FROM orders
WHERE orders.department_id = d.department_id
GROUP BY person_id
)
SELECT p.first_name, p.last_name,
d.department_count, s.total_sales
FROM persons as p
INNER JOIN deps as d
ON d.department_id = p.department_id
LEFT OUTER JOIN sales as s
ON s.person_id = p.person_id
but I'm also wanting the answer in overall case. In some cases it may depend on the dataset and objective? But usually, which one is better for optimization/performance when running the query? Moreover, if there's few less lines in any of these procedure compared to the other, will that make the execution faster?

Both examples you show will be executed by MySQL using temporary tables. That is, the result of both the subquery or the CTE will be stored in a temporary table that lives for the duration of the query, then automatically dropped when the query ends.
Temporary tables are used for other types of queries in MySQL. You can read more about them here: https://dev.mysql.com/doc/refman/8.0/en/internal-temporary-tables.html
Temporary tables are often associated with performance overhead. It takes time for the temporary table to be created and filled with rows from the result of the subquery or CTE. This is unavoidable.
If you can run a different query to get the result you want without creating a temporary table, that's almost always better for performance. But in the examples you show, I don't think it's possible to do in a single query.
Almost every general rule about performance has exceptions, so you really need to be careful to evaluate performance on a case by case basis. Performance optimization is a complex subject.

These indexes may help:
orders: INDEX(department_id, person_id)
p: INDEX(department_id, first_name, last_name, person_id)
s: INDEX(person_id, total_sales)
d: INDEX(department_type, department_id)
Typically COUNT(*) is better than COUNT(col)

MySQL: Optimizing Sub-queries

I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";

This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)

Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition

The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.

Best way to structure SQL queries with many inner joins?

I have an SQL query that needs to perform multiple inner joins, as follows:
SELECT DISTINCT adv.Email, adv.Credit, c.credit_id AS creditId, c.creditName AS creditName, a.Ad_id AS adId, a.adName
FROM placementlist pl
INNER JOIN
(SELECT Ad_id, List_id FROM placements) AS p
ON pl.List_id = p.List_id
INNER JOIN
(SELECT Ad_id, Name AS adName, credit_id FROM ad) AS a
ON ...
(few more inner joins)
My question is the following: How can I optimize this query? I was under the impression that, even though the way I currently query the database creates small temporary tables (inner SELECT statements), it would still be advantageous to performing an inner join on the unaltered tables as they could have about 10,000 - 100,000 entries (not millions). However, I was told that this is not the best way to go about it but did not have the opportunity to ask what the recommended approach would be.
What would be the best approach here?

To use derived tables such as
INNER JOIN (SELECT Ad_id, List_id FROM placements) AS p
is not recommendable. Let the dbms find out by itself what values it needs from
INNER JOIN placements AS p
instead of telling it (again) by kinda forcing it to create a view on the table with the two values only. (And using FROM tablename is even much more readable.)
With SQL you mainly say what you want to see, not how this is going to be achieved. (Well, of course this is just a rule of thumb.) So if no other columns except Ad_id and List_id are used from table placements, the dbms will find its best way to handle this. Don't try to make it use your way.
The same is true of the IN clause, by the way, where you often see WHERE col IN (SELECT DISTINCT colx FROM ...) instead of simply WHERE col IN (SELECT colx FROM ...). This does exactly the same, but with DISTINCT you tell the dbms "make your subquery's rows distinct before looking for col". But why would you want to force it to do so? Why not have it use just the method the dbms finds most appropriate?
Back to derived tables: Use them when they really do something, especially aggregations, or when they make your query more readable.
Moreover,
SELECT DISTINCT adv.Email, adv.Credit, ...
doesn't look to good either. Yes, sometimes you need SELECT DISTINCT, but usually you wouldn't. Most often it is just a sign that you haven't thought your query through.
An example: you want to select clients that bought product X. In SQL you would say: where a purchase of X EXISTS for the client. Or: where the client is IN the set of the X purchasers.
select * from clients c where exists
(select * from purchases p where p.clientid = c.clientid and product = 'X');
Or
select * from clients where clientid in
(select clientid from purchases where product = 'X');
You don't say: Give me all combinations of clients and X purchases and then boil that down so I just get each client once.
select distinct c.*
from clients c
join purchases p on p.clientid = c.clientid and product = 'X';
Yes, it is very easy to just join all tables needed and then just list the columns to select and then just put DISTINCT in front. But it makes the query kind of blurry, because you don't write the query as you would word the task. And it can make things difficult when it comes to aggregations. The following query is wrong, because you multiply money earned with the number of money-spent records and vice versa.
select
sum(money_spent.value),
sum(money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
And the following may look correct, but is still incorrect (it only works when the values happen to be unique):
select
sum(distinct money_spent.value),
sum(distinct money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
Again: You would not say: "I want to combine each purchase with each earning and then ...". You would say: "I want the sum of money spent and the sum of money earned per user". So you are not dealing with single purchases or earnings, but with their sums. As in
select
sum(select value from money_spent where money_spent.userid = user.userid),
sum(select value from money_earned where money_earned.userid = user.userid)
from user;
Or:
select
spent.total,
earned.total
from user
join (select userid, sum(value) as total from money_spent group by userid) spent
on spent.userid = user.userid
join (select userid, sum(value) as total from money_earned group by userid) earned
on earned.userid = user.userid;
So you see, this is where derived tables come into play.

MySQL alternative to using a subquery

So let's say I have the following tables Person and Wage. It's a 1-N relation, where a person can have more then one wage.
**Person**
id
name
**Wage**
id
person_id
amount
effective_date
Now, I want to query a list of all persons and their latest wages. I can get the results by doing the following query:
SELECT
p.*,
( SELECT w.amount
FROM wages a w
WHERE w.person_id = p.id
ORDER BY w.effective_date
LIMIT 1
) as wage_amount,
( SELECT w.effective_date
FROM wages a w
WHERE w.person_id = p.id
ORDER BY w.effective_date
LIMIT 1
) as effective_date
FROM person as p
The problem is, my query will have multiple sub-queries from different tables. I want to make it as efficient as possible. Is there an alternative to using sub-queries that would be faster and give me the same results?

Proper indexing would probably make your version work efficiently (that is, an index on wages(person_id, effective_date)).
The following produces the same results with a single subquery:
SELECT p.*, w.amount, w.effective_date
from person p left outer join
(select person_id, max(effective_date) as maxdate
from wages
group by personid
) maxw
on maxw.person_id = p.id left outer join
wages w
on w.person_id = p.id and w.effective_date = maxw.maxdate;
And this version might make better us of indexes than the above version:
SELECT p.*, w.amount, w.effective_date
from person p left outer join
wages w
on w.person_id = p.id
where not exists (select * from wages w2 where w2.effective_date > w.effective_date);
Note that these version will return multiple rows for a single person, when there are two "wages" with the same maximum effective date.

Subqueries can be a good solution like Sam S mentioned in his answer but it really depends on the subquery, the dbms you are using, and your indexes. See this question and answers for a good discussion on the performance of subqueries vs. joins: Join vs. sub-query
If performance is an issue for you, you must consider using the EXPLAIN command of your dbms. It will show you how the query is being built and where the bottlenecks are. Based on its results, you might consider rewriting your query some other way.
For instance, it was usually the case that a join would yield better performance, so you could rewrite your query according to this answer: https://stackoverflow.com/a/2111420/362298 and compare their performance.
Note that creating the right indexes will also make a big difference.
Hope it helps.

Subqueries are very efficient as long as you make sure you use indexes. Try running EXPLAIN on your query and see if it uses correct indexes

SELECT p.name, w.amount, MAX(w.effective_date) FROM Person p LEFT JOIN Wage
w ON w.person_id = p.id GROUP BY p.name
I didn't test this query.

MySQL is not using INDEX in subquery

I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?

Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;

Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008