I have millions of customers and when I use left join and then I sort by a column it takes 4-5sec here is my query:
SELECT c.id AS id, o.description AS office_description, ... , d.type AS document_type, d.number AS document_number
FROM customers c INNER JOIN offices o ON (c.id_office = o.id)
INNER JOIN company cp ON (o.id_company = cp.id)
LEFT JOIN documents d ON (C.id = d.id_customer)
WHERE c.archive = 0
ORDER BY office_description
LIMIT 10
So when I remove documents columns in my SELECT the query is very fast.
Here is the query explain :
I have 1 millions customers and other tables I have only 1 row (for company / office / documents)
I set index on c.archive / o.description and primary keys / foreigns keys ofc. Here is the structures of these tables: http://sqlfiddle.com/#!9/a222f9
So I tried to build my query like this:
SELECT A.*, d.*
FROM (
SELECT c.id AS id, o.description AS office_description, ...
FROM customers c INNER JOIN offices o ON (c.id_office = o.id)
INNER JOIN company cp ON (o.id_company = cp.id)
WHERE c.archive = 0
ORDER BY o.description
LIMIT 10
) A LEFT JOIN documents d ON (A.id = d.id_customer)
And now, wow, it's very fast.
But I don't know if it's the best way to reduce the lag and if I'm doing wrong. I'd like to know if you know a better way to do that.
I hope there is an easier way because it will be complicated to use this query in my Phalcon project
An explanation...
Your faster query can find the 10 rows before looking in documents. So, it needs only 10 probes into that table.
In the original query, the Optimizer was not too smart. It planned to execute the query as if there were no LIMIT. Instead, it decided to optimizer the join to documents by fetching the entire table into the "join buffer" into RAM and built a hash index into it. While this would help some queries like yours, it was a big waste for the mere 10 rows that you needed.
So, your reformulation convinced the Optimizer to do it a better way.
If you had needed only one column from d, there is another way:
SELECT ...,
( SELECT col FROM d WHERE ... ) AS col,
... ((without the LEFT JOIN at all))
As for an "easier" way, especially one that can be reverse-engineered into some 3rd package, I doubt it. (Packages tend to be cruxes for getting started in databases. As you are finding out, you eventually need to learn more than they can teach you.)
A separate inefficiency:
WHERE c.archive = 0
ORDER BY o.office_description
LIMIT ...
If the archived rows had been removed from c, then the optimal execution would be to find the first 10 rows of o. Instead it must do a lengthy JOIN before sorting and limiting. (This is a common problem with "soft deletes". Neither MySQL nor the 3rd party package can optimize it.)
Related
I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";
This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)
Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition
The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.
I have a query that looks like this:
select `adverts`.*
from `adverts`
inner join `advert_category` on `advert_category`.`advert_id` = `adverts`.`id`
inner join `advert_location` on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
and `advert_category`.`category_id` = ?
order by `updated_at` desc
The problem here is I have a huge database and this response is absolutely ravaging my database.
What I really need is to do the first join, and then do there where clause. This will whittle down my response from like 100k queries to less than 10k, then I want to do the other join, in order to whittle down the responses again so I can get the advert_location on the category items.
Doing it as is just isn't viable.
So, how do I go about using a join and a where condition, and then after getting that response doing a further join with a where condition?
Thanks
This is your query, written a bit simpler so I can read it:
select a.*
from adverts a inner join
advert_category ac
on ac.advert_id = a.id inner join
advert_location al
on al.advert_id = a.id
where al.location_id = ? and
ac.category_id = ?
order by a.updated_at desc;
I am speculating that advert_category and advert_locations have multiple rows per advert. In that case, you are getting a Cartesian product for each advert.
A better way to write the query uses exists:
select a.*
from adverts a
where exists (select 1
from advert_location al
where al.advert_id = a.id and al.location_id = ?
) and
exists (select 1
from advert_category ac
where ac.advert_id = a.id and ac.category_id = ?
)
order by a.updated_at desc;
For this version, you want indexes on advert_location(advert_id, location_id), advert_category(advert_id, category_id), and probably advert(updated_at, id).
You can write the 1st join in a Derived Table including a WHERE-condition and then do the 2nd join (but a decent optimizer might resolve the Derived Table again and do what he thinks is best based on statistics):
select adverts.*
from
(
select `adverts`.*
from `adverts`
inner join `advert_category`
on `advert_category`.`advert_id` =`adverts`.`id`
where `advert_category`.`category_id` = ?
) as adverts
inner join `advert_location`
on `adverts`.`id` = `advert_location`.`advert_id`
where `advert_location`.`location_id` = ?
order by `updated_at` desc
MySQL will reorder inner joins for you during optimization, regardless of how you wrote them in your query. Inner join is the same in either direction (in algebra this is called commutative), so this is safe to do.
You can see the result of join reordering if you use EXPLAIN on your query.
If you don't like the order MySQL chose for your joins, you can override it with this kind of syntax:
from `adverts`
straight_join `advert_category` ...
https://dev.mysql.com/doc/refman/5.7/en/join.html says:
STRAIGHT_JOIN is similar to JOIN, except that the left table is always read before the right table. This can be used for those (few) cases for which the join optimizer processes the tables in a suboptimal order.
Once the optimizer has decided on the join order, it always does one join at a time, in that order. This is called the nested join method.
There isn't really any way to "do the join then do the where clause". Conditions are combined together when looking up rows for joined tables. But this is a good thing, because you can then create a compound index that helps match rows based on both join conditions and where conditions.
PS: When asking query optimization question, you should include the EXPLAIN output, and also run SHOW CREATE TABLE <tablename> for each table, and include the result. Then we don't have to guess at the columns and indexes in your table.
I've done some searching without success and I want to know if there is some better way to rewrite sql query because this OR condition in the LEFT JOIN kills the performance:(
For e.g.:
SELECT DISTINCT * FROM computers
LEFT JOIN monitors ON computers.brand = monitors.brand
LEFT JOIN keyboards ON computers.type = keyboards.type
LEFT JOIN accessories ON accessories.id = keyboards.id OR accessories.id = monitors.id
GROUP BY computers.id
ORDER BY computers.id DESC
Sorry for dumb question, but is it possible to rewrite OR statement to improve performance?
I doubt it will make any difference, but you could try this:
SELECT DISTINCT *
FROM computers
LEFT JOIN monitors ON computers.brand = monitors.brand
LEFT JOIN keyboards ON computers.type = keyboards.type
LEFT JOIN accessories ON a1.id IN (keyboards.id, monitors.id)
GROUP BY computers.id
ORDER BY computers.id DESC
You could also join to the same table twice, if you are comfortable having two sets of accessories columns (perhaps using coalesce() a bunch in the SELECT list):
SELECT DISTINCT * FROM computers
LEFT JOIN monitors ON computers.brand = monitors.brand
LEFT JOIN keyboards ON computers.type = keyboards.type
LEFT JOIN accessories a1 ON a1.id = keyboards.id
LEFT JOIN accessories a2 ON a2.id = monitors.id
GROUP BY computers.id
ORDER BY computers.id DESC
And, fwiw, this query would not be legal in most modern database engines. If you want to GROUP BY a field, the ANSI SQL standard says you can't also just put * (even with DISTINCT) in the SELECT list, because you haven't specified which values to keep and which to discard as the database rolls up the group... the results are undefined, and that's a bad thing.
You are doing SELECT DISTINCT *, so its checking that your entire record is unique across all rows it gets, which is 3 tables worth. Its probably going to be already unique, if your primary keys and unique indexes are set up correctly its definitely unique, so just take it out.
If your primary keys and indexes arent setup, do that first. Primary key on fields named id.
That and SELECT * incurs a big overhead since it has to figure out what the rest of your columns are.
Guessing without knowing what the table structure actually is: Since you are grouping by GROUP BY computers.id, put that in your SELECT instead and take it out of your GROUP BY.
SELECT DISTINCT computers.id
I have an SQL query that needs to perform multiple inner joins, as follows:
SELECT DISTINCT adv.Email, adv.Credit, c.credit_id AS creditId, c.creditName AS creditName, a.Ad_id AS adId, a.adName
FROM placementlist pl
INNER JOIN
(SELECT Ad_id, List_id FROM placements) AS p
ON pl.List_id = p.List_id
INNER JOIN
(SELECT Ad_id, Name AS adName, credit_id FROM ad) AS a
ON ...
(few more inner joins)
My question is the following: How can I optimize this query? I was under the impression that, even though the way I currently query the database creates small temporary tables (inner SELECT statements), it would still be advantageous to performing an inner join on the unaltered tables as they could have about 10,000 - 100,000 entries (not millions). However, I was told that this is not the best way to go about it but did not have the opportunity to ask what the recommended approach would be.
What would be the best approach here?
To use derived tables such as
INNER JOIN (SELECT Ad_id, List_id FROM placements) AS p
is not recommendable. Let the dbms find out by itself what values it needs from
INNER JOIN placements AS p
instead of telling it (again) by kinda forcing it to create a view on the table with the two values only. (And using FROM tablename is even much more readable.)
With SQL you mainly say what you want to see, not how this is going to be achieved. (Well, of course this is just a rule of thumb.) So if no other columns except Ad_id and List_id are used from table placements, the dbms will find its best way to handle this. Don't try to make it use your way.
The same is true of the IN clause, by the way, where you often see WHERE col IN (SELECT DISTINCT colx FROM ...) instead of simply WHERE col IN (SELECT colx FROM ...). This does exactly the same, but with DISTINCT you tell the dbms "make your subquery's rows distinct before looking for col". But why would you want to force it to do so? Why not have it use just the method the dbms finds most appropriate?
Back to derived tables: Use them when they really do something, especially aggregations, or when they make your query more readable.
Moreover,
SELECT DISTINCT adv.Email, adv.Credit, ...
doesn't look to good either. Yes, sometimes you need SELECT DISTINCT, but usually you wouldn't. Most often it is just a sign that you haven't thought your query through.
An example: you want to select clients that bought product X. In SQL you would say: where a purchase of X EXISTS for the client. Or: where the client is IN the set of the X purchasers.
select * from clients c where exists
(select * from purchases p where p.clientid = c.clientid and product = 'X');
Or
select * from clients where clientid in
(select clientid from purchases where product = 'X');
You don't say: Give me all combinations of clients and X purchases and then boil that down so I just get each client once.
select distinct c.*
from clients c
join purchases p on p.clientid = c.clientid and product = 'X';
Yes, it is very easy to just join all tables needed and then just list the columns to select and then just put DISTINCT in front. But it makes the query kind of blurry, because you don't write the query as you would word the task. And it can make things difficult when it comes to aggregations. The following query is wrong, because you multiply money earned with the number of money-spent records and vice versa.
select
sum(money_spent.value),
sum(money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
And the following may look correct, but is still incorrect (it only works when the values happen to be unique):
select
sum(distinct money_spent.value),
sum(distinct money_earned.value)
from user
join money_spent on money_spent.userid = user.userid
join money_earned on money_earned.userid = user.userid;
Again: You would not say: "I want to combine each purchase with each earning and then ...". You would say: "I want the sum of money spent and the sum of money earned per user". So you are not dealing with single purchases or earnings, but with their sums. As in
select
sum(select value from money_spent where money_spent.userid = user.userid),
sum(select value from money_earned where money_earned.userid = user.userid)
from user;
Or:
select
spent.total,
earned.total
from user
join (select userid, sum(value) as total from money_spent group by userid) spent
on spent.userid = user.userid
join (select userid, sum(value) as total from money_earned group by userid) earned
on earned.userid = user.userid;
So you see, this is where derived tables come into play.
So let's say I have the following tables Person and Wage. It's a 1-N relation, where a person can have more then one wage.
**Person**
id
name
**Wage**
id
person_id
amount
effective_date
Now, I want to query a list of all persons and their latest wages. I can get the results by doing the following query:
SELECT
p.*,
( SELECT w.amount
FROM wages a w
WHERE w.person_id = p.id
ORDER BY w.effective_date
LIMIT 1
) as wage_amount,
( SELECT w.effective_date
FROM wages a w
WHERE w.person_id = p.id
ORDER BY w.effective_date
LIMIT 1
) as effective_date
FROM person as p
The problem is, my query will have multiple sub-queries from different tables. I want to make it as efficient as possible. Is there an alternative to using sub-queries that would be faster and give me the same results?
Proper indexing would probably make your version work efficiently (that is, an index on wages(person_id, effective_date)).
The following produces the same results with a single subquery:
SELECT p.*, w.amount, w.effective_date
from person p left outer join
(select person_id, max(effective_date) as maxdate
from wages
group by personid
) maxw
on maxw.person_id = p.id left outer join
wages w
on w.person_id = p.id and w.effective_date = maxw.maxdate;
And this version might make better us of indexes than the above version:
SELECT p.*, w.amount, w.effective_date
from person p left outer join
wages w
on w.person_id = p.id
where not exists (select * from wages w2 where w2.effective_date > w.effective_date);
Note that these version will return multiple rows for a single person, when there are two "wages" with the same maximum effective date.
Subqueries can be a good solution like Sam S mentioned in his answer but it really depends on the subquery, the dbms you are using, and your indexes. See this question and answers for a good discussion on the performance of subqueries vs. joins: Join vs. sub-query
If performance is an issue for you, you must consider using the EXPLAIN command of your dbms. It will show you how the query is being built and where the bottlenecks are. Based on its results, you might consider rewriting your query some other way.
For instance, it was usually the case that a join would yield better performance, so you could rewrite your query according to this answer: https://stackoverflow.com/a/2111420/362298 and compare their performance.
Note that creating the right indexes will also make a big difference.
Hope it helps.
Subqueries are very efficient as long as you make sure you use indexes. Try running EXPLAIN on your query and see if it uses correct indexes
SELECT p.name, w.amount, MAX(w.effective_date) FROM Person p LEFT JOIN Wage
w ON w.person_id = p.id GROUP BY p.name
I didn't test this query.