I have a query which actually runs two queries on a table. I query the whole table, a datediff and then a subquery which tells me the sum of hours each unit spent in certain operational steps. The main query limits the results to the REP depot so technically I don't need to put that same criteria on the subquery since repair_order is unique.
Would it be faster, slower or no difference to apply the depot filter on the subquery?
SELECT
*,
DATEDIFF(date_shipped, date_received) as htg_days,
(SELECT SUM(t3.total_days) FROM report_tables.cycle_time_days as t3 WHERE t1.repair_order=t3.repair_order AND (operation='MFG' OR operation='ENG' OR operation='ENGH' OR operation='HOLD') GROUP BY t3.repair_order) as subt_days
FROM
report_tables.cycle_time_days as t1
WHERE
YEAR(t1.date_shipped)=2010
AND t1.depot='REP'
GROUP BY
repair_order
ORDER BY
date_shipped;
I run into this with a lot of situations but I never know if it would be better to put the filter in the sub query, main query or both.
In this example, it would actually alter the query if you moved your WHERE clause to filter by REP into the subquery. So it wouldn't be about performance at that point, it would be about getting the same result set. In general, though, if you will get the same exact result set by moving a WHERE clause elsewhere in a complex query, it is better to do so at the most atomic level possible, ie, in the subquery. Then the subquery returns a smaller result set to the main query before the main query has to process it.
The answer to your question will vary depending on your schema, the complexity of your queries, the reliability of your data, etc. A general rule of thumb is to try to process the least amount of data possible, which generally means filtering it at the lowest level possible as well.
When you want to optimize a query the absolute number one place to start is to use the EXPLAIN output to see what optimizations the query parser was able to figure out and check to see what the weakest link is in the query plan. Resolve that, rinse, repeat.
You can also use explain's "extended" keyword to see the actual query it built to run which will reveal more about its usage of your criteria. In some cases, it will optimize away duplicate conditions between parent/subqueries. In other cases, it may push the conditions down from the parent in to the subquery. In some cases for (too) complex queries I've seen the it repeat the condition when it was only specified in the query once. Thankfully, you don't have to guess, mysql's explain plan will reveal all, albeit sometimes in cryptic ways.
I usually use a derived table as a "driver or aggregating" query then join that result back onto whatever table that i want to pull data from:
select
t1.*,
datediff(t1.date_shipped, t1.date_received) as htg_days,
subt_days.total_days
from
cycle_time_days as t1
inner join
(
-- aggregating/driver query
select
repair_order,
sum(total_days) as total_days
from
cycle_time_days
where
year(date_shipped) = 2010 and depot = 'REP' and
operation in ('MFG','ENG','ENGH','HOLD') -- covering index on date, depot, op ???
group by
repair_order -- indexed ??
having
total_days > 14 -- added for demonstration purposes
order by
total_days desc limit 10
) as subt_days on t1.repair_order = subt_days.repair_order
order by
t1.date_shipped;
Related
i've got an optimisation problem with my query, once I use the aggregate GROUP BY in my query with a JSON_OBJECT(), the performances are heavily affected, and it seems that the JSON_OBJECT() function is called for EVERY row in the table, even if there is a LIMIT.
Once there is no more GROUP BY, the query is executed really fast. I abstracted the query i'm using to the easiest, but I need to GROUP BY cause
I'm using JSON_ARRAYAGG() for another join.
I got ~25k rows in my table and it takes 10x less time when removing the group by aggregate
select JSON_OBJECT('id',`b`.`id`) as bw
from a
left join `b` on `a`.`id` = `b`.`id_a`
group by `a`.`id`
LIMIT 1;
In general, JSON should be used for storing structured data that only the app needs to look inside. It is clumsy and probably very inefficient for MySQL to pick apart JSON for use with WHERE, GROUP BY, etc.
As for GROUP BY (or ORDER BY) plus LIMIT 1:
With just the LIMIT, MySQL simply peels of the first row it finds. -- much faster, but which row you get is unpredictable.
With Group or Order, it may have to gather all possible rows, juggle them (grouping or sorting), and only then peel off 1 row. -- much slower.
It sounds like you have an "array" of things in each JSON? The RDBMS equivalent involves a second table to handle all those arrays -- one element per row. Switching to that may lead to much faster code. (I don't understand your data well enough to give you a concrete suggestion.)
I have a case where I do a select from another select and the order of the returned rows is changed if I add a where clause.
Example:
SELECT t.id
FROM (
SELECT t.id
FROM table1 t
ORDER BY
t.viewsTotal ASC
LIMIT 20
OFFSET 0
) base
INNER JOIN table1 t ON base.id = t.id
LEFT JOIN table2 t2 ON t2.id = t1.secondTableId
# WHERE t2.someBoolColumn = FALSE
;
Now, the order is the same for the inner select and the outer select, but if I uncomment the where condition, the outer select will change the ordering.
How can I prevent this from happening?
Lets assume the following for a given example:
I can not do one select.
I do not know what order has been applied to an inner select when doing an outer select. So, if I order from a joined table, I wouldn't know that I need to join it here.
More info on my use case
There is a query builder that provides inner select, and it may apply order by a third table that is joined to that inner select, if i would like to apply the same order i would need to know what tables were joined, and in the case of this poor query builder i do not have that knowledge
tl;dr If you want a particular order in your result set, use ORDER BY.
The ordering of rows in a result set from any RDMS server without an ORDER BY clause is formally unpredictable. Unpredictable is like random, except worse. Random ordering implies you'll get your rows in a different order every time you run the query. Truly random ordering, if it existed, would make it hard for simple unit tests to pass when your assumptions about ordering fail.
Unpredictable means you'll get them in the same order, until you don't. That means your unit tests will pass, and your system tests will pass, and your system will fail six months into production, if it depends on result set ordering.
Why is this so? A server's query planner is free to use any algorithm at its disposal to satisfy the queries you give it. These algorithms work differently for different types of table and different sizes of table. If you don't constrain the query planner by specifying the result set ordering, it may pick some algorithm that gives an ordering that appears strange to you the programmer.
Query planners have, literally, thousands of programmer years' worth of optimizations built in to them.
For people used to the procedural ways of thinking encouraged by all kinds of programming languages, it's sometimes hard to switch your thinking to the declarative / descriptive mode used by SQL. With SQL (at least clean SQL without stuff like SELECT #a := #a+1 and other hacks) you're simply describing the result set you want. The server generates results matching your specification.
I would suggest you not rely on the implicit ordering produced my SQL (because there is no implicit ordering as per Bohemian's comment). Rather, you should use an ORDER BY statement and select one of your columns in the query by which you should order your results. That way you can ensure that the results are always presented in the same way regardless of the WHERE clauses.
I've got a complex query I have to run in an application that is giving me some performance trouble. I've simplified it here. The database is MySQL 5.6.35 on CentOS.
SELECT a.`po_num`,
Count(*) AS item_count,
Sum(b.`quantity`) AS total_quantity,
Group_concat(`web_sku` SEPARATOR ' ') AS web_skus
FROM `order` a
INNER JOIN `order_item` b
ON a.`order_id` = b.`order_key`
WHERE `store` LIKE '%foobar%'
LIMIT 200 offset 0;
The key part of this query is where I've placed "foobar" as a placeholder. If this value is something like big_store, the query takes much longer (roughly 0.4 seconds in the query provided here, much longer in the query I'm actually using) than if the value is small_store (roughly 0.1 seconds in the query provided). big_store would return significantly more results if there were not limit.
But there is a limit and that's what surprises me. Both datasets have more than the LIMIT, which is only 200. It appears to me that MySQL performing the select functions COUNT, SUM, GROUP_CONCAT for all big_store/small_store rows and then applies the LIMIT retroactively. I would imagine that it'd be best to stop when you get to 200.
Could it not do the select functions COUNT, SUM, GROUP_CONCAT actions after grabbing the 200 rows it will use, making my query much much quicker? This seems feasible to me except in cases where there's an ORDER BY on one of those rows.
Does MySQL not use LIMIT to optimize a query select functions? If not, is there a good reason for that? If so, did I make a mistake in my thinking above?
It can stop short due to the LIMIT, but that is not a reasonable query since there is no ORDER BY.
Without ORDER BY, it will pick whatever 200 rows it feels like and stop short.
With an ORDER BY, it will have to scan the entire table that contains store (please qualify columns with which table they come from!). This is because of the leading wildcard. Only then can it trim to 200 rows.
Another problem -- Without a GROUP BY, aggregates (SUM, etc) are performed across the entire table (or at least those that remain after filtering). The LIMIT does not apply until after that.
Perhaps what you are asking about is MariaDB 5.5.21's "LIMIT_ROWS_EXAMINED".
Think of it this way ... All of the components of a SELECT are done in the order specified by the syntax. Since LIMIT is last, it does not apply until after the other stuff is performed.
(There are a couple of exceptions: (1) SELECT col... must be done after FROM ..., since it would not know which table(s); (2) The optimizer readily reorders JOINed table and clauses in WHERE ... AND ....)
More details on that query.
The optimizer peeks ahead, and sees that the WHERE is filtering on order (that is where store is, yes?), so it decides to start with the table order.
It fetches all rows from order that match %foobar%.
For each such row, find the row(s) in order_item. Now it has some number of rows (possibly more than 200) with which to do the aggregates.
Perform the aggregates - COUNT, SUM, GROUP_CONCAT. (Actually this will probably be done as it gathers the rows -- another optimization.)
There is now 1 row (with an unpredictable value for a.po_num).
Skip 0 rows for the OFFSET part of the LIMIT. (OK, another out-of-order thingie.)
Deliver up to 200 rows. (There is only 1.)
Add ORDER BY (but no GROUP BY) -- big deal, sort the 1 row.
Add GROUP BY (but no ORDER BY) in, now you may have more than 200 rows coming out, and it can stop short.
Add GROUP BY and ORDER BY and they are identical, then it may have to do a sort for the grouping, but not for the ordering, and it may stop at 200.
Add GROUP BY and ORDER BY and they are not identical, then it may have to do a sort for the grouping, and will have to re-sort for the ordering, and cannot stop at 200 until after the ORDER BY. That is, virtually all the work is performed on all the data.
Oh, and all of this gets worse if you don't have the optimal index. Oh, did I fail to insist on providing SHOW CREATE TABLE?
I apologize for my tone. I have thrown quite a few tips in your direction; please learn from them.
This is question is simple i know but i would like to somebody explain me to see if i am getting right.
This simple query does MYSQL always executes the semantics from left to right?
SELECT c03,c04,c05,count(*) as c FROM table
where status='Active' and c04 is not null and c05 is not null
group by c03,c04,c05
having c>1 order by c desc limit 10;
The engine starts from left to right filtering every record by
* Status later comparing c04 and later c05
* Later groups the results
* later filters again the result applying the c>1 filter
* Later sorts and fetch the first 10 results and disposing the others
Or have some other optimizations suppose not index are using...
I guess Status later comparing c04 and later c05 depends on the available indexes and the cardinality of the columns, and is not always evaluated in that order. MySQL will try to evaluate the constraint first, which promises to reduce the result set the most at the lowest cost, e.g. constraints using columns without indexes might be evaluated last, because the evaluation is expensive.
The rest leaves little room for other orderings, but some steps may gain benefit from appropriate indexes.
So firstly here's my query: (NOTE:I know SELECT * is bad practice I just switched it in to make the query more readable)
SELECT pcln_cities.*,COUNT(pcln_hotels.cityid) AS hotelcount
FROM pcln_cities
LEFT OUTER JOIN pcln_hotels ON pcln_hotels.cityid=pcln_cities.cityid
WHERE pcln_cities.state_name='California' GROUP BY pcln_cities.cityid
ORDER BY hotelcount DESC
LIMIT 5
So I know that to solve things like that you add EXPLAIN to the beginning of the query but I'm not 100% sure how to read the results, so here they are:
alt text http://www.andrew-g-johnson.com/query-results.JPG
Bonus points to an answer that tells me what to look for in the EXPLAIN results
EDIT The cities tables has the following indexes (or is it indices?)
cityid
state_name
and I just added one with both as I thought it might help (it didn't)
The hotels tables has the following indexes (or is it indices?)
cityid
Hmm, there's something not very right in your query.
You use an aggregate function (count), but you simply group by on id.
Normally, you should group on all columns in your select list, that are not an aggregate function.
As you've specified the query now, IMHO, the DBMS can never correctly determine which values he should display for those columns that are not an aggregate ...
It would be more correct if your query was written like:
select cityname, count(*)
from city inner join hotel on hotel.city_id = city_id
group by cityname
order by count(*) desc
If you do not have an index on the cityName, and you filter on cityname, it will improve performance if you put an index on that column.
In short: adding an index on columns that you regularly use for filtering or for sorting, may improve performance.
(That is simply put offcourse, you can use it as a 'guideline', but every situation is different. Sometimes it can be helpfull to add an index which spans multiple columns.
Also, remember that if you update or insert a record, indexes need to be updated as well, so there's a slight performance cost in adding/updating/deleting records)
Another thing that could improve performance, is using an inner join instead of an outer join. I don't think that it is necessary to use an outer join here.
It looks like you don't have an index on pcln_cities.state_name, or pcln_cities.cityid? Try adding them.
Given that you've updated your question to say that you do have these indexes, I can only suggest that your database currently has a preponderance of cities in California, so the query optimizer decided it would be easier to do a table scan and throw out the non-California ones than to use the index to pick out the California ones.
Your query looks fine. Is there a chance that something else has a lock on a record that you need? Are the tables especially big? I doubt that data is the problem as there are not that many hotels...
I've run in to similar issues with MySQL. After spending over a year tuning, patching, and thinking I'm a SQL dummy, I switched to SQL Server Express. The exact same queries with the exact same data would run 2-5 orders of magnitude faster in SQL Server Express. MySQL seemed to have an especially difficult time with moderately complex queries (5+ tables). I think the MySQL optimizer became retarded after SUN bought the organization...