I have a case where I do a select from another select and the order of the returned rows is changed if I add a where clause.
Example:
SELECT t.id
FROM (
SELECT t.id
FROM table1 t
ORDER BY
t.viewsTotal ASC
LIMIT 20
OFFSET 0
) base
INNER JOIN table1 t ON base.id = t.id
LEFT JOIN table2 t2 ON t2.id = t1.secondTableId
# WHERE t2.someBoolColumn = FALSE
;
Now, the order is the same for the inner select and the outer select, but if I uncomment the where condition, the outer select will change the ordering.
How can I prevent this from happening?
Lets assume the following for a given example:
I can not do one select.
I do not know what order has been applied to an inner select when doing an outer select. So, if I order from a joined table, I wouldn't know that I need to join it here.
More info on my use case
There is a query builder that provides inner select, and it may apply order by a third table that is joined to that inner select, if i would like to apply the same order i would need to know what tables were joined, and in the case of this poor query builder i do not have that knowledge
tl;dr If you want a particular order in your result set, use ORDER BY.
The ordering of rows in a result set from any RDMS server without an ORDER BY clause is formally unpredictable. Unpredictable is like random, except worse. Random ordering implies you'll get your rows in a different order every time you run the query. Truly random ordering, if it existed, would make it hard for simple unit tests to pass when your assumptions about ordering fail.
Unpredictable means you'll get them in the same order, until you don't. That means your unit tests will pass, and your system tests will pass, and your system will fail six months into production, if it depends on result set ordering.
Why is this so? A server's query planner is free to use any algorithm at its disposal to satisfy the queries you give it. These algorithms work differently for different types of table and different sizes of table. If you don't constrain the query planner by specifying the result set ordering, it may pick some algorithm that gives an ordering that appears strange to you the programmer.
Query planners have, literally, thousands of programmer years' worth of optimizations built in to them.
For people used to the procedural ways of thinking encouraged by all kinds of programming languages, it's sometimes hard to switch your thinking to the declarative / descriptive mode used by SQL. With SQL (at least clean SQL without stuff like SELECT #a := #a+1 and other hacks) you're simply describing the result set you want. The server generates results matching your specification.
I would suggest you not rely on the implicit ordering produced my SQL (because there is no implicit ordering as per Bohemian's comment). Rather, you should use an ORDER BY statement and select one of your columns in the query by which you should order your results. That way you can ensure that the results are always presented in the same way regardless of the WHERE clauses.
Related
i've got an optimisation problem with my query, once I use the aggregate GROUP BY in my query with a JSON_OBJECT(), the performances are heavily affected, and it seems that the JSON_OBJECT() function is called for EVERY row in the table, even if there is a LIMIT.
Once there is no more GROUP BY, the query is executed really fast. I abstracted the query i'm using to the easiest, but I need to GROUP BY cause
I'm using JSON_ARRAYAGG() for another join.
I got ~25k rows in my table and it takes 10x less time when removing the group by aggregate
select JSON_OBJECT('id',`b`.`id`) as bw
from a
left join `b` on `a`.`id` = `b`.`id_a`
group by `a`.`id`
LIMIT 1;
In general, JSON should be used for storing structured data that only the app needs to look inside. It is clumsy and probably very inefficient for MySQL to pick apart JSON for use with WHERE, GROUP BY, etc.
As for GROUP BY (or ORDER BY) plus LIMIT 1:
With just the LIMIT, MySQL simply peels of the first row it finds. -- much faster, but which row you get is unpredictable.
With Group or Order, it may have to gather all possible rows, juggle them (grouping or sorting), and only then peel off 1 row. -- much slower.
It sounds like you have an "array" of things in each JSON? The RDBMS equivalent involves a second table to handle all those arrays -- one element per row. Switching to that may lead to much faster code. (I don't understand your data well enough to give you a concrete suggestion.)
My requirements are: I now have a table, I need to group according to one of the fields, and get the latest record in the group, and then I search the scheme on the Internet,
SELECT
* FROM(
SELECT
*
FROM
record r
WHERE
r.id in (xx,xx,xx) HAVING 1
ORDER BY
r.time DESC
) a
GROUP BY
a.id
, the result is correct, but I can't understand the meaning of "having 1" after the where statement. I hope a friend can give me an answer. Thank you very much.
It does nothing, just like having true would. Presumably it is a placeholder where sometimes additional conditions are applied? But since there is no group by or use of aggregate functions in the subquery, any having conditions are going to be treated no differently than where conditions.
Normally you select rows and apply where conditions, then any grouping (explicit, or implicit as in select count(*)) occurs, and the having clause can specify further constraints after the grouping.
Note that your query is not guaranteed to give the results you want; the order by in the subquery in theory has no effect on the outer query and the optimizer may skip it. It is possible the presence of having makes a difference to the optimizer, but that is not something you should rely on, certainly from one version of mysql to another.
I was playing around with SQLite and I ran into an odd performance issue with CROSS JOINS on very small data sets. For example, any cross join I do in SQLite takes about 3x or longer than the same cross join in mysql. For example, here would be an example for 3,000 rows in mysql:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins? I have had a lot of luck using SQLite on a single table/database, but whenever joining tables, it seems be become a bit more problematic.
Does SQLite use a different algorithm or something than does other client-server databases for doing cross joins or other types of joins?
Yes. The algorithm used by SQLite is very simple. In SQLite, joins are executed as nested loop joins. The database goes through one table, and for each row, searches matching rows from the other table.
SQLite is unable to figure out how to use an index to speed the join and without indices, an k-way join takes time proportional to N^k. MySQL for example, creates some "ghostly" indexes which helps the iteration process to be faster.
It has been commented already by Shawn that this question would need much more details in order to get a really accurate answer.
However, as a general answer, please be aware that this note in the SQLite documentation states that the algorithm used to perform CROSS JOINs may be suboptimal (by design!), and that their usage is generally discouraged:
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite. The "CROSS JOIN" join operator produces the same result as the "INNER JOIN", "JOIN" and "," operators, but is handled differently by the query optimizer in that it prevents the query optimizer from reordering the tables in the join. An application programmer can use the CROSS JOIN operator to directly influence the algorithm that is chosen to implement the SELECT statement. Avoid using CROSS JOIN except in specific situations where manual control of the query optimizer is desired. Avoid using CROSS JOIN early in the development of an application as doing so is a premature optimization. The special handling of CROSS JOIN is an SQLite-specific feature and is not a part of standard SQL.
This clearly indicates that the SQLite query planner handles CROSS JOINs differently than other RDBMS.
Note: nevertheless, I am unsure that this really applies to your use case, where both derived tables being joined have the same number of records.
Why MySQL might be faster: It uses the optimization that it calls "Using join buffer (Block Nested Loop)".
But... There are many things that are "wrong" with the query. I would hate for you to draw a conclusion on comparing DB engines based on your findings.
It could be that one DB will create an index to help with join, even if none were already there.
SELECT * probably hauls around all the columns, unless the Optimizer is smart enough to toss all the columns except for territory.
A LIMIT without an ORDER BY gives you random value. You might think that the resultset is necessarily 3000 rows of the value "3000" in each, but it is perfectly valid to come up with other results. (Depending on what you ORDER BY, it still may not be deterministic.)
Having a COUNT(*) without a column saying what it is counting (territory) seems unrealistic.
You have the same subquery twice. Some engine may be smart enough to evaluate it only once. Or you could reformulate it with WITH to (possibly) give the Optimizer a big hint of such. (I think the example below shows how it would be reformulated in MySQL 8.0 or MariaDB 10.2; I don't know about SQLite).
If you are pitting one DB against the other, please use multiple queries that relate to your application.
This is not necessarily a "small" dataset, since the intermediate table (unless optimized away) has 9,000,000 rows.
I doubt if I have written more than one cross join in a hundred queries, maybe a thousand. Its performance is hardly worth worrying about.
WITH w AS ( SELECT territory FROM main_s LIMIT 3000 )
SELECT COUNT(*)
FROM w AS x1
JOIN w AS x2
GROUP BY x1.territory;
As noted above, using CROSS JOIN in SQLite restricts the optimiser from reordering tables so that you can influence the order the nested loops that perform the join will take.
However, that's a red herring here as you are limiting rows in both sub selects to 3000 rows, and its the same table, so there is no optimisation to be had there anyway.
Lets see what your query actually does:
select COUNT(*) from (
select * from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You say; produce an intermediate result set of 9 million rows (3000 x 3000), group them on x.territory and return count of the size of the group.
So let's say the row size of your table is 100 bytes.
You say, for each of 3000 rows of 100 bytes, give me 3000 rows of 100 bytes.
Hence you get 9 million rows of 200 bytes length, an intermediate result set of 1.8GB.
So here are some optimisations you could make.
select COUNT(*) from (
select territory from main_s limit 3000
) x cross join (
select * from main_s limit 3000
) x2 group by x.territory
You don't use anything other than territory from x, so select just that. Lets assume it is 8 bytes, so now you create an intermediate result set of:
9M x 108 = 972MB
So we nearly halve the amount of data. Lets try the same for x2.
But wait, you are not using any data fields from x2. You are just using it multiply the result set by 3000. If we do this directly we get:
select COUNT(*) * 3000 from (
select territory from main_s limit 3000
) group by territory
The intermediate result set is now:
3000 x 8 = 24KB which is now 0.001% of the original.
Further, now that SELECT * is not being used, it's possible the optimiser will be able to use an index on main_s that includes territory as a covering index (meaning it doesn't need to lookup the row to get territory).
This is done when there is a WHERE clause, it will try to chose a covering index that will also satisfy the query without using row lookups, but it's not explicit in the documentation if this is also done when WHERE is not used.
If you determined a covering index was not being use (assuming one exists), then counterintuitively (because sorting takes time), you could use ORDER BY territory in the sub select to cause the covering index to be used.
select COUNT(*) * 3000 from (
select territory from main_s limit 3000 order by territory
) group by territory
Check the optimiser documentation here:
https://www.sqlite.org/draft/optoverview.html
To summarise:
The optimiser uses the structure of your query to look for hints and clues about how the query may be optimised to run quicker.
These clues take the form of keywords such as WHERE clauses, ORDER By, JOIN (ON), etc.
Your query as written provides none of these clues.
If I understand your question correctly, you are interested in why other SQL systems are able to optimise your query as written.
The most likely reasons seem to be:
Ability to eliminate unused columns from sub selects (likely)
Ability to use covering indexes without WHERE or ORDER BY (likely)
Ability to eliminate unused sub selects (unlikely)
But this is a theory that would need testing.
Sqlite uses CROSS JOIN as a flag to the query-planner to disable optimizations. The docs are quite clear:
Programmers can force SQLite to use a particular loop nesting order for a join by using the CROSS JOIN operator instead of just JOIN, INNER JOIN, NATURAL JOIN, or a "," join. Though CROSS JOINs are commutative in theory, SQLite chooses to never reorder the tables in a CROSS JOIN. Hence, the left table of a CROSS JOIN will always be in an outer loop relative to the right table.
https://www.sqlite.org/optoverview.html#crossjoin
I have followed the tutorial over at tizag for the MAX() mysql function and have written the query below, which does exactly what I need. The only trouble is I need to JOIN it to two more tables so I can work with all the rows I need.
$query = "SELECT idproducts, MAX(date) FROM results GROUP BY idproducts ORDER BY MAX(date) DESC";
I have this query below, which has the JOIN I need and works:
$query = ("SELECT *
FROM operators
JOIN products
ON operators.idoperators = products.idoperator JOIN results
ON products.idProducts = results.idproducts
ORDER BY drawndate DESC
LIMIT 20");
Could someone show me how to merge the top query with the JOIN element from my second query? I am new to php and mysql, this being my first adventure into a computer language I have read and tried real hard to get those two queries to work, but I am at a brick wall. I cannot work out how to add the JOIN element to the first query :(
Could some kind person take pity on a newb and help me?
Try this query.
SELECT
*
FROM
operators
JOIN products
ON operators.idoperators = products.idoperator
JOIN
(
SELECT
idproducts,
MAX(date)
FROM results
GROUP BY idproducts
) AS t
ON products.idproducts = t.idproducts
ORDER BY drawndate DESC
LIMIT 20
JOINs function somewhat independently of aggregation functions, they just change the intermediate result-set upon which the aggregate functions operate. I like to point to the way the MySQL documentation is written, which hints uses the term 'table_reference' in the SELECT syntax, and expands on what that means in JOIN syntax. Basically, any simple query which has a table specified can simply expand that table to a complete JOIN clause and the query will operate the same basic way, just with a modified intermediate result-set.
I say "intermediate result-set" to hint at the mindset which helped me understand JOINS and aggregation. Understanding the order in which MySQL builds your final result is critical to knowing how to reliably get the results you want. Generally, it starts by looking at the first row of the first table you specify after 'FROM', and decides if it might match by looking at 'WHERE' clauses. If it is not immediately discardable, it attempts to JOIN that row to the first JOIN specified, and repeats the "will this be discarded by WHERE?". This repeats for all JOINs, which either add rows to your results set, or remove them, or leaves just the one, as appropriate for your JOINs, WHEREs and data. This process builds what I am referring to when I say "intermediate result-set". Somewhere between starting and finishing your complete query, MySQL has in it's memory a potentially massive table-like structure of data which it built using the process I just described. Only then does it begin to aggregate (GROUP) the results according to your criteria.
So for your query, it depends on what specifically you are going for (not entirely clear in OP). If you simply want the MAX(date) from the second query, you can simply add that expression to the SELECT clause and then add an aggregation spec to the end:
SELECT *, MAX(date)
FROM operators
...
GROUP BY idproducts
ORDER BY ...
Alternatively, you can add the JOIN section of the second query to the first.
I have a query which actually runs two queries on a table. I query the whole table, a datediff and then a subquery which tells me the sum of hours each unit spent in certain operational steps. The main query limits the results to the REP depot so technically I don't need to put that same criteria on the subquery since repair_order is unique.
Would it be faster, slower or no difference to apply the depot filter on the subquery?
SELECT
*,
DATEDIFF(date_shipped, date_received) as htg_days,
(SELECT SUM(t3.total_days) FROM report_tables.cycle_time_days as t3 WHERE t1.repair_order=t3.repair_order AND (operation='MFG' OR operation='ENG' OR operation='ENGH' OR operation='HOLD') GROUP BY t3.repair_order) as subt_days
FROM
report_tables.cycle_time_days as t1
WHERE
YEAR(t1.date_shipped)=2010
AND t1.depot='REP'
GROUP BY
repair_order
ORDER BY
date_shipped;
I run into this with a lot of situations but I never know if it would be better to put the filter in the sub query, main query or both.
In this example, it would actually alter the query if you moved your WHERE clause to filter by REP into the subquery. So it wouldn't be about performance at that point, it would be about getting the same result set. In general, though, if you will get the same exact result set by moving a WHERE clause elsewhere in a complex query, it is better to do so at the most atomic level possible, ie, in the subquery. Then the subquery returns a smaller result set to the main query before the main query has to process it.
The answer to your question will vary depending on your schema, the complexity of your queries, the reliability of your data, etc. A general rule of thumb is to try to process the least amount of data possible, which generally means filtering it at the lowest level possible as well.
When you want to optimize a query the absolute number one place to start is to use the EXPLAIN output to see what optimizations the query parser was able to figure out and check to see what the weakest link is in the query plan. Resolve that, rinse, repeat.
You can also use explain's "extended" keyword to see the actual query it built to run which will reveal more about its usage of your criteria. In some cases, it will optimize away duplicate conditions between parent/subqueries. In other cases, it may push the conditions down from the parent in to the subquery. In some cases for (too) complex queries I've seen the it repeat the condition when it was only specified in the query once. Thankfully, you don't have to guess, mysql's explain plan will reveal all, albeit sometimes in cryptic ways.
I usually use a derived table as a "driver or aggregating" query then join that result back onto whatever table that i want to pull data from:
select
t1.*,
datediff(t1.date_shipped, t1.date_received) as htg_days,
subt_days.total_days
from
cycle_time_days as t1
inner join
(
-- aggregating/driver query
select
repair_order,
sum(total_days) as total_days
from
cycle_time_days
where
year(date_shipped) = 2010 and depot = 'REP' and
operation in ('MFG','ENG','ENGH','HOLD') -- covering index on date, depot, op ???
group by
repair_order -- indexed ??
having
total_days > 14 -- added for demonstration purposes
order by
total_days desc limit 10
) as subt_days on t1.repair_order = subt_days.repair_order
order by
t1.date_shipped;