Is mysql’s lag function non-deterministic without an ORDER BY? - mysql

I’ve been working on a query using the lag function. My initial query appeared to be returning the correct data, even though I left out the ORDER BY in the OVER clause. I was partitioning over several columns.
Then I added a WHERE clause and was surprised to find that the result set returned contained rows which were not in the unfiltered query.
My question is, is there a use case for using lag without an ORDER BY in the OVER clause? I also read in the documentation that lag does not even require an OVER clause, but it seems to me, that without the OVER and the ORDER BY lag would return random values from the column. Am I missing something?
It seems like OVER and ORDER BY in the OVER clause should be an error.

Related

Why would a ending Group By without an aggregate slow down my query?

I recently came across in my inherited program a long SQL query that joined 8 tables and 3 views. Using EXPLAIN, I saw it was 7 Unique Key lookups and 4 non-unique key lookups. On average it took 18 seconds to fetch 350 rows (there's a number of reasons for that, each of those 3 views is made up of other views for one) Then, I noticed a GROUP BY tableone.id without any aggregate. I deleted that and now it runs in milliseconds.
Now the confusing part is that I then looked up why MySQL allows a GROUP BY statement without an aggregate function and learned it's actually an optimizing technique (Why does MySQL allow "group by" queries WITHOUT aggregate functions?).
This was obviously NOT the case in my situation. So why is that? When is a dangling GROUP BY a hindrance and not an optimizer?
The GROUP BY clause, even without an actual aggregate function being used, implies additional processing for the RDBMS, in order to check if some records need to be aggregated. Thus the boost that you are seeing when removing an unnecessary GROUP BY.
The link you shared explains that this somehow loose behavior from MySQL might have been designed as a way to shorten the syntax of aggregate queries where grouping by one field would imply other fields are also being grouped, and possibly as an optimization as well. Anyway this does not properly fit your use case, where you don’t actually need aggregation.
The use of group by without aggregation function is not more allowed starting from mysql 5.6 (for obvious reasons.
For the versions previos then 5.7 the group by clause work extracting a (causal ) value for all the column not in aggregated function .. this others then an unpredictable result for these columns .. produce the need of work forscan all the rows and extract the result with a degradation of the performance .

The process order of SQL order by, group by, distinct and aggregation function?

Query like:
SELECT DISTINCT max(age), area FROM T_USER GROUP BY area ORDER BY area;
So, what is the process order of order by, group by, distinct and aggregation function ?
Maybe different order will get the same result, but will cause different performance. I want to merge multi-result, I got the sql, and parsed.So I want to know the order of standard sql dose.
This is bigger than just group by/aggregation/order by. You want to have an sense of how a query engine creates a result set. At a high level, that means creating an execution plan, retrieving data from the table into the query's working set, manipulating the data to match the requested result set, and then returning the result set back to the caller. For very simple queries, or queries that are well matched to the table design (or table schemas that are well-designed for the queries you'll need to run), this can mean streaming data from a table or index directly back to the caller. More often, it means thinking at a more detailed level, where you roughly follow these steps:
Look at the query to determine which tables will be needed.
Look at joins and subqueries, to determine which of those table depend on other tables.
Look at the conditions on the joins and in the where clause, in conjunction with indexes, to determine the how much space from each table will be needed, and how much work it will take to extract the portions of each table that you need (how well the query matches up with your indexes or the table as stored on disk).
Based the information collected from steps 1 through 3, figure out the most efficient way to retrieve the data needed for the select list, regardless of the order in which tables are included in the query and regardless of any ORDER BY clause. For this step, "most efficient" is defined as the method that keeps the working set as small as possible for as long as possible.
Begin to iterate over the records indicated by step 4. If there is a GROUP BY clause, each record has to be checked against the existing discovered groups before the engine can determine whether or not a new row should be generated in the working set. Often, the most efficient way to do this is for the query engine to conduct an effective ORDER BY step here, such that all the potential rows for the results are materialized into the working set, which is then ordered by the columns in the GROUP BY clause, and condensed so that only duplicate rows are removed. Aggregate function results for each group are updated as the records for that group are discovered.
Once all of the indicated records are materialized, such that the results of any aggregate functions are known, HAVING clauses can be evaluated.
Now, finally, the ORDER BY can be factored in, as well.
The records remaining in the working set are returned to the caller.
And as complicated as that was, it's only the beginning. It doesn't begin to account for windowing functions, common table expressions, cross apply, pivot, and on and on. It is, however, hopefully enough to give you a sense of the kind of work the database engine needs to do.

MySQL: Does LIMIT reduce the number of calls to user-defined functions?

I have a computationally expensive user-defined function that I need to use against a large dataset. I don't sort nor ask for row-count (no FOUND_ROWS). If I specify LIMIT as part of the query, does MYSQL engine actually stop calling the function after getting the LIMIT-rows or does it run the function against the entire dataset regardless? Example:
select cols, .. where fingerprint_match(col, arg) > score limit 5;
Ideally, fingerprint_match would be called as few as 5 times if first (random) rows resulted in a passing score.
As documented under Optimizing LIMIT Queries:
MySQL sometimes optimizes a query that has a LIMIT row_count clause and no HAVING clause:
[ deletia ]
As soon as MySQL has sent the required number of rows to the client, it aborts the query unless you are using SQL_CALC_FOUND_ROWS.
I believe the query will stop processing as soon as the specified number of matches are found but ONLY IF there is no ORDER BY clause. Otherwise it must find and sort all matches before applying the limit.
The only evidence I have for this is the statement in the docs that "LIMIT 0 quickly returns an empty set. This can be useful for checking the validity of a query.". This suggests to me that it doesn't bother applying the where clause to any rows once the limit has already been satisfied.
http://dev.mysql.com/doc/refman/5.6/en/limit-optimization.html

Rewrite a group-by over a randomly-ordered sub-query using only one select

Here's the thing. I'm having 3 tables, and I'm doing this query:
select t.nomefile, t.tipo_nome, t.ordine
from
(select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
order by RAND()
) as t
group by t.tipo_nome
order by t.ordine
It's applied to 3 tables, all in relationship 1-N, which need to be joined and then take 1 random result from each different result in the higher level table. This query works just fine, the problem is that I'm being asked to rewrite this query USING ONLY ONE SELECT. I've come with another way of doing this with only one select, the thing is that according to SQL sintax the GROUP BY must be before the ORDER BY, so it's pointless to order by random when you already have only the first record for each value in the higher level table.
Someone has a clue on how to write this query using only one select?
Generally, if I am not much mistaken, an ORDER BY clause in the subquery of a query like this has to do with a technique that allows you to pull non-GROUP BY columns (in the outer query) according the order specified. And so you may be out of luck here, because that means the subquery is important to this query.
Well, because in this specific case the order chosen is BY RAND() and not by a specific column/set of columns, you may have a very rough equivalent by doing both the joins and the grouping on the same level, like this:
select nomefile, tipo_nome, categorie.ordine
from t_gallerie_immagini_new as immagini
join t_gallerie_new as collezioni on collezioni.id=immagini.id_ref
join t_gallerie_tipi as categorie on collezioni.type=categorie.id
group by tipo_nome
order by categorie.ordine
You must understand, though, why this is not an exact equivalent. The thing is, MySQL does allow you to pull non-GROUP BY columns in a GROUP BY query, but if they are not correlated to the GROUP BY columns, then the values returned would be... no, not random, the term used by the manual is indeterminate. On the other hand, the technique mentioned in the first paragraph takes advantage of the fact that if the row set is ordered explicitly and unambiguously prior to grouping, then the non-GROUP BY column values will always be the same*. So indeterminateness has to do with the fact that "normally" rows are not ordered explicitly before grouping.
Now you can probably see the difference. The original version orders the rows explicitly. Even if it's BY RAND(), it is intentionally so, to ensure (as much as possible) different results in the output most of the times. But the modified version is "robbed" of the explicit ordering, and so you are likely to get identical results for many executions in a row, even if they are kind of "random".
So, in general, I consider your problem unsolvable for the above stated reasons, and if you choose to use something like the suggested modified version, then just be aware that it is likely to behave slightly differently from the original.
* The technique may not be well documented, by the way, and may have been found rather empirically than by following manuals.
I was not able to understand the reasons behind the request to rewrite this query, however, i found out that there is a solution which uses the "select" word only once. Here's the query:
SELECT g.type, SUBSTRING_INDEX(GROUP_CONCAT(
i.nomefile ORDER BY
RAND()),',',1) nomefile
FROM t_gallerie_new g JOIN t_gallerie_immagini_new i ON g.id=i.id_ref
GROUP BY g.type;
for anyone interested in this question.
NOTE: The use of GROUP_CONCAT has a couple of downsides: It is not recommended to use this keyword when using medium/large tables since it could increase the server side payload. Also, there is a limit to the size of the string returned by GROUP_CONTACT, by default 1024, so, it's necessary to modify a parameter in the mySql server to be able to receive a bigger string from this instruction.

Mysql LIMIT operator - equal efficency when not using it?

How is it that when i use LIMIT, mysql checks the same number of rows? and how do i solve this?
The Where clause is processed first. Once the matches are found, the limit is applied on the result set, so all of the rows have to be evaluated to determine if they match the conditions before the limit can be applied.
The explain output is misleading. Mysql will evaluate the query using the where clause and such, but it'll stop after it find LIMIT number of matching rows (1 in this case). This is a known issue with mysql: http://bugs.mysql.com/bug.php?id=50168
To clarify... the limit clause will work as expected... it's only the explain output that's inaccurate.