There are several good posts on how to number rows within groups with MySQL, but how does the actually code work? I'm unclear on what MySQL evaluates first in the code below.
For instance, placing #yearqt := yearqt as bloc before the IF() call produces different results, and I'm unclear on the role of the s1 subquery in initializing the # variables: when are they updated as MySQL runs through the data rows? Is the order by statement run before the select?
The code below selects three random records per yearqt group. There may be other ways to do this, but the question pertains to how the code works, not how I could do this differently or whether I can do this more efficiently. Thank you.
select * from (
select customer_id , yearqt , a ,
IF(#yearqt = yearqt , #rownum := #rownum + 1 , #rownum := 1) as rownum ,
#yearqt := yearqt as bloc
from
( select customer_id , yearqt , rand(123) as a from tbl
order by rand(123)
) a join ( select #rownum := 0 , #yearqt := '' ) s1
order by yearqt
) s2
where rownum <= 3
order by bloc
This question is related to how the engine retrieves SQL SELECT query results. The order is roughly the following:
Calculate explain plan
Calculate sets and join them using plan's directives (FROM / JOIN phase)
Apply WHERE clause
Apply GROUP BY/HAVING clause
Apply ORDER BY clause
Projection phase: every row returned is ordered and can now be 'displayed'.
So, in respect to the variables, you now understand why there's subquery to initialize them. This subquery is evaluated only once, and at the beginning of the process.
After that, the project phase seems to treat each selected attribute in the order you decided which is the reason why puting #yearqt := yearqt as bloc up one attribute would changes the outcome of the next/previous IF statement. Since each row will be projected once, it means any work you're doing on the variables will be done as many times as the number of rows in the final resulset.
The purpose of this
join ( select #rownum := 0 , #yearqt := '' ) s1
is to initialize the user-defined variables at the beginning of statement execution. Because this is a rowsource for the outer query (MySQL calls it a derived table) this will be executed BEFORE the outer query runs. We aren't really interested in what this query returns, except that it returns a single row, because of the JOIN operation.
So this inline view s1 could be omitted from the query and be replaced by a couple of SET statements that are executed immediately before the query:
SET #rownum := 0;
SET #yearqt := 0;
But then we'd have three separate statements to run, and we'd get different output from the query if these weren't run, if those variables were set to some other value. By including this in the query itself, it's a single statement, and we remove the dependency on separate SET statements.
This is the query that's really doing the work, whittled down to just the two expressions that matter in this case
SELEECT IF(#yearqt = t.yearqt , #rownum := #rownum + 1 , #rownum := 1) as rownum
, #yearqt := t.yearqt as bloc
FROM ( ... ) t
ORDER BY t.yearqt
Some key points that make this "work"
MySQL processes the expressions in the SELECT list in the order that they appear in the SELECT list.
MySQL processes the rows in the order specified in the ORDER BY.
The references to user-defined variables are evaluated for each row, not once at the beginning of the statement.
Note that the MySQL Reference Manual points out that this behavior is not guaranteed. (So, it may change in a future release.)
So, the processing of that can be described as
for the first expression:
compare the value of the yearqt column from the current row with current value of #yearqt user-defined variable
set the value of #rownum user-defined variable
return the result of the IF() expression in the resultset
for the second expression:
set the value of the #yearqt user-defined variable to the value of the yearqt column from the current row
return the value of the yearqt column in the resultset
The net effect is that for each row processed, we're comparing the value in the yearqt column to the value from the "previously" processed row, and we're saving the current value to compare to the next row.
Related
Why do I have to assign to a session variable for it to have the right number in a query like this:
SELECT #row_number := #row_number + 1, name FROM cities;
Instead of something like:
SELECT #row_number, name FROM cities;
In the second form it returns what I'm guessing is the last row number. Maybe even the value of a COUNT(*). It's almost as if the value is somehow closed over. What is going on in these two queries?
you have #row_number variable. Everytime the below sql hits the record it shows the result and increments by one.
SELECT #row_number := #row_number + 1, name FROM cities;
if you are using mysql 8.0+, you can use row_number window function to achieve same result
select row_number() over (order by <pk>) rn, name from cities;
If we turn back to SELECT #row_number, name FROM cities;, you are not icrementing #row_number which in turns shows always same value which is assigned value for #row_number
PS: please also note that you are not using order by clause on your query which may lead to inconsistent row numbering.
You need to assign to it in order to add 1 to the value on each row. If you don't do this, you get the same value on every row, which isn't a row number. It will be whatever was left from the last time you assigned the variable, which might be the total number of rows from a previous query that was correctly incrementing.
If you're using MySQL 8.x you can replace this use of session variables with the ROW_NUMBER() function.
Instead of session variable, for MySQL version 8+ you can use ROW_NUMBER() and for below MySQL 8 you can do this
SELECT #row_number := #row_number + 1, name
FROM cities,
(SELECT #row_number:= 0) AS x;
I was going through this answer How do you select every n-th row from mysql. In that I am not able to understand the initialisation in the following subquery.
SELECT
#row := #row +1 AS rownum, [column name]
FROM (
SELECT #row :=0) r, [table name]
How exactly the initialisation of
SELECT #row :=0
is working?
Is some kind of join happening between table ‘r’ and ‘table name’?
If I change above query as below, would there be any difference in the performance?
SET #row = 0;
SELECT
#row := #row +1 AS rownum, [column name] FROM [table name]
Please share your thoughts.
Using two statements, initializing the user-defined variable in a separate statement will be equivalent performance.
Instead of the SET statement, we could do
SELECT #row : = 0
which would achieve the same result, assigning a value to the user-defined variable #row. The difference would that MySQL needs to prepare a resultset to be returned to the client. We avoid that with the SET statement, which doesn't return a resultset.
With two separate statement executions, there's the overhead of sending an extra statement: parsing tokens, syntax check, semantic check, ... and returning the status to the client. It's a small amount of overhead. We aren't going to notice it onesie-twosie.
So performance will be equivalent.
I strongly recommend ditching the oldschool comma syntax for join operation, and using the JOIN keyword instead.
Consider the query:
SELECT t.foo
FROM r
CROSS
JOIN t
ORDER BY t.foo
What happens when the table r is guaranteed to contain exactly one row?
The query is equivalent to:
SELECT t.foo
FROM t
ORDER BY t.foo
We can use a SELECT query in place of a table or view. Consider for example:
SELECT v.foo
FROM ( SELECT t.foo
FROM t
) v
Also consider what happens with this query:
SELECT #foo := 0
There is no FROM clause (or Oracle-style FROM dual), so the query will return a single row. The expression in the SELECT list is evaluated... the constant value 0 is assigned to the user-defined variable #foo.
Consider this query:
SELECT 'bar'
FROM ( SELECT #foo := 0 ) r
Before the outer query runs, the SELECT inside the parens is executed. (MySQL calls it an "derived table" but more generically it's an inline view definition.) The net effect is that the constant 0 is assigned to the user-defined variable, and a single row is returned. So the outer query returns a single row.
If we understand that, we have what we need to understand what is happening here:
SELECT t.mycol
FROM ( SELECT #row := 0 ) r
CROSS
JOIN mytable t
ORDER
BY t.mycol
Inline view r is evaluated, the SELECT returns a single row, the value 0 is assigned to user-defined variable #row. Since r is guaranteed to return a single row, we know that the Cartesian product (cross join) with mytable will result in one row for each row in mytable. Effectively yielding just a copy of mytable.
To answer the question that wasn't asked:
The benefit of doing the initialization within the statement rather than a separate statement is that we now have a single statement that stands alone. It knocks out a dependency i.e. doesn't require a separate execution of a SET statement to assign the user defined variable. Which also cuts out a roundtrip to the database to prepare and execute a separate statement.
I am trying to optimize following query.
SELECT t3.*,
(SELECT SUM(t4.col_sum)
FROM (...) t4
WHERE t4.timestamp BETWEEN CONCAT(SUBSTR(t3.timestamp, 1, 11), "00:00:00") AND t3.timestamp)
AS cum_sum
FROM (...) t3
Where (...) is a container for long query. It results 2 columns: timestamp and col_sum. I want to add third column to it by writing a query. That third column is a cumulative sum of col_sum.
The problem is I am putting same big query in two places (...)
Is there a way I can obtain a result and use the result in those two/multiple places (...)?
One method is to use a temporary table.
Probably a more efficient method is to use variables to calculate a cumulative sum. It would be something like:
select t.*,
(#c := if(#t = left(t.timestamp, 11), #c + t.col_sum,
if(#t := left(t.timestamp, 11), 0, 0)
)
) as cumesum
from (. . .) t cross join
(select #t := '', #c := 0) vars
order by t.timestamp;
The above query orders the rows by timestamp. The variable #t keeps track of the first 11 characters in the timestamp -- as I read your logic, you want to do the cumulative sum only within a group where this is constant.
The variable #c keeps track of the cumulative sum, resetting to zero when a new "first 11 characters" are encountered. The logic looks a bit complicated, but it is best to put all variable assignments in a single expression, because MySQL does not guarantee the order of evaluation of expressions.
i'm preparing a presentation about one of our apps and was asking myself the following question: "based on the data stored in our database, how much growth have happend over the last couple of years?"
so i'd like to basically show in one output/graph, how much data we're storing since beginning of the project.
my current query looks like this:
SELECT DATE_FORMAT(created,'%y-%m') AS label, COUNT(id) FROM table GROUP BY label ORDER BY label;
the example output would be:
11-03: 5
11-04: 200
11-05: 300
unfortunately, this query is missing the accumulation. i would like to receive the following result:
11-03: 5
11-04: 205 (200 + 5)
11-05: 505 (200 + 5 + 300)
is there any way to solve this problem in mysql without the need of having to call the query in a php-loop?
Yes, there's a way to do that. One approach uses MySQL user-defined variables (and behavior that is not guaranteed)
SELECT s.label
, s.cnt
, #tot := #tot + s.cnt AS running_subtotal
FROM ( SELECT DATE_FORMAT(t.created,'%y-%m') AS `label`
, COUNT(t.id) AS cnt
FROM articles t
GROUP BY `label`
ORDER BY `label`
) s
CROSS
JOIN ( SELECT #tot := 0 ) i
Let's unpack that a bit.
The inline view aliased as s returns the same resultset as your original query.
The inline view aliased as i returns a single row. We don't really care what it returns (except that we need it to return exactly one row because of the JOIN operation); what we care about is the side effect, a value of zero gets assigned to the #tot user variable.
Since MySQL materializes the inline view as a derived table, before the outer query runs, that variable gets initialized before the outer query runs.
For each row processed by the outer query, the value of cnt is added to #tot.
The return of s.cnt in the SELECT list is entirely optional, it's just there as a demonstration.
N.B. The MySQL reference manual specifically states that this behavior of user-defined variables is not guaranteed.
Guys I want to use analytical function lag in mysql. In Oracle it is supported but I can't do it in Mysql. So can anybody help me how to perform lag operation in Mysql?
For example
UID Operation
1 Logged-in
2 View content
3 Review
I want to use lag function so that my output would be as follows
UID Operation Lagoperation
1 Logged-in
2 View content Logged-in
3 Review View content
Does Mysql support lag function???
You can emulate it with user variables:
select uid, operation, previous_operation from (
select
y.*
, #prev AS previous_Operation
, #prev := Operation
from
your_table y
, (select #prev:=NULL) vars
order by uid
) subquery_alias
see it working in an sqlfiddle live
Here you initialize your variable(s). It's the same as writing SET #prev:=NULL; before writing your query.
, (select #prev:=NULL) vars
Then the order of these statements in the select clause is important:
, #prev AS previous_Operation
, #prev := Operation
The first just displays the variables value, the second assigns the value of the current row to the variable.
It's also important to have an ORDER BY clause, as the output is otherwise not deterministic.
All this is put into a subquery just out of aesthetic reasons,... to filter out this
, #prev := Operation
column.