MySQL running total computed column message - mysql

I am using the following query to compute a running balance for my credit card. My bank does not provide running balance column on its website, so I am using a SQL query to compute it myself. This code works for me, except my IDE shows the following warning message, "Setting user variables within expressions is deprecated and will be removed in a future release. Consider alternatives: 'SET variable=expression, ...', or 'SELECT expression(s) INTO variables(s)'".
SET #running_sum = 100;
SELECT
post_date,
category,
amount,
(#running_sum := #running_sum + amount) AS running_total
FROM credit_card;
I also noticed the following statement under MySQL version 8 section 9.4 User-Defined Variables, "Previous releases of MySQL made it possible to assign a value to a user variable in statements other than SET. This functionality is supported in MySQL 8.0 for backward compatibility but is subject to removal in a future release of MySQL".
From these messages it appears to me that MySQL does not like that I am using this expression within my query (#running_sum := #running_sum + amount). However, I can't fully understand how am I supposed to re-write this query, so I can achieve the same result of computing running balance, and comply with MySQL version 8 code style.
Thank you in advance for your response.

Without ORDER BY the order of the result set is unspecified and your query could return different results on each execution.
Assuming that the result set should be sorted by post_date and the sum should start at 0 (and not at 100) the following query based on window functions yields the same result set:
SELECT
`post_date`,
`category`,
`amount`,
SUM(`amount`) OVER (ORDER BY `post_date` ASC) AS `running_total`
FROM `credit_card`;
Why the functionality was deprecated:
Important Change: Setting user variables in statements other than SET
is now deprecated due to issues that included those listed here:
The order of evaluation for expressions involving user variables was
undefined.
The default result type of a variable is based on its type at the
beginning of the statement, which could have unintended effects when a
variable holding a value of one type at the beginning of a statement
was assigned a new value of a different type in the same statement.
HAVING, GROUP BY, and ORDER BY clauses, when referring to a variable
that was assigned a value in the select expression list, did not work
as expected because the expression was evaluated on the client and so
it was possible for stale column values from a previous row to be
used.

Related

Is there an equivalent way to do MySQL's JSONOBJECTAGG() in MariaDB?

The server I was working on for a data project crashed and I am now recreating the database. I used to be working on a MySQL database, and now I'm using MariaDB. I have never used MariaDB before.
Previously, I used the following command to insert some data into one table from another:
CREATE TABLE collaborators_list
SELECT awards.id, awards.researcher_name, awards.organization_id,
JSON_OBJECTAGG(awards.fiscal_year, coapplicants.coapplicant_name,
coapplicants.organization_number)
AS 'coapplicants_list' FROM awards
INNER JOIN coapplicants
ON awards.id=coapplicants.id
GROUP BY awards.researcher_name, awards.organization_id;
Basically, I want to do the same thing in MariaDB. I tried looking here:
https://mariadb.com/kb/en/library/json-functions/
but unless I am misreading something, none of these is what I really want...
Help!
No, MariaDB still does not support JSON_ARRAYAGG and JSON_OBJECTAGG functions. A JIRA ticket has been raised for requesting this feature: https://jira.mariadb.org/browse/MDEV-16620
Now, from the docs of JSON_OBJECTAGG():
It takes only two column names or expressions as arguments, the
first of these being used as a key and the second as a value.
An error occurs if any key name is NULL or the number of arguments is
not equal to 2.
However, you are specifying three arguments in JSON_OBJECTAGG(awards.fiscal_year, coapplicants.coapplicant_name, coapplicants.organization_number); so your attempted query will not work as well.
Now, in the absence of the required functions, we can utilize Group_Concat() with Concat(). I am assuming that you need only first two arguments (as explained in previous para).
GROUP_CONCAT( DISTINCT CONCAT('"', awards.fiscal_year, '": "',
coapplicants.coapplicant_name, '"')
SEPARATOR ', ')
Note that, in case of string getting very very long, Group_Concat() may truncate it. So, you can increase the allowed length, by executing the following query, before the above query:
SET SESSION group_concat_max_len = ##max_allowed_packet;

Common Table Expressions -- Using a Variable in the Predicate

I've written a common table expression to return hierarchical information and it seems to work without issue if I hard code a value into the WHERE statement. If I use a variable (even if the variable contains the same information as the hard coded value), I get the error The maximum recursion 100 has been exhausted before statement completion.
This is easier shown with a simple example (note, I haven't included the actual code for the CTE just to keep things clearer. If you think it's useful, I can certainly add it).
This Works
WITH Blder
AS
(-- CODE IS HERE )
SELECT
*
FROM Blder as b
WHERE b.PartNo = 'ABCDE';
This throws the Max Recursion Error
DECLARE #part CHAR(25);
SET #part = 'ABCDE'
WITH Blder
AS
(-- CODE IS HERE )
SELECT
*
FROM Blder as b
WHERE b.PartNo = #part;
Am I missing something silly? Or does the SQL engine handle hardcoded values and parameter values differently in this type of scenario?
Kindly put semicolon at the end of your variable assignment statement
SET #part ='ABCDE';
Your SELECT statement is written incorrectly: the SQL Server Query Optimizer is able to optimize away the potential cycle if fed the literal string, but not when it's fed a variable, which uses the plan that developed from the statistics.
SQL Server 2016 improved on the Query Optimizer, so if you could migrate your DB to SQL Server 2016 or newer, either with the DB compatibility level set to 130 or higher (for SQL Server 2016 and up), or have it kept at 100 (for SQL Server 2008) but with OPTION (USE HINT ('ENABLE_QUERY_OPTIMIZER_HOTFIXES')) added to the bottom of your SELECT statement, you should get the desired result without the max recursion error.
If you are stuck on SQL Server 2008, you could also add OPTION (RECOMPILE) to the bottom of your SELECT statement to create an ad hoc query plan that would be similar to the one that worked correctly.

MySQL 5.7 RAND() and IF() without LIMIT leads to unexpected results

I have the following query
SELECT t.res, IF(t.res=0, "zero", "more than zero")
FROM (
SELECT table.*, IF (RAND()<=0.2,1, IF (RAND()<=0.4,2, IF (RAND()<=0.6,3,0))) AS res
FROM table LIMIT 20) t
which returns something like this:
That's exactly what you would expect. However, as soon as I remove the LIMIT 20 I receive highly unexpected results (there are more rows returned than 20, I cut it off to make it easier to read):
SELECT t.res, IF(t.res=0, "zero", "more than zero")
FROM (
SELECT table.*, IF (RAND()<=0.2,1, IF (RAND()<=0.4,2, IF (RAND()<=0.6,3,0))) AS res
FROM table) t
Side notes:
I'm using MySQL 5.7.18-15-log and this is a highly abstracted example (real query is much more difficult).
I'm trying to understand what is happening. I do not need answers that offer work arounds without any explanations why the original version is not working. Thank you.
Update:
Instead of using LIMIT, GROUP BY id also works in the first case.
Update 2:
As requested by zerkms, I added t.res = 0 and t.res + 1 to the second example
The problem is caused by a change introduced in MySQL 5.7 on how derived tables in (sub)queries are treated.
Basically, in order to optimize performance, some subqueries are executed at different times and / or multiple times leading to unexpected results when your subquery returns non-deterministic results (like in my case with RAND()).
There are two easy (and likewise ugly) workarounds to get MySQL to "materialize" (aka return deterministic results) these subqueries: Use LIMIT <high number> or GROUP BY id both of which force MySQL to materialize the subquery and return the expected results.
The last option is turn off derived_merge in the optimizer_switch variable: derived_merge=off (make sure to leave all the other parameters as they are).
Further readings:
https://mysqlserverteam.com/derived-tables-in-mysql-5-7/
Subquery's rand() column re-evaluated for every repeated selection in MySQL 5.7/8.0 vs MySQL 5.6

Why the order of evaluation for expressions involving user variables is undefined?

From MySQL Manual the output of the following query is not guaranteed to be same always.
SET #a := 0;
SELECT
#a AS first,
#a := #a + 1 AS second,
#a := #a + 1 AS third,
#a := #a + 1 AS fourth,
#a := #a + 1 AS fifth,
#a := #a + 1 AS sixth;
Output:
first second third fourth fifth sixth
0 1 2 3 4 5
Quoting from the Manual:
However,the order of evaluation for expressions involving user
variables is undefined;
I want to know the story behind.
So my question is : Why the order of evaluation for expressions involving user variables is undefined?
The order of evaluation of expressions in the select is undefined. For the most part, you only notice this when you have variables, because the errors result in erroneous information.
Why? The SQL standard does not require the order of evaluation, so each database is free to decide how to evaluate the expressions. Typically such decisions are left to the optimizer.
TL;DR MySQL user-defined variables are not intended to be used that way. An SQL statement describes a result set, not a series of operations. The documentation isn't clear about what variable assignments even mean. But you can't both read and write a variable. And assignment order within SELECT clause is not defined. And all you can assume is that assignments in an outer SELECT clause are done for some one output row.
Almost all the code you see like yours has undefined behaviour. Some sensible people demonstrate via the implementation code for operators & optimization what a particular implementation actually does. But that behaviour can't be relied on for the next release.
Read the documentation. Reading and writing the same variable is undefined. When it's not done, any variable read is fixed within a statement. There is no order to assignments. For SELECTs with only DETERMINISTIC functions (whose values are determined by argument values) the result is defined by a conceptual evaluation execution. But there is no connection between that and user variable. What an assignment ever means is not clear: the documention says "each select expression is evaluated only when sent to the client". This seems to be saying that there's no guarantee a row is even "selected" except in the sense of put into a result set per an outermost SELECT clause. The order of assignments in a SELECT is not defined. And even if assignments are conceptually done for every row, they can only depend on the row value, so that's the same as saying the assignment is done only once, for some row. And since assignment order is not defined, that row can be any row. So assuming that that is what the documentation means, all you can expect is that if you don't read and write from the same variable in a SELECT statement then each variable assignment in the outermost SELECT will have happened in some order for one output row.
It depends on database's optimizer's decision. That's why it's uncertain. But mostly optimizer decides as the way we predict the result.

Could this simple T-SQL update fail when running on multiple processors?

Assuming that all values of MBR_DTH_DT evaluate to a Date data type other than the value '00000000', could the following UPDATE SQL fail when running on multiple processors if the CAST were performed before the filter by racing threads?
UPDATE a
SET a.[MBR_DTH_DT] = cast(a.[MBR_DTH_DT] as date)
FROM [IPDP_MEMBER_DEMOGRAPHIC_DECBR] a
WHERE a.[MBR_DTH_DT] <> '00000000'
I am trying to find the source of the following error
Error: 2014-01-30 04:42:47.67
Code: 0xC002F210
Source: Execute csp_load_ipdp_member_demographic Execute SQL Task
Description: Executing the query "exec dbo.csp_load_ipdp_member_demographic" failed with the following error: "Conversion failed when converting date and/or time from character string.". Possible failure reasons: Problems with the query, "ResultSet" property not set correctly, parameters not set correctly, or connection not established correctly.
End Error
It could be another UPDATE or INSERT query, but the otehrs in question appear to have data that is proeprly typed from what I see,, so I am left onbly with the above.
No, it simply sounds like you have bad data in the MBR_DTH_DT column, which is VARCHAR but should be a date (once you clean out the bad data).
You can identify those rows using:
SELECT MBR_DTH_DT
FROM dbo.IPDP_MEMBER_DEMOGRAPHIC_DECBR
WHERE ISDATE(MBR_DTH_DT) = 0;
Now, you may only get rows that happen to match the where clause you're using to filter (e.g. MBR_DTH_DT = '00000000').
This has nothing to do with multiple processors, race conditions, etc. It's just that SQL Server can try to perform the cast before it applies the filter.
Randy suggests adding an additional clause, but this is not enough, because the CAST can still happen before any/all filters. You usually work around this by something like this (though it makes absolutely no sense in your case, when everything is the same column):
UPDATE dbo.IPDP_MEMBER_DEMOGRAPHIC_DECBR
SET MBR_DTH_DT = CASE
WHEN ISDATE(MBR_DTH_DT) = 1 THEN CAST(MBR_DTH_DT AS DATE)
ELSE MBR_DTH_DT END
WHERE MBR_DTH_DT <> '00000000';
(I'm not sure why in the question you're using UPDATE alias FROM table AS alias syntax; with a single-table update, this only serves to make the syntax more convoluted.)
However, in this case, this does you absolutely no good; since the target column is a string, you're just trying to convert a string to a date and back to a string again.
The real solution: stop using strings to store dates, and stop using token strings like '00000000' to denote that a date isn't available. Either use a dimension table for your dates or just live with NULL already.
Not likely. Even with multiple processors, there is no guarantee the query will processed in parallel.
Why not try something like this, assuming you're using SQL Server 2012. Even if you're not, you could write a UDF to validate a date like this.
UPDATE a
SET a.[MBR_DTH_DT] = cast(a.[MBR_DTH_DT] as date)
FROM [IPDP_MEMBER_DEMOGRAPHIC_DECBR] a
WHERE a.[MBR_DTH_DT] <> '00000000' And IsDate(MBR_DTH_DT) = 1
Most likely you have bad data are are not aware of it.
Whoops, just checked. IsDate has been available since SQL 2005. So try using it.