Is it possible to define a Spark SQL expression as a named function? - function

I would like to create/register a named function in Spark SQL whose body is defined as a Spark SQL expression. What I want is almost identical to a Spark UDF, but:
the logic is simple enough that it can be expressed entirely as a Spark SQL function
I don't want to use a UDF written in Scala or Python because that would incur ser/des overhead (especially so in Python, which I am using, but that shouldn't be relevant here since I am asking how to do this explicitly in Spark SQL)
I don't want to write the function as a DataFrame transformation in Pyspark or Scala Spark because I want this code to be immediately portable between Spark environments (pluggable as a string into spark.sql()) regardless of whether running on Python or Scala
I want the function to be callable as easily as select my_function(x) from tbl (identical to how a UDF named my_function would be called) since this is part of a simplified Spark interface for data scientists/analysts who are extensively familiar with SQL but not with other programming languages. The interface also exposes some project-specific UDFs we have written, but the analysts shouldn't have to care whether a function they're calling is a UDF or a spark expression under the hood. Effectively, I want a UDF's ease-of-callability without its performance overhead.
I have seen that certain built-in Spark functions accept anonymous lambda functions as arguments. For example transform which accepts a list and a lambda function to apply to all elements in that function:
select
transform(
array(2022, 1950, 1701),
y -> (y div 100) + 1
) as centuries
/*
creates:
+------------+
| centuries|
+------------+
|[21, 20, 18]|
+------------+
*/
However, I cannot find out how to define a lambda function with a name inside the spark function registry using this lambda syntax.
I effectively want to do this, all in Spark SQL:
CREATE FUNCTION year_to_century AS y -> y div 100 + 1
/*
assuming there exists a table named year_tbl:
|year |
|_____|
|2022 |
|1950 |
|1701 |
*/
select year, year_to_century(year) as century from year_tbl
and the result would look like this:
year | century
_____|________
2022 | 21
1950 | 20
1701 | 18
But I don't know what the correct syntax is for the "CREATE FUNCTION ..." line. Does Spark SQL even allow for such a thing?

Related

why aren't anonymous functions in Lua expressions?

Can anyone explain to me why the anonymous function construct in Lua isn't a fully fledged expression? To me this seems an oddity: it goes (slightly) against the idea that functions should be first class objects, and is (not often but occasionally) an inconvenience in what is mostly a really well-thought out and elegant language.
example, using the command line Lua, with workaround
Lua 5.3.3 Copyright (C) 1994-2016 Lua.org, PUC-Rio
> function(x) return x*x end (2)
stdin:1: <name> expected near '('
> square = function(x) return x*x end
> square(2)
4
Lua's function call syntax has some syntactic sugar built into it. You can call functions with 3 things:
A parenthesized list of values.
A table constructor (the function will take the table as a single argument).
A string literal.
Lua wants to be somewhat regular in its grammar. So if there's a thing which you can call as a function in one of these ways, then it should make sense to be able to call it in any of these ways.
Consider the following code:
local value = function(args)
--does some stuff
end "I'm a literal" .. foo
If we allow arbitrary, unparenthesized expressions to be called just like any other function call, then this means to create a function, invoke it with the string literal, concatenate the result of that function call with foo, and store that in value.
But... do we actually want that to work? That is, do we want people to be able to write that and have it be valid Lua code?
If such code is considered unsightly or confusing, then there are a few options.
Lua could just not have function calls with string literals. You're only saving 2 parentheses, after all. Maybe even don't allow table constructors as well, though those are less unsightly and far less confusing. Make everyone use parentheses for all function calls.
Lua could make it so that only in the cases of lambdas are function calls with string literals prevented. This would require substantially de-regularizing the grammar.
Lua could force you to parenthesize any construct where calling a function is not an obviously intended result of the preceding text.
Now, one might argue that table_name[var_name] "literal" is already rather confusing as to what is going on. But again, preventing that specifically would require de-regularizing the grammar. You'd have to add in all of these special cases where something like name "literal" is a function call but name.name "literal" is not. So option 2 is out.
The ability to call a function with a string literal is hardly limited to Lua. JavaScript can do it, but you have to use a specific literal syntax to get it. Plus, being able to type require "module_name" feels like a good idea. Since such calls are considered an important piece of syntactic sugar, supported by several languages, option #1 is out.
So your only option is #3: make people parenthesize expressions that they want to call.
Oh I see.. round brackets are needed, sorry.
(function(x) return x*x end) (2)
I still don't see why it is designed like that.
Short Answer
To call a function, the function expression must be either a name, an indexed value, another function call, or an expression inside parentheses.
Long Answer
I don't know why it's designed that way, but I did look up the grammar to see exactly how it works. Here's the entry for a function call:
functioncall ::= prefixexp args | prefixexp ‘:’ Name args
"args" is just a list of arguments in parentheses. The relevant part is "prefixexp".
prefixexp ::= var | functioncall | ‘(’ exp ‘)’
Ok, so we can call another "functioncall". "exp" is just a normal expression:
exp ::= nil | false | true | Numeral | LiteralString | ‘...’ | functiondef | prefixexp | tableconstructor | exp binop exp | unop exp
So we can call any expression as long as it's inside parentheses. "functiondef" covers anonymous functions:
functiondef ::= function funcbody
funcbody ::= ‘(’ [parlist] ‘)’ block end
So an anonymous function is an "exp", but not a "prefixexp", so we do need parentheses around it.
What is "var"?
var ::= Name | prefixexp ‘[’ exp ‘]’ | prefixexp ‘.’ Name
"var" is either a name or an indexed value (usually a table). Note that the indexed value must be a "prefixexp", which means a string literal or table constructor must be in parentheses before we can index them.
To sum up: A called function must be either a name, an indexed value, a function call, or some other expression inside parentheses.
The big question is: Why is "prefixexp" treated differently from "exp"? I don't know. I suspect it has something to do with keeping function calls and indexing outside the normal operator precedence, but I don't know why that's necessary.

How to force addition using meta_table in lua?

I have defined addition using metatable as follows.
local matrix_meta = {}
matrix_meta.__add = function( ... )
return matrix.add( ... )
end
I want to add variables using matrix_meta add. The following commands work well.
matrix(p)+q
matrix(p)+matrix(q)
p+matrix(q)
However the following code doesn't work.
p+q
The reason is obvious that it doesn't recognize p or q as matrix objects. It simply throws error that trying to perform arithmetic on table values. I am curious about how to force addition for matrix objects. I mean that is it possible to execute in lua something like this env-Matrix: p+q or as matrix_meta.__add: p,q so that p and q are auto recognized as matrix objects. So the problem is to perform addition in matrix environment where variables will be recognized as matrix objects. Note that I simply don't want to this only for two variables, there may be more than two variables.
As defined in your comment
local p={{2,4,6},{8,10,12},{14,16,20}}
local q={{1,2,3},{8,10,12},{14,16,20}}
So unless you something like
local p = setmetatable(p={{2,4,6},{8,10,12},{14,16,20}}, matrix_meta)
p and q are just regular Lua tables with no metamethods.
Arithmetic operations are not defined for Lua tables. Hence the error message.
If you don't like the Lua operators or its syntax, consider using another programming language.
It wouldn't hurt to write something like m({2,4,6},{8,10,12},{14,16,20}) instead of {{2,4,6},{8,10,12},{14,16,20}}.

functions in Module (Fortran) [duplicate]

I use the Intel Visual Fortran. According to Chapmann's book, declaration of function type in the routine that calls it, is NECESSARY. But look at this piece of code,
module mod
implicit none
contains
function fcn ( i )
implicit none
integer :: fcn
integer, intent (in) :: i
fcn = i + 1
end function
end module
program prog
use mod
implicit none
print *, fcn ( 3 )
end program
It runs without that declaration in the calling routine (here prog) and actually when I define its type (I mean function type) in the program prog or any other unit, it bears this error,
error #6401: The attributes of this name conflict with those made accessible by a USE statement. [FCN] Source1.f90 15
What is my fault? or if I am right, How can it be justified?
You must be working with a very old copy of Chapman's book, or possibly misinterpreting what it says. Certainly a calling routine must know the type of a called function, and in Fortran-before-90 it was the programmer's responsibility to ensure that the calling function had that information.
However, since the 90 standard and the introduction of modules there are other, and better, ways to provide information about the called function to the calling routine. One of those ways is to put the called functions into a module and to use-associate the module. When your program follows this approach the compiler takes care of matters. This is precisely what your code has done and it is not only correct, it is a good approach, in line with modern Fortran practice.
association is Fortran-standard-speak for the way(s) in which names (such as fcn) become associated with entities, such as the function called fcn. use-association is the way implemented by writing use module in a program unit, thereby making all the names in module available to the unit which uses module. A simple use statement makes all the entities in the module known under their module-defined names. The use statement can be modified by an only clause, which means that only some module entities are made available. Individual module entities can be renamed in a use statement, thereby associating a different name with the module entity.
The error message you get if you include a (re-)declaration of the called function's type in the calling routine arises because the compiler will only permit one declaration of the called function's type.

SSIS Processing money fields with what looks like signs over the last digit

I have a fixed length flat file input file. The records look like this
40000003858172870114823 0010087192017092762756014202METFORMIN HCL ER 500 MG 0000001200000300900000093E00000009E00000000{0000001{00000104{JOHN DOE 196907161423171289 2174558M2A2 000 xxxx YYYYY 100000000000 000020170915001 00010000300 000003zzzzzz 000{000000000{000000894{ aaaaaaaaaaaaaaa P2017092700000000{00000000{00000000{00000000{ 0000000{00000{ F89863 682004R0900001011B2017101109656 500 MG 2017010100000000{88044828665760
If you look just before the JOHN DOE you will see a field that represents a money field. It looks like 00000104{.
This looks like the type of field I used to process from a mainframe many years ago. How do I handle this in SSIS. If the { on the end is in fact a 0, then I want the field to be a string that reads 0000010.40.
I have other money fields that are, e.g. 00000159E. If my memory serves me correctly, that would be 00000015.95.
I can't find anything on how to do this transform.
Thanks,
Dick Rosenberg
import the values as strings
00000159E
00000104{
in derived column do your transforms with replace:
replace(replace(col,"E","5"),"{","0")
in another derived column cast to money and divide by 100
(DT_CY)(drvCol) / 100
I think you will need to either use a Script Component source in the data flow, or use a Derived Column transformation or Script Component transformation. I'd recommend a Script Component either way as it sounds like your custom logic will be fairly complex.
I have written a few detailed answers about how to implement a Script component source:
SSIS import a Flat File to SQL with the first row as header and last row as a total
How can I load in a pipe (|) delimited text file that has columns that sometimes contain line breaks?
Essentially, you need to locate the string, "00000104{", for example, and then convert it into decimal/money form before adding it into the data flow (or during it if you're using a Derived Column Transformation).
This could also be done in a Script Component transformation, which would function in a similar way to the Derived Column transformation, only you'd perhaps have a bit more scope for complex logic. Also in a Script Component transformation (as opposed to a source), you'd already have all of your other fields in place from the Flat File Source.

is it possible to work with local variables inside linq2sql for F#?

Linq example
<# seq {for a in db.ArchiveAnalogs do
for d in db.Deltas do
let ack = ref false
for ac in db.Acks do
if a.Date > ac.StartDate && a.Date < ac.EndDate then
ack := true
yield
if a.ID = d.ID && a.Value > d.DeltaLimit then
Some(a.Date, d.AboveMessage, !ack)
elif a.ID = d.ID && a.Value < d.DeltaLimit then
Some(a.Date, d.BelowMessage, !ack)
else None }
#> |> query |> Seq.choose id |> Array.ofSeq
Error message :
The following construct was used in query but is not recognised by the F#-to-LINQ query translator:
Sequential (Call (None,
Void op_ColonEquals[Boolean](Microsoft.FSharp.Core.FSharpRef`1[System.Boolean], Boolean),
[Call (None,
Microsoft.FSharp.Core.FSharpRef`1[System.Boolean] Ref[Boolean](Boolean),
[Value (false)]), Value (true)]),
Call (None,
System.Collections.Generic.IEnumerable`1[Microsoft.FSharp.Core.FSharpOption`1[System.Tuple`3[System.DateTime,System.String,System.Boolean]]] Empty[FSharpOption`1](),
[]))
This is not a valid query expression. Check the specification of permitted queries and consider moving some of the query out of the quotation
No, it's not possible to do what you want. LINQ to SQL enables queries in the form of "expression trees" to be run against a database. The F# wrapper converts F# quotations into language-agnostic .NET expression trees, which are then fed to the standard LINQ to SQL machinery. There are two issues with what you're trying to do:
Expression trees only support a limited set of operations, and in .NET 3.5 this does not include assignments. Therefore, even though an F# quotation can contain assignments to local variables, there's no way for the query operator to translate this into a .NET expression tree.
Even if assignments could be expressed in an expression tree, how would they be translated to a SQL query by the LINQ to SQL infrastructure? The main idea behind LINQ to SQL is to ensure that the query is compiled to SQL and run on the database, but there's no obvious way to translate your logic into a database query.
Just to give some details about the error message - You can actually use let binding inside the LINQ to SQL computations. The problem is that you're using imperative construct and imperative sequencing of expressions.
Your query contains something like this (where body is some expression that returns unit):
for v in e1 do
body
yield e3
This is not allowed, because LINQ to SQL can only contain C# expression trees that can contain just a single expression - to represent this, we'd need the support for statements (but that doesn't quite make sense in the database world).
To remove the statement, the F# translator converts this to something (very roughly) like:
Special.Sequence
(fun () -> for v in e1 do body)
(fun () -> e3)
Now it is just a single expression (The Sequence method lives somewhere in F# libraries, so that you can compile F# quotations using LINQ and run them dynamically), but the LINQ to SQL translator doesn't understand Sequence and cannot deal with it.