Use custom checks in the Palantir Foundry transform decorator - palantir-foundry

I can specify checks in the transform decorator, such as Primary Key. Can I also specify a custom check which applies a lambda function, for example? Thanks!
I read the documentation and couldn't find an existing check type that confirms to my use case.
EDIT:
Here's a code example of what I am trying to accomplish. For example, I want to check if an array column only contains distinct elements. The transform check should raise a warning if my UDF returns false. This is how I would implement the check with an extra column (rather than using checks):
df = (
df
.withColumn('my_array_col1', F.array(F.lit('first'), F.lit('second'), F.lit('third')))
.withColumn('my_array_col2', F.array(F.lit('first'), F.lit('first')))
.withColumn('custom_check1', check_for_distinct_array_elements(F.col('my_array_col1')))
.withColumn('custom_check2', check_for_distinct_array_elements(F.col('my_array_col2')))
)
#F.udf
def check_for_distinct_array_elements(arr):
return len(set(arr)) == len(arr)

You could create custom checks, here is an example given your described issue. I just extend the _SizeExpectation expectation to change how the value to check is computed, and add the eq function in order to work around the expectation-factory step.
from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
from pyspark.sql import Column
from pyspark.sql import types as T
from transforms.expectations.utils._expectation_utils import check_columns_exist, check_column_type
from transforms.expectations.evaluator import EvaluationTarget
from transforms.expectations.column._column_expectation import _SizeExpectation
from transforms.api import Check
import operator
class CountDuplicatesExpectation(_SizeExpectation):
def __init__(self, col, op=None, threshold=None):
super(CountDuplicatesExpectation, self).__init__(col, op, threshold)
self._col = col
self._op = op
self._threshold = threshold
#check_columns_exist
#check_column_type([T.ArrayType], lambda self: self._col)
def value(self, target: EvaluationTarget) -> Column:
return F.size(self._col) - F.size(F.array_distinct(self._col))
def eq(self, target):
return CountDuplicatesExpectation(self._col, operator.eq, target)
#transform_df(
Output(
"<output-rid>",
checks=[
Check(CountDuplicatesExpectation('values').eq(0), 'custom check', 'WARN'),
]
),
...

Related

Assign unique ID's in parallell [duplicate]

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)

Python-Sqlalchemy Binary Column Type HEX() and UNHEX()

I'm attempting to learn Sqlalchemy and utilize an ORM. One of my columns stores file hashes as binary. In SQL, the select would simply be
SELECT type, column FROM table WHERE hash = UNHEX('somehash')
How do I achieve a select like this (ideally with an insert example, too) using my ORM? I've begun reading about column overrides, but I'm confused/not certain that that's really what I'm after.
eg
res = session.query.filter(Model.hash == __something__? )
Thoughts?
Only for select's and insert's
Well, for select you could use:
>>> from sqlalchemy import func
>>> session = (...)
>>> (...)
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> q = session.query(Model.id).filter(Model.some == func.HEX('asd'))
>>> print q.statement.compile(bind=engine)
SELECT model.id
FROM model
WHERE model.some = HEX(?)
For insert:
>>> from sqlalchemy import func
>>> session = (...)
>>> (...)
>>> engine = create_engine('sqlite:///:memory:', echo=True)
>>> m = new Model(hash=func.HEX('asd'))
>>> session.add(m)
>>> session.commit()
INSERT INTO model (hash) VALUES (HEX(%s))
A better approach: Custom column that converts data by using sql functions
But, I think the best for you is a custom column on sqlalchemy using any process_bind_param, process_result_value, bind_expression and column_expression see this example.
Check this code below, it create a custom column that I think fit your needs:
from sqlalchemy.types import VARCHAR
from sqlalchemy import func
class HashColumn(VARCHAR):
def bind_expression(self, bindvalue):
# convert the bind's type from String to HEX encoded
return func.HEX(bindvalue)
def column_expression(self, col):
# convert select value from HEX encoded to String
return func.UNHEX(col)
You could model your a table like:
from sqlalchemy import Column, types
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "model"
id = Column(types.Integer, primary_key=True)
col = Column(HashColumn(20))
def __repr__(self):
return "Model(col=%r)" % self.col
Some usage:
>>> (...)
>>> session = create_session(...)
>>> (...)
>>> model = Model(col='Iuri Diniz')
>>> session.add(model)
>>> session.commit()
this issues this query:
INSERT INTO model (col) VALUES (HEX(?)); -- ('Iuri Diniz',)
More usage:
>>> session.query(Model).first()
Model(col='Iuri Diniz')
this issues this query:
SELECT
model.id AS model_id, UNHEX(model.col) AS model_col
FROM model
LIMIT ? ; -- (1,)
A bit more:
>>> session.query(Model).filter(Model.col == "Iuri Diniz").first()
Model(col='Iuri Diniz')
this issues this query:
SELECT
model.id AS model_id, UNHEX(model.col) AS model_col
FROM model
WHERE model.col = HEX(?)
LIMIT ? ; -- ('Iuri Diniz', 1)
Extra: Custom column that converts data by using python types
Maybe you want to use some beautiful custom type and want to convert it between python and the database.
In the following example I convert UUID's between python and the database (the code is based on this link):
import uuid
from sqlalchemy.types import TypeDecorator, VARCHAR
class UUID4(TypeDecorator):
"""Portable UUID implementation
>>> str(UUID4())
'VARCHAR(36)'
"""
impl = VARCHAR(36)
def process_bind_param(self, value, dialect):
if value is None:
return value
else:
if not isinstance(value, uuid.UUID):
return str(uuid.UUID(value))
else:
# hexstring
return str(value)
def process_result_value(self, value, dialect):
if value is None:
return value
else:
return uuid.UUID(value)
I wasn't able to get #iuridiniz's Custom column solution to work because of the following error:
sqlalchemy.exc.StatementError: (builtins.TypeError) encoding without a string argument
For an expression like:
m = Model(col='FFFF')
session.add(m)
session.commit()
I solved it by overriding process_bind_param, which processes the parameter
before passing it to bind_expression for interpolation into your query language.
from sqlalchemy.types import VARCHAR
from sqlalchemy import func
class HashColumn(VARCHAR):
def process_bind_param(self, value, dialect):
# encode value as a binary
if value:
return bytes(value, 'utf-8')
def bind_expression(self, bindvalue):
# convert the bind's type from String to HEX encoded
return func.HEX(bindvalue)
def column_expression(self, col):
# convert select value from HEX encoded to String
return func.UNHEX(col)
And then defining the table is the same:
from sqlalchemy import Column, types
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
class Model(Base):
__tablename__ = "model"
id = Column(types.Integer, primary_key=True)
col = Column(HashColumn(20))
def __repr__(self):
return "Model(col=%r)" % self.col
I really like iuridiniz approach A better approach: Custom column that converts data by using sql functions, but I had some trouble making it work when using BINARY and VARBINARY to store hex strings in MySQL 5.7. I tried different things, but SQLAlchemy kept complaining about the encoding, and/or the use of func.HEX and func.UNHEX in contexts where they couldn't be used. Using python3 and SQLAlchemy 1.2.8, I managed to make it work extending the base class and replacing its processors, so that sqlalchemy does not require a function from the database to bind the data and compute the result, but rather it is done within python, as follows:
import codecs
from sqlalchemy.types import VARBINARY
class VarBinaryHex(VARBINARY):
"""Extend VARBINARY to handle hex strings."""
impl = VARBINARY
def bind_processor(self, dialect):
"""Return a processor that decodes hex values."""
def process(value):
return codecs.decode(value, 'hex')
return process
def result_processor(self, dialect, coltype):
"""Return a processor that encodes hex values."""
def process(value):
return codecs.encode(value, 'hex')
return process
def adapt(self, impltype):
"""Produce an adapted form of this type, given an impl class."""
return VarBinaryHex()
The idea is to replace HEX and UNHEX, which require DBMS intervention, with python functions that do just the same, encode and decode an hex string just like HEX and UNHEX do. If you directly connect to the database, you can use HEX and UNHEX, but from SQLAlchemy, codecs.enconde and codecs.decode functions make the work for you.
I bet that, if anybody were interested, writting the appropriate processors, one could even manage the hex values as integers from the python perspective, allowing to store integers that are greater the BIGINT.
Some considerations:
BINARY could be used instead of VARBINARY if the length of the hex string is known.
Depending on what you are going to do, it might worth to un-/capitalise the string on the constructor of class that is going to use this type of column, so that you work with a consistent capitalization, right at the moment of the object initialization. i.e., 'aa' != 'AA' but 0xaa == 0xAA.
As said before, you could consider a processor that converts db binary hex values to prython integer.
When using VARBINARY, be careful because 'aa' != '00aa'
If you use BINARY, lets say that your column is col = Column(BinaryHex(length=4)), take into account that any value that you provide with less than length bytes will be completed with zeros. I mean, if you do
obj.col = 'aabb' and commit it, when you later retrieve this, from the dataase, what you will get is obj.col == 'aabb0000', which is something quite different.

How to use MySQL's standard deviation (STD, STDDEV, STDDEV_POP) function inside SQLAlchemy?

I need to use the STD function of MySQL through SQLAlchemy, but after a couple of minutes of search, it looks like there is no func.<> way of using this one in SQLAlchemy. Is it not supported, or am I missing something?
Found this issue while coding some aggregates on SQLAlchemy.
Citing the docs:
Any name can be given to func. If the function name is unknown to SQLAlchemy, it will be rendered exactly as is. For common SQL functions which SQLAlchemy is aware of, the name may be interpreted as a generic function which will be compiled appropriately to the target database.
Basically func will generate a function matching the attribute "func." if its not a common function of which SQLAlchemy is aware of (like func.count).
To keep the advantages of RDBMS abstraction that comes with any ORM I always suggest to use ANSI functions to decouple the code from the DB Engine.
For a working sample you can add a connection string and execute the following code:
from sqlalchemy.orm import sessionmaker
from sqlalchemy import func, create_engine, Column
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.types import DateTime, Integer, String
# Add your connection string
engine = create_engine('My Connection String')
Base = declarative_base(engine)
Session = sessionmaker(bind=engine)
db_session = Session()
# Make sure to have a table foo in the db with foo_id, bar, baz columns
class Foo(Base):
__tablename__ = 'foo'
__table_args__ = { 'autoload' : True }
query = db_session.query(
func.count(Foo.bar).label('count_agg'),
func.avg(Foo.foo_id).label('avg_agg'),
func.stddev(Foo.foo_id).label('stddev_agg'),
func.stddev_samp(Foo.foo_id).label('stddev_samp_agg')
)
print(query.statement.compile())
It will generate the following SQL
SELECT count(foo.bar) AS count_agg,
avg(foo.foo_id) AS avg_agg,
stddev(foo.foo_id) AS stddev_agg,
stddev_samp(foo.foo_id) AS stddev_samp_agg
FROM foo

setting a different default value for a column

How do I generate a different default value for a column in SQLAlchemy model? In the following example, I am getting the same default value for every new instance of the model object.
import random, string
def randomword():
length = 10
return ''.join(random.choice(string.lowercase) for i in range(length))
class ModelFoo(AppBase):
temp = Column("temp", String, default=randomword())
default=randomword() is wrong. Since the function has called so become a constant, it is not a function any more. Pass a callable function if you want to get different values every execution:
import random, string
from sqlalchemy import create_engine, Column, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
Base = declarative_base()
engine = create_engine('sqlite:///foo.db')
Session = sessionmaker(bind=engine)
sess = Session()
def randomword():
return ''.join(random.choice(string.lowercase) for i in xrange(10))
class Foo(Base):
__tablename__ = 'foo'
key = Column(String, primary_key=True, default=randomword)
Base.metadata.create_all(engine)
Demo:
>>> sess.add(Foo())
>>> sess.add(Foo())
>>> sess.add(Foo())
>>> sess.flush()
>>> [foo.key for foo in sess.query(Foo)]
[u'aerpkwsaqx', u'cxnjlgrshh', u'dszcgrbfxn']
default=randomword will solve the issue.
Not useful for you case, but there is another default called 'server_default' which sits at the DB. So, even if you are manually inserting rows, 'server_default' gets applied.

Unique Sequencial Number to column

I need create sequence but in generic case not using Sequence class.
USN = Column(Integer, nullable = False, default=nextusn, server_onupdate=nextusn)
, this funcion nextusn is need generate func.max(table.USN) value of rows in model.
I try using this
class nextusn(expression.FunctionElement):
type = Numeric()
name = 'nextusn'
#compiles(nextusn)
def default_nextusn(element, compiler, **kw):
return select(func.max(element.table.c.USN)).first()[0] + 1
but the in this context element not know element.table. Exist way to resolve this?
this is a little tricky, for these reasons:
your SELECT MAX() will return NULL if the table is empty; you should use COALESCE to produce a default "seed" value. See below.
the whole approach of inserting the rows with SELECT MAX is entirely not safe for concurrent use - so you need to make sure only one INSERT statement at a time invokes on the table or you may get constraint violations (you should definitely have a constraint of some kind on this column).
from the SQLAlchemy perspective, you need your custom element to be aware of the actual Column element. We can achieve this either by assigning the "nextusn()" function to the Column after the fact, or below I'll show a more sophisticated approach using events.
I don't understand what you're going for with "server_onupdate=nextusn". "server_onupdate" in SQLAlchemy doesn't actually run any SQL for you, this is a placeholder if for example you created a trigger; but also the "SELECT MAX(id) FROM table" thing is an INSERT pattern, I'm not sure that you mean for anything to be happening here on an UPDATE.
The #compiles extension needs to return a string, running the select() there through compiler.process(). See below.
example:
from sqlalchemy import Column, Integer, create_engine, select, func, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.sql.expression import ColumnElement
from sqlalchemy.schema import ColumnDefault
from sqlalchemy.ext.compiler import compiles
from sqlalchemy import event
class nextusn_default(ColumnDefault):
"Container for a nextusn() element."
def __init__(self):
super(nextusn_default, self).__init__(None)
#event.listens_for(nextusn_default, "after_parent_attach")
def set_nextusn_parent(default_element, parent_column):
"""Listen for when nextusn_default() is associated with a Column,
assign a nextusn().
"""
assert isinstance(parent_column, Column)
default_element.arg = nextusn(parent_column)
class nextusn(ColumnElement):
"""Represent "SELECT MAX(col) + 1 FROM TABLE".
"""
def __init__(self, column):
self.column = column
#compiles(nextusn)
def compile_nextusn(element, compiler, **kw):
return compiler.process(
select([
func.coalesce(func.max(element.column), 0) + 1
]).as_scalar()
)
Base = declarative_base()
class A(Base):
__tablename__ = 'a'
id = Column(Integer, default=nextusn_default(), primary_key=True)
data = Column(String)
e = create_engine("sqlite://", echo=True)
Base.metadata.create_all(e)
# will normally pre-execute the default so that we know the PK value
# result.inserted_primary_key will be available
e.execute(A.__table__.insert(), data='single row')
# will run the default expression inline within the INSERT
e.execute(A.__table__.insert(), [{"data": "multirow1"}, {"data": "multirow2"}])
# will also run the default expression inline within the INSERT,
# result.inserted_primary_key will not be available
e.execute(A.__table__.insert(inline=True), data='single inline row')