Creating a primary key data health expectation in Palantir Foundry Code Repositories - palantir-foundry

I have a dataset that is the output of a Python transform defined in Palantir Foundry Code Repository. It has a primary key, but given that over time the data may change I want to validate this primary key holds in the future.
How can I create a data health expectation or check to ensure the primary key holds in future?

You can define data expectations in your Python transform, for example:
from transforms.api import transform_df, Input, Output, Check
from transforms import expectations as E
#transform_df(
Output("/path/to/output", checks=[
Check(E.primary_key("thing_id"), "primary_key: thing_id"),
]),
source_df=Input("/path/to/input"),
)
def compute(source_df):
return source_df.select("thing_id", "thing_name").distinct()
More information is available in the Palantir Foundry documentation on defining data expectations.

Related

How can I read and write column descriptions and typeclasses in foundry transforms?

I want to read the column descriptions and typeclasses from my upstream datasets, then I want to simply pass them through to my downstream datasets.
How can I do this in Python Transforms?
If you upgrade your repository to at least 1.206.0, you'll be able to access a new feature inside the Transforms Python API: read and write of descriptions and typeclasses. For visibility, this question is also highly related to this one
The column_descriptions property gives back a structured Dict<str, List<Dict<str, str>>>, for example a column of tags will have a column_typeclasses object of {'tags': [{"name": "my_name", "kind": "my_kind"}]}. A typeclass always consists of two components, a name, and a kind, which is present in every dictionary of the list shown above. It is the only two keys possible to pass in this dict, and the corresponding values for each key must be str.
Full documentation is in the works for this feature, so stay tuned.
from transforms.api import transform, Input, Output
#transform(
my_output=Output("ri.foundry.main.dataset.my-output-dataset"),
my_input=Input("ri.foundry.main.dataset.my-input-dataset"),
)
def my_compute_function(my_input, my_output):
recent = my_input.dataframe().limit(10)
existing_typeclasses = my_input.column_typeclasses
existing_descriptions = my_input.column_descriptions
my_output.write_dataframe(
recent,
column_descriptions=existing_descriptions,
column_typeclasses=existing_typeclasses
)

How to save JSON to table storage in the azure without deserialization?

I have a message coming to me from the outside world in JSON format. I know for sure what is going to be partition key and row key in the JSON but I don't want to deserialize all properties of that object. How can I save this string using Table API SDK available?
E.g. that is my input string:
{
"name":"John Doe",
"age":38,
"country":"USA",
"phone_number":"+123456789",
"current_balance":100500,
"subscribed":false
}
For that particular example, I would like to define the name as a row key and country as a partition key. In the same time, I don't want to perform deserialization of the whole entity but store all the data available. However, using Microsoft.Azure.Cosmos.Table package I found no way to just dump this data directly to the table. What is a preferable option to do that?

Explicitly providing schema in a form of a json in Spark / Mongodb integration

When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema
As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:
case class Character(name: String, age: Int)
val explicitDF = MongoSpark.load[Character](sparkSession)
explicitDF.printSchema()
I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?
Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).
For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:
sparkSession.read.option("samplingRatio", 0.1)
For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.
Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read

How can we do feature selection on json data?

I have large dataset in json format from which I want to extract important attributes whcih captures the most variance. I want to extract these attributes to build a search engine on the dataset with these attributes being the hash key.
The main question being asked here is doing feature selection on a json data.
You could read the data into a pandas DataFrame Object with the pandas.read_json() function. You can use this DataFrame Object to gain insight into your data. For example:
data = pandas.load_json(json_file)
data.head() # Displays the top five rows
data.info() # Displays description of the data
Or you can use matplotlib on this DataFrame to plot a histogram for each numerical attribute
import matplotlib.pyplot as plt
data.hist(bins=50, figsize=(20,15))
If you are interested into correlation of attributes, you can use the pandas.scatter_matrix() function.
You have to manually pick the attributes that fit best to your task and this tools help you to understand the data and gain insight into it.

How to create s3 Key object without validation?

Is there a way to create a Key for a connection without validation using boto? The docs say there is a validate parameter, but it doesn't exist in the 2.23 source (which is supposedly the same version as the docs).
I need a workaround to avoid doing the lookup on the key.
The get_key() method in boto.s3.bucket.Bucket performs a HEAD request on the object to verify that it exists. If you are sure the object exists and don't want the overhead of the HEAD request, simply create the Key object directly like this:
import boto.s3
from boto.s3.key import Key
conn = boto.s3.connect_to_region('us-east-1')
bucket = conn.get_bucket('mybucket', validate=False)
key = Key(bucket, 'mykeyname')
This avoids the HEAD request and still allows you to perform normal operations on the Key object. Note, however, that the HEAD request retrieves certain metadata about the Key in question such as its content-type, size, ETag, etc. The Key object constructed directly will not have that information available.