Copying files from a Foundry dataset to another Foundry dataset - palantir-foundry

I have two Foundry datasets that contain raw files (lets say xml or csv files). I would like to merge these two within a transform to create a new dataset with a collection from both.
(This explicit example was due to a API schema being updated, and required to merge the existing data with the new version).
ex
A: csv1, csv2, csv3, csv4, csv5 (source)
B: csv1, csv2, csv3 (target)

Because Foundry datasets store raw files, a simple Python transform using shutil.copyfileobj should do the trick. This is further documented under Palantir docs: transforms/python-raw-file-access#writing-files
for file_status in in_source.filesystem().ls(glob='*.csv'):
with in_source.filesystem().open(file_status.path, 'rb') as in_f:
with out.filesystem().open(file_status.path, 'wb') as out_f:
shutil.copyfileobj(in_f, out_f)

Related

Databricks Delta tables from json files: Ignore initial load when running COPY INTO

I am working with Databricks on AWS. I have mounted an S3 bucket as /mnt/bucket-name/. This bucket contains json files under the prefix jsons. I create a Delta table from these json files as follows:
%python
df = spark.read.json('/mnt/bucket-name/jsons')
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
So far, so good. Then new json files arrive in the bucket. In order to update the Delta table, I run the following:
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This does indeed update the Delta table, but it duplicates the rows contained in the initial load, i.e. the rows in df are now contained in table_name twice. I have the following workaround, whereby I create an empty dataframe with the correct schema:
%python
df_schema = spark.read.json('/mnt/bucket-name/jsons').schema
df = spark.createDataFrame([], df_schema)
df.write.format('delta').save('/mnt/bucket-name/delta')
%sql
CREATE TABLE IF NOT EXISTS default.table_name
USING DELTA
LOCATION '/mnt/bucket-name/delta'
%sql
COPY INTO default.table_name
FROM '/mnt/bucket-name/jsons'
FILEFORMAT = JSON
This works and there is no duplication, but it seems neither elegant nor efficient, since spark.read.json('/mnt/bucket-name/jsons').schema reads all the json files, even though only the schema needs to be inferred. (The schema of the json files can be assumed to be stable.) Is there a way to tell COPY INTO to ignore the initial json files? There's the option modifiedAfter, but that would be cumbersome and doesn't sit well idempotently. I also considered recreating the dataframe and then running df.write.format('delta').mode('append').save('/mnt/bucket-name/delta') followed by REFRESH TABLE default.table_name, but this seems inefficient, since why should the initial json files be read again? Edit: This method also duplicates the initial load.
Or is there a way to circumvent using a Spark dataframe entirely and create a Delta table from the json files directly? I have searched for such a solution but to no avail.
One last point: Schema inference is crucial and so I do not want a solution that requires the schema of the json files to be written out manually.

Can Pyarrow non-legacy parquet datasets read and write to Azure Blob? (legacy system and Dask are able to)

Is it possible to read a parquet dataset from Azure Blob using the new non-legacy?
I can read and write to blob storage with the old system where fs is fsspec:
pq.write_to_dataset(table=table.replace_schema_metadata(),
root_path=path,
partition_cols=[
'year',
'month',
],
filesystem=fs,
version='2.0',
flavor='spark',
)
With Dask, I am able to read the data using storage options:
ddf = dd.read_parquet(path='abfs://analytics/iag-cargo/zendesk/ticket-metric-events',
storage_options={
'account_name': base.login,
'account_key': base.password,
})
But when I try using
import pyarrow.dataset as ds
dataset = ds.dataset()
Or
dataset = pq.ParquetDataset(path_or_paths=path, filesystem=fs, use_legacy_dataset=False)
I run into errors about invalid filesystem URIs. I tried every combination I could think of and tried to figure out how Dask and the legacy system can read and write files but the new one can't.
I'd like to test the row filtering and non-Hive partitioning.

Using NIFI how do you apply attributes to files in a zip from a json file also contained in the same zip?

Using Apache Nifi I'd like to process a zip which contains a category.json file and a number of data files as illustrated.
somefile.zip
├──category.json
├──datafile-1
├──datafile-2
├──...
├──datafile-n
Example category.json
{
"category": "history",
"rating" : 5
}
What I'd like to do is unpack the files and apply the category.json data as attributes to each datafile.
What would be the best way to handle this problem?
Myabe not the best one, but a way to do it :
1) unzip
2) use routeOnAttribut based on category.json filename
3) retrieve category as attribut in category.json flowfile
4) zip all file again but keep atttribut
5) unzip again and keep attribut, all your flowfile will have the category attribut
I'd recommend starting with a combination of ListFile and FetchFile (or GetFile on its own) to retrieve the archive, CompressContent to extract the component files, RouteOnAttribute using the flowfile filename attribute to separate the flowfile containing category.json, and the EvaluateJSONPath processor to retrieve the JSON content of that flowfile and populate certain values into attributes.
From there, it's unclear if your question is how to update the NiFi flowfile attributes for each flowfile containing one of the data files from that archive, or apply the extracted JSON to the data files on disk somewhere.
Assuming the former, you could either write the extracted JSON into a variable or parameter (use ExecuteScript to do so) and use UpdateAttribute to apply those attributes onto the other flowfiles resulting from the CompressContent processor.

Loading a entity relation triple csv as nodes

Suppose I have a csv file with data in the format (Subject, relation, Object).
Is it possible to load this into neo4j as a graph modeled such that the subject and object become nodes and the relation between them is the relation from the triple?
Essentially while loading from the csv, I want to load the subject and object as individual nodes and the relation is the one joining them.
(subject)-[:relation]->(object)
My csv is in the format
ent1,state,ent2
a,is,b
.
.
.
Yes, It's possible. You need to install the APOC plugin in Neo4j and then use apoc.merge.relationship.
Refer the following query to load the data: Add/Modify required details in the query.
LOAD CSV FROM "file:///path-to-file" AS line
MERGE (sub:Subject {name:line[0]})
MERGE (obj:Object {name:line[2]})
WITH sub, obj, line
CALL apoc.merge.relationship(sub,line[1],{},{},obj) YIELD rel
RETURN COUNT(*);

running nested jobs in spark

I am using PySpark. I have a list of gziped json files on s3 which I have to access, transform and then export in parquet to s3. Each json file contains around 100k lines so parallelizing it wont make much sense(but i am open to parallelizing it), but there are around 5k files which I have parallelize. My approach is pass the json file list to script -> run parallelize on the list -> run map(? this is where I am getting blocked). how do I access and transform the json create a DF out of the transformed json and dump it as parquet into s3.
To read json in a distributed fashion, you will need to parallelize your keys as you mention. To do this while reading from s3, you'll need to use boto3. Below is a skeleton sketch of how to do so. You'll likely need to modify distributedJsonRead to fit your use case.
import boto3
import json
from pyspark.sql import Row
def distributedJsonRead(s3Key):
s3obj = boto3.resource('s3').Object(bucket_name='bucketName', key=key)
contents = json.loads(s3obj.get()['Body'].read().decode('utf-8'))
return Row(**contents)
pkeys = sc.parallelize(keyList) #keyList is a list of s3 keys
dataRdd = pkeys.map(distributedJsonRead)
Boto3 Reference: http://boto3.readthedocs.org/en/latest/guide/quickstart.html
Edit: to address the 1:1 mapping of input files to output files
Later on, it's likely that having a merged parquet data set would be easier to work with. But if this is the way you need to do it, you could try something like this
for k in keyList:
rawtext = sc.read.json(k) # or whichever method you need to use to read in the data
outpath = k[:-4]+'parquet'
rawtext.write.parquet(outpath)
I don't think you will not be able to parallelize these operations if you want a 1:1 mapping of json to parquet files. Spark's read/write functionality is designed to be called by the driver, and needs access to sc and sqlContext. This is another reason why having 1 parquet directory is likely the way to go.