Saving CSV files in Foundry Transforms - palantir-foundry

Is it possible to save a CSV file in Foundry Code Repositories transforms-python language rather than saving them in the Parquet format?

Yes, this can be done by specifying the output dataset format when calling the write_dataset function. You can include compression options as well when making the call. For example:
#transform(
my_input = Input('/path/to/input/dataset'),
my_output = Output('/path/to/output/dataset')
)
def compute_function(my_input, my_output):
my_output.write_dataframe(
my_input.dataframe(),
output_format = "csv",
options = {
"compression": "gzip"
}
)

Related

Code Repository - What exactly is CTX in pyspark for a code repo?

I have seen the use of ctx in a code repo, what exactly is this? Is it a built in library? When would I use it?
I've seen it in examples such as the following:
df = ctx.spark.createdataframe(...
For Code Repositories transformations, you can optionally include a parameter ctx which gives you more access to the underlying infrastructure running your job. Typically, you'll access the ctx.spark_session attribute for making your own pyspark.sql.Dataframe objects from Python objects, like:
from transforms.api import transform_df, Output
from pyspark.sql import types as T
#transform_df(
Output=("/my/output")
)
def my_compute_function(ctx):
schema = T.StructType(
[
T.StructField("name", T.StringType(), True)
]
)
return ctx.spark_session.createDataFrame([["Alex"]], schema=schema)
You'll find a full API description in documentation on the transforms.api.TransformContext class, where attributes such as the spark_session and parameters are available for you to read.
Note: the spark_session attribute has type pyspark.sql.SparkSession

How can I ingest an Excel spreadsheet with multiple tabs?

I would like to ingest Excel files in a remote folder or a SFTP. It works with CSV files but not XLS or XLSX files.
The code below provides functions to transform an xls/xlsx file into a Spark dataframe.
To use these functions you need to:
Copy paste the functions below to your repository (in a utils.py file for instance)
Create a new transform script
In the transform script, copy/paste the example transform and modify the parameters.
Example transform to use the functions:
# Parameters for Excel files with multiple tabs ingestion
SHEETS_PARAMETERS = {
# Each of these blocks will take one tab of your Excel file ("Artists" here) and write from "header" a dataset in the path provided "/Studio/studio_datasource/artists"
"Artists": {
"output_dataset_path": "/Studio/studio_datasource/artists",
"header": 7
},
"Records": {
"output_dataset_path": "/Studio/studio_datasource/records",
"header": 0
},
"Albums": {
"output_dataset_path": "/Studio/studio_datasource/albums",
"header": 1
}
}
# Define the dictionary of outputs needed by the transform's decorator
outputs = {
sheet_parameter["output_dataset_path"]: Output(sheet_parameter["output_dataset_path"])
for sheet_parameter in SHEETS_PARAMETERS.values()
}
#transform(
my_input=Input("/Studio/studio_datasource/excel_file"),
**outputs
)
def my_compute_function(my_input, ctx, **outputs):
# Add the output objects to the parameters
for sheetname, parameters in SHEETS_PARAMETERS.items():
output_dataset_path = SHEETS_PARAMETERS[sheetname]["output_dataset_path"]
SHEETS_PARAMETERS[sheetname]["output_dataset"] = outputs[output_dataset_path]
# Transform the sheets to datasets
write_datasets_from_excel_sheets(my_input, SHEETS_PARAMETERS, ctx)
Functions:
import pandas as pd
import tempfile
import shutil
def normalize_column_name(cn):
"""
Remove forbidden characters from the columns names
"""
invalid_chars = " ,;{}()\n\t="
for c in invalid_chars:
cn = cn.replace(c, "_")
return cn
def get_dataframe_from_excel_sheet(fp, ctx, sheet_name, header):
"""
Generate a Spark dataframe from a sheet in an excel file available in Foundry
Arguments:
fp:
TemporaryFile object that allows to read to the file that contains the Excel file
ctx:
Context object available in a transform
sheet_name:
Name of the sheet
header:
Row (0-indexed) to use for the column labels of the parsed DataFrame.
If a list of integers is passed those row positions will be combined into a MultiIndex.
Use None if there is no header.
"""
# Using UTF-8 encoding is safer
dataframe = pd.read_excel(
fp,
sheet_name,
header=header,
encoding="utf-8"
)
# Cast all the dataframes as string

How to write CSV file with headers using akka stream alpakka?

I can't see to find it, hence i turn to slack to ask: Is there a way to write a csv file with its heards using akka stream alpakka ?
The only thing i see is https://doc.akka.io/docs/alpakka/current/data-transformations/csv.html#csv-formatting
But no reverse operation to csv to map somehow.
My use case is that i need to read few csv files, filter their content, and write the clean content in a corresponding file orginalcsvfilename-cleanded.csv.
If it is not directly supported, any recommendation ?
You can do something like that
def csv_header(elem:T):List[String] = ???
def csv_line(elem:T):List[String] = ???
def firstTrueIterator(): Iterator[Boolean] = (Iterator single true) ++ (Iterator continually false)
def firstTrueSource: Source[Boolean, _] = Source fromIterator firstTrueIterator
def processData(elem: T, firstRun: Boolean): List[List[String]] = {
if (firstRun) {
List(
csv_header(elem),
csv_line(elem)
)
} else {
List(csv_line(elem))
}
}
val finalSource = source
.zipWith(firstTrueSource)(processData)
.mapConcat(identity)
.via(CsvFormatting.format())

How to properly merge multiple FlowFile's?

I use MergeContent 1.3.0 in order to merge FlowFiles from 2 sources: 1) from ListenHTTP and 2) from QueryElasticsearchHTTP.
The problem is that the merging result is a list of JSON strings. How can I convert them into a single JSON string?
{"event-date":"2017-08-08T00:00:00"}{"event-date":"2017-02-23T00:00:00"}{"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
I would to get this result:
{"event-date":["2017-08-08T00:00:00","2017-02-23T00:00:00"],"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
Is it possible?
UPDATE:
After changing data structure in Elastic, I was able to come up with the following output result of MergeContent. Now I have a common field eid in both JSON strings. I would like to merge these strings by eid in order to get a single JSON file. Which operator should I use?
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
I need to get the following output:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4,"dates":{"event-date":["2017-08-08","2017-02-23"]}}
It was suggested to use ExecuteScript to merge files. However I cannot figure out how to do this. This is what I tried:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"eid": obj['eid'],
"zid": obj['zid'],
...
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile1 = session.get()
flowFile2 = session.get()
if (flowFile1 != None && flowFile2 != None):
# WHAT SHOULD I PUT HERE??
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
The example how to read multiple files from incoming queue using filtering
Assume you have multiple pairs of flow files with following content:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}
and
{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
The same value of eid field provides a link between pairs.
Before merging we have to extract the value of eid field and put it into na attribute of the flow file for fast filtering.
Use the EvaluateJsonPath processor with properties:
Destination : flowfile-attribute
eid : $.eid
After this you'll have new eid attribute of the flow file.
Then use ExecuteScript processor with groovy language and with following code:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def eid = ff0.getAttribute('eid')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( eid == ff.getAttribute('eid') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
//if less than required
//rollback current session with penalize retrieved files so they will go to the end of the incoming queue
//with pre-configured penalty delay (default 30sec)
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
//for example unexpected situation. you have more files then expected
//redirect all of them to failure
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)
Joining together two different types of data is not really what MergeContent was made to do.
You would need to write a custom processor, or custom script, that understood your incoming data formats and created the new output.
If you have ListenHttp connected to QueryElasticSearchHttp, meaning that you are triggering the query based on the flow file coming out of ListenHttp, then you may want to make a custom version of QueryElasticSearchHttp that takes the content of the incoming flow file and joins it together with any of the outgoing results.
Here is where the query result is currently written to a flow file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/QueryElasticsearchHttp.java#L360
Another option is to use ExecuteScript and write a script that could take multiple flow files and merge them together in the way you described.

Flask read uploaded json file

I'm uploading a json file via flask, but I'm having trouble actually reading what is in the file.
# named fJson b/c of other json imports
from flask import json as fJson
#app.route('/upload', methods=['GET', 'POST'])
def upload():
if request.method == 'POST':
file = request.files['file']
# data = fJson.load(file)
# myfile = file.read()
I'm trying to deal with this by using the 'file' variable. I looked at http://flask.pocoo.org/docs/0.10/api/#flask.json.load, but I get the error "No JSON object could be decoded". I also looked at Read file data without saving it in Flask which recommended using file.read(), but that didn't work, returns either "None" or "".
Request.files
A MultiDict with files uploaded as part of a POST or PUT request. Each file is stored as FileStorage object. It basically behaves like a standard file object you know from Python, with the difference that it also has a save() function that can store the file on the filesystem.
http://flask.pocoo.org/docs/0.10/api/#flask.Request.files
You don't need use json, just use read(), like this:
if request.method == 'POST':
file = request.files['file']
myfile = file.read()
For some reason the position in the file was at the end. Doing
file.seek(0)
before doing a read or load fixes the problem.