I have a dataset which have an xml column and i am trying to export individual xmls as files with filename being in another column using codeworkbook
I filtered the rows i want using below code
def prepare_input(xml_with_debug):
from pyspark.sql import functions as F
filter_column = "key"
filter_value = "test_key"
df_filtered = xml_with_debug.filter(filter_value == F.col(filter_column))
approx_number_of_rows = 1
sample_percent = float(approx_number_of_rows) / df_filtered.count()
df_sampled = df_filtered.sample(False, sample_percent, seed=0)
important_columns = ["key", "xml"]
return df_sampled.select([F.col(c).cast(F.StringType()).alias(c) for c in important_columns])
It works till here. Now for the last part i tried this in a python task, but was complaining about the parameters (i should have set it up wrongly). But even if it works it will be as a single file i think.
from transforms.api import transform, Input, Output
#transform(
output=Output("/path/to/python_csv"),
my_input=Input("/path/to/input")
)
def my_compute_function(output, my_input):
output.write_dataframe(my_input.dataframe().coalesce(1), output_format="csv", options={"header": "true"})
I am trying to set it up in GUI like below
My question i guess is, what will be the code in the last Python task (write_file) after the prepare input so that i extract individual xmls (And if possible zip them into single file for download)
You can access the output dataset filesystem and write files into it in whatever format you want.
The documentation for that can be found here: https://www.palantir.com/docs/foundry/code-workbook/transforms-unstructured/#writing-files
(If you want to do it from a code repository it's very similar https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/#writing-files)
By doing that you can create multiple different files or you can create a single zip file and write it into a dataset.
Related
How to read and write the same dataset in a transform? I have an input dataset (input_ds1) and another input dataset (input_ds2). When I output to one of these dataset's paths (ex.dataset2 in code below) the check fails, with a cyclical dependency error.
Below I attacked an example:
#transform(
input_ds1=Input('Other Namespace/Other/Foundry_support_test/dataset1'),
input_ds2=Input('/Other Namespace/Other/Foundry_support_test/dataset2'),
output=Output('/Other Namespace/Other/Foundry_support_test/dataset2'),
)
def compute(input_ds1, input_ds2, output):
This is possible to read and write to the content of the output dataset with the #incremental() decorator. With it you can read the previous version of any dataset and avoid the cyclical dependency error.
#transform(
input_ds1=Input('Other Namespace/Other/Foundry_support_test/dataset1'),
output=Output('/Other Namespace/Other/Foundry_support_test/dataset2'),
)
def compute(input_ds1, input_ds2, output):
input_ds2 = output.dataframe('previous')
Incremental transform is designed for other use cases but contains a lot of features. More details in the incremental documentation: https://www.palantir.com/docs/foundry/transforms-python/incremental-reference/
I have a folder where I will upload one file every month. The file will have the same format in every month.
First problem
The idea is to concatenate all the files in this folder into one file. Currently I am hardcoding the filenames (filename[0], filename[1], filename[2]..) but imagine later I will have 50 files, should I explicitly add them to the transform_df decorator? Is there any other method to handle this?
Second problem:
Currently I have let's say 4 files (2021_07, 2021_08, 2021_09, 2021_10) and I want whenever I add the file presenting 2021_12 data to avoid changing the code.
If I add input_5 = Input(path_to_2021_12_do_not_exists) the code will not be run and give an error.
How can I implement the code for future files and let the code ignore the input if it does not exist without manually each month add a new value to my code?
Thank you
# from pyspark.sql import functions as F
from transforms.api import transform_df, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
from functools import reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
#transform_df(
Output("/Company/Project_name/Data/clean/File_concat"),
input_1=Input(filenames[0]),
input_2=Input(filenames[1]),
input_3=Input(filenames[2]),
input_4=Input(filenames[3]),
)
def compute(input_1, input_2, input_3, input_4):
input_dfs = [input_1, input_2, input_3, input_4]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
dfs = reduce(DataFrame.unionByName, dfs)
return dfs
This question comes up a lot, the simple answer is that you don't. Defining datasets and executing a build on them are two different steps executed at different stages.
Whenever you commit your code and run the checks, your overall python code is executed during the renderSchrinkwrap stage, except for the compute part. This allows Foundry to discover what datasets exist and publish.
Publishing involves creating your dataset and putting whatever is inside your compute function is published into the jobspec of the dataset, so foundry knows what code to execute whenever you run a build.
Once you hit build on the dataset, Foundry will only pick up whatever is on the jobspec and execute it. Any other code has already run during your checks, and it has run just once.
So any dynamic input/output would require you to re-run checks on your repo, which means that some code change would have had to happen since the Checks is part of the CI process, not part of the build.
Taking a step back, assuming each of your input files has the same schema, Foundry would expect you to have all of those files in the same dataset as append transactions.
This might not be possible though, if for instance, the only indication of the "year" of the data is embedded in the filename, but your sample code would indicate that you expect all these datasets to have the same schema and easily union together.
You can do this manually through the Dataset Preview - just use the Upload File button or drag-and-drop the new file into the Preview window - or, if it's an "end user" workflow, with a File Upload Widget in a Workshop app. You may need to coordinate with your Foundry support team if this widget isn't available.
Bit late to the post although for anyone who is interested in an answer to most of the question. Dynamically determining file names from within a folder is not doable although having some level of dynamic input is possible as follows:
# from pyspark.sql import functions as F
from transforms.api import transform, Input, Output
from pyspark.sql.functions import to_date, year, col
from pyspark.sql.types import StringType
from myproject.datasets import utils
from pyspark.sql import DataFrame
# from functools import reduce
from transforms.verbs.dataframes import union_many # use this instead of reduce
input_dir = '/Company/Project_name/'
prefix_filename = 'DataInput1_'
suffixes = ['2021_07', '2021_08', '2021_09', '2021_10', '2021_11', '2021_12']
filenames = [input_dir + prefix_filename + suffixe for suffixe in suffixes]
inputs = {('input{}'.format(index)): Input(filename) for (index, filename) in enumerate(filenames))}
#transform(
output=Output("/Company/Project_name/Data/clean/File_concat"),
**inputs
)
def compute(output, **kwargs):
# Extract dataframes from input datasets
input_dfs = [dataset_df.dataframe() for dataset_name, dataset_df in kwargs.items()]
dfs = []
def transformation_input(df):
# some transformation
return df
for input_df in input_dfs:
dfs.append(transformation_input(input_df))
# dfs = reduce(DataFrame.unionByName, dfs)
unioned_dfs = union_many(*dfs)
return unioned_dfs
Couple points:
Created dynamic input dict.
That dict is read into the transform using **kwargs.
Using transform decorator not transform_df, we can extract the dataframes.
(not in question) Combine multiple dataframes using union_many function from transforms_verbs library.
I have a text widget in which user needs to feed in batch id say "201906", it is a year with month. So the data gets processed for this particular batch. So now how do I get this value from a CSV or a file name located in ADLS container and use it in a databricks dropdown widget so that the user will not have the freedom to enter a batchid which is not to be processed or restricted to process? So basically I want to give the option to the user with the required batch to be processed but not an entire field to type whatever he wants.
It's just easy - you can use local file API to access file on DBFS, like this (you need to replace dbfs:/ with /dbfs/ to access file on DBFS):
with open("/dbfs/tmp/my-batches.txt") as f:
batches = [l.strip() for l in f.readlines() if l.strip() != ""]
dbutils.widgets.dropdown(name="batches", label="Select batch",
choices=batches, defaultValue=batches[0])
will give you what you need:
You can achieve the same by using the Spark API - it could be a bit slower, but it won't require that storage account is mounted - you can use abfss://, wasbs:// and other supported protocols:
dbutils.widgets.removeAll()
df = spark.read.text("/tmp/my-batches.txt")
batches = [r[0].strip() for r in df.collect() if r[0].strip() != ""]
dbutils.widgets.dropdown(name="batches", label="Select batch",
choices=batches, defaultValue=batches[0])
I have a data feed that gives a large .txt file (50-75GB) every day. The file contains several different schemas within it, where each row corresponds to one schema. I would like to split this into partitioned datasets for each schema, how can I do this efficiently?
The largest problem you need to solve is the iteration speed to recover your schemas, which can be challenging for a file at this scale.
Your best tactic here will be to get an example 'notional' file with each of the schemas you want to recover as a line within it, and to add this as a file within your repository. When you add this file into your repo (alongside your transformation logic), you will then be able to push it into a dataframe, much as you would with the raw files in your dataset, for quick testing iteration.
First, make sure you specify txt files as a part of your package contents, this way your tests will discover them (this is covered in documentation under Read a file from a Python repository):
You can read other files from your repository into the transform context. This might be useful in setting parameters for your transform code to reference.
To start, In your python repository edit setup.py:
setup(
name=os.environ['PKG_NAME'],
# ...
package_data={
'': ['*.txt']
}
)
I am using a txt file with the following contents:
my_column, my_other_column
some_string,some_other_string
some_thing,some_other_thing,some_final_thing
This text file is at the following path in my repository: transforms-python/src/myproject/datasets/raw.txt
Once you have configured the text file to be shipped with your logic, and after you have included the file itself in your repository, you can then include the following code. This code has a couple of important functions:
It keeps raw file parsing logic completely separate from the stage of reading the file into a Spark DataFrame. This is so that the way this DataFrame is constructed can be left to the test infrastructure, or to the run time, depending on where you are running.
This keeping of the logic separate lets you ensure the actual row-by-row parsing you want to do is its own testable function, instead of having it live purely within your my_compute_function
This code uses the Spark-native spark_session.read.text method, which will be orders of magnitude faster than row-by-row parsing of a raw txt file. This will ensure the parallelized DataFrame is what you operate on, not a single file, line by line, inside your executors (or worse, your driver).
from transforms.api import transform, Input, Output
from pkg_resources import resource_filename
def raw_parsing_logic(raw_df):
return raw_df
#transform(
my_output=Output("/txt_tests/parsed_files"),
my_input=Input("/txt_tests/dataset_of_files"),
)
def my_compute_function(my_input, my_output, ctx):
all_files_df = None
for file_status in my_input.filesystem().ls('**/**'):
raw_df = ctx.spark_session.read.text(my_input.filesystem().hadoop_path + "/" + file_status.path)
parsed_df = raw_parsing_logic(raw_df)
all_files_df = parsed_df if all_files_df is None else all_files_df.unionByName(parsed_df)
my_output.write_dataframe(all_files_df)
def test_my_compute_function(spark_session):
file_path = resource_filename(__name__, "raw.txt")
raw_df = raw_parsing_logic(
spark_session.read.text(file_path)
)
assert raw_df.count() > 0
raw_columns_set = set(raw_df.columns)
expected_columns_set = {"value"}
assert len(raw_columns_set.intersection(expected_columns_set)) == 1
Once you have this code up and running, your test_my_compute_function method will be very fast to iterate on, so that you can perfect your schema recovery logic. This will make it substantially easier to get your dataset building at the very end, but without any of the overhead of a full build.
For web2py there are generic views e.g. for JSON.
I could not find a sample.
When looking at the web2py manual 10.1.2 and 10.1.6, its written:
'.. define a "generic.csv" file, but one would have to specify the name of the object to be serialized ("animals" in the example)'
Looking at the generic pdf view
{{
import os
from gluon.contrib.generics import pdf_from_html
filename = '%s/%s.html' % (request.controller,request.function)
if os.path.exists(os.path.join(request.folder,'views',filename)):
html=response.render(filename)
else:
html=BODY(BEAUTIFY(response._vars))
pass
=pdf_from_html(html)
}}
and also the specified csv (Manual charpter 10.1.6):
{{
import cStringIO
stream=cStringIO.StringIO() animals.export_to_csv_file(stream)
response.headers['Content-Type']='application/vnd.ms-excel'
response.write(stream.getvalue(), escape=False)
}}
Massimo is writing: 'web2py does not provide a "generic.csv";'
He is not fully against it but..
So lets try to get it and deactivate when necessary.
The generic view should look similar to (the non working)
(well, this we better call pseudocode as it is not working):
{{
import os
from gluon.contrib.generics export export_to_csv_file(stream)
filename = '%s/%s' % (request.controller,request.function)
if os.path.exists(os.path.join(request.folder,'views',filename)):
csv=response.render(filename)
else:
csv=BODY(BEAUTIFY(response._vars))
pass
= export_to_csv_file(stream)
}}
Whats wrong?
Or is there a sample?
Is there a reson not to have a generic csv?
{{
import os
from gluon.contrib.generics export export_to_csv_file(stream)
filename = '%s/%s' % (request.controller,request.function)
if os.path.exists(os.path.join(request.folder,'views',filename)):
csv=response.render(filename)
else:
csv=BODY(BEAUTIFY(response._vars))
pass
= export_to_csv_file(stream)
}}
Adapting the generic.pdf code so literally as above would not work for CSV output, as the generic.pdf code is first executing the standard HTML template and then simply converting the generated HTML to a PDF. This approach does not make sense for CSV, as CSV requires data of a particular structure.
As stated in the documentation:
Notice that one could also define a "generic.csv" file, but one would
have to specify the name of the object to be serialized ("animals" in
the example). This is why we do not provide a "generic.csv" file.
The execution of a view is triggered by a controller action returning a dictionary. The keys of the dictionary become available as variables in the view execution environment (the entire dictionary is also available as response._vars). If you want to create a generic.csv view, you therefore need to establish some conventions about what variables are in the returned dictionary as well as the possible structure(s) of the returned data.
For example, the controller could return something like dict(data=mydata). The code in generic.csv would then access the data variable and could convert it to CSV. In that case, there would have to be some convention about the structure of data -- perhaps it could be required to be a list of dictionaries or a DAL Rows object (or optionally either one).
Another possible convention is for the controller to return something like dict(columns=mycolumns, rows=myrows), where columns is a list of column names and rows is a list of lists containing the data for each row.
The point is, there is no universal convention for what the controller might return and how that can be converted into CSV, so you first need to decide on some conventions and then write generic.csv accordingly.
For example, here is a very simple generic.csv that would work only if the controller returns dict(rows=myrows), where myrows is a DAL Rows object:
{{
import cStringIO
stream=cStringIO.StringIO() rows.export_to_csv_file(stream)
response.headers['Content-Type']='application/vnd.ms-excel'
response.write(stream.getvalue(), escape=False)
}}
I tried:
# Sample from Web2Py manual 10.1.1 Page 464
def count():
session.counter = (session.counter or 0) + 1
return dict(counter=session.counter, now = request.now)
#and my own creation from a SQL table (if possible used for json and csv):
def csv_rt_bat_c_x():
battdat = db().select(db.csv_rt_bat_c.rec_time, db.csv_rt_bat_c.cellnr,
db.csv_rt_bat_c.volt_act, db.csv_rt_bat_c.id).as_list()
return dict(battdat=battdat)
Bot times I get an error when trying csv. It works for /default/count.json but not for /default/count.csv
I suppose the requirement:
dict(rows=myrows)
"where myrows is a DAL Rows object" is not met.