How can I save the current values of the output of a Contour path as a snapshot dataset in Foundry?
You can save the results of your analysis as a dataset in Foundry by clicking Save as Dataset at the bottom of your Contour path. If the path already has an associated dataset, the dataset will appear at the bottom of your Contour path. Click Update to update that dataset with new logic or data. Note that if you change the logic in a Contour analysis, building a resulting dataset using Dataset-app or Data Lineage will not update the logic used to build the dataset. You must use the Update button to pick up logic changes. If you build through Dataset-app or Data Lineage, data updates will be picked up.
Related
Having developed a pipeline transform, I might want to write a test that applies the new transform to a given small dataset and compares the result with an expected result.
How do I comfortable select a small portion of an existing dataset, preferably including the schema and some rows of interest? Usually I am looking at a given dataset using a preview, like in this example. Is it possible to select some rows and reuse these rows directly either
within the source code to create a pyspark dataset within my test file like
spark_session.createDataFrame([
(0, 1, 2)
], ['col_a', 'col_b', 'col_c'])
or a csv-file which I can upload to my test directory and load from within the test
...?
Basically, what is the most comfortable way to obtain data for the tasks described here from existing data sets, which might be too large to be exported as csv directly?
Is it possible to create an output depending on the date and time? I would like to a use a dynamic name for the datasets that will be built every day for example. This would allow me to keep track of the dataset and the date will be displayed in the pathing. I attached an example below.
Ideally - the output will be named ".../datasets/dataset_2023_03_09_HH_MM"
Dynamic naming for transform outputs is not possible.
The technical reason is that the inputs/outputs/transforms are fixed at CI-time. When you press "commit" in Authoring or you merge a PR, a CI job is kicked off. In this CI job, all the relations between inputs and outputs are determined including the links between unique identifiers and datasets names. Output datasets that don't exist yet are created, and a "jobspec" is added to them. A "jobspec" is a snippet of JSON that describes to foundry how a particular dataset is generated.
Anytime you press the "build" button on a dataset (or build the dataset through a schedule or similar), the jobspec is consulted (not updated). It contains a reference to the repository, revision, source file and entry point of the function that builds this dataset. From there the build is orchestrated and kicks off, invoking your function to produce the final output.
Therefore, when the date is changing and the build is triggered, an error will be raised as the job specs are not valid anymore. The link between unique identifier and dataset name is broken.
To address your needs of further analysis based on the date of the day, the optimal way to proceed is to:
add a new column to your dataset which includes the date at which the build has run
build you dataset incrementally by specifying snapshot_input=[your_dataset] to ensure your are adding each day all exiting rows from your input dataset
perform your analysis by filtering on the date column
Please find here the complete documentation for Incremental transforms and Snapshot Input.
I have SSIS with 2 loops that loop over excel files and over the sheets. I have a data flow task in the for each sheet loop with variable name for sheetname and the source is excel and odbc destination.
The table in the db has all the columns I need such as userid, username, productname, supportname.
However, some sheets can have columns username, productname and others have userid, username, productname, supportname.
How can I load the excel files? Can I add columns to a derived column task that checks if a column exists and if not add it with a default value and then map it to the destination?
thanks
SSIS is a not an any format goes at run-time data loading engine. There was a conscious design decision to make the fastest possible ETL tool and one of those requirements was that they needed to define a contract between the data source's shape and the destination. That's why you'll inevitably run into VS_NEEDSNEWMETADATA error because something has altered the shape and the package needs to be edited in designer mode to update the columns and sizes.
If you want to write the C# to make a generic Excel ingest engine, more power to you.
An alternative approach would be to have multiple data flows defined within your file and worksheet looping construct. The trick would be to conditionally enable them based on the available column set.
Columns "username and productname" detected, enable DFT UserName and ProductName. And that DFT will have default values, or a lookup, for UserId, SupportName, etc
All columns present, enable DFT All.
Finally, Azure Data Factory can "slurp and burp" whatever source to whatever destination. Perhaps that might be a better fit for your problem.
I would like to be able to add a column to the generated datasets that contains the version number of the dataset. The dataset version tag being present in the dataset itself will let them track applications that use this dataset (even if they use the dataset outside Foundry) and reproduce results in these applications by always being able to access the original dataset version used to create them. Is this possible?
I'm preparing a Data quality Report based on couple Contour analyses and would like to do a daily snapshots of the reported incorrect records. Then I want to show these daily numbers as another report in the same dashboard to see the progress on the data quality.
The main questions for me are:
can a Contour analyses be used as a source for data storing/computation
how to store these numbers on a daily base (e.g. Fusion spreadsheet or Code workbook etc.)
Here's one process for setting up daily snapshots of a dataset derived from a Contour analysis:
Ensure that the Contour analysis results are saved as a dataset. Let's call this dataset mydataset:
Create a Python Transform that performs daily snapshots and stores them in a dataset named mydataset_daily_snapshots:
from transforms.api import transform_df, Input, Output
from pyspark.sql import functions as F
#transform_df(
Output("/output/path/for/mydataset_daily_snapshots"),
my_input=Input("/path/to/mydataset"),
)
def compute(my_input):
out_df = my_input.withColumn('asof_timestamp', F.current_timestamp()) # the column 'asof_timestamp' will contain the snapshot for this row on the current date
out_df = out_df.withColumn('primary_key', F.concat_ws('-', 'id', 'asof_timestamp')) # this second line is optional -- create a primary key for this row, in case you want to create an Ontology object later on for use in Workshop.
return out_df
Create Build Schedules on both mydataset and mydataset_daily_snapshots that build the datasets daily (or as frequently as desired), so that mydataset_daily_snapshots will have data snapshots for each day. Ensure you check Force build so that snapshots will always be built, even if the source data has not changed:
You can then use the mydataset_daily_snapshots dataset within another Contour analysis to show the changes in the data over time in a Report, or create an Ontology object from it and use Workshop to show the change over time.
Something to keep in mind is that this dataset can potentially get very large very quickly -- any filtering to keep the dataset smaller (e.g. to limit snapshots to just the incorrect records or a sum of incorrect records for the day, for example) is a good idea.