Snapshotting of Contour results in Palantir Foundry - palantir-foundry

I'm preparing a Data quality Report based on couple Contour analyses and would like to do a daily snapshots of the reported incorrect records. Then I want to show these daily numbers as another report in the same dashboard to see the progress on the data quality.
The main questions for me are:
can a Contour analyses be used as a source for data storing/computation
how to store these numbers on a daily base (e.g. Fusion spreadsheet or Code workbook etc.)

Here's one process for setting up daily snapshots of a dataset derived from a Contour analysis:
Ensure that the Contour analysis results are saved as a dataset. Let's call this dataset mydataset:
Create a Python Transform that performs daily snapshots and stores them in a dataset named mydataset_daily_snapshots:
from transforms.api import transform_df, Input, Output
from pyspark.sql import functions as F
#transform_df(
Output("/output/path/for/mydataset_daily_snapshots"),
my_input=Input("/path/to/mydataset"),
)
def compute(my_input):
out_df = my_input.withColumn('asof_timestamp', F.current_timestamp()) # the column 'asof_timestamp' will contain the snapshot for this row on the current date
out_df = out_df.withColumn('primary_key', F.concat_ws('-', 'id', 'asof_timestamp')) # this second line is optional -- create a primary key for this row, in case you want to create an Ontology object later on for use in Workshop.
return out_df
Create Build Schedules on both mydataset and mydataset_daily_snapshots that build the datasets daily (or as frequently as desired), so that mydataset_daily_snapshots will have data snapshots for each day. Ensure you check Force build so that snapshots will always be built, even if the source data has not changed:
You can then use the mydataset_daily_snapshots dataset within another Contour analysis to show the changes in the data over time in a Report, or create an Ontology object from it and use Workshop to show the change over time.
Something to keep in mind is that this dataset can potentially get very large very quickly -- any filtering to keep the dataset smaller (e.g. to limit snapshots to just the incorrect records or a sum of incorrect records for the day, for example) is a good idea.

Related

How do I create an output transform with a dynamic date and time in Code Repository?

Is it possible to create an output depending on the date and time? I would like to a use a dynamic name for the datasets that will be built every day for example. This would allow me to keep track of the dataset and the date will be displayed in the pathing. I attached an example below.
Ideally - the output will be named ".../datasets/dataset_2023_03_09_HH_MM"
Dynamic naming for transform outputs is not possible.
The technical reason is that the inputs/outputs/transforms are fixed at CI-time. When you press "commit" in Authoring or you merge a PR, a CI job is kicked off. In this CI job, all the relations between inputs and outputs are determined including the links between unique identifiers and datasets names. Output datasets that don't exist yet are created, and a "jobspec" is added to them. A "jobspec" is a snippet of JSON that describes to foundry how a particular dataset is generated.
Anytime you press the "build" button on a dataset (or build the dataset through a schedule or similar), the jobspec is consulted (not updated). It contains a reference to the repository, revision, source file and entry point of the function that builds this dataset. From there the build is orchestrated and kicks off, invoking your function to produce the final output.
Therefore, when the date is changing and the build is triggered, an error will be raised as the job specs are not valid anymore. The link between unique identifier and dataset name is broken.
To address your needs of further analysis based on the date of the day, the optimal way to proceed is to:
add a new column to your dataset which includes the date at which the build has run
build you dataset incrementally by specifying snapshot_input=[your_dataset] to ensure your are adding each day all exiting rows from your input dataset
perform your analysis by filtering on the date column
Please find here the complete documentation for Incremental transforms and Snapshot Input.

SSIS import from excel files with multiple sheets and different amount of columns

I have SSIS with 2 loops that loop over excel files and over the sheets. I have a data flow task in the for each sheet loop with variable name for sheetname and the source is excel and odbc destination.
The table in the db has all the columns I need such as userid, username, productname, supportname.
However, some sheets can have columns username, productname and others have userid, username, productname, supportname.
How can I load the excel files? Can I add columns to a derived column task that checks if a column exists and if not add it with a default value and then map it to the destination?
thanks
SSIS is a not an any format goes at run-time data loading engine. There was a conscious design decision to make the fastest possible ETL tool and one of those requirements was that they needed to define a contract between the data source's shape and the destination. That's why you'll inevitably run into VS_NEEDSNEWMETADATA error because something has altered the shape and the package needs to be edited in designer mode to update the columns and sizes.
If you want to write the C# to make a generic Excel ingest engine, more power to you.
An alternative approach would be to have multiple data flows defined within your file and worksheet looping construct. The trick would be to conditionally enable them based on the available column set.
Columns "username and productname" detected, enable DFT UserName and ProductName. And that DFT will have default values, or a lookup, for UserId, SupportName, etc
All columns present, enable DFT All.
Finally, Azure Data Factory can "slurp and burp" whatever source to whatever destination. Perhaps that might be a better fit for your problem.

How can I save a Contour analysis as a dataset in Foundry?

How can I save the current values of the output of a Contour path as a snapshot dataset in Foundry?
You can save the results of your analysis as a dataset in Foundry by clicking Save as Dataset at the bottom of your Contour path. If the path already has an associated dataset, the dataset will appear at the bottom of your Contour path. Click Update to update that dataset with new logic or data. Note that if you change the logic in a Contour analysis, building a resulting dataset using Dataset-app or Data Lineage will not update the logic used to build the dataset. You must use the Update button to pick up logic changes. If you build through Dataset-app or Data Lineage, data updates will be picked up.

MySQL Storing Reports

I am looking for a way to store auto-generated reports. There are about 10-15 columns and 100-3000 rows depending on the report but each report is consistent in column count.
I am looking for a way to organise and store these reports into a large group without creating an entire new database and 1000s of tables to store each indervidual report.
The reports need to be queryable so they can be subdivided by team/area/person etc as each report can be a combination of 3-4 different sub-reports depending on how you split/sort the data.
I am using Python to collect and sort the data from the database so using MariaDB/MySQL would be preferred but im happy to use something else if there is a pre-exising connection libary for it.
To sum up i need something similar to a excel spreadsheet with each table being a sheet and sheet name being the date it was generated so i can select by the date generated.
Think through the goals.
Is this a legal issue -- you need to produce an unalterable report as something "official". A la a non-editable .pdf?
(at the opposite extreme) Be able to generate (or regenerate) any report for any timeframe.
Is performance an issue? (Either perceived or real)
I like to build and maintain Summary Table(s) for any "Data Warehouse" application. And build "reports" that take as a parameter a date range and a small number of other things. And have the report generation so fast that it does not matter if multiple people are pulling reports at random times.
15 columns and 3000 rows is usually excessive. If pulling a report is trivial enough, it can be less 'massive'; just get the parts you want, without such bulk.
http://mysql.rjweb.org/doc.php/summarytables

Access (possible VBA) Export to multiple files based on column content

I currently have a dataset that contains information for arrivals at over 100 airports. My end goal is create an excel file with a tab for each of those airports containing only the information for that airport (multiple rows).
I can of course create 100 queries and export them in a macro. However the list may change overtime and while I can amend the query that creates the initial file I'd rather not have to tweak the downstream process each time.
I cannot amend the source file process so do not want to export 105 initial files each time.
I am looking for a process that will export based on the contents of the data.
As your question is pretty general, here's a general approach I might take.
Query the data to get a distinct list of airports.
Loop over list from #1, building a set of data just for that airport.
Export the data from #2 to Excel.
The details on how to do each of those is going to depend on what the initial dataset looks like and is beyond the scope of what Stack Overflow is for.