In Palantir Foundry's Data Connection tool, what's the difference between the transaction type options? - palantir-foundry

When setting up a file-based sync in Data Connection, I see there are a few different options for 'Transaction Type'. What's the difference between them? When might I use them?

From the Foundry docs:
Transaction types
The way dataset files are modified in a transaction depends on the transaction type. There are four possible transaction types: SNAPSHOT, APPEND, UPDATE, and DELETE.
SNAPSHOT
A SNAPSHOT transaction replaces the current view of the dataset with a completely new set of files.
SNAPSHOT transactions are the simplest transaction type, and are the basis of batch pipelines.
APPEND
An APPEND transaction adds new files to the current dataset view.
An APPEND transaction cannot modify existing files in the current dataset view. If an APPEND transaction is opened and existing files are overwritten, then attempting to commit the transaction will fail.
APPEND transactions are the basis of incremental pipelines. By only syncing new data into Foundry and only processing this new data throughout the pipeline, changes to large datasets can be processed end-to-end in a performant way. However, building and maintaining incremental pipelines comes with additional complexity. Learn more about incremental pipelines.
UPDATE
An UPDATE transaction, like an APPEND, adds new files to a dataset view, but may also overwrite the contents of existing files.
DELETE
A DELETE transaction removes files that are in the current dataset view.
Note that committing a DELETE transaction does not delete the underlying file from the backing file system—it simply removes the file reference from the dataset view.
In practice, DELETE transactions are mostly used to enable data retention workflows. By deleting files on a dataset based on a retention policy—typically based on the age of the file—data can be removed from Foundry, both to minimize storage costs and to comply with data governance requirements.
Data Connection doesn't let you create a sync with a DELETE transaction type, because a sync that purely deletes data doesn't really make sense! If you'd like to delete data from your sync'd dataset, you can use a SNAPSHOT transaction to do so, but note that previous versions of the dataset will still include those files.
You can combine an APPEND or UPDATE transaction type with file-based sync filters to only ingest the newly changed files on each run of your sync.

Related

What exactly triggers a pipeline job to run in Code Repositories?

Want to understand when the pipeline job runes so I can more effectively understand the pipeline build process. Does it check the code change from master branch of the Code Repository?
Building a job on the pipeline, builds the artifact that was delivered on the instances, not what has been merged onto master.
It should be the same, but there is a checking process after the merge onto master and before the delivery of the artifact, like you would have on a regular Git/Jenkins/Artifactory.
So there is a delay.
And moreover if these checks don't pass, your change, even though merged onto master, will never appear on the pipeline.
To add a bit more precision as to what #Kevin Zhang wrote.
There's also the possibility to trigger a job using an API call, even though it's not the most common.
Also you can combine the different events to say things like
Before work hours
build only if the schedule of the morning update has succeeded
during work hours
build every hour
if an input has new data
and
if a schedule has run successfully
or another dataset has been updated
after hours
build whenever an input has new data
It can also helps you create loops, like if you have a huge amount of data coming in input B and it impacts your sync toward the ontology, or a timeserie,... , you could create a job that takes a limited number of rows from input B and log the id's of these in a table to not retake them again, you process those rows and when the output C is updated you rerun your job and when there is no more row you update output D.
You can also add a schedule on the job that produces input B from input A stating to rerun it only when output D is updated.
This would enable you to process a number of files from a source, process the data from those files chunk by chunk and then take another batch of files and iterate.
By naming your schedule functionnaly you can have a more controlled build of your pipeline and finer grain of data governance and you can also add some audit table or log tables based on these schedules, which will make debugging and auditing way more easy.
You would have a trace of when and where a specific source update has reach.
Of course, you need such precision only if your pipeline is complexe : like many different sources, updated at different time and updating multiple part of your pipeline.
For instance if you're unifying the data of your client that was before separated in many silos or if it is a multinational group of many different local or global entities, like big car manufacturers
It depends on what type of trigger you’ve set up.
If your schedule is a single cron schedule (i.e.: by scheduled time), the build would not look at the master branch repo. It'll just build according to the cron schedule.
If your schedule contains an event trigger (e.g. one of the 4 event types: Job Spec Put, Transaction Committed, Job Succeeded and Schedule Ran Successfully), then it'll trigger based on the event where only the Job Spec Put even type would trigger based on the master branch code change.

Is it possible to merge two Debezium connectors?

Dealing with large MySQL DBs (10s TB), I find myself having to split up a connector so I can process more rows at a time (every single connector can process one table a time).
Once the initial sync is complete, and it switches to incremental, what is the cleanest way of merging the two connectors?
Is it even possible?
Since both connectors are created with different database.server.name values, their associated topics are likely prefixed with different values and so attempting to merge multiple connectors would be highly cumbersome and error-prone.
What I would suggest is if you have a large volume of data that you need to snapshot, don't rely on the initial snapshot phase to capture the data. Instead, configure a single connector to use the schema_only snapshot mode so that the schemas get captured before streaming starts. Then you can leverage incremenal snapshots that run in parallel with streaming to capture the 10TB of data concurrently.

I notice my Transform job has many ExecuteStats stages. Is there any way to avoid these?

I'm performance optimizing my pipeline, and when I open Job Tracker for my Transform job, I notice that there's several stages at the beginning of the job for something called ExecuteStats.scala. Is there any way to optimize my job by removing / skipping these? They typically take tens of seconds and they occur every time I run my transformation.
This stage type is executed when your files don't yet have statistics computed on them, i.e. if you have ingested non-parquet (or, more generally, files that have summary statistics on them) files.
Let's imagine you uploaded a .csv file via Data Connection or manually in the Foundry UI. When you do this, you apply a schema, and Spark is able to read the file and run computations on top of it. However, Spark needs to understand the distributions of values on the file contents in order to make estimations of join strategies, AQE optimizations, and other related things. Therefore, before you are able to run any computation, each .csv file has a stage executed on it to compute these stats.
This means every time you run a downstream transformation on these non-parquet files, it re-runs the statistics. You can imagine how Spark's tendency to re-run stages when running larger jobs can mean this stats problem is magnified.
Instead, you can inject a step immediately after the .csv file whereby you perform a select * repartition(1) and write out a single parquet file (if that is the appropriate number of files for your .csv size), and Foundry will compute statistics on the contents one time. Then, your downstream transformations should use this new input instead of the .csv, and you'll see the ExecuteStats.scala command isn't run anymore.

Extract daily data changes from mysql and rollout to timeseries DB

In MySQL, using binlog, we can extract the data changes. But I need only the latest changes made during that time / day and need to feed those data into timeseries DB (planning to go with druid)
While reading binlog, Is there any mechanism to avoid duplicate and keep latest changes?
My intension is to get the entire MySQL DB backed up every day in a timeseries DB. It helps to debug my application for past dates by referring actual data present on that day
Kafka, by design, is append only log (no updates).
Kafka Connect source connector will continuously capture all the changes from binlog into the Kafka topic. Connector stores its position in binlog and will only write new changes into Kafka as they become available in MySLQ.
For consuming from Kafka, as one option, you can use sink connector that will write all the changes to your target. Or, instead of the Kafka Connect sink connector, some independent process that will read (consume) from Kafka. For Druid specifically, you may look https://www.confluent.io/hub/imply/druid-kafka-indexing-service.
Consumer (either connector or some independent process), will store its position (offset) in the Kafka topic, and will only write new changes into the target (Druid) as they become available in Kafka.
Processes described above captures all the changes and allows you to view source (MySQL) data at any point in time in target (Druid). It is the best practice to have all the changes available in the target. Use your target's functionality to limit view of data to certain time of the day, if needed.
If, for example, there are huge number of daily changes to a record in MySQL and you'd like to only write latest status as of specific time of the day to target. You'll still need to read all the changes from MySQL. Create some additional daily process that will read all the changes since prior run and filter only latest records and write them to target.

Spark: avoid task restart when writing

I have a Spark application that reads CSVs and writes Parquet files.
In some cases (too little allocated memory, lost executor), the Parquet tasks may fail and retry; I noticed in this case there are duplicated records, i.e. some CSVs were written to Parquet files many times as it retries.
What is the state of the art to avoid such duplicates? I already use --conf spark.yarn.maxAppAttempts=1 but this works only for jobs, not tasks. Should the application fail if one stage fail, or is there any way to rollback?
Spark uses FileOutputCommitter to manage staging output files and final output files.
The behavior of FileOutputCommitter has direct impact on the performance of jobs that write data. It has two methods, commitTask and commitJob.
Apache Spark 2.0 and higher versions use Apache Hadoop 2, which uses the value of mapreduce.fileoutputcommitter.algorithm.version to control how commitTask and commitJob work.
Currently Spark ships with two default Hadoop commit algorithms — version 1 & version 2.
In version 1, commitTask moves data generated by a task from the task temporary directory to job temporary directory and when all tasks complete, commitJob moves data to from job temporary directory the final destination. This ensures the Transactional writes at job level.
In version 2, commitTask will move data generated by a task directly to the final destination and commitJob is basically a no-op. This ensures the Transactional writes at Task level. You may see duplicates, if the job is re-submitted.
In your case, set the dataframe.write.option("mapreduce.fileoutputcommitter.algorithm.version", "1") to ensure the Transactional writes at job level.
Reference: https://databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html