Foundry has a concept of changelog datasets, which I am using in order to speed up my ontology syncs. However, I've been told to always build datasets from the 'snapshot' version of the dataset, rather than the 'changelog' version. Why is this?
In summary: Changelog datasets (by design) include previous versions of the same row. Unless your transform is designed to handle this, your transform will behave as if there were incorrect or duplicated input data.
Each time a Changelog dataset is built, any changes to the input data are appended to the changelog dataset as new rows. This is done because Foundry's Object Storage then can just apply the diff against the currently synced data, minimising the amount of data that needs to be synced.
This means that the changelog dataset is designed to contain multiple entries for each single row in the input dataset---every time an input row changes, the changelog dataset will have appended another entry containing a new version of that row.
Unless your transform is expecting this:
you will, in effect, end up processing multiple versions of the same row, which could appear as if you were working with outdated and/or duplicated rows.
unless your transform is always incrementally passing through just the appended rows using an APPEND transaction, your output won't preserve the benefits of the append-only behaviour of changelog datasets.
This might mean you have multiple entries for a single row in a SNAPSHOT transaction. If you then synched that data to Object Storage, you could end up seeing multiple or outdated versions of rows in your ontology.
As a result, unless your transform is designed to handle the format of changelog datasets, it's best to build off the 'snapshot' version of datasets.
Related
Before I go much further down this thought process, I wanted to check whether the idea was feasible at all. Essentially, I have two datasets, each of which will consist of ~500K records. For the sake of discussion, we can assume they will be presented in CSV files for ingesting.
Basically, what I'll need to do, is take records from the first dataset, do a lookup against the second dataset, and then produce an output that essentially merges the two together and produces an output CSV file with the results. The expected number of records after the merge will be in the range of 1.5-2M records.
So, my questions are,
Will Power Automate allow me to work with CSV datasets of those sizes?
Will the "Apply to each" operator function across that large of a dataset?
Will Power Automate allow me to produce the export CSV file with that size?
Will the process actually complete, or will it eventually just hit some sort of internal timeout error?
I know that I can use more traditional services like SQL Server Integration Services for this, but I'm wondering whether Power Automate has matured enough to handle this level of ETL operation.
I'm trying to read (all or multiple) datasets from single directory in single Pyspark transform. Is it possible to iterate over all the datasets in a path, without hardcoding individual datasets as input?
I'd like to dynamically fetch different columns from multiple datasets without having to hardcode individual input datasets.
So this doesn't work since you will have inconsistent results every time you run CI. This will break TLLV (transforms level logic versioning) by making it impossible to tell when logic actually has changed, thus marking a dataset as stale.
You will have to write out the logical paths of each dataset you wish to transform, even if it means they are passed into a generated transform. There will need to be at least some consistent record of which datasets were targeted by which commit.
Another tactic to achieve what you're looking for is to make a single long dataset that is the unpivoted version of the datasets. In this way, you could simple APPEND new rows / files to this dataset, which would let you accept arbitrary inputs, assuming your transform is constructed in such a way to handle this.
My rule of thumb is this: if you need dynamic schemas or dynamic counts of datasets, then you're better off using dynamic files / row counts in a single dataset.
I am essentially creating a data warehouse.
For warehouse to remain consistent with the source data, I have to pull changes daily from the source mysql DBs.
My problem is that, in some source mysql tables, there is no 'lastupdated' equivalent columns.
How can i pull changes in this scenario?
In a data warehouse, in order to capture the changes in the target system, there has to be a way to identify the changed, new records at the source side. This is normally done with the help of some Flag or Lastupdated column. But if neither of these is present & if the table is small, you can consider truncating & reloading the entire data from source to the target. But this may not be very feasible for large tables.
You can also refer some other techniques mentioned in the below blog-:
https://www.hvr-software.com/blog/change-data-capture/
I am working in a data warehouse project with a lot of sources creating flat files as sources and we are using SSIS to load these into our staging tables, we are currently using the Flat File Source component.
However, after a while, we need an extra column in one of the files and from a date the file specification change to add that extra column. This exercise happens quite frequently and over time accumulate quite a lot versions.
According to answers I can find here and on the rest of the internet the agreed method to handle this scenario seems to be to set up a new flat file source in a new separate data flow for this version, to keep re-runablility for ETL process for old files.
Method is outlined here for example: SSIS pkg with flat-file connection with fewer columns will fail
In our specific setup, the additional columns are always additional columns (never remove old columns) and also, for logical reasons the new columns can not be mandantory if we keep re-runability for the older files in their separate data flows.
I don´t think the method of creating a duplicate data flow handling the same set of columns over and over again is a good answer for a data warehouse project as ours and I would prefeer a source component that takes the last file version and have the ability to mark columns as "not mandadory" and deliver nulls if they are missing.
Is anybody aware of a SSIS Flat File component that is more flexible in handle old file versions or have a better solution for this problem?
I assume that such a component would need to approach the files on a named column basis rather than the existing left-to-right approach?
Any thoughts or suggestions are welcome!
The following will lose efficiency when processing (over having separate data flows), but will provide you with the flexibility to handle multiple file types within a single data flow.
You can arrange you flat file connection to return lines rather than individual columns, by only specifying the row delimiter. Connect this to a flat file source component which will output a single column per row. We now have a single row that represents one of the many file types that you are aware of – the next step is to determine which file type you have.
Consume the output from a flat file type with a script component. Pass in a single column and pass out the superset of all possible columns. We have lost the meta data normally gleamed from a file source, so you will need to build up the column name / type / size within the script component output types.
Within the script component, pass the line and break it into its component columns. You will have to perform a pattern match (maybe using RegularExpression.Regex.Match) to identify when a new column starts. Hopefully the file is well formed which will aid you - beware of quotes and commas within text columns.
You can now access the file type by determining the number of columns you have and default the missing columns. Set the rows’ output columns to pass out the constituent parts. You may want to attach a new column to record the file type with your output.
The rest of the process should be able to load your table with a single data flow as you have catered for all file types within your script.
I would not recommend that you perform the above lightly. The benefit of SSIS is somewhat reduced when you have to code up all the columns / types etc, however it will provide you with a single data flow to handle each file version and can be extended as new columns are passed.
I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?