SSIS Slowly changing dimension column - ssis

I'm using a Slowly Changing Dimension in SSIS and I'm using a single column called active of type BIT to determine the latest records instead of start date and end date column.
My problem is the following: I want to turn the active value to 0 for records that are no more present in the source file.
For example imagine if my DWH is empty and in the source file I have the following data(Salary is the historisation attribute):
employee_ID|NAME|salary
117|a|100
125|b|150
378|c|200
Now once I charge those into my DWH I get the following data.
employee_code|employee_ID|NAME|salary|active
1|117|a|100|1
2|125|b|150|1
3|378|c|200|1
everything is good so far but now imagine I get a new source where the data is like this:
employee_ID|NAME|salary
117|a|120
125|b|150
Here when I charge this data in the datawarehouse I get the following:
employee_code|employee_ID|NAME|salary|active
1|117|a|100|0
2|125|b|150|1
3|378|c|200|1
4|117|a|120|1
Everything makes sense. Employee A's salary has changed so a new record is added in the DWH and the old record's active value turned to 0. Employee b's salary stayed the same so there is no need to add a new record. However mployee C does not exist in the source file anymore (He quit or got fired) I want to know if there is a way to turn the active value to 0 in such a situation

Related

Data factory copy based off last high water mark value (Dynamic date)

I'm currently working on a project where I need the data factory pipeline to copy based off the last run date.
The process breakdown....
Data is ingested into a storage account
The data ingested is in the directory format topic/yyyy/mm/dd i.e., multiple files being brought in a single directory hence it's files are partitioned by date which looks like this day format and month and year etc
The process currently filters based on the last high water mark date which updates each time the pipeline is run and triggers daily at 4am, once the copy is successful, a set variable increases the high-water mark value by 1 (I.e., one day), though files are not brought over on the weekends (this is the problem)
The date value (HWM) will not increase if no files are brought over and will continue to loop through the same date.
How to I get the pipeline to increase or look for the next file in that directory given that I use the HWV as the directory to the file, copy and update the HWM value only when completed dynamically. Current update logic
current lookup of HWV lookup and directory path to copy files
Instead of adding 1 to last high water mark value, we can try to update current UTC as watermark value. So that, even when pipeline is not triggered data will be copied to the correct destination folder. I have tried to repro in my environment and below is the approach.
Watermark table is taken initially with watermark value as '1970-01-01'.
This table is referred in the Lookup Activity.
Copy data activity is added and in source, query is given as
select * from tab1 where lastmodified > '#{activity('Lookup1').output.firstRow.watermark_value}'
In Sink, Blob storage is taken. In order to have folder structure as year/month/day,
#concat(formatDateTime(utcnow(),'yyyy'),'/', formatDateTime(utcnow(),'mm'),'/',formatDateTime(utcnow(),'dd'))
is given in folder path.
File is copied as in below path.
Once file is copied, Watermark value is updated with the current UTC time.
update watermark_table
set
watermark_value='#{formatDateTime(utcnow(),'yyyy-MM-dd')}'
where tab_name='tab1'
When pipeline is triggered next day, data will be copied from the watermark value and once file is copied, value of current UTC is updated as watermark value.
I think reading the post a couple of time , what I understood is
You already have a water mark logic .
On the weekend when there are NO files in the folder , the current logic does NOT increment the watermark and so you are facing issues .
If I understand the ask correctly . please use the #dayOfWeek() function . Add a If statement and let the current logic only execute when the day of the week is Monday(2)-Friday(6) .
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expressions-usage#dayofweek

Is there a way to store database modifications with a versioning feature (for eventual versions comparaison)?

I'm working on a project where users could upload excel files into a MySQL database. Those files are the main source of our data as they come directly from the contractors working with the company. They contain a large number of rows (23000 on average for each file) and 100 columns for each row!
The problem I am facing currently is that the same file could be changed by someone (either the contractor or the company) and when re-uploading it, my system should detect changes, update the actual data, and save the action (The fact that the cell went from a value to another value :: oldValue -> newValue) so we can go back and run a versions comparison (e.g 3 re-uploads === 3 versions). (oldValue Version1 VS newValue Version5)
I developed a tiny mechanism for saving the changes => I have a table to save Imports data (each time a user import a file a new row will be inserted in this table) and another table for saving the actual changes
Versioning data
I save the id of the row that have some changes, as well as the id and the table where the actual data was modified (Uploading a file results in a insertion in multiple tables, so whenever a change occurs, I need to know in which table that happened). I also save the new value and the old value which is gonna help me with restoring the "archives data".
To restore a version : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber}
To restore a version for one row : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber} and rowId = ${rowId}
To restore all version for one row : SELECT * FROM 'Archive' WHERE rowId = ${rowId}
To restore version for one table : SELECT * FROM 'Archine' WHERE tableName = ${table}
Etc.
Now with this structure, I'm struggling to restore a version or to run a comparaison between two versions, which makes think that I've came up with a wrong approach since it makes it hard to do the job! I am trying to know if anyone had done this before or what a good approach would look like?
Cases when things get really messy :
The rows that have changed in a version might not have changed in the other version (I am working on a time machine to search in other versions when this happens)
The rows have changed in both versions but not the same fields. (Say we have a user table, the data of the user with id 15 have changed in 2nd and 5th upload, great! Now for the second version only the name was changed, but for the fifth version his address was changed! When comparing these two versions, we will run into a problem constrcuting our data array. name went from "some"-> NULL (Name was never null. No name changes in 5th version) and address went from NULL -> "some' is which obviously wrong).
My actual approach (php)
<?php
//Join records sets and Compare them
foreach ($firstRecord as $frecord) {
//Retrieve first record fields that have changed
$fFields = $frecord->fieldName;
//Check if the same record have changed in the second version as well
$sId = array_search($frecord->idRecord, $secondRecord);
if($sId) {
$srecord = $secondRecord[$sId];
//Retrieve straversee fields that have changed
$sFields = $srecord->fieldName;
//Compare the two records fields
foreach ($fFields as $fField) {
$sfId = array_search($fField, $sFields);
//The same field for the same record was changed in both version (perfect case)
if($sfId) {
$sField = $sFields[$sfId];
$deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
$deltaRow[$fField]["newValue"] = $srecord->deltaValue;
//Delete the checked field from the second version traversee to avoid re-checking
unset($sField[$sfId]);
}
//The changed field in V1 was not found in V2 -> Lookup for a value
else {
$deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
$deltaRow[$fField]["newValue"] = $this->valueLookUp();
}
}
$dataArray[] = $deltaRow;
//Delete the checked record from the second version set to avoid re-checking
unset($secondRecord[$srecord]);
}
I don't know how to deal with that, as I said I m working on a value lookup algorithm so when no data found in a version I will try to find it in the versions between theses two so I can construct my data array. I would be very happy if anyone could give some hints, ideas, improvements so I can go futher with that.
Thank you!
Is there a way to store database modifications with a versioning feature (for eventual versions comparaison [sic!])?
What constitutes versioning depends on the database itself and how you make use of it.
As far as a relational database is concerned (e.g. MariaDB), this boils down to the so called Normal Form which is in numbers.
On Database Normalization: 5th Normal Form and Beyond you can find the following guidance:
Beyond 5th normal form you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer os [sic!] similar to that of infinity to a bookkeeper - i.e. it exists in theory but is not going to be used in practice. Even the most demanding owner is not going to expect that of the bookkeeper!
One strategy to step into these realms is to reach the 5th normal form first (do this just in theory, by going through all the normal forms, and study database normalization).
Additionally you can construe versioning outside and additional to the database itself, e.g. by creating your own versioning system. Reading about what you can do with normalization will help you to find better ways to decide on how to structure and handle the database data for your versioning needs.
However, as written it depends on what you want and need. So no straight forward "code" answer can be given to such a general question.

LabVIEW - writing data from multiple DAQ Assistants in the same .csv-file

I have the following problem with my VI, which I could not solve by myself or research:
When running the VI, the data should be stored in a .csv-File. In the pictures, you can see the block diagram. When running, it produces the following file:
Test Steady State
T_saug_1/T_saug_2/Unbelegt/Unbelegt/T_ND/T_HD/T_Wasser_ein/T_Wasser_aus/T_front/T_back/T-right/T-left
18,320 18,491 20,873 20,838 20,463 20,969 20,353 20,543 20,480 20,618
20,618 20,238
As you can see, the data gets stored only in the first column (in the preview of the post it looks like it is a row, but it is really a column; T steady state is the header). But these temperatures are not the temperatures of the first sensor, it somehow stored the value for every sensor in the respective row. When the first row was filled, it stopped storing data entirely. I did not figure out how I could insert a file here, otherwise I would have done so... I want to store the data for each sensor in the associated column.
Another problem I have: the waveform-chart, which shows all the temperatures, only updates every 4-6 seconds. Not only is the interval between every update not always the same, but from my understanding it should update every second since the while-loop has a wait-timer set to 1000ms. I don't know what my mistake here is...
Please let me know if you have any ideas on how to solve the problems I have or suggestions where I could find answers to my questions. I am very new to LabVIEW, I am sorry if this question is silly.
With best regards an thank you for the patient help,
lempy.
csv-file
Block diagram
DAQ-Assis. for PT100
DAQ-Ass. for TC
The Write Delimited Spreadsheet VI has two boolean inputs: Append to file? and transpose?
Append to file? is not set for the first write, which defaults to FALSE. That means, on each write, the file is overwritten. For the second and third call, it is set to TRUE, so those data is appended.
The most simple solution is to put the first two write functions outside the main loop. This overwrites the file at start of the VI with the headers, and values will be appended as desired.
transpose? will swap rows and columns. Wire TRUE to it, and check if it works.
About your second question:
A loop runs as fast as the slowest process inside. If the graph is updated every 6s only, something takes 6s to complete. My guess is that those temperature readings take so long...

Slowly Changing Dimension Transform in SSIS won't update

I used the following from a CSV to test the SCD. I thought it would recognize the LocationIDs and update the records where necessary. But it did not. It only inserts new records.
I'm using Visual Studio 2010 and SQL Server 2012 with Win Authentication (I assume its not a permissions issue because it doesn't seem to be acknowledging the changes to the historical data at all if you look at the pic of the executed package.) I also have Windows 7 Home Premium.
There was a lot of nulls in the original and this set also has changes but the changes are not committed. Also notice that when i add a new location, both are added even though the LocationIDs are the same.
Input into the SSIS package. Look no nulls! But data above was not updated.
LocationID,Locations,Address,City,State,Zip,Phone,Country,Region
9,Pluto Disney,5000 Out this World,PlanetRock,PL,85338,(902) 504-1747,US,SolarSystem
1,Disney Lend,159 Mickey Mouse Road,Orlando,FL,58741,(201) 345-1234,US,North
2,Disney Werld,98532 Donald Duck Boulevard,Los Angelos,SA,75523,(601) 375-1345,US,South
3,Disney Pleyground,449 Smoke Mountain Lane,Atlanta,GA,24747,(804) 375-1126,US,East
4,Cajun Desney,Jazz Land Avenue,New Orleans,LA,88888,(904) 325-1237,US,West
5,Wild West Desney,Magic Kingdom Street,Somewhere West,CO,21543,(804) 346-1274,US,Northwest
3,Disney Super Playground,449 Smoke Mountain Lane,Atlanta,GA,24747,(864) 375-1526,US,East
4,Cajen Disney,Jazz Land Avenue,New Orleans,LA,88888,(904) 525-1237,US,West
6,Winter Disney,0 Ice Land Avenue,New Orleans,LA,85588,(900) 507-1297,US,North
2,Disney World,98532 Donald Duck Boulevard,Los Angelos,CA,75523,(671) 375-1345,US,South
7,Desert Disney,100 Melting Pot Way,Phoenix,AZ,85338,(902) 504-1747,US,Southwest
9,Plutian Disney,5000 Out this World,PlanetRock,PL,85338,(902) 504-1747,US,SolarSystem
10,Martian Disney,3000 Rover Drive,RedRock,M,85338,(902) 504-1747,US,SolarSystem
Here are the pictures from my SCD Package
This is where I map all my incoming attributes to the Database attributes.
All most all the data is historical but NO UPDATES
For the next one I've tried different values, It doesn't make a difference which one i pick or if i deselect them all.
I've kept this the same (never changed)
I've enabled and disabled this one. No Results
The finished Screen
Ok I figured it out. I took some thinking through it.
If "Fail the transformation if it detects changes in fixed attributes" is selected as it is below, then the whole package will fail. If you deselect it, the package will run, but if the SCD transform detects changes to the fixed attribute, it will allow all the changes go through except where it detects changes in the Fixed attributes. SO WHAT THIS MEANS, it does not ERROR OUT or completely cancel the package the way it does when checked. But it STILL DOESN'T IGNORE or allow the other changes to take effect if that row has a changed fixed attribute.
The problem is that book I have suggested using a Derived column to create a DateCreated column with a GetDate() function in the Expression column of the Derived Column transform to determine when the column was originally created. The author then suggested that this column should be set as fixed (even though it wasn't actually fixed since it will always enter the SCD with a current date.) The SCD will detect that the DateCreated column's value is different from the one in the database and so all those rows will fail to update because of that one change.
So it was the book's fault.

how to create a SSIS package which creates three text files, using same variables but the textfile is only created when the correct data is found?

There are only 3 files that can be created : "File_1", "File_2" and "File_3". The same variable name is used in each instance (User::FileDirectory) and (User::File_name), but because the actual value of the variable changes, a new file is created.However the files are only created if there is data to go into the file. i.e. if there are no records to populate the file, it will not be created at all. When the files are created, the date the file was created should also be added to the filename. eg: File1_22102011.txt
Ok if the above was a little confusing, the following is how it works,
All the files use the same variable, but it is reset before each file is created.
• So it populates a result set in memory with the first sql selection (ID number, First_Name and Main_Name). It sets the file variable to “File_1”. If there are records in the result set, it creates and writes to this filename.
• Then it creates a new result set with the second selection(Contract No). It sets the variable to "File_2". If there are records in this new result set, a new file will be created from the variable(which now has a new value)
• Finally a third result set is created (Contract_no, ExperianNo, Entity_ID_Number, First_Name, Main_Name), and the file variable is set to "File_3". Again if there are records in the result set, then this file will be created and written to.
I have worked on a few methods to achieve this but they all have failed, So little help will be greatly appreciated.
While what you have works, I think it'd be rather painful to maintain.
I would approach it as 3 sequence containers running in parallel. Each container would have a data flow and two file tasks hanging off it based on success of the parent and the value of row count variable. If the row count variable is 0, delete the file. If it's greater than 0, rename it to File_n
As you can see, I have a container for the first file. The data flow creates an output a.txt file. Based on the value of the variable #RowCount1, it will either delete the empty file or rename it to File_1.
Each data flow would look like a source query, a row count transformation and a file destination with a temporary name (a.txt, b.txt, c.txt). As a file is always created, even if it's empty, we will need to delete or rename it afterwards which will be accomplished based on the file operation tasks.
In my opinion, this approach will be cleaner as it will allow you to test and debug each item in a cleaner manner rather than dealing with an in-memory dataset.