Change Record Integration - ssis

We have the need to perform an end of day process to extract the daily transactions from System A and transfer only the changes to System B.
The problem is that System A can only provide the full set of transactions available in System A.
My initial thoughts were to use a staging table (SQL Server) which will persist the data from System A, and then is used for comparison purposes for each execution of the end of day comparison. This can all be done using table joins to identify the required UPDATEs, INSERTs, DELETEs.
Not being an SSIS expert I understand this could be done in SSIS using LOOKUPs to identify the additions, updates and deletion.
Question:
Is the SSIS solution a better approach and why (maintainability, scalability, extensibility) ?
Which would be better performing? Any experience on these 2 options?
Is there any alternative option?

Since you need the full set of transactions from System A, that limits your options as far as source goes. I recommend pulling that data down to a Raw File Destination. This will help you as you develop, since you can just run the tasks that need that data over and over again without refetching. Also, make sure that the source data is sorted by the origin machine. SSIS is very weak with sorting unless you use a 3rd party component (which may be a career-limiting decision in some cases).
Anyway, let's assume that you have that sorted Raw File lying around. Next thing you do is toss that into a Data Flow as a Raw File Source. Then, have an OLEDB (or whatever) source that represents System B. You could use Raw File for that, also, if you like. Make sure that the data from System B is sorted using the same columns you used to sort System A.
Mark the Sources with IsSorted=True, and set the SortKey value on the appropriate columns in the metadata. This will tell SSIS that the data is pre-sorted, and it will permit you to JOIN on your key columns. Otherwise, you may wait days for SSIS to sort big sets.
Add MultiCasts to both System A and System B's sources, because we want to leverage them twice.
Next, add a Merge Join to join the two Raw File Sources together. Make System A the left input. System B will become the right input when you connect it to the Merge Join. SSIS will automatically set the JOIN up on those sorted columns that you marked in the previous step. Set the Merge Join to use LEFT JOIN. This way, we can find rows in System A that do not exist in System B, and we can compare existing rows to see if they were changed.
Next up, add a Conditional Split. In there, you can define 2 output buffers based on conditions.
NewRows: ISNULL(MyRightTable.PrimaryKey)
UpdatedRows: [Whatever constitutes an updated row]
The default output will take whatever rows do not meet those 2 conditions, and may be safely ignored.
We are not done.
Add another Merge Join.
This time, make an input from System B's MultiCast the left input. Make an input from System A's MultiCast the right input. Again, SSIS will set the join keys up for you. Set the Merge Join to use Left Join.
Add a Conditional Split down this flow, and the only thing you need is this:
DeletedRows: ISNULL(MyRightTable.PrimaryKey)
The default output will take all of the other rows, which can be ignored.
Now you have 3 buffers.
2 of them come out of your first Merge Join, and represent New Rows and Updated Rows.
1 of them comes out of your second Merge Join, and represents Deleted Rows.
Take action.

Related

How to get changes done to mysql table during a time interval?

I am essentially creating a data warehouse.
For warehouse to remain consistent with the source data, I have to pull changes daily from the source mysql DBs.
My problem is that, in some source mysql tables, there is no 'lastupdated' equivalent columns.
How can i pull changes in this scenario?
In a data warehouse, in order to capture the changes in the target system, there has to be a way to identify the changed, new records at the source side. This is normally done with the help of some Flag or Lastupdated column. But if neither of these is present & if the table is small, you can consider truncating & reloading the entire data from source to the target. But this may not be very feasible for large tables.
You can also refer some other techniques mentioned in the below blog-:
https://www.hvr-software.com/blog/change-data-capture/

Mysql update table at the same time

Let's say in mysql, I want to update a column in one of the table. I need to SELECT the record and change the value, after that, UPDATE it back to the database. In some case, I couldn't do these 2 operations in one sql query and nest them into subquery (due to mysql limitation), I have to load it into program (let's say Java), change the value, and then put back into database.
For example, program A get a column's value and want to increase it with one and then put it back. At the same time, program B want to do the same thing too. Before program A put back the increased value, program B already get the wrong value (program B is supposed to get the value that is increased by program A, but it run at the same time as program A, so it retrieved the same value as A).
Now my question is , what are the good ways to handle this kind of problem?
My another question is, I believe that mysql shouldn't be a single threaded system, but let's say if there are two same queries (they are updating the same table, same column and same record) come in at the same time, how mysql handle this kind of situation? Which one mysql will schedule first and which one later?
Moreover, could anyone explain a bit how mysql work in multithreading support? One connection one thread? So all the statement created under that connection will schedule in a same queue?
If you're using InnoDB, you can use transactions to provide fine-grained mutual exclusion.
If you're using MyISAM, you can use LOCK TABLE to prevent B from accessing the table until A finishes making its changes.
If two clients try to update the same field at the same time, it's unpredictable which one will win the race. The database has internal mutual exclusion to serialize the two queries, but the specific order is essentially random.

SSIS lookups - unknown code determination

Multiple SSIS packages import data into multiple reference tables (i.e. Country, Currency, Language, etc).
Each of these tables have special values in the case that incoming data codes are not part of the codes found in these reference tables (i.e. Language has an Unknown code). This is done to keep referential integrity and to track incoming codes that are unknown to our system. It is completely normal and we need to keep this process.
How can SSIS easily determine that the incoming value is not part of the referenced set of codes? When this happens, how to assign to it the Unknown codes?
Is there a way to do this globally over several columns?
I am trying to avoid using a Lookup task for each column in the source.
Thanks for your time.
the only possible way I see is a merge join operator (with full join) with the codes table and then use a derived table to transform the NULLs on whatever you want.
But why dont you want to use a lookup? just because the amount of columns you have to lookup and you are worried about performace? If thats the problem I suggest you to try to implement the lookups with the FULL CACHE option configured. This way, the lookup query (codes on your example) will be executed only once and result will be kept in memory.
Use a lookup transformation. Thats the easiest way of achieving this.

Loading a fact table in SSIS when obtaining the dimension key isn't easy

I have a fact table that needs a join to a dimension table however obtaining that relationship from the source data isn't easy. The fact table is loaded from a source table that has around a million rows, so in accordance with best practice, I'm using a previous run date to only select the source rows that have been added since the previous run. After getting the rows I wish to load I need to go through 3 other tables in order to be able to do the lookup to the dimension table. Each of the 3 tables also has around a million rows.
I've read that best practice says not to extract source data that you know won't be needed. And best practice also says to have as light-touch as possible on the source system and therefore avoid sql joins. But in my case, those two best practices become mutually exlusive. If I only extract changed rows in the itermediary tables then I'll need to do a join in the source query. If I extract all the rows from the source system then I'm extracting much more data than I need and that may cause SSIS memory/performance problems.
I'm leaning towards a join in the extraction of the source data but I've been unable to find any discussions on the merits and drawbacks of that approach. Would that be correct or incorrect? (The source tables and the DW tables are in Oracle).
Can you stage the 3 source tables that you are referencing? You may not need them in the DW, but you could have them sitting in a staging database purely for this purpose. You would still need to keep these up-to-date however, but assuming you can just pull over the changes, this may not be too bad.

SSIS - Bulk Update at Database Field Level

Here's our mission:
Receive files from clients. Each file contains anywhere from 1 to 1,000,000 records.
Records are loaded to a staging area and business-rule validation is applied.
Valid records are then pumped into an OLTP database in a batch fashion, with the following rules:
If record does not exist (we have a key, so this isn't an issue), create it.
If record exists, optionally update each database field. The decision is made based on one of 3 factors...I don't believe it's important what those factors are.
Our main problem is finding an efficient method of optionally updating the data at a field level. This is applicable across ~12 different database tables, with anywhere from 10 to 150 fields in each table (original DB design leaves much to be desired, but it is what it is).
Our first attempt has been to introduce a table that mirrors the staging environment (1 field in staging for each system field) and contains a masking flag. The value of the masking flag represents the 3 factors.
We've then put an UPDATE similar to...
UPDATE OLTPTable1 SET Field1 = CASE
WHEN Mask.Field1 = 0 THEN Staging.Field1
WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 )
WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 )
...
As you can imagine, the performance is rather horrendous.
Has anyone tackled a similar requirement?
We're a MS shop using a Windows Service to launch SSIS packages that handle the data processing. Unfortunately, we're pretty much novices at this stuff.
If you are using SQL Server 2008, look into the MERGE statement, this may be suitable for your Upsert needs here.
Can you use a Conditional Split for the input to send the rows to a different processing stage dependent upon the factor that is matched? Sounds like you may need to do this for each of the 12 tables but potentially you could do some of these in parallel.
I took a look at the merge tool, but I’m not sure it would allow for the flexibility to indicate which data source takes precedence based off of a predefined set of rules.
This function is critical to allow for a system that lets multiple members utilize the process that can have very different needs.
From what I have read the Merge function is more of a sorted union.
We do use an approach similar to what you describe in our product for external system inputs. (we handle a couple of hundred target tables with up to 240 columns) Like you describe, there's anywhere from 1 to a million or more rows.
Generally, we don't try to set up a single mass update, we try to handle one column's values at a time. Given that they're all a single type representing the same data element, the staging UPDATE statements are simple. We generally create scratch tables for mapping values and it's a simple
UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn = mapping.sourcecolumn.
Setting up the mappings is a little involved, but we again deal with one column at a time while doing that.
I don't know how you define 'horrendous'. For us, this process is done in batch mode, generally overnight, so absolute performance is almost never an issue.
EDIT:
We also do these in configurable-size batches, so the working sets & COMMITs are never huge. Our default is 1000 rows in a batch, but some specific situations have benefited from up to 40 000 row batches. We also add indexes to the working data for specific tables.