SSIS lookups - unknown code determination - ssis

Multiple SSIS packages import data into multiple reference tables (i.e. Country, Currency, Language, etc).
Each of these tables have special values in the case that incoming data codes are not part of the codes found in these reference tables (i.e. Language has an Unknown code). This is done to keep referential integrity and to track incoming codes that are unknown to our system. It is completely normal and we need to keep this process.
How can SSIS easily determine that the incoming value is not part of the referenced set of codes? When this happens, how to assign to it the Unknown codes?
Is there a way to do this globally over several columns?
I am trying to avoid using a Lookup task for each column in the source.
Thanks for your time.

the only possible way I see is a merge join operator (with full join) with the codes table and then use a derived table to transform the NULLs on whatever you want.
But why dont you want to use a lookup? just because the amount of columns you have to lookup and you are worried about performace? If thats the problem I suggest you to try to implement the lookups with the FULL CACHE option configured. This way, the lookup query (codes on your example) will be executed only once and result will be kept in memory.

Use a lookup transformation. Thats the easiest way of achieving this.

Related

SSIS Data Flow: duplicated rule problem after lookup

I have a data flow that I need to get a column value from 'SQL tableA' and do a lookup task in 'SQL tableB' using this column value. If the lookup found a connection between the two tables, I need to get the value of another column from 'SQL tableA' and put this value in 'SQL tableC'( the table that will be persisted ). If lookup fail, this column value will be NULL.
My problem: After this behavior above, the rest of my flow is the same. So I have two duplicated equal flows below lookup. And this is terrible for readability and maintenance.
What do I can do to resolve this situation with little performance loss?
The data model is legacy, so change the data model is impossible.
Best Regards,
Luis
The way I see it, there are really three options:
Use UNION ALL and possibly sacrifice performance for modularity. There may in fact be no performance issue. You should test and see
If possible, implement all of this in a stored procedure. You can implement code reuse there and it will quite possibly run much faster
Build a custom transformation component that implements those last three steps.
This option appeals to all programmers but may have the worst performance and in my opinion will just cause issues down the track. If you're writing reams of C# code inside SSIS then you'll eventually reach a point where it's easier to just build a standalone app.
It would be much easier to answer if you explained
What you're really doing
slowly changing dimension?
data cleansing?
adding reference data?
spamming
What are those three activities?
sending an email?
calling a web service?
calling some other API?
What your constraints are
Is all of this data on one server and can you create stored procs and tables?

Aggregate Transformation vs Sort (remove Duplicate) in SSIS

I'm trying to populate dimension tables on a regular basis and I've thought of two ways of getting distinct values for my dimension:
Using an Aggregate transformation and then using the "Group by" operation.
Using a Sort transformation while removing duplicates.
I'm not sure which one is better (more efficient), or which one is adopted more widely in the industry.
I tried to perform some tests using dummy data, but I can't quite get a solid answer.
P.S. Using SELECT DISTINCT from the source is not an option here.
My first choice would always be to correct this in my source query if possible. I realise that isn't always an option, but for the sake of completeness for future readers: I would first check whether I had a problem in my source query that was creating duplicates. Whenever a DISTINCT seems necessary, I first see whether there's actually a problem with the query that needs resolving.
My second choice would be a DISTINCT - if it were possible - because this is one of those cases where it will probably be quicker to resolve in SQL than in SSIS; but I realise that's not an option for you.
From that point, you're getting into a situation where you might need to try out the remaining options. Aside from using an Aggregate or Sort in SSIS, you could also dump the results into a staging table, and then have a separate data flow which does use a DISTINCT in its source query. Aggregate and Sort are both blocking transactions in SSIS so using a staging table might end up being faster - but which is fastest for you will depend on a number of factors including the nature of your data, and also the nature of your infrastructure. You might also want to keep in mind what else is running in parallel if you use the SSIS options, as they can be memory-hungry.
If your data is (or can be) sorted in your source or source query, then there's also a clever idea in the link below for creating "semi-blocking" versions of Aggregate and Sort using script tasks:
http://social.technet.microsoft.com/wiki/contents/articles/30703.ssis-implementing-a-faster-distinct-sort-or-aggregate-transformation.aspx

update target table given DateCreated and DateUpdated columns in source table

What is the most efficient way of updating a target table given the fact that the source table contains a DateTimeCreated and DateTimeUpdated column?
I would like to keep the source in target in synch avoiding a
truncate. I am looking for a bets practice pattern in this situation
I'll avoid a best practice answer but give enough detail to make an appropriate choice. There are two main methods with which you might update a table in SSIS, avoiding a TRUNCATE - LOAD:
1) Use an OLEBD COMMAND
This method is good if:
you have a reliable DateTimeUpdated column,
there are not many rows to update,
there are not a lot of columns to update
there are not many added columns in the dataflow (i.e. derived column transforms)
and the update statement is fairly straightforward.
This method performs poorly with many columns because it performs a row-by-row update. Relying on an audit date column can be a great method to reduce the number of rows to update, but it can also cause problems if rows are updated in the source system and the audit column is not changed. I recommend only trusted it if it has a trigger or you can be certain that no human can perform updates on the table.
Additionally, this component falls short when there is a lot of columns to map or a lot of transforms going on in the data flow. For example, if you are converting all string columns from unicode to non-unicode, you may have many additional columns in the mix that will make mapping and maintenance a pain. The mapping tool in this component is good for about 10 columns, it starts to get confusing very quickly after that. Especially because you are mapping to numbered parameters rather than column names.
Lastly, if you are doing anything complex in the update statement, it is better suited for SQL code rather than maintaining it in the components editor which has no intellisense and is generally painful to use.
2) Stage the data and perform the update in Execute SQL task after the data flow
This method is good for all the reasons that the OLEDB command is bad for, but has some disadvantages as well. There is more code to maintain:
a couple of t-sql tasks,
a proc
and a staging table
This means also that it takes more time to set up as well. However, it does perform very well and the code is far easier to read and understand. Ongoing maintenance is simpler as well.
Please see my notes from this other question that I happened to answer today on the same subject: SSIS Compare tables content and update another

Loading a fact table in SSIS when obtaining the dimension key isn't easy

I have a fact table that needs a join to a dimension table however obtaining that relationship from the source data isn't easy. The fact table is loaded from a source table that has around a million rows, so in accordance with best practice, I'm using a previous run date to only select the source rows that have been added since the previous run. After getting the rows I wish to load I need to go through 3 other tables in order to be able to do the lookup to the dimension table. Each of the 3 tables also has around a million rows.
I've read that best practice says not to extract source data that you know won't be needed. And best practice also says to have as light-touch as possible on the source system and therefore avoid sql joins. But in my case, those two best practices become mutually exlusive. If I only extract changed rows in the itermediary tables then I'll need to do a join in the source query. If I extract all the rows from the source system then I'm extracting much more data than I need and that may cause SSIS memory/performance problems.
I'm leaning towards a join in the extraction of the source data but I've been unable to find any discussions on the merits and drawbacks of that approach. Would that be correct or incorrect? (The source tables and the DW tables are in Oracle).
Can you stage the 3 source tables that you are referencing? You may not need them in the DW, but you could have them sitting in a staging database purely for this purpose. You would still need to keep these up-to-date however, but assuming you can just pull over the changes, this may not be too bad.

Change Record Integration

We have the need to perform an end of day process to extract the daily transactions from System A and transfer only the changes to System B.
The problem is that System A can only provide the full set of transactions available in System A.
My initial thoughts were to use a staging table (SQL Server) which will persist the data from System A, and then is used for comparison purposes for each execution of the end of day comparison. This can all be done using table joins to identify the required UPDATEs, INSERTs, DELETEs.
Not being an SSIS expert I understand this could be done in SSIS using LOOKUPs to identify the additions, updates and deletion.
Question:
Is the SSIS solution a better approach and why (maintainability, scalability, extensibility) ?
Which would be better performing? Any experience on these 2 options?
Is there any alternative option?
Since you need the full set of transactions from System A, that limits your options as far as source goes. I recommend pulling that data down to a Raw File Destination. This will help you as you develop, since you can just run the tasks that need that data over and over again without refetching. Also, make sure that the source data is sorted by the origin machine. SSIS is very weak with sorting unless you use a 3rd party component (which may be a career-limiting decision in some cases).
Anyway, let's assume that you have that sorted Raw File lying around. Next thing you do is toss that into a Data Flow as a Raw File Source. Then, have an OLEDB (or whatever) source that represents System B. You could use Raw File for that, also, if you like. Make sure that the data from System B is sorted using the same columns you used to sort System A.
Mark the Sources with IsSorted=True, and set the SortKey value on the appropriate columns in the metadata. This will tell SSIS that the data is pre-sorted, and it will permit you to JOIN on your key columns. Otherwise, you may wait days for SSIS to sort big sets.
Add MultiCasts to both System A and System B's sources, because we want to leverage them twice.
Next, add a Merge Join to join the two Raw File Sources together. Make System A the left input. System B will become the right input when you connect it to the Merge Join. SSIS will automatically set the JOIN up on those sorted columns that you marked in the previous step. Set the Merge Join to use LEFT JOIN. This way, we can find rows in System A that do not exist in System B, and we can compare existing rows to see if they were changed.
Next up, add a Conditional Split. In there, you can define 2 output buffers based on conditions.
NewRows: ISNULL(MyRightTable.PrimaryKey)
UpdatedRows: [Whatever constitutes an updated row]
The default output will take whatever rows do not meet those 2 conditions, and may be safely ignored.
We are not done.
Add another Merge Join.
This time, make an input from System B's MultiCast the left input. Make an input from System A's MultiCast the right input. Again, SSIS will set the join keys up for you. Set the Merge Join to use Left Join.
Add a Conditional Split down this flow, and the only thing you need is this:
DeletedRows: ISNULL(MyRightTable.PrimaryKey)
The default output will take all of the other rows, which can be ignored.
Now you have 3 buffers.
2 of them come out of your first Merge Join, and represent New Rows and Updated Rows.
1 of them comes out of your second Merge Join, and represents Deleted Rows.
Take action.