I have a SSIS package with 2 data flow tasks. The first data flow task is filling values into a dimension table. The second data flow task is filling surrogate keys into the fact table. The fact table is referencing the previous filled dimension table via surrogate key. However, a further SSIS package is doing exactly the same, but with data from another data source. Both SSIS packages are fired by SQLServer Agent in low frequency (each 20 - 40 seconds).
I am worrying about consistency. If I had a single SSIS package that loads the data into the dimension table and fact table, I wouln't have to because it would be possible to create the control flow to enforce the following sequence:
Fill the Dimension table with data from data source 1
Fill the Fact table with data from data source 1 (correct surrogate key to Dim)
Fill the Dimension table with data from data source 2
Fill the Fact table with data from data source 2 (correct surrogate key to Dim)
So in this case the primary key of the Dimension table as well as the corresponding surrogate key in the fact table could be auto-incremented simply in SQL Server DB and everything would be fine.
But since I have 2 SSIS packages, each running independently on a multi-core ETL server in low frequency, I am worrying about the case when the following will happen:
Both packages are starting approximately at the same time
Fill the Dimension table with data from data source 1
Fill the Dimension table with data from data source 2
Fill the Fact table with data from data source 2 (surrogate key to wrong Dim record)
Fill the Fact table with data from data source 1 (surrogate key to wrong Dim record)
Are there any common best practises or, on the other hand, is such a handling necessary or does SQL Server handle such situation by default e.g. by forbid packages to be processed in parallel? Maybe a Write Lock on both tables during the start of each SSIS-package could be satisfactory but in this case I am worrying that this could result in a failure thrown by the other SSIS-package if it cannot reach the destination tables. I am new to SSIS and I would like to know my options about any good techniques to avoid this situation (if necessary).
One option is to use transactions in SSIS. You can embed in the transaction the critical part of the ETLs.
But I'm not sure to understand what makes you think there could be a problem. If you use an identity column on your dimension table, there can not be duplicates, no matter how many threads insert at the same time. In your step 4 and 5, how could you get a surrogate to a wrong record ? Please illustrate your question with an example of how you plan to match your fact with your Dim record.
If I understand your query properly,another option you can use is to make them one package and use sequence containers if you don't want to do this you can still combine them in the control flow with an execute SSIS package task,that way you can control the flow and the one package will only run after the other.The only disadvantage to this is that the package needs to initialize again when executed so it would probably be better proformance wise to just combine then and create data sources for them in the same package.
Related
I have a problem where I need to have the dimension key for all rows in the data stream.
I use a lookup component to find the dimension key of the records
Records with no dimension key (lookup no match output) are redirect to a different output because they need to be inserted.
the no match output is multicated
the new records are inserted into the dimension.
a second lookup component should be executed after the records are inserted
number 5 fails because I don't know how to wait for the ADO NET Destination to finish...
Is there any way to solve this other than dump the stream into raw files and use other data flow to resume the task?
I think I understand what you're doing now. Normally you would load the dimension fully first in its own data flow, then after this is fully complete, you load the fact, using the already populated dimension. You appear to be trying to load the fact and dimension in one data flow.
The only reason you'd do this in one data flow is if you couldn't seperate your distinct dimensions from your facts and your fact source is so huge you don't want to go through it twice. Normally you can preload a dimension without a large fact source but this is not always the case. It depends on your source data.
You could use a SEQUENCE (http://technet.microsoft.com/en-us/library/ff878091.aspx) to do this in one data flow. This is a way of autogenerating a number without needing to insert it into a table, but your dimension would need to rely on a sequence instead of an identity. But you'd need to call it in some kind of inline script component or you might be able to trick a lookup component. It would be very slow.
Instead you should try building all of your dimensions in a prior load so that when you arrive at the fact load, all dimensions are already there.
In short the question is: Do you really need to do this in one big step? Can you prebuild your dimension in a prior data flow?
Thomas Kejser wrote a blogpost with a nice solution to this early arriving fact / late arriving dimension problem.
http://blogs.msdn.com/b/sqlcat/archive/2009/05/13/assigning-surrogate-keys-to-early-arriving-facts-using-integration-services.aspx
Basically you use a second lookup with a partial cache. Whenever the partial lookup cache receives a non-matched row it will call a SQL statement and fetch data to populate the lookup cache. If you use a stored proc in this SQL statement you can first add it to dimension table and then use the SELECT statement to alter the cache.
I am new to SSIS in data warehouse. I am using Microsoft business intelligence studio.
I have 5 Dimensions each having some PK.
I have a Fact table that contains all the PK of Dimensions, means their is a foreign key relationship exist ( as in star schema).
Now what is the best practice to load the fact table.
What i have done is write a cross join query between 5 Dimensions and the resultant set is dumped to the fact table. But i don't think this is a good practice.
I am completely new to MS SSIS. so plz describe suggestions in detail.
thanks
Take a look at Microsoft Project Real examples. Also get a Kimball book and read-up on loading fact tables -- the topic covers several chapters.
I would echo #Damir's points about Project Real and Kimball. I am a fan of both.
I guess to give you some more thoughts, to answer your question,
load your date dimension and other "static" dimensions as a one off load
load records into all your dimensions to take care of NULL and UNKNOWN values
load your dimensions. For your dimensions, decide on a column by column basis what you want as type 1 or type 2 changing dimension columns. Be cautious and choose them mostly as type 1 unless there is a good reason.
[edited] load your fact table by joining your staging transaction data which will go into a fact table to your new dimension tables using the business keys, thus looking up the dimension's foreign keys as you go. e.g. sales transactions will have a store number (the business key), which you would want to look up in DimStore (already loaded in the previous step), which would give you the kStore of DimStore, then you would record kstore against that transaction in FactSalesTransaction.
Other general things you should consider (not related to your question, but if yo uare stating out you should consider)
Data archiving. How long will you keep data online? / when will it be deleted?
Table partitioning. If you have very large Fact tables(s), you should consider partitioning on a date or subject area basis. Date is quite nice, as you can do some interesting things with regard to dropping old partitions when the data is too old as part of the standard load process.
Having the DWH as a snowflaked schema, then using a set of views to flatten the snoflake into a star. This is particularly useful when putting an OLAP cube on top of a SQL DWH, as it simplifies the cube design.
How are you going to manage different environments (Dev/Test/etc/Prod)? Using one of the SQL Server configuration styles is imperative.
Build a template SSIS package with all the variables you need and the configration/connection strings you want. It will save loads of time to do that now, rather than having to rework packages when you discover new things. Do trivial prototypes initially to prove your methodology!
I created an SSIS package so I can import data from a legacy FoxPro database at scheduled intervals. A copy of the FoxPro databaseis installed for several customers. Overall, the package is working very well and accomplishing all that I need.
However, I have one annoying situation where at least one customer (maybe more) has a modified FP database, where they increased the length of one column in one table. When I run the package on such a customer, it fails because of truncation.
I thought I could just give myself some wiggle room and change the length from 3 to 10. That way the mutants with a length of 10 would be accommodated, as well as everyone else using 3. However, SSIS complains when the column lengths don't match, period.
I suppose I have a few options:
On the task, set 'ValidateExternalMetadata' to false. However, I'm not sure that is the most responsible option... or is it?
Get our implementation team to change the length to 10 for all customers. This could be a problem, but at least it would be their problem.
Create a copy of the task that works for solutions with the different column length. Implementation will likely use the wrong package at some point, and everyone will ask me why I didn't just give them a single package that couldn't handle all scenarios and blame this on me.
Use some other approach you might be able to fill me in on.
If you are using the Visual FoxPro OleDB, and you are concerned about the columns widths, you can explicitly force them by using PADR() during your call. I don't know how many tables / queries this impacts but would guarantee you get your expected character column lengths. If dealing with numeric, decimal, date/time, logical (boolean), should not be an issue... Anyhow, you could do this as your select to get the data
select
t1.Fld1,
t1.Fld2,
padr( t1.CharFld3, 20 ) CharFld3,
padr( t1.CharFld4, 5 ) CharFld4,
t1.OtherFld5,
padr( t1.CharFld6, 35 ) CharFld5
from
YourTable t1
where
SomeCondition
This will force character based (implied sample) fields "CharFld3", "CharFld4", "CharFld6" to a force width of 20, 5 and 35 respectively regardless of the underlying structure length. Now, if someone updates the structure LONGER than what you have it will be truncated down to proper length, but won't crash. Additionally, if they have a shorter column length, it will be padded out to the full size you specify via the PADR() function (pad right).
I'm weak on the FoxPro side, but...
You could create a temporary table that meets the SSIS expectations. Create a task that would use FoxPro instructions to copy the data from the problem table to the temporary table. Alter your data flow to work with the temp table.
You can create the preliminary steps (create temp table and transfer to temp table) as SSIS tasks so that flow control is managed by your SSIS package.
I am trying to improve the performance of a SSIS Package.
One thing I got startted with is to filter the reference table of the Lookups. Until now, I was using a table as a reference table for that lookup.
First improvment was to change the table to a SQL clause that is selecting just the columns I need from that table.
Next, I want to load in this table just the records I know I'll use for sure. If I'm maintaining it in this state, I will get to load 300 000 lines or more (huge lines with binary content of around 500 kb each) and use just around 100 of them.
I would put some filters in the SQL query that sets the reference table of the lookup, BUT, in that filter I need to use ALL the ids of the rows loaded in my OLE DB source.
Is there any way to do this?
I thought of loading each row at a time using a OleDB Command instead of a Lookup, but except of beeing time consuming, I might get to load the same thing 100 times for 100 different rows, when I could just load it once in the lookup and use it 100 times...
Enableing the cache still would be another option that still doesn't sound very good, because it would slow us down - we are already terribly slow.
Any ideeas are greatly appreaciated.
One possibility is to first stream the distinct IDs to a permanent/temporary table in one data flow and then use it in your lookup (with a join) in a later data flow (you probably have to defer validation).
In many of our ETL packages, we first stream the data into a Raw file, handling all the type conversions and everything on the way there. Then, when all these conversions were successful, then we handle creating new dimensions and then the facts linking to the dimensions.
Here's our mission:
Receive files from clients. Each file contains anywhere from 1 to 1,000,000 records.
Records are loaded to a staging area and business-rule validation is applied.
Valid records are then pumped into an OLTP database in a batch fashion, with the following rules:
If record does not exist (we have a key, so this isn't an issue), create it.
If record exists, optionally update each database field. The decision is made based on one of 3 factors...I don't believe it's important what those factors are.
Our main problem is finding an efficient method of optionally updating the data at a field level. This is applicable across ~12 different database tables, with anywhere from 10 to 150 fields in each table (original DB design leaves much to be desired, but it is what it is).
Our first attempt has been to introduce a table that mirrors the staging environment (1 field in staging for each system field) and contains a masking flag. The value of the masking flag represents the 3 factors.
We've then put an UPDATE similar to...
UPDATE OLTPTable1 SET Field1 = CASE
WHEN Mask.Field1 = 0 THEN Staging.Field1
WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 )
WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 )
...
As you can imagine, the performance is rather horrendous.
Has anyone tackled a similar requirement?
We're a MS shop using a Windows Service to launch SSIS packages that handle the data processing. Unfortunately, we're pretty much novices at this stuff.
If you are using SQL Server 2008, look into the MERGE statement, this may be suitable for your Upsert needs here.
Can you use a Conditional Split for the input to send the rows to a different processing stage dependent upon the factor that is matched? Sounds like you may need to do this for each of the 12 tables but potentially you could do some of these in parallel.
I took a look at the merge tool, but I’m not sure it would allow for the flexibility to indicate which data source takes precedence based off of a predefined set of rules.
This function is critical to allow for a system that lets multiple members utilize the process that can have very different needs.
From what I have read the Merge function is more of a sorted union.
We do use an approach similar to what you describe in our product for external system inputs. (we handle a couple of hundred target tables with up to 240 columns) Like you describe, there's anywhere from 1 to a million or more rows.
Generally, we don't try to set up a single mass update, we try to handle one column's values at a time. Given that they're all a single type representing the same data element, the staging UPDATE statements are simple. We generally create scratch tables for mapping values and it's a simple
UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn = mapping.sourcecolumn.
Setting up the mappings is a little involved, but we again deal with one column at a time while doing that.
I don't know how you define 'horrendous'. For us, this process is done in batch mode, generally overnight, so absolute performance is almost never an issue.
EDIT:
We also do these in configurable-size batches, so the working sets & COMMITs are never huge. Our default is 1000 rows in a batch, but some specific situations have benefited from up to 40 000 row batches. We also add indexes to the working data for specific tables.