I have a problem where I need to have the dimension key for all rows in the data stream.
I use a lookup component to find the dimension key of the records
Records with no dimension key (lookup no match output) are redirect to a different output because they need to be inserted.
the no match output is multicated
the new records are inserted into the dimension.
a second lookup component should be executed after the records are inserted
number 5 fails because I don't know how to wait for the ADO NET Destination to finish...
Is there any way to solve this other than dump the stream into raw files and use other data flow to resume the task?
I think I understand what you're doing now. Normally you would load the dimension fully first in its own data flow, then after this is fully complete, you load the fact, using the already populated dimension. You appear to be trying to load the fact and dimension in one data flow.
The only reason you'd do this in one data flow is if you couldn't seperate your distinct dimensions from your facts and your fact source is so huge you don't want to go through it twice. Normally you can preload a dimension without a large fact source but this is not always the case. It depends on your source data.
You could use a SEQUENCE (http://technet.microsoft.com/en-us/library/ff878091.aspx) to do this in one data flow. This is a way of autogenerating a number without needing to insert it into a table, but your dimension would need to rely on a sequence instead of an identity. But you'd need to call it in some kind of inline script component or you might be able to trick a lookup component. It would be very slow.
Instead you should try building all of your dimensions in a prior load so that when you arrive at the fact load, all dimensions are already there.
In short the question is: Do you really need to do this in one big step? Can you prebuild your dimension in a prior data flow?
Thomas Kejser wrote a blogpost with a nice solution to this early arriving fact / late arriving dimension problem.
http://blogs.msdn.com/b/sqlcat/archive/2009/05/13/assigning-surrogate-keys-to-early-arriving-facts-using-integration-services.aspx
Basically you use a second lookup with a partial cache. Whenever the partial lookup cache receives a non-matched row it will call a SQL statement and fetch data to populate the lookup cache. If you use a stored proc in this SQL statement you can first add it to dimension table and then use the SELECT statement to alter the cache.
Related
The first time our SSIS packages run they do a full load, and incremental loads afterwards. All Lookups are using Full cache, but it may not be handy for incremental loads, as some of the Lookup tables contain millions of records, and the incremental load may be small.
Is it possible to dynamically set, based on some parameter, whether a Lookup should use Full Cache, Partial Cache or No Cache?
Solution
Because the database and SSIS packages are on the same server, partial cache with indexes on the lookup columns is as fast as full cache for full loads, and even faster for incremental loads.
I can't see a way to do this, but potential alternatives would be:
Build two different data flow tasks, one for each cache type, and then use a variable/logic to decide which one to run by setting expressions on your precedence constraints.
Use an SQL statement from a variable as your Lookup connection, and select only the records you need, e.g., if it's based on a date range or something like that. You could then build the SQL statement before each execution.
Here is a trick I am using. Create 2 lookups. First is a lookup with full cache and redirect on no match. Its no match branch goes into the second lookup that is exactly the same but in partial cache mode. The no match policy on this one is what you actually need. Match results from both lookups combine back together in a union all.
Now the tricky part. You go to out of the dataflow to control flow that contains it. When you select that dataflow element, in its properties you can find expressions. In those expressions you find the following looking entry [].[SqlCommand]. There you should build an expression for your lookup that will add a false condition to it in case when you want to use the partial lookup.
Here is a simple example. Let's say you have IsFullLookup boolean variable that you set to true when you are having a large initial load, and to false on small incremental loads. And let's say your lookup query looks the following way.
SELECT Value, Key FROM MyLookupTable
Then the expression may look something like this
"SELECT Value, Key FROM MyLookupTable" + ( #[User::IsFullLookup] ? "" : " WHERE 0=1" )
How this works. If your IsFullLookup variable is true then the query remains unchanged. It performs full load to cache, and all matching values go straight to union all. Non-matching values go to the partial cache lookup. That's the downside of this method, so it might be not best if you have too many non-matching values on full loads.
If your IsFullLookup is false, full cache produces 0 rows quite quickly. All lookups also produce no match very quickly and your rows are all redirected to the partial cache lookup that does its job the regular way. In this case you get pure partial cache lookup with minimal overhead.
This technique makes your packages uglier than before. But if you work with large lookup tables it is totally worth it. In your everyday load you are not spending extra time and memory on large lookups. But in rare case when you need a full (or just a large) load your partial cache does not become a disaster.
I have a SSIS package with 2 data flow tasks. The first data flow task is filling values into a dimension table. The second data flow task is filling surrogate keys into the fact table. The fact table is referencing the previous filled dimension table via surrogate key. However, a further SSIS package is doing exactly the same, but with data from another data source. Both SSIS packages are fired by SQLServer Agent in low frequency (each 20 - 40 seconds).
I am worrying about consistency. If I had a single SSIS package that loads the data into the dimension table and fact table, I wouln't have to because it would be possible to create the control flow to enforce the following sequence:
Fill the Dimension table with data from data source 1
Fill the Fact table with data from data source 1 (correct surrogate key to Dim)
Fill the Dimension table with data from data source 2
Fill the Fact table with data from data source 2 (correct surrogate key to Dim)
So in this case the primary key of the Dimension table as well as the corresponding surrogate key in the fact table could be auto-incremented simply in SQL Server DB and everything would be fine.
But since I have 2 SSIS packages, each running independently on a multi-core ETL server in low frequency, I am worrying about the case when the following will happen:
Both packages are starting approximately at the same time
Fill the Dimension table with data from data source 1
Fill the Dimension table with data from data source 2
Fill the Fact table with data from data source 2 (surrogate key to wrong Dim record)
Fill the Fact table with data from data source 1 (surrogate key to wrong Dim record)
Are there any common best practises or, on the other hand, is such a handling necessary or does SQL Server handle such situation by default e.g. by forbid packages to be processed in parallel? Maybe a Write Lock on both tables during the start of each SSIS-package could be satisfactory but in this case I am worrying that this could result in a failure thrown by the other SSIS-package if it cannot reach the destination tables. I am new to SSIS and I would like to know my options about any good techniques to avoid this situation (if necessary).
One option is to use transactions in SSIS. You can embed in the transaction the critical part of the ETLs.
But I'm not sure to understand what makes you think there could be a problem. If you use an identity column on your dimension table, there can not be duplicates, no matter how many threads insert at the same time. In your step 4 and 5, how could you get a surrogate to a wrong record ? Please illustrate your question with an example of how you plan to match your fact with your Dim record.
If I understand your query properly,another option you can use is to make them one package and use sequence containers if you don't want to do this you can still combine them in the control flow with an execute SSIS package task,that way you can control the flow and the one package will only run after the other.The only disadvantage to this is that the package needs to initialize again when executed so it would probably be better proformance wise to just combine then and create data sources for them in the same package.
I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?
I have a data flow to migrate rows from a database to a new version of the database. One of the changes we are making is to replace user name strings with an integer identifier.
I'm using a Lookup component to replace the Manager and Trader names with their numeric ID but one of the transforms seems to be performing very slowly compared to the other. In the following screen shot it shows how far behind the Lookup Trader component is compared to the Lookup Manager component.
As you can see approx 22m rows have been passed to the Lookup Manager component and it is "only" 100k row behind but the data passed to the Lookup Trader component is almost 8m rows behind.
The lookups contain the same query to get the user name and ID for all the traders and managers (they are maintained in the same table) and are both set to use Full Cache. They are looking up the same data type (string) and both add a new field to the flow of type INT.
I don't understand why one component is performing so much faster than the other when they are essentially the same. The Warning icons on both components are shown because I have set the error action to Fail Component while debugging even though there is an error output connected. Later, I'll redirect the errors to a flat file.
My question is two fold; why is one performing much slower than the other and, more importantly, how do I find out why?
After the comment from Damien_The_Unbeliever about components later in the data flow I did a test by extracting the two lookup components and directing their output a to a Trash Destination; I also reversed the order of the components to see if one was operating faster than the other.
All was running well until both lookup components paused and the input raced ahead to nearly 5m rows but overall the data was processed quickly and the Lookup Trader component did not work more slowly than the Lookup Manager.
So it would seem the later components are causing a bottleneck in the flow and therefore making it appear the Lookup Trader was the culprit.
I am trying to improve the performance of a SSIS Package.
One thing I got startted with is to filter the reference table of the Lookups. Until now, I was using a table as a reference table for that lookup.
First improvment was to change the table to a SQL clause that is selecting just the columns I need from that table.
Next, I want to load in this table just the records I know I'll use for sure. If I'm maintaining it in this state, I will get to load 300 000 lines or more (huge lines with binary content of around 500 kb each) and use just around 100 of them.
I would put some filters in the SQL query that sets the reference table of the lookup, BUT, in that filter I need to use ALL the ids of the rows loaded in my OLE DB source.
Is there any way to do this?
I thought of loading each row at a time using a OleDB Command instead of a Lookup, but except of beeing time consuming, I might get to load the same thing 100 times for 100 different rows, when I could just load it once in the lookup and use it 100 times...
Enableing the cache still would be another option that still doesn't sound very good, because it would slow us down - we are already terribly slow.
Any ideeas are greatly appreaciated.
One possibility is to first stream the distinct IDs to a permanent/temporary table in one data flow and then use it in your lookup (with a join) in a later data flow (you probably have to defer validation).
In many of our ETL packages, we first stream the data into a Raw file, handling all the type conversions and everything on the way there. Then, when all these conversions were successful, then we handle creating new dimensions and then the facts linking to the dimensions.