In Data flow task suppose I have source, couple of transforms and destination.
Say there are 1 million records to be read by source. Say it has reached row number 10000. Will the read rows (10000) get passed to the next transform or will the subsequent tasks wait for the previous task to completely process the rows? So only run transform when all 1 million have been read.
It depends!
Quick definitions:
Synchronous: One row in, one row out. Input lineage id is the same as output lineage id
Asynchronous: N row(s) in, M row(s) out.
Async, Fully blocking: All data must arrive before new data can leave
Async, Partially blocking: Enough data must arrive before new data can leave
All synchronous
OLE DB Source -> Derived Column -> OLE DB Destination
All synchronous components. 1M source rows, 10k rows flow from source, to column to destination. Lather-rinse-repeat
Asynchronous, fully blocking
OLE DB Source ->
Aggregate -> OLE DB Destination
Aggregate is an asynchronous, fully-blocking, component. 1M source rows, 10k rows flow from source to Aggregate (let's assume we're getting the Maximum sales grouped by sales id). It computes the maximum amount for the 10k it has, but it can't release them downstream because the 10k+1 row might have a larger value so it holds and stores the values until it has received an end of buffer signal from the source.
Only then, can the Aggregate component release the results to the downstream consumers of data.
I show the Aggregate not being "in line" with the source because at this point in a data flow, there is a rift between the data before and the data after. If it had been Source -> Derived column -> Aggregate, the Derived component is going to work on the same memory address (yay pointers) as the Source. Asynchronous components do an in memory copy of the data into a separate memory space. So, instead of being able to allocate 1GB to your data flow, it has to spend .5GB to the first half and .5 to the last half.
If you rename a column "upstream" from an asynchronous component, you can tell where the "break" in data lineage is because that column won't be available to the final destination until you modify all the async components between the source to the destination.
Asynchronous, partially blocking
OLE Source DB 1 -->
Merge Join -> OLE DB Destination
OLE Source DB 2 -->
Merge join is an asynchronous, partially blocking component. You can usually tell the partially blocking components as they have a requirement of Sorted input. For an aggregate, we have to have all the data before we can say this is the maximum value. With a merge join, since we know that both streams are sorted on key, we can release once the match key is out of matches. Assume I have the merge in an INNER JOIN configuration. If I have the rows with value of A,B,C from db1 feed and A,C from db2. While A matches A, we'll release rows to the destination. We exhaust As and go to the next. Source 1 provides B, Source 2 provides C. They don't match so B is discarded and the next Source 1 is retrieved. C matches so it continues on.
It Depends(again)
OLE DB Source -> Script Component -> OLE DB Destination
OLE DB Source ->
Script Component -> OLE DB Destination
A script component operates the way you define it. The default is synchronous but you can make it async.
Jorg has a good table flagging the components into the various buckets: https://jorgklein.com/2008/02/28/ssis-non-blocking-semi-blocking-and-fully-blocking-components/comment-page-2/
The comment asks "What about a lookup transform?"
As the referenced article shows, Lookup is in Synchronous column. However, if one if looking for performance bottlenecks (async components are usually the first place I look), we often point out that the default lookup will cache all data in the table in the PreExecute phase. If your reference table has 10, 100, 1000000 rows, who cares. However long it takes to run SELECT * FROM MyTable and stream that from the database source to the machine SSIS is running on is the performance penalty you pay.
However, if you work at a mutual fund company and have a trade table that records prices for all of your stocks for all time, maybe don't try and pull that data back for a lookup transform, hypothetically speaking of course. Maybe you only needed to get the most recent settlement price so don't be lazy and pull more data than you'll ever need and/or crash the machine.
Related
I have a Data Flow Task that moves a bunch of data from multiple sources to multiple destinations. About 50 in all. The data moved is from one database to another with varying rows and columns in each flow.
While I believe I understand the basic idea behind the Data Flow Task's DefaultBufferMaxRows and DefaultBufferSize as it relates to Rows per Batch and Maximum insert commit size of the Destination, it's not clear to me what happens when there are multiple unrelated source and destination flows.
What I'm wondering is which of the following makes the most sense :
Divide out all the source and destination flows into separate Data Flow Tasks
Divide them into groups that have roughly the same size and number of rows
Leave as is and just make sure to set the properties with enough Buffer Rows and Buffer Size while setting the Rows per batch and Maximum insert commit size to the individual destination
I believe I read some place that it's better to have each source and destination in it's own data flow task, but I am unable to find the link at this time.
Most examples I've been able to locate online seem to always be for one source to one or more destinations, or just one to one.
Let me go from the basis. Data Flow Task is a task, organizing a pipeline of data from Data Source to Data Destination. It is a unique task in SSIS because it runs data manipulation in SSIS itself, all other tasks call external systems to do something with data out of SSIS.
On the relationships between DefaultBufferMaxRows, DefaultBufferSize as it relates to Rows per Batch and Maximum insert commit size of the Destination. There is no direct relation. DefaultBufferMaxRows and DefaultBufferSize are properties of Data Flow pipeline; the pipeline processes rows in batches and these properties controls the processing batch size. These properties control RAM consumption and performance of Data Flow Task.
On other hand, Rows per Batch and Maximum insert commit size are the properties of Data Destination, namely OLE DB Destination in Fast Load mode only; it controls performance of Data Destination itself. You may have a Data Flow with Flat File Destination where you do not have Rows per Batch, but it will definitely have DefaultBufferMaxRows and DefaultBufferSize properties.
Typical usage from my experience:
DefaultBufferMaxRows and DefaultBufferSize control batch size of Data Flow pipeline. Tuning it is a tradeoff - bigger batches means less overhead on batch handling i.e. less execution time, but more RAM consumption. More RAM means that you might experience outage of RAM and DFT data buffers will be swapped to Disk.
In SSIS 2016+ there is a "magical setting" AutoAdjustBufferSize which tells the engine to autogrow the buffer.
Values for these properties are usually defined at performance tests in QA environment. On development - use the defaults.
Rows per Batch and Maximum insert commit size -- control log growth and possibility to rollback all changes. Do not change these unless you really need to do so. Defaults are generally Ok; I changed it rarely on special reason. More on its functions.
On package design:
1 pair of Source-Destination per DFT (Data Flow Task). This is optimal - gives you most of control in terms of tuning and execution order etc. Also you can utilize parallel execution of tasks by SSIS engine. BTW, it simplifies debugging and support.
Division in groups. You can group DFT in Sequence groups and define common properties via Expressions-Variables. But - use it if you really need to do so because it complicates your design.
All Source-Destination in one DFT. I would recommend against it, complex and error prone.
As a bottom line, keep it simple -- 1 pair of Source-Destination per DFT, and play with your parameters only if have to do so.
I have a SSIS package with 2 data flow tasks. The first data flow task is filling values into a dimension table. The second data flow task is filling surrogate keys into the fact table. The fact table is referencing the previous filled dimension table via surrogate key. However, a further SSIS package is doing exactly the same, but with data from another data source. Both SSIS packages are fired by SQLServer Agent in low frequency (each 20 - 40 seconds).
I am worrying about consistency. If I had a single SSIS package that loads the data into the dimension table and fact table, I wouln't have to because it would be possible to create the control flow to enforce the following sequence:
Fill the Dimension table with data from data source 1
Fill the Fact table with data from data source 1 (correct surrogate key to Dim)
Fill the Dimension table with data from data source 2
Fill the Fact table with data from data source 2 (correct surrogate key to Dim)
So in this case the primary key of the Dimension table as well as the corresponding surrogate key in the fact table could be auto-incremented simply in SQL Server DB and everything would be fine.
But since I have 2 SSIS packages, each running independently on a multi-core ETL server in low frequency, I am worrying about the case when the following will happen:
Both packages are starting approximately at the same time
Fill the Dimension table with data from data source 1
Fill the Dimension table with data from data source 2
Fill the Fact table with data from data source 2 (surrogate key to wrong Dim record)
Fill the Fact table with data from data source 1 (surrogate key to wrong Dim record)
Are there any common best practises or, on the other hand, is such a handling necessary or does SQL Server handle such situation by default e.g. by forbid packages to be processed in parallel? Maybe a Write Lock on both tables during the start of each SSIS-package could be satisfactory but in this case I am worrying that this could result in a failure thrown by the other SSIS-package if it cannot reach the destination tables. I am new to SSIS and I would like to know my options about any good techniques to avoid this situation (if necessary).
One option is to use transactions in SSIS. You can embed in the transaction the critical part of the ETLs.
But I'm not sure to understand what makes you think there could be a problem. If you use an identity column on your dimension table, there can not be duplicates, no matter how many threads insert at the same time. In your step 4 and 5, how could you get a surrogate to a wrong record ? Please illustrate your question with an example of how you plan to match your fact with your Dim record.
If I understand your query properly,another option you can use is to make them one package and use sequence containers if you don't want to do this you can still combine them in the control flow with an execute SSIS package task,that way you can control the flow and the one package will only run after the other.The only disadvantage to this is that the package needs to initialize again when executed so it would probably be better proformance wise to just combine then and create data sources for them in the same package.
As per the attached, we have a Balanced Data Distributor set up in a data transformation covering about 2 million rows. The script tasks are identical - each one opens a connection to oracle and executes first a delete and then an insert. (This isn't relevant but it's done that way due to parameter issues with the Ole DB command and the Microsoft Ole DB provider for Oracle...)
The issue I'm running into is no matter how large I make my buffers or how many concurrent executions I configure, the BDD will not execute more than five concurrent processes at a time.
I've pulled back hundreds of thousands of rows in a larger buffer, and it just gets divided 5 ways. I've tried this on multiple machines - the current shot is from a 16 core server with -1 concurrent executions configured on the package - and no matter what, it's always 5 parallel jobs.
5 is better than 1, but with 2.5 million rows to insert/update, 15 rows per second at 5 concurrent executions isn't much better than 2-3 rows per second with 1 concurrent execution.
Can I force the BDD to use more paths, and if so how?
Short answer:
Yes BDD can make use of more than five paths. You shouldn't be doing anything special to force it, by definition it should automatically do it for you. Then why isn't it using more than 5 paths? Because your source is producing data faster than your destination can consume causing backpressure. To resolve it, you've to tune your destination components.
Long answer:
In theory, "the BDD takes input data and routes it in equal proportions to it's outputs, however many there are." In your set up, there are 10 outputs. So input data should be equally distributed to all the 10 outputs at the same time and you should see 10 paths executing at the same time - again in theory.
But another concept of BDD is "instead of routing individual rows, the BDD operates on buffers on data." Which means data flow engine initiates a buffer, fills it with as many rows as possible, and moves that buffer to the next component (script destination in your case). As you can see 5 buffers are used each with the same number of rows. If additional buffers were started, you'd have seen more paths being used. SSIS couldn't use additional buffers and ultimately additional paths because of a mechanism called backpressure; it happens when the source produces data faster than the destination can consume it. If it happens all memory would be used up by the source data and SSIS will not have any memory to use for the transformation and destination components. So to avoid it, SSIS limits the number of active buffers. It is set to 5 (can't be changed) which is exactly the number of threads you're seeing.
PS: The text within quotes is from this article
There is a property in SSIS data flow tasks called EngineThreads which determines how many flows can be run concurrently, and its default value is 5 (in SSIS 2012 its default value is 10, so I'm assuming you're using SSIS 2008 or earlier.) The optimal value is dependent on your environment, so some testing will probably be required to figure out what to put there.
Here's a Jamie Thomson article with a bit more detail.
Another interesting thing I've discovered via this article on CodeProject.
[T]his component uses an internal buffer of 9,947 rows (as per the
experiment, I found so) and it is pre-set. There is no way to override
this. As a proof, instead of 10 lac rows, we will use only 9,947 (Nine
thousand nine forty seven ) rows in our input file and will observe
the behavior. After running the package, we will find that all the
rows are being transferred to the first output component and the other
components received nothing.
Now let us increase the number of rows in our input file from 9,947 to
9,948 (Nine thousand nine forty eight). After running the package, we
find that the first output component received 9,947 rows while the
second output component received 1 row.
So I notice in your first buffer run that you pulled 50,000 records. Those got divided into 9,984 record buckets and passed to each output. So essentially the BDD takes the records it gets from the buffer and passes them out in ~10,000 record increments to each output. So in this case perhaps your source is the bottleneck.
Perhaps you'll need to split your original Source query in half and create two BDD-driven data flows to in essence double your parallel throughput.
We have the need to perform an end of day process to extract the daily transactions from System A and transfer only the changes to System B.
The problem is that System A can only provide the full set of transactions available in System A.
My initial thoughts were to use a staging table (SQL Server) which will persist the data from System A, and then is used for comparison purposes for each execution of the end of day comparison. This can all be done using table joins to identify the required UPDATEs, INSERTs, DELETEs.
Not being an SSIS expert I understand this could be done in SSIS using LOOKUPs to identify the additions, updates and deletion.
Question:
Is the SSIS solution a better approach and why (maintainability, scalability, extensibility) ?
Which would be better performing? Any experience on these 2 options?
Is there any alternative option?
Since you need the full set of transactions from System A, that limits your options as far as source goes. I recommend pulling that data down to a Raw File Destination. This will help you as you develop, since you can just run the tasks that need that data over and over again without refetching. Also, make sure that the source data is sorted by the origin machine. SSIS is very weak with sorting unless you use a 3rd party component (which may be a career-limiting decision in some cases).
Anyway, let's assume that you have that sorted Raw File lying around. Next thing you do is toss that into a Data Flow as a Raw File Source. Then, have an OLEDB (or whatever) source that represents System B. You could use Raw File for that, also, if you like. Make sure that the data from System B is sorted using the same columns you used to sort System A.
Mark the Sources with IsSorted=True, and set the SortKey value on the appropriate columns in the metadata. This will tell SSIS that the data is pre-sorted, and it will permit you to JOIN on your key columns. Otherwise, you may wait days for SSIS to sort big sets.
Add MultiCasts to both System A and System B's sources, because we want to leverage them twice.
Next, add a Merge Join to join the two Raw File Sources together. Make System A the left input. System B will become the right input when you connect it to the Merge Join. SSIS will automatically set the JOIN up on those sorted columns that you marked in the previous step. Set the Merge Join to use LEFT JOIN. This way, we can find rows in System A that do not exist in System B, and we can compare existing rows to see if they were changed.
Next up, add a Conditional Split. In there, you can define 2 output buffers based on conditions.
NewRows: ISNULL(MyRightTable.PrimaryKey)
UpdatedRows: [Whatever constitutes an updated row]
The default output will take whatever rows do not meet those 2 conditions, and may be safely ignored.
We are not done.
Add another Merge Join.
This time, make an input from System B's MultiCast the left input. Make an input from System A's MultiCast the right input. Again, SSIS will set the join keys up for you. Set the Merge Join to use Left Join.
Add a Conditional Split down this flow, and the only thing you need is this:
DeletedRows: ISNULL(MyRightTable.PrimaryKey)
The default output will take all of the other rows, which can be ignored.
Now you have 3 buffers.
2 of them come out of your first Merge Join, and represent New Rows and Updated Rows.
1 of them comes out of your second Merge Join, and represents Deleted Rows.
Take action.
I have a data flow to migrate rows from a database to a new version of the database. One of the changes we are making is to replace user name strings with an integer identifier.
I'm using a Lookup component to replace the Manager and Trader names with their numeric ID but one of the transforms seems to be performing very slowly compared to the other. In the following screen shot it shows how far behind the Lookup Trader component is compared to the Lookup Manager component.
As you can see approx 22m rows have been passed to the Lookup Manager component and it is "only" 100k row behind but the data passed to the Lookup Trader component is almost 8m rows behind.
The lookups contain the same query to get the user name and ID for all the traders and managers (they are maintained in the same table) and are both set to use Full Cache. They are looking up the same data type (string) and both add a new field to the flow of type INT.
I don't understand why one component is performing so much faster than the other when they are essentially the same. The Warning icons on both components are shown because I have set the error action to Fail Component while debugging even though there is an error output connected. Later, I'll redirect the errors to a flat file.
My question is two fold; why is one performing much slower than the other and, more importantly, how do I find out why?
After the comment from Damien_The_Unbeliever about components later in the data flow I did a test by extracting the two lookup components and directing their output a to a Trash Destination; I also reversed the order of the components to see if one was operating faster than the other.
All was running well until both lookup components paused and the input raced ahead to nearly 5m rows but overall the data was processed quickly and the Lookup Trader component did not work more slowly than the Lookup Manager.
So it would seem the later components are causing a bottleneck in the flow and therefore making it appear the Lookup Trader was the culprit.