I have a data flow task within SSIS 2008 that pulls about 2,000,000 rows with an OLE DB source, then goes one-by-one through 24 lookup transformations replacing "code" data with either its pre-defined equivalent from a dimension table, or an "unknown" value.
This one-by-one process, with the entire flow running through each transformation has become a major bottleneck in the execution of the package and I need a way to speed up the process. Any ideas?
I've tried to multicast the data set to each of the 24 different lookups (so that only the necessary column is sent to it) but when I then run them all into a union all the task seems to not like the various data types and tends to throw errors no matter how I configure it. Is there another option I'm missing?
I would do it all in pure TSQL: insert the 2 million rows into a staging table and use UPDATE statements to set the values you need. That will almost certainly be much faster than a row-by-row lookup process, and you can also put indexes on the staging table if necessary.
After the data is updated, you can push it on to the destination table in another data flow or, if the staging and destination tables are on the same server, just use INSERT ... SELECT ... to do it.
Personally, I always avoid SSIS transformations if there's an easy way to do it in TSQL; performance is better and I find TSQL code easier to maintain than SSIS packages. Having said that, SSIS is a great control-flow tool for getting data from different places, delivering it to a staging database where you can work on it and executing procedures or scripts to transform the data.
It is possible that the bottleneck may not be in the place where you think it is. It could be the destination component that might be slowing down the performance of the package. The package transformations wait until the batch data is inserted into destination. This makes us believe that the transformations that appear in yellow are performing slowly. Actually, the lookup transformation tasks are really fast as far as what I have seen in my experience.
Following example reads 1 million rows from a flat file source and inserts into SQL Server. Even though it uses only one lookup, the reason why I have provided the example here is to give you an idea about having multiple destination components. Having multiple destinations to accept the data processed by various transformations will speed up the package.
I hope this example gives you an idea about how you can improve your package performance.
Step-by-step process:
In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.
Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.
Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.
Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.
Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.
All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.
On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.
Test scenarios:
Test results may vary from machine to machine.
In this scenario, file was located locally on the machine.
Files on network might perform slower.
This is provided just to give you an idea.
So, please take these results with grain of salt.
Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.
Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.
Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.
Hope that helps.
Test queries:
.
--These records will be deleted during next run
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)
--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)
--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)
Flat File Connection Columns
.
Name InputColumnWidth DataType OutputColumnWidth
---------- ---------------- --------------- -----------------
Id 8 string [DT_STR] 8
ItemId 11 string [DT_STR] 11
ItemName 21 string [DT_STR] 21
ItemType 9 string [DT_STR] 9
Scripts: (to create both tables and stored procedures)
.
CREATE TABLE [dbo].[ItemInfo](
[Id] [int] IDENTITY(1,1) NOT NULL,
[ItemId] [varchar](255) NOT NULL,
[ItemName] [varchar](255) NOT NULL,
[ItemType] [varchar](255) NOT NULL,
[IsProcessed] [bit] NULL,
CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Staging](
[Id] [int] IDENTITY(1,1) NOT NULL,
[ItemId] [varchar](255) NOT NULL,
[ItemName] [varchar](255) NOT NULL,
[ItemType] [varchar](255) NOT NULL,
CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO
CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
SET NOCOUNT ON;
UPDATE ITM
SET ITM.ItemName = STG.ItemName
, ITM.ItemType = STG.ItemType
, ITM.IsProcessed = 1
FROM dbo.ItemInfo ITM
INNER JOIN dbo.Staging STG
ON ITM.ItemId = STG.ItemId;
DELETE FROM dbo.ItemInfo
WHERE IsProcessed = 0;
END
GO
CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
SET NOCOUNT ON;
TRUNCATE TABLE dbo.Staging;
UPDATE dbo.ItemInfo
SET IsProcessed = 0;
END
GO
Screenshot #1:
Screenshot #2:
Screenshot #3:
Screenshot #4:
Screenshot #5:
Screenshot #6:
Screenshot #7:
Screenshot #8:
Screenshot #9:
Screenshot #10:
Screenshot #11:
Screenshot #12:
Screenshot #13:
Screenshot #14:
Screenshot #15:
Screenshot #16:
Screenshot #17:
Screenshot #18:
Screenshot #19:
Screenshot #20:
Screenshot #21:
Screenshot #22:
Screenshot #23:
Related
I want to be able to update a specific column of a table using data from another table. Here's what the two tables look like, the DB type and SSIS components used to get the tables data (btw, both ID and Code are unique).
Table1(ID, Code, Description) [T-SQL DB accessed using ADO NET Source component]
Table2(..., Code, Description,...) [MySQL DB accessed using ODBC Source component]
I want to update the column Table1.Description using the Table2.Description by matching them with the right Code first (because Table1.Code is the same as Table2.Code).
What i tried:
Doing a Merge Join transformation using the Code column but I couldn't figure out how to reinsert the table because since Table1 has relationships i can't simply drop the table and replace it with the new one
Using a Lookup transformation but since both tables are not the same type it didn't allow me to create the lookup table's connection manager (which would be for in my case MySQL)
I'm still new to SSIS but any ideas or help would be greatly appreciated
My solution is based on #Akina's comments. Although using a linked server would've definitely fit, my requirement is to make an SSIS package to take care of migrating some old data.
The first and last are SQL tasks, while the Migrate ICDDx is the DFT that transfers the data to a staging table created during the first SQL task.
Here's the SQL commands that gets executed during Create Staging Table :
DROP TABLE IF EXISTS [tempdb].[##stagedICDDx];
CREATE TABLE ##stagedICDDx (
ID INT NOT NULL,
Code VARCHAR(15) NOT NULL,
Description NVARCHAR(500) NOT NULL,
........
);
and here's the sql command (based on #Akina's comment) for transferring from staged to final (inside Transfer Staged):
UPDATE [MyDB].[dbo].[ICDDx]
SET [ICDDx].[Description] = [##stagedICDDx].[Description]
FROM [dbo].[##stagedICDDx]
WHERE [ICDDx].[Code]=[##stagedICDDx].[Code]
GO
Here's the DFT used (both TSQL and MySQL sources return sorted output using ORDER BY Code, so i didnt have to insert Sort components before the Merge Join) :
Note: Btw, you have to setup the connection manager to retain/reuse the same connection so that the temporary table doesn't get deleted before we transfer data to it. If all goes well, then after the Transfer Staged SQL Task, the connection would be closed and the global temporary table would be deleted.
how can we insert new data or update the data from one table to another table from MySQL to SQL server using ssis and by not using lookup.
A common way to do this is to insert new data to an empty temporary table, and then run SQL Merge command (using separate SQL Query task).
MERGE command is super powerful and can do updates, inserts or even deletes. See full description of Merge here:
https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-2017
The design for this will look like below :
You will have 4 tables and 1 view : Source, TMP_Dest (exactly as source with no PK), CHG_Dest(for changes, exactly as destination with no PK), Dest(will have PK), FV_TMP_Dest (this is in case the destination looks different than the source - different field types)
SSIS package :
1.Use ExecuteSQLTask and truncate TMP_Dest because it is just temporary for the extracted data
Use ExecuteSQlTask and truncate CHG_Dest because it is just temporary for the extracted data
Use one DataFlowTask for loading data from Source to TMP_Dest
Define two variables OperationIDInsert=1 and OperationIDUpdate=2 (the values are not important, you can set them as you want) -> you will use them at 5. point below
Use another DataFlowTask in which you will have:
on the left side OLE DB Source in which you will extract data from the view, ordered by PK (do not forget to set the SortKeyPosition from Advanced Editor for the PK fields)
on the right side OLE DB Source in which you will extract data from the Dest ordered by PK (do not forget to set the SortKeyPosition from Advanced Editor for the PK fields)
LEFT JOIN between this
on the left side ( "insert side") you will have: a derived column in which you will assign as Expression the OperationIDInsert variable AND an OLE DB Destination for inserting the data in CHG_Dest table. In this way, you will insert the data that have to be inserted in the destination table and you know this because you have the OperationIDInsert column.
on the right side you will do the same thing but using OperationIDUpdate column
You will use ExecuteSQLTask in the ControlFlow and will have an SQL Merge. Based on the PK fields and OperationIDInsert/OperationIDUpdate fields you will either insert the data or update it.
Hope this will help you. Let me know if you need additional info.
After I run a query I get the following error:
OLE DB provider "IBMDASQL" for linked server "DB2400OLEDB" returned message "SQL7008: TABLE1 in STAGING not valid for operation.
Cause . . . . . : **The reason code is 3.**
Reason codes are:
1 -- TABLE1 has no members.
2 -- TABLE1 has been saved with storage free. ***
3 -- TABLE1 not journaled, no authority to the journal, or the journal state is *STANDBY. Files with an RI constraint action of CASCADE, SET NULL, or SET DEFAULT must be journaled to the same journal.***
4 and 5 -- TABLE1 is in or being created into production library but the user has debug mode UPDPROD(*NO).
6 -- Schema being created, but user in debug mode with UPDPROD(*NO).
7 -- A based-on table used in creation of a view is not valid. Either the table is program described table or it is in a temporary schema.
8 -- Based-on table resides in a different ASP than ASP of object being created.
9 -- Index is currently held or is not valid.
10 -- A constraint or trigger is being added to an invalid type of table, or the maximum number of triggers has been reached, or all nodes of the distributed table are not at the same release level.
11 -- Distributed table is being created in schema QTEMP, or a view is being created over more than one distributed table.
12 -- Table could not be created in QTEMP, QSYS, QSYS2, or SYSIBM because it contains a column of type DATALINK having the FILE LINK CONTROL option.
13 -- The table contains a DATALINK column or a LOB column that conflicts with the data dictionary.
14 -- A DATALINK, LOB, or IDENTITY column cannot be added to a non SQL table.
15 -- Attempted to create or change an object using a commitment definition in a different ASP.
16 -- Sequence TABLE1 in STAGING was incorrectly modified with a CL command.
17 -- The table is not usable because it contains partial transactions. Recovery . . . :
Do one of the following based on the reason code:
1 -- Add a member to TABLE1 (ADDPFM).
2 -- Restore TABLE1 (RSTOBJ). 3 -- Start journaling on TABLE1 (STRJRNPF), get access to the journal, or change the ...
IBMDASQL does support commitment control; it's IBMDA400 that does not (according to the http://www-01.ibm.com/support/docview.wss?uid=nas8N1014514 link). If the table is not journaled, then the transactions must run with commitment control disabled. That's a bad practice. The table should be journaled. If it causes significant performance issues, it's almost certainly because the system is too small for its workload or it's configured poorly.
I am not a DBA but I do work for a small company as the IT person. I have to replicate a database from staging to production. I have created an SSIS package to do this but it takes hours to run. This isn't a large data warehouse type of project, either, it's a pretty straightforward Upsert. I'm assuming that I am the weak link in how I designed it.
Here's my procedure:
Truncate staging tables (EXECUTE SQL TASK)
Pull data from a development table into staging (Data Flow Task)
Run a data flow task
OLE DB Source
Conditional Split Transformation (Condition used: [!]ISNULL(is_new_flag))
If new insert, if existing update
The data flow task is mimicked a few times to change tables/values but the flow is the same. I've read several things about OLE DB components being slow to updates being slow and have tried a few things but haven't gotten it to run very quickly.
I'm not sure what other details to give, but I can give anything that's asked for.
Sample package using SSIS 2008 R2 that inserts or updates using batch operation:
Here is a sample package written in SSIS 2008 R2 that illustrates how to perform insert, update between two databases using batch operations.
Using OLE DB Command will slow down the update operations on your package because it does not perform batch operations. Every row is updated individually.
The sample uses two databases namely Source and Destination. In my example, both the databases reside on the server but the logic can still be applied for databases residing on different servers and locations.
I created a table named dbo.SourceTable in my source database Source.
CREATE TABLE [dbo].[SourceTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL,
[IsActive] [bit] NULL
)
Also, created two tables named dbo.DestinationTable and dbo.StagingTable in my destination database Destination.
CREATE TABLE [dbo].[DestinationTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
CREATE TABLE [dbo].[StagingTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
Inserted about 1.4 million rows in the table dbo.SourceTable with unique values into RowNumber column. The tables dbo.DestinationTable and dbo.StagingTable were empty to begin with. All the rows in the table dbo.SourceTable have the flag IsActive set to false.
Created an SSIS package with two OLE DB connection managers, each connecting to Source and Destination databases. Designed the Control Flow as shown below:
First Execute SQL Task executes the statement TRUNCATE TABLE dbo.StagingTable against the destination database to truncate the staging tables.
Next section explains how the Data Flow Task is configured.
Second Execute SQL Task executes the below given SQL statement that updates data in dbo.DestinationTable using the data available in dbo.StagingTable, assuming that there is a unique key that matches between those two tables. In this case, the unique key is the column RowNumber.
Script to update:
UPDATE D
SET D.CreatedOn = S.CreatedOn
, D.ModifiedOn = S.ModifiedOn
FROM dbo.DestinationTable D
INNER JOIN dbo.StagingTable S
ON D.RowNumber = S.RowNumber
I have designed the Data Flow Task as shown below.
OLE DB Source reads data from dbo.SourceTable using the SQL command SELECT RowNumber,CreatedOn, ModifiedOn FROM Source.dbo.SourceTable WHERE IsActive = 1
Lookup transformation is used to check if the RowNumber value already exists in the table dbo.DestinationTable
If the record does not exist, it will be redirected to the OLE DB Destination named as Insert into destination table, which inserts the row into dbo.DestinationTable
If the record exists, it will be redirected to the OLE DB Destination named as Insert into staging table, which inserts the row into dbo.StagingTable. This data in staging table will be used in the second `Execute SQL Task to perform batch update.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 1)
OR (RowNumber % 9 = 2)
First execution of the package looked as shown below. All the rows were directed to destination table because it was empty. The execution of the package on my machine took about 3 seconds.
Ran the row count query again to find the row counts in all three table.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 3)
OR (RowNumber % 9 = 5)
OR (RowNumber % 9 = 6)
OR (RowNumber % 9 = 7)
Second execution of the package looked as shown below. 314,268 rows that were previously inserted during first execution were redirected to staging table. 628,766 new rows were directly inserted into the destination table. The execution of the package on my machine took about 12 seconds. 314,268 rows in destination table were updated in the second Execute SQL Task with the data using staging table.
Ran the row count query again to find the row counts in all three table.
I hope that gives you an idea to implement your solution.
The two things I'd look at are your inserts (ensure you are using either the "Table or View - fast load" or "Table name or view name variable - fast load") and your updates.
As you have correctly determined, the update logic is usually where performance falls down and that is due to the OLE DB component firing singleton updates for each row flowing through it. The usual approach people take to overcome this is to write all the updates to a staging table, much as your Insert logic does. Then follow up your Data Flow Task with an Execute SQL Task to perform a bulk Update.
If you are in the mind of acquiring 3rd party tools, PragmaticWorks offers an Upsert destination
I have a simple data flow.
The source is a small flat file with approxiamtely 16k rows in it.
The destination is an OLE DB destination, a SQL 2008 table with a 3 part Unique key on it.
The data flow goes via some simple transformations; Row Count, derived columns, data conversion etc.
All simple and all that works fine.
My problem is that within this data there are 2 rows which are duplicates in terms of the primary key, 2 duplicate rows that violate that key, so 4 rows in total. On the OLE DB destination i have set the error output to redirect Row and the rows are sent to an Error table which has enough columns for me to identify the bad rows.
The problem is that even though there are 4 cuplrits the tranformation keeps writing 1268 rows to the error table.
Any ideas?
Thanks.
**
Just to add, if i remove the 2 duplicate rows the whole file imports successfully....16,875 rows.
There is no question that only 2 rows violate the key, but the error redirection affects 1268.
**
I have found the solution.
The problem goes away if you load the data using Data access mode 'Table or view' in the OLE Destination rather than 'Table or View - Fast load'.
The only relevant comment i can find is on MSDN;
Any constraint failure at the destination causes the entire batch of rows defined by FastLoadMaxInsertCommitSize to fail.
So it seems that the row size was 1268 in my case and the 2 duplicate rows that were violating the key caused the whole batch to be redirected to the error destination table.
Are you sure that the other rows are errors due to the PK violation? There are a couple of additional columns (ErrorCode, ErrorColumn) available through the error path. This may show that you have different issues.
In SQL 2008 you can redirect failing rows to e.g. a flat file destination. Go to the destination OLEDB task goto the error output (select all fields in the windows). With the combobox below choose redirect row and apply, then ok.
Next drag a precedance contraint (red arrow) from the OLEDB to a new Flat file task and configure this task (don't change the default mapping of the columns).
Now you should be able to find the error row more easy.
Eric