How do I optimize Upsert (Update and Insert) operation within SSIS package? - sql-server-2008

I am not a DBA but I do work for a small company as the IT person. I have to replicate a database from staging to production. I have created an SSIS package to do this but it takes hours to run. This isn't a large data warehouse type of project, either, it's a pretty straightforward Upsert. I'm assuming that I am the weak link in how I designed it.
Here's my procedure:
Truncate staging tables (EXECUTE SQL TASK)
Pull data from a development table into staging (Data Flow Task)
Run a data flow task
OLE DB Source
Conditional Split Transformation (Condition used: [!]ISNULL(is_new_flag))
If new insert, if existing update
The data flow task is mimicked a few times to change tables/values but the flow is the same. I've read several things about OLE DB components being slow to updates being slow and have tried a few things but haven't gotten it to run very quickly.
I'm not sure what other details to give, but I can give anything that's asked for.

Sample package using SSIS 2008 R2 that inserts or updates using batch operation:
Here is a sample package written in SSIS 2008 R2 that illustrates how to perform insert, update between two databases using batch operations.
Using OLE DB Command will slow down the update operations on your package because it does not perform batch operations. Every row is updated individually.
The sample uses two databases namely Source and Destination. In my example, both the databases reside on the server but the logic can still be applied for databases residing on different servers and locations.
I created a table named dbo.SourceTable in my source database Source.
CREATE TABLE [dbo].[SourceTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL,
[IsActive] [bit] NULL
)
Also, created two tables named dbo.DestinationTable and dbo.StagingTable in my destination database Destination.
CREATE TABLE [dbo].[DestinationTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
CREATE TABLE [dbo].[StagingTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
Inserted about 1.4 million rows in the table dbo.SourceTable with unique values into RowNumber column. The tables dbo.DestinationTable and dbo.StagingTable were empty to begin with. All the rows in the table dbo.SourceTable have the flag IsActive set to false.
Created an SSIS package with two OLE DB connection managers, each connecting to Source and Destination databases. Designed the Control Flow as shown below:
First Execute SQL Task executes the statement TRUNCATE TABLE dbo.StagingTable against the destination database to truncate the staging tables.
Next section explains how the Data Flow Task is configured.
Second Execute SQL Task executes the below given SQL statement that updates data in dbo.DestinationTable using the data available in dbo.StagingTable, assuming that there is a unique key that matches between those two tables. In this case, the unique key is the column RowNumber.
Script to update:
UPDATE D
SET D.CreatedOn = S.CreatedOn
, D.ModifiedOn = S.ModifiedOn
FROM dbo.DestinationTable D
INNER JOIN dbo.StagingTable S
ON D.RowNumber = S.RowNumber
I have designed the Data Flow Task as shown below.
OLE DB Source reads data from dbo.SourceTable using the SQL command SELECT RowNumber,CreatedOn, ModifiedOn FROM Source.dbo.SourceTable WHERE IsActive = 1
Lookup transformation is used to check if the RowNumber value already exists in the table dbo.DestinationTable
If the record does not exist, it will be redirected to the OLE DB Destination named as Insert into destination table, which inserts the row into dbo.DestinationTable
If the record exists, it will be redirected to the OLE DB Destination named as Insert into staging table, which inserts the row into dbo.StagingTable. This data in staging table will be used in the second `Execute SQL Task to perform batch update.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 1)
OR (RowNumber % 9 = 2)
First execution of the package looked as shown below. All the rows were directed to destination table because it was empty. The execution of the package on my machine took about 3 seconds.
Ran the row count query again to find the row counts in all three table.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 3)
OR (RowNumber % 9 = 5)
OR (RowNumber % 9 = 6)
OR (RowNumber % 9 = 7)
Second execution of the package looked as shown below. 314,268 rows that were previously inserted during first execution were redirected to staging table. 628,766 new rows were directly inserted into the destination table. The execution of the package on my machine took about 12 seconds. 314,268 rows in destination table were updated in the second Execute SQL Task with the data using staging table.
Ran the row count query again to find the row counts in all three table.
I hope that gives you an idea to implement your solution.

The two things I'd look at are your inserts (ensure you are using either the "Table or View - fast load" or "Table name or view name variable - fast load") and your updates.
As you have correctly determined, the update logic is usually where performance falls down and that is due to the OLE DB component firing singleton updates for each row flowing through it. The usual approach people take to overcome this is to write all the updates to a staging table, much as your Insert logic does. Then follow up your Data Flow Task with an Execute SQL Task to perform a bulk Update.
If you are in the mind of acquiring 3rd party tools, PragmaticWorks offers an Upsert destination

Related

Update a table (that has relationships) using another table in SSIS

I want to be able to update a specific column of a table using data from another table. Here's what the two tables look like, the DB type and SSIS components used to get the tables data (btw, both ID and Code are unique).
Table1(ID, Code, Description) [T-SQL DB accessed using ADO NET Source component]
Table2(..., Code, Description,...) [MySQL DB accessed using ODBC Source component]
I want to update the column Table1.Description using the Table2.Description by matching them with the right Code first (because Table1.Code is the same as Table2.Code).
What i tried:
Doing a Merge Join transformation using the Code column but I couldn't figure out how to reinsert the table because since Table1 has relationships i can't simply drop the table and replace it with the new one
Using a Lookup transformation but since both tables are not the same type it didn't allow me to create the lookup table's connection manager (which would be for in my case MySQL)
I'm still new to SSIS but any ideas or help would be greatly appreciated
My solution is based on #Akina's comments. Although using a linked server would've definitely fit, my requirement is to make an SSIS package to take care of migrating some old data.
The first and last are SQL tasks, while the Migrate ICDDx is the DFT that transfers the data to a staging table created during the first SQL task.
Here's the SQL commands that gets executed during Create Staging Table :
DROP TABLE IF EXISTS [tempdb].[##stagedICDDx];
CREATE TABLE ##stagedICDDx (
ID INT NOT NULL,
Code VARCHAR(15) NOT NULL,
Description NVARCHAR(500) NOT NULL,
........
);
and here's the sql command (based on #Akina's comment) for transferring from staged to final (inside Transfer Staged):
UPDATE [MyDB].[dbo].[ICDDx]
SET [ICDDx].[Description] = [##stagedICDDx].[Description]
FROM [dbo].[##stagedICDDx]
WHERE [ICDDx].[Code]=[##stagedICDDx].[Code]
GO
Here's the DFT used (both TSQL and MySQL sources return sorted output using ORDER BY Code, so i didnt have to insert Sort components before the Merge Join) :
Note: Btw, you have to setup the connection manager to retain/reuse the same connection so that the temporary table doesn't get deleted before we transfer data to it. If all goes well, then after the Transfer Staged SQL Task, the connection would be closed and the global temporary table would be deleted.

delete entry in DB without reference

We have the below requirement:
Currently, we get the data from source (another server, another team, another DB) into a temp DB (via batch jobs) and after we get data into our temp DB, we process the data, transform and update our primary DB with the difference (i.e. the records that changed or the newly added records).
Source->tempDB (daily recreated)->delta->primaryDB
Requirement:
- To delete the data in primary DB once its deleted in source.
Ex: suppose a record with ID=1 is created in source, it comes to temp DB and eventually makes it to primary DB. When this record is deleted in source, it should get deleted in primary DB also.
Challenge:
How do we delete from primary DB when there is nothing to refer to in temp DB (since the record is already deleted in source, nothing comes in tempDB).
Naive approach:
- We can clean up primary DB, before every transform and load afresh. However, it takes a significant amount of time to clean up and populate primary DB everytime.
You could create triggers on each table that fills a history table with deleted entries. Synch that over to your tempDB and use it to delete stuff i your primary DB.
You either want one "delete-history-table" per table or a combined history table that also includes the tablename which triggered the deletion.
You might want to look into SQL Compare or other tools for synching tables.
If you have access to tempDB and primeDB (same server or linked servers) at the same time you could also try a
delete *
from primeBD.Tablename
where not exists (
select 1
from tempDB.Tablename where id = primeDB.Tablename.Id
)
which will perform awfully - ask your db designers.
In this scenorio if TEMPDB & Primary DB have no direct reference then can use track event notification on database level .
Here is the link i got for same :
https://www.mssqltips.com/sqlservertip/2121/event-notifications-in-sql-server-for-tracking-changes/

How to insert rows daily using SSIS so that Identity column will record count of rows?

I have an SSIS package scheduled every day. The theme of package is to copy 1 table data from 100 databases each from server A to server B. The requirement is databases increases day by day. so, tomorrow total databases are 101 and following day 102.
The package will truncate all data from 1 tables from 100 databases and will load table with 100 databses and also with new 101 database. Package executing through SQL job is taking ages.
Table has same column structure with Identity RowID column. what I am looking instead of loading every day from starting database I need package to load only new databses i,e 101, 102, 103 and so on. so that Identoty RowID column will record increment.
Is there any possibility to do this. so that will tajke less time.
Thanks.
If you only need to transfer the newer database and ignore the old ones. Here is the way to do it:
Create a new log table for logging purpose. Could have few rows such as DatabaseName, ImportedDate
Create new variable for holding the processed database name
Adding execute sql task beofre the actual transfer task to check whether the processing database has already existed in that log table, command will look like this:
if not exists(select DatabaseName from logTable
begin select 1 end)
and set result to single row
Create new variable to map that result in execute sql task
Using that result variable in expression constraint to control whether to processed or check another database
Hope this will faster your process

Any way to specify pre and post SSRS execution script?

I want to run some setup SQLs before the content of my report is being processed and then at the end run some cleanup SQLs. e.g. some ALTER statements at the beginning and revert the ALTER at the end.
These should be run per report and users will be accessing the reports via the web url of the report server. I wonder if these SQLs can be configured in the report definition file.rdl using BIDS or I can configure this on the SSRS server side or the underlying database. And how?
First I should say that you may not have the best process if you need to ALTER a table back and forth for a query but I know that crazy stuff is sometimes necessary.
You can add DDL statements to your dataset query.
Here's a query for a Dataset I have that creates a Temp table and some other processes before SELECTing the data needed.
CREATE TABLE #TEMP_CENSUS(
GEO_DATA GEOMETRY NOT NULL,
VALUE DECIMAL(12, 4) NOT NULL DEFAULT 0,
NAME NVARCHAR(50) NULL,
GEO NVARCHAR(250) NULL ) ON [PRIMARY]
INSERT INTO #TEMP_CENSUS(GEO_DATA, VALUE, NAME)
exec dbo.CreateHeatMap 20, 25, ...
Unfortunately, you want other operations after your data is selected. For your reverting ALTER statements, you would want to create another dataset using the same source with the alter statements.
In your DataSource, check the Use Single Transaction box so that the two datasets will be performed in order (as they appear in the Dataset list) so your first dataset will ALTER the tables you need then SELECT your data. Then the second query will run to unALTER (re/de -ALTER?) the tables. You may need to add a SELECT of some sort to the second dataset query so it has some data so SSRS doesn't freak out - I haven't had to run any DDL without returning data (yet).

SSIS Lookup Transformation Stack

I have a data flow task within SSIS 2008 that pulls about 2,000,000 rows with an OLE DB source, then goes one-by-one through 24 lookup transformations replacing "code" data with either its pre-defined equivalent from a dimension table, or an "unknown" value.
This one-by-one process, with the entire flow running through each transformation has become a major bottleneck in the execution of the package and I need a way to speed up the process. Any ideas?
I've tried to multicast the data set to each of the 24 different lookups (so that only the necessary column is sent to it) but when I then run them all into a union all the task seems to not like the various data types and tends to throw errors no matter how I configure it. Is there another option I'm missing?
I would do it all in pure TSQL: insert the 2 million rows into a staging table and use UPDATE statements to set the values you need. That will almost certainly be much faster than a row-by-row lookup process, and you can also put indexes on the staging table if necessary.
After the data is updated, you can push it on to the destination table in another data flow or, if the staging and destination tables are on the same server, just use INSERT ... SELECT ... to do it.
Personally, I always avoid SSIS transformations if there's an easy way to do it in TSQL; performance is better and I find TSQL code easier to maintain than SSIS packages. Having said that, SSIS is a great control-flow tool for getting data from different places, delivering it to a staging database where you can work on it and executing procedures or scripts to transform the data.
It is possible that the bottleneck may not be in the place where you think it is. It could be the destination component that might be slowing down the performance of the package. The package transformations wait until the batch data is inserted into destination. This makes us believe that the transformations that appear in yellow are performing slowly. Actually, the lookup transformation tasks are really fast as far as what I have seen in my experience.
Following example reads 1 million rows from a flat file source and inserts into SQL Server. Even though it uses only one lookup, the reason why I have provided the example here is to give you an idea about having multiple destination components. Having multiple destinations to accept the data processed by various transformations will speed up the package.
I hope this example gives you an idea about how you can improve your package performance.
Step-by-step process:
In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.
Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.
Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.
Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.
Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.
All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.
On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.
Test scenarios:
Test results may vary from machine to machine.
In this scenario, file was located locally on the machine.
Files on network might perform slower.
This is provided just to give you an idea.
So, please take these results with grain of salt.
Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.
Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.
Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.
Hope that helps.
Test queries:
.
--These records will be deleted during next run
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)
--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)
--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)
Flat File Connection Columns
.
Name InputColumnWidth DataType OutputColumnWidth
---------- ---------------- --------------- -----------------
Id 8 string [DT_STR] 8
ItemId 11 string [DT_STR] 11
ItemName 21 string [DT_STR] 21
ItemType 9 string [DT_STR] 9
Scripts: (to create both tables and stored procedures)
.
CREATE TABLE [dbo].[ItemInfo](
[Id] [int] IDENTITY(1,1) NOT NULL,
[ItemId] [varchar](255) NOT NULL,
[ItemName] [varchar](255) NOT NULL,
[ItemType] [varchar](255) NOT NULL,
[IsProcessed] [bit] NULL,
CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Staging](
[Id] [int] IDENTITY(1,1) NOT NULL,
[ItemId] [varchar](255) NOT NULL,
[ItemName] [varchar](255) NOT NULL,
[ItemType] [varchar](255) NOT NULL,
CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO
CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
SET NOCOUNT ON;
UPDATE ITM
SET ITM.ItemName = STG.ItemName
, ITM.ItemType = STG.ItemType
, ITM.IsProcessed = 1
FROM dbo.ItemInfo ITM
INNER JOIN dbo.Staging STG
ON ITM.ItemId = STG.ItemId;
DELETE FROM dbo.ItemInfo
WHERE IsProcessed = 0;
END
GO
CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
SET NOCOUNT ON;
TRUNCATE TABLE dbo.Staging;
UPDATE dbo.ItemInfo
SET IsProcessed = 0;
END
GO
Screenshot #1:
Screenshot #2:
Screenshot #3:
Screenshot #4:
Screenshot #5:
Screenshot #6:
Screenshot #7:
Screenshot #8:
Screenshot #9:
Screenshot #10:
Screenshot #11:
Screenshot #12:
Screenshot #13:
Screenshot #14:
Screenshot #15:
Screenshot #16:
Screenshot #17:
Screenshot #18:
Screenshot #19:
Screenshot #20:
Screenshot #21:
Screenshot #22:
Screenshot #23: