I want to be able to update a specific column of a table using data from another table. Here's what the two tables look like, the DB type and SSIS components used to get the tables data (btw, both ID and Code are unique).
Table1(ID, Code, Description) [T-SQL DB accessed using ADO NET Source component]
Table2(..., Code, Description,...) [MySQL DB accessed using ODBC Source component]
I want to update the column Table1.Description using the Table2.Description by matching them with the right Code first (because Table1.Code is the same as Table2.Code).
What i tried:
Doing a Merge Join transformation using the Code column but I couldn't figure out how to reinsert the table because since Table1 has relationships i can't simply drop the table and replace it with the new one
Using a Lookup transformation but since both tables are not the same type it didn't allow me to create the lookup table's connection manager (which would be for in my case MySQL)
I'm still new to SSIS but any ideas or help would be greatly appreciated
My solution is based on #Akina's comments. Although using a linked server would've definitely fit, my requirement is to make an SSIS package to take care of migrating some old data.
The first and last are SQL tasks, while the Migrate ICDDx is the DFT that transfers the data to a staging table created during the first SQL task.
Here's the SQL commands that gets executed during Create Staging Table :
DROP TABLE IF EXISTS [tempdb].[##stagedICDDx];
CREATE TABLE ##stagedICDDx (
ID INT NOT NULL,
Code VARCHAR(15) NOT NULL,
Description NVARCHAR(500) NOT NULL,
........
);
and here's the sql command (based on #Akina's comment) for transferring from staged to final (inside Transfer Staged):
UPDATE [MyDB].[dbo].[ICDDx]
SET [ICDDx].[Description] = [##stagedICDDx].[Description]
FROM [dbo].[##stagedICDDx]
WHERE [ICDDx].[Code]=[##stagedICDDx].[Code]
GO
Here's the DFT used (both TSQL and MySQL sources return sorted output using ORDER BY Code, so i didnt have to insert Sort components before the Merge Join) :
Note: Btw, you have to setup the connection manager to retain/reuse the same connection so that the temporary table doesn't get deleted before we transfer data to it. If all goes well, then after the Transfer Staged SQL Task, the connection would be closed and the global temporary table would be deleted.
Related
how can we insert new data or update the data from one table to another table from MySQL to SQL server using ssis and by not using lookup.
A common way to do this is to insert new data to an empty temporary table, and then run SQL Merge command (using separate SQL Query task).
MERGE command is super powerful and can do updates, inserts or even deletes. See full description of Merge here:
https://learn.microsoft.com/en-us/sql/t-sql/statements/merge-transact-sql?view=sql-server-2017
The design for this will look like below :
You will have 4 tables and 1 view : Source, TMP_Dest (exactly as source with no PK), CHG_Dest(for changes, exactly as destination with no PK), Dest(will have PK), FV_TMP_Dest (this is in case the destination looks different than the source - different field types)
SSIS package :
1.Use ExecuteSQLTask and truncate TMP_Dest because it is just temporary for the extracted data
Use ExecuteSQlTask and truncate CHG_Dest because it is just temporary for the extracted data
Use one DataFlowTask for loading data from Source to TMP_Dest
Define two variables OperationIDInsert=1 and OperationIDUpdate=2 (the values are not important, you can set them as you want) -> you will use them at 5. point below
Use another DataFlowTask in which you will have:
on the left side OLE DB Source in which you will extract data from the view, ordered by PK (do not forget to set the SortKeyPosition from Advanced Editor for the PK fields)
on the right side OLE DB Source in which you will extract data from the Dest ordered by PK (do not forget to set the SortKeyPosition from Advanced Editor for the PK fields)
LEFT JOIN between this
on the left side ( "insert side") you will have: a derived column in which you will assign as Expression the OperationIDInsert variable AND an OLE DB Destination for inserting the data in CHG_Dest table. In this way, you will insert the data that have to be inserted in the destination table and you know this because you have the OperationIDInsert column.
on the right side you will do the same thing but using OperationIDUpdate column
You will use ExecuteSQLTask in the ControlFlow and will have an SQL Merge. Based on the PK fields and OperationIDInsert/OperationIDUpdate fields you will either insert the data or update it.
Hope this will help you. Let me know if you need additional info.
I have a simple data flow in SSIS (defined in visual studio 2013), which uses SQL to extract data from one sql server instance's table A to then add it to another SQL server instance's table B.
What is the best practice pattern to truncate the data in table B? A truncate statement like this:
TRUNCATE TABLE B
after the select statement for table A - especially when you have have a fairly big table to 'transmit'?
One thing I have done in cases like that is to create two copies of the same table and then a view that points to one or the other that has the name of the current table.
The SSIS package then determines which table is in use and sets the connection for the table to populate to the other table.
Then an exec SQl task truncates the table not currently in use. You may also want to drop any indexes at this point.
Then a dataflow populates the table not currently in use.
Then recreate any indexes you dropped.
Finally an exec SQL task drops and creates the view to use the table you just populated instead of the other one.
Total down time of the table being referenced? Generally less than a second for the drop and create view no matter how long it takes to populate the table.
I have 2 databases(A) with same name in different servers( B & C). Both the databases have same schema. (sql server 2008 r2)
Task 1: Copy(transfer) both the databases into 3rd server (D) with the names (A_B and A_C).
Task 2: Merge both the databases into one database(A_D). (I don't know how will I handle keys)
Task 3: On daily basis I have to get data from servers B & C and put in centralized server D.
Any help would be appreciated.
Thanks.
Ritesh
Here are a few ideas:
Task 1: Transfer databases by doing a backup an restore to server D.
Task 2: I think this will involve ETL processes and creating new surrogate keys in database A_D. Keep keys from original source in a data source id column. I think a MERGE statement would be helpful.
Task 3: Leverage logic in Task 2
Update for Task 2:
Say a source Table1 in database A and B has an key column named Table1_ID. In database A_D add columns Table1_SourceID and Table1_Source. Populate Table1_SourceID with the key from source database, and use Table1_Source to indicate the source database.
Use Table1_ID as the key for Table1, and is unique to database A_D. This will account for collisions for key columns in the source databases. Also, you can track the row to the source database.
Task 1: Create destination Databases with no structures. I'd use tasks -> export function on the source databases with create structures option in SSMS. After export you will have exact copies in destination.
Task 2: In each table of A_D create a new key column (SurKey). It has to be a combination of values which will give unique values in the whole table. E.g. source table abbreviation + PK column + date.
For each table create two Data Flows in SSIS Package, which will load data from A_B and A_C. Put a Derived Column component, which will add a new column - SurKey.
In A_B DataFlow put the A_B as an abbreviation, A_C in the second one.
Task 3: Use Data Flows you created. Script a Job in SSMS, add it to the daily plan.
I am not a DBA but I do work for a small company as the IT person. I have to replicate a database from staging to production. I have created an SSIS package to do this but it takes hours to run. This isn't a large data warehouse type of project, either, it's a pretty straightforward Upsert. I'm assuming that I am the weak link in how I designed it.
Here's my procedure:
Truncate staging tables (EXECUTE SQL TASK)
Pull data from a development table into staging (Data Flow Task)
Run a data flow task
OLE DB Source
Conditional Split Transformation (Condition used: [!]ISNULL(is_new_flag))
If new insert, if existing update
The data flow task is mimicked a few times to change tables/values but the flow is the same. I've read several things about OLE DB components being slow to updates being slow and have tried a few things but haven't gotten it to run very quickly.
I'm not sure what other details to give, but I can give anything that's asked for.
Sample package using SSIS 2008 R2 that inserts or updates using batch operation:
Here is a sample package written in SSIS 2008 R2 that illustrates how to perform insert, update between two databases using batch operations.
Using OLE DB Command will slow down the update operations on your package because it does not perform batch operations. Every row is updated individually.
The sample uses two databases namely Source and Destination. In my example, both the databases reside on the server but the logic can still be applied for databases residing on different servers and locations.
I created a table named dbo.SourceTable in my source database Source.
CREATE TABLE [dbo].[SourceTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL,
[IsActive] [bit] NULL
)
Also, created two tables named dbo.DestinationTable and dbo.StagingTable in my destination database Destination.
CREATE TABLE [dbo].[DestinationTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
CREATE TABLE [dbo].[StagingTable](
[RowNumber] [bigint] NOT NULL,
[CreatedOn] [datetime] NOT NULL,
[ModifiedOn] [datetime] NOT NULL
)
GO
Inserted about 1.4 million rows in the table dbo.SourceTable with unique values into RowNumber column. The tables dbo.DestinationTable and dbo.StagingTable were empty to begin with. All the rows in the table dbo.SourceTable have the flag IsActive set to false.
Created an SSIS package with two OLE DB connection managers, each connecting to Source and Destination databases. Designed the Control Flow as shown below:
First Execute SQL Task executes the statement TRUNCATE TABLE dbo.StagingTable against the destination database to truncate the staging tables.
Next section explains how the Data Flow Task is configured.
Second Execute SQL Task executes the below given SQL statement that updates data in dbo.DestinationTable using the data available in dbo.StagingTable, assuming that there is a unique key that matches between those two tables. In this case, the unique key is the column RowNumber.
Script to update:
UPDATE D
SET D.CreatedOn = S.CreatedOn
, D.ModifiedOn = S.ModifiedOn
FROM dbo.DestinationTable D
INNER JOIN dbo.StagingTable S
ON D.RowNumber = S.RowNumber
I have designed the Data Flow Task as shown below.
OLE DB Source reads data from dbo.SourceTable using the SQL command SELECT RowNumber,CreatedOn, ModifiedOn FROM Source.dbo.SourceTable WHERE IsActive = 1
Lookup transformation is used to check if the RowNumber value already exists in the table dbo.DestinationTable
If the record does not exist, it will be redirected to the OLE DB Destination named as Insert into destination table, which inserts the row into dbo.DestinationTable
If the record exists, it will be redirected to the OLE DB Destination named as Insert into staging table, which inserts the row into dbo.StagingTable. This data in staging table will be used in the second `Execute SQL Task to perform batch update.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 1)
OR (RowNumber % 9 = 2)
First execution of the package looked as shown below. All the rows were directed to destination table because it was empty. The execution of the package on my machine took about 3 seconds.
Ran the row count query again to find the row counts in all three table.
To activate few more rows for OLE DB Source, I ran the below query to activate some records
UPDATE dbo.SourceTable
SET IsActive = 1
WHERE (RowNumber % 9 = 3)
OR (RowNumber % 9 = 5)
OR (RowNumber % 9 = 6)
OR (RowNumber % 9 = 7)
Second execution of the package looked as shown below. 314,268 rows that were previously inserted during first execution were redirected to staging table. 628,766 new rows were directly inserted into the destination table. The execution of the package on my machine took about 12 seconds. 314,268 rows in destination table were updated in the second Execute SQL Task with the data using staging table.
Ran the row count query again to find the row counts in all three table.
I hope that gives you an idea to implement your solution.
The two things I'd look at are your inserts (ensure you are using either the "Table or View - fast load" or "Table name or view name variable - fast load") and your updates.
As you have correctly determined, the update logic is usually where performance falls down and that is due to the OLE DB component firing singleton updates for each row flowing through it. The usual approach people take to overcome this is to write all the updates to a staging table, much as your Insert logic does. Then follow up your Data Flow Task with an Execute SQL Task to perform a bulk Update.
If you are in the mind of acquiring 3rd party tools, PragmaticWorks offers an Upsert destination
Finally reached data migration part of my Project and now trying to move data from MySQL to SQL Server.
SQL Server has new schema (mapping is not always one to one).
I am trying to use SSIS for the conversion, which I started learning today morning.
We have customer and customer location table in MySQL and equivalent table in SQL Server. In SQL server all my tables now have surrogate key column (GUID) and I am creating the same in Script Component.
Also note that I do have a primary key in current mysql tables.
What I am looking for is how I can add child records to customer location table with newly created guid as parent key.
I see that SSIS have Foreach loop container, is this of any use here.
if not another possibility that I can think of is create two Data Flow Task and [somehow] just before the master data is sent to Destination Component [Table] on primary dataflow task , add a variable with newly created GUID and another with old PrimaryID, which will be used to create source for DataTask Flow for child records.
May be to simplyfy , this can also be done once datatask for master is complete and then datatask for child reads this master data and inserts child records from MySQL to SQL Server table. This would though mean that I have to load all my parent table records back into memory.
I know this is all too confusing and it is mainly because I am very confused :-(, to bear with me and if you want more information let me know.
I have been through may links that i found through google search but none of them really explains( or I was not able to uderstand) how the process is carried out.
Please advise
regards,
Mar
** Edit 1**
after further searching and refining key words i found this link in SO and going through it to see if it can be used in my scenario
How to load parent child data found in EDI 823 lockbox file using SSIS?
OK here is what I would do. Put the my sql data into staging tables in sql server that have identity columns set up and an extra column for the eventual GUID which will start out as null. Now your records have a primary key.
Next comes the sneaky trick. Pick a required field (we use last_name) and instead of the real data insert the value form the id field in the staging table. Now you havea record that has both the guid and the id in it. Update the guid field in the staging table by joing to it on the ID and the required field you picked out. Now update the last_name field with the real data.
To avoid the sneaky trick and if this is only a onetime upload, add a column to your tables that contains the staging table id. Again you can use this to get the guid for inserting to related tables. Then when you are done, drop the extra column.
You are aware that there are performance issues involved with using GUIDs? Make sure not to make them the clustered index (as the PK they will be by default unless you specify differntly) and use newsequentialid() to populate them. Why are you using GUIDs? If an identity would work, it is usually better to use it.