I have an SSIS application that needs to get data from 2 databases of different servers (not link). I need to get the match names and DOB records between 2 database then use the results to insert/update a table.
My initial approach is to use OLE DB source then Merge Join and put the results to recordset. Then on controlflow, use the results of the recordset to insert/update a table. But I can't see the recordset at the control flow.
Alternative solution is to create temp tables. But the temp tables are not visible since they reside at the tempdb database of each servers.
What is a better approach for this problem?
what do you mean by put the results to recordset?
If you join two sources on the data flow using a join, that "recordset" on the join will only be available during the current dataflow. You cant use it on the control flow after the data flow is finisehd.
why cant you just insert the resultset on the destination DB? You can perform any other transform operation on the same data flow and insert the result on the destination database.
Or, if you really need to do something that can only be done on the control flow before insert the data, you can yes, insert the recordset on a temp table on the destination using a oleDBDestination and access in on another dataflow (not a very good approach, though)
In this case, I would keep a database around for work table or create a schema for those work tables.
Next, add a SQL control flow task that truncates the table that will hold the intermediate result. After this, load the intermediate result set into the table, do the operation and optionally, truncate the table again.
The recordset destination is fine for smaller datasets. But if you plan to use it for larger datasets that dont fit memory it will be very slow.
If you dont have a database/schema that can serve as a workspace, you could use RAW files to hold the intermediate result. Those are very fast too.
Related
I have a table in database 1 with columns x and y. I have another table in database 2 with columns x and y. I want to update all the y columns in database 1 to the y columns in database 2 where the x columns in database 1 match the x columns in database 2.
This seems like an unbelievably trivial task, but I can't figure out how to do it in SSIS. I have an OLE DB Source and Destination in my data flow task and I have the 2 columns mapped, but it keeps trying to insert instead of update, and it fails because there are a bunch of other non-nullable columns in the destination that I don't have mapped.
The problem with using SSIS to do data transformation is that both the source and target data sets need to be pulled up into memory on the ETL server, the transformation needs to happen there, and then the results have to be written back down to the destination server.
It's network intensive. It's memory intensive. It's just less than ideal. That's also why you're having trouble figuring it out. On a server, it's just an UPDATE statement, but getting it up into SSIS requires many more steps than just that, and absent third party tools, there's no out of the box method to do anything other than row by row updates.
In your situation, where your source data is comparatively lightweight, I would suggest that the most efficient approach would be to use SSIS to move the source data from the source server to the target server and drop it into a working/holding/intermediate table. SSIS is absolutely awesome at moving data from point A to point B. Then, after the Data Flow, use an Execute SQL task to either call an UPDATE stored procedure, or go ahead and write the UPDATE statement in the package.
Doing it that way off-loads the DML from the ETL server to the SQL Server, which is designed for exactly that kind of work. Sort of a "let everybody do what they're good at" approach, if you will.
OK so rather than trying to directly map the data from DB2 into DB1, it's probably a better idea to stage the data from DB2 into DB1 and then update the table of interest in DB1.
The best way to do this is to create a new table in DB1 that stores all of the data from the table in DB2. Let's call this table 'staging'. Use SSIS to do a flat insert from the table in DB2 to your new 'staging' table in DB1 and then create an UPDATE stored procedure in your DB1 database to update existing entries in your endpoint table based on the entries you now have in your 'staging' table. You can trigger the SP from SSIS to run after your staging table is populated. You can cut out the 'staging' table here if you have a synonym from DB1 that references the table in DB2.
SSIS is more about bulk movement of data than it is updating what already exists. Use stored procedures for anything in between.
Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
PS: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.
I am writing a basic file dump from one database to another. I am using SSIS 2008 and creating several packages to transform the data I have from a MSSQL 2010 database to a MYSQL 5.1 database.
All the connections are set up and records can be tranfered between the two databases but I would like to use temp tables in the transform processes and use the temp table as the MSSQL source in a dataflow task to dump the table in an awaiting MYSQL table.
I have been having problems setting this up. I am using an OLEDB connection and have set the RetainSameConnection property as well as the DelayValidation property to true. When setting up the source figure as the source from the MSSQL database I cannot find the temp table I have created in an earlier task from the control flow. I am using the same connection manager for these two tasks.
Anyone have any ideas or experience with this?
As a simple example one task does..
SELECT *
INTO #TMP
FROM CUSTOMERS
(This is a simplified example and I relize in this case I could just use the Customers table so bear with me)
Is it possible to use this temp table in a dataflow operation as the source table?
As I mentioned in my comment, not much of a solution and more of a workaround. SSIS uses the shape of result sets to bind properties in tasks. As temp tables are not always available in the database this can cause errors in SSIS even if you set DelayValidation to true.
My solution is to create an SSIS schema in whichever database you're connecting to. The reasons for doing so are security and clear separation of objects that are only used within SSIS packages - primarily staging tables.
Instead of throwing tables in your dbo schema (you shouldn't be anyway, shame on you) you'd create them in the SSIS schema. A typical data flow would truncate the table when it begins, load values and perform whatever operations are required, optionally truncating it when complete. As long as the table is always available SSIS can examine the shape of result sets.
You should not use temp tables as the source as it will not recognize the columns for the select. use table variables or CTEs instead.
I would like to bring in an XML source and do data conversion and update it in a table. Data from this table will be used to update another table. How to accomplish this in SSIS?
I understand the first two steps. But lost after that.
XML Source (under dataflow task)
Data Conversion
OLE DB Destination? (If I use OLE DB Destination, then I cannot use that as a source again to update another table). What component should I be using to accomplish this?
TIA
Within a dataflow you can split the records to go to multiple tables using either a conditional split (if you want some records to go one way and some to go another way) or a mulicast task if you want all records to go to both destinations. We use a multicast to create two staging tables, one where the raw data from the file will stay and one where the data will be cleaned and transformed before going into our prod tables. This enables us to easily research if some problem data that came in was due to our transformation process (a bug) or bad data being sent (a problem at the client end, but which might require more steps to handle if they can't fix).
You can also have multiple data flows that all have the same source. Or you can insert to one staging table and then have a second data flow or exec SQL task to move that data to where you want it.
Use the OLE DB Destination to inject your XML source data into your staging table. Then, in your control flow use an Execute SQL task after your data flow task to execute a stored procedure or T-SQL script to move your data from the staging table into the production table(s) and truncate the staging table if required.
I've found that SSIS is great for ETL work, but moving data around inside a DB or aggregation work is best carried out using T-SQL in stored procs. Easier to write, control and you know you're not going to have any RBAR shenanigans you can happen upon in a DFT.
YMMV
I'm fairly new to SSIS,
I'm importing from an XLS spreadsheet into a database table. Along the way I want to select a record from a table, but it is NOT a lookup, ie: a straight SELECT with no join from input source. Then I want to merge this along with the other rows from the XLS.
What is the best way to do this? Variables? OLE DB commands?
Thanks
You could use an OLE DB command but the important thing to remember about this is that it is fired on a per-row basis and could potentially be slow. You can still use a lookup for this purpose, but make sure that you use set the error output to ignore lookup errors for the cases when the lookup transformation does not contain an value for the match you are looking for.
You could also use a merge transformation with an outer join condition rather than an inner join.
If the record that you are retrieving from the database table is not dependent on the data within the row from the spreadsheet then it will probably be the same for each row - is that what you are hoping for?
In this case, I would consider using an Execute SQL Task in the Control Flow to retrieve the record and save it to a variable. You can use a Script Component in the Data Flow to copy the values in the record from the variable to the appropriate fields in each row. This will mean that the lookup data is retrieved only once and not once per row which is slow as jn29098 said above.
If the target for your Data Flow is the same database as the one from which you are extracting the 'lookup' record then you could also consider using an Execute SQL Task (in the Control Flow) to add the lookup values once the spreadsheet data has arrived in the database (once the Data Flow has completed). This would be much more efficient.