First, I am newer to SSIS. I've been working with it for a while, but not very often and only for very basic simple things.
I have an SSIS package set up to load about 30 tables from a "live" data source. This has been working fine and we add new tables and update existing tables all the time. We use these tables to query for reports. This was not a planned out project, it was/is being done on the fly.
Then I was told I need to create "point in time" tables. Basically, duplicate tables from above need to be loaded 3 times every year: once in the beginning, middle and end of year. The data cannot change, but just like the "live" tables, it will be a work in progress with new fields/tables being added. I was under the gun to get it done now (along with a bunch of other stuff of course) and I did it in a way that "works for now", but cannot obviously stay this way. Here is basically what I did:
I created synonyms and revised all the stored procedures to use the synonyms
I set up an SSIS package with the following steps:
Execute SQL Task - Truncate tables
Execute SQL Task - Point synonyms to beginning of the year data source
Data Flow Task - Load tables
Execute SQL Task - Point synonyms to middle of the year data source
Data Flow Task - Load tables
Execute SQL Task - Point synonyms to end of the year data source
Data Flow Task - Load tables
Obviously it cannot stay this way as I now have 4 copies of the same Data Flow Tasks. What a nightmare if I need to add fields or a new table! At least I am able to use the same stored procedures though. Can someone please help me with an idea for a better set up?
Related
Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
PS: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.
I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.
Process - first off, i would like to just use a program to enter data into the SQL table but due to certain ... issues ... this is not an option.
Excel sheet has set column headers - they do not change - excel data for columns does change. new records are added and other are removed daily. i have a project in SSDT 2015:
clear temp table - delete from tbldatemp
data flow task
a. source: excel sheet - access mode: table or view
b. Data conversion 0-0 - some conversions necessary for sql to accept data
c. destination - tbldatemp
update main tblda - insert into where not exists a.[blah] = b.[blah] etc etc etc
so every day the data in excel changes. i have a vb.net program that executes the update package with the click of a button - and every day it doesn't work. so i open up the project in SSDT and see a nice yellow ! triangle on my data flow task stating that columns are out of synchronization. when i open up the data flow task, the ! is now on the excel source stating columns are out of synchronization.
i have looked for DAYS trying to figure out how to fix this or get around it and cannot figure it out. please help!!!! Thank you
I have a package that is essentially trying to copy 26 tables from oracle to sql server.
Its not a complete table copy, we are looking for records that belong to certain 'Regions' of our company.
I pull the data from oracle
I started just doing this with elbow grease, but each of the 26 tables required several variable to do the deletes, the fetches etc.
Long story short, I decided to use variables to represent the table names (source, temp and target).
This allowed me to copy/paste one sequence and effectively bypass a lot of click click in bids.
The problem I am running into is that the meta data seems to be very fragile. Sequences all seem to run on their owwn, but when i run the whole package, it breaks. And never in the same place.
Is this approach just a bad idea w/ SSIS?
So just to take this off the board....
Each sequence container had the following ops
Script task - set variables
Execute SQL task - delete from temp
data flow SourceToTemp -
ole db source - used a generic select * from tbl to temp_tbl
derived column1 - insert a timestamp column
oledb destination - map all the columns into a temp table (**THIS IS THE BIG PROBLEM CHILD)
Execute SQL task - delete from target
Execute SQL task - insert target select from temp
the oleDB destination is the piece that kept breaking.
Since it references variables, I had to be very careful at design time to set the variables correctly before opening one of the data flows.
I am pretty sure this is the problem. Since I can not say w/ certainty when SSIS refreshes meta data in the design environment, I cant be sure if/when sequence X refreshed while the variables were set to support sequence Y.
So while it conceptually should work at run time, dev time is a change control night mare.
I have changed all the oleDB destinations to point to a hard table name. this is really a small concession since there are 4 sql statements that are still driven by variables. (saving me a lot of clicking and typing)
This small change has eliminated the 'shifting sands' problem.
Take-a-way lesson : dont have an oledb destination be basesd on a variable.
thanks for the comments
I would like to bring in an XML source and do data conversion and update it in a table. Data from this table will be used to update another table. How to accomplish this in SSIS?
I understand the first two steps. But lost after that.
XML Source (under dataflow task)
Data Conversion
OLE DB Destination? (If I use OLE DB Destination, then I cannot use that as a source again to update another table). What component should I be using to accomplish this?
TIA
Within a dataflow you can split the records to go to multiple tables using either a conditional split (if you want some records to go one way and some to go another way) or a mulicast task if you want all records to go to both destinations. We use a multicast to create two staging tables, one where the raw data from the file will stay and one where the data will be cleaned and transformed before going into our prod tables. This enables us to easily research if some problem data that came in was due to our transformation process (a bug) or bad data being sent (a problem at the client end, but which might require more steps to handle if they can't fix).
You can also have multiple data flows that all have the same source. Or you can insert to one staging table and then have a second data flow or exec SQL task to move that data to where you want it.
Use the OLE DB Destination to inject your XML source data into your staging table. Then, in your control flow use an Execute SQL task after your data flow task to execute a stored procedure or T-SQL script to move your data from the staging table into the production table(s) and truncate the staging table if required.
I've found that SSIS is great for ETL work, but moving data around inside a DB or aggregation work is best carried out using T-SQL in stored procs. Easier to write, control and you know you're not going to have any RBAR shenanigans you can happen upon in a DFT.
YMMV