I have SSIS packages to extract fact tables into the staging tables. I have a control table which contains the last extract date for each table. So, the package extract rows where > control table date. The problem I have is I want to redirect rows with error to an error file in the data flow task of the package. If I do that the package will not fail (so I can't rollback) and some rows might actually go through which if I coninue with the process will ultimately get to my fact table. Now, next time when I run the package if I had updated the control table, I will miss the rows which had erros. If I had not updated the control table with the date, I will re-extract the rows which went through. What is the best practice for this?
How about adding a Row Count Transformation onto the error branch? It sounds like you are using the transaction option in SSIS so put the Data Flow in a sequence container and post Data Flow, evaluate the value of your row count variable. If it's greater than zero, rollback/abort processing.
Related
Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
PS: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.
I created 4 Data Flow Tasks as shown in the image in which order I want them to run:
Customers
Locations
Orders
Items
Each data task flow consists of:
Flat File data source
Data Conversion
OLE DB Destination
When I run, according to the Execution Results tab in the image, not in the order I wanted but:
Customers
Items
Locations
Orders
The execution completes successfully though. I was able to check all data successfully inserted into DB.
Questions:
Does sequence of tables migration matter when tables have foreign keys?
If yes, then why the above execution didn't throw any errors?
How to specify the order of execution?
Thanks.
The order you've specified is the order in which it has actually run. The Execution Results tab is showing them alphabetically.
If you'd like to see the order and prove it to yourself, add some tasks in the sequence to write to a table with datetimes in, or make or inspect a log (using the SSIS Catalog if you're using a Project deployment, or a Log Provider if using Package deployment).
The order is exactly the way you set. but If its important to you to be sure its in specific order, you can set the Constraint to Completion.
i this way the second will not start till the first complete.
I have a ssis project with 3 ssis packages, one is a parent package which calls the other 2 packages based on some condition. In the parent package I have a foreach loop container which will read multiple .csv files from a location and based on the file name one of the two child packages will be executed and the data is uploaded into the tables present in MS SQL Server 2008. Since multiple files are read, if any of the file generates an error in the the child packages, I have to log the details of error (like the filename, error message, row number etc) in a custom database table, delete all the records that got uploaded in the table and read the next file and the package should not stop for the files which are valid and doesn't generate any error when they are read.
Say if a file has 100 rows and there is a problem at row number 50, then we need to log the error details in a table, delete rows 1 to 49 which got uploaded in the database table and the package to start executing the next file.
How can I achieve this in SSIS?
You will have to set TransactionOption=*Required* on your foreach loop container and TransactionOption=*Supported* on the control flow items within it. This will allow for your transactions to be rolled back if any complications happen in your child packages. More information on 'TransactionOption' property can be found # http://msdn.microsoft.com/en-us/library/ms137690.aspx
Custom logging can be performed within the child packages by redirecting the error output of your destination to your preferred error destination. However, this redirection logging only occurs on insertion errors. So if you wish to catch errors that occur anywhere in your child package, you will have to set up an 'OnError' event handler or utilize the built-in error logging for SSIS (SSIS -> Logging..)
I suggest you try the creation of two dataflows in your loop container. The main idea here is to have a set of three tables to better and more easily handle the error situations. In the same flow you do the following:
1st dataflow:
Should read .csv file and load data to a temp table. If the file is processed with errors you simply truncate the temp table. In addition, you should also configure the flat file source output to redirect the errors to an error log table.
2nd dataflow:
On the other hand, in case of processing error-free, you need to transfer the rows from temp into the destination table. So, here, the OLEDB datasource is "temp table" and the OLEDB destination is "final table".
Don´t forget to truncate the temp table in both cases, as the next file will need an empty table.
Let's break this down a bit.
I assume that you have a data flow that processes an individual file at a time. The data flow would read the input file via a source connection, transform it and then load the data into the destination. You would basically need to implement the Error Handler flow in your transformations by choosing "Redirect Row". Details on the Error Flow are available here: https://learn.microsoft.com/en-us/sql/integration-services/data-flow/error-handling-in-data.
If you need to skip an entire file due to a bad format, you will need to implement a Precedence Constraint for failure on the file system task.
My suggestion would be to get a copy of the exam preparation book for exam 70-463 - it has great practice examples on exactly the kind of scenarios that you have run into.
We do something similar with Excel files
We have an ErrorsFound variable which is reset each time a new file is read within the for each loop.
A script component validates each row of the data and sets the ErrorsFound variable to true if an error is found, and builds up a string containing any error details.
Then - based on the ErrorsFound variable - either the data is imported or the error is recorded in a log table.
It gets a bit more tricky when the Excel files are filled in badly enough for the process not to be able to read them at all - for example when text is entered in a date, number or currency field. In this case we use the OnError Event Handler of the Data flow task to record an error in the log but won't know which row(s) caused the problem
I have a bit of a problem. When I set up a SSIS package and i fire it off it shows me the amount of rows that is going into the SQL table, but when I query the table there is almost 40000 rows missing from what the last count was after the conditional split that I have in the package.
What causes this problem? Even if I have it on normal table or view it still does the same thing. But here I have to use the fastload option as it is a lot of source files being loaded. This is only testing before sending it to production and I am stuck at the moment. Is there a way I can work around this problem and get all the data that is supposed to be pumped into the table. please also take note that in the conditional split it removes any NULL values as seen in first picture.
Check the Error Output (under Connection Manager and Mappings) within Destination Component. If the Error setting is set to Ignore Failure or Redirect Row, the component will succeed, but only the successful rows will be inserted.
What is the data source? Try checking your data and make sure you don't have any terminators stored in one of the rows.
I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.