I have a package that is essentially trying to copy 26 tables from oracle to sql server.
Its not a complete table copy, we are looking for records that belong to certain 'Regions' of our company.
I pull the data from oracle
I started just doing this with elbow grease, but each of the 26 tables required several variable to do the deletes, the fetches etc.
Long story short, I decided to use variables to represent the table names (source, temp and target).
This allowed me to copy/paste one sequence and effectively bypass a lot of click click in bids.
The problem I am running into is that the meta data seems to be very fragile. Sequences all seem to run on their owwn, but when i run the whole package, it breaks. And never in the same place.
Is this approach just a bad idea w/ SSIS?
So just to take this off the board....
Each sequence container had the following ops
Script task - set variables
Execute SQL task - delete from temp
data flow SourceToTemp -
ole db source - used a generic select * from tbl to temp_tbl
derived column1 - insert a timestamp column
oledb destination - map all the columns into a temp table (**THIS IS THE BIG PROBLEM CHILD)
Execute SQL task - delete from target
Execute SQL task - insert target select from temp
the oleDB destination is the piece that kept breaking.
Since it references variables, I had to be very careful at design time to set the variables correctly before opening one of the data flows.
I am pretty sure this is the problem. Since I can not say w/ certainty when SSIS refreshes meta data in the design environment, I cant be sure if/when sequence X refreshed while the variables were set to support sequence Y.
So while it conceptually should work at run time, dev time is a change control night mare.
I have changed all the oleDB destinations to point to a hard table name. this is really a small concession since there are 4 sql statements that are still driven by variables. (saving me a lot of clicking and typing)
This small change has eliminated the 'shifting sands' problem.
Take-a-way lesson : dont have an oledb destination be basesd on a variable.
thanks for the comments
Related
Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
PS: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.
I am implementing a SSIS package and currently trying to do the following.
Truncate the destination table
Fetch the data by executing the stored procedure and insert it into the destination table.
I have created an Execute SQL task to address step 1 and dataflow with oledb source and oledb destination to address the second point. It been working successfully so far but isn't working for one my stored procedure that uses temp tables.
When I edit the oledb source and click the preview button, I get the error no column returned
I know that SSIS has an issue with generating column while executing stored procedures that depend on temp tables. I have converted the stored proc to use temporary table variables and its now able to return columns in SSIS when I do a preview. The only downside is that the stored procedure is taking longer time to execute. Its taking 1 hour 15 mins as compared to 15 mins while using temp tables.
I did see a suggestion to use SET FMTONLY before executing the stored procedure as an alternate solution to changing to temp table variables but that didn't seem to work as I am getting syntax or permission denied error.
Could somebody tell me a solution to my problem which does not compromise on the performance.
Sounds like you've already read all the approaches to using Temp tables in SSIS, including the IF 1=0... trick? If you haven't seen that one yet, google it.
You say that using Table Variables causes your stored procedure to take about 5 times longer than using Temp Tables. The most likely reason for that is that you are indexing your temp tables but not your table variables. If you didn't know that table variables can be indexed, they can. You might try that.
Finally, a solution that you haven't mentioned is that you can replace your temporary table with a real table that gets truncated when you're done using it.
Short comment:
Try EXEC WITH RESULT SETS and specify the metadata yourself for a proc with temp tables; or use the Script Component as a source and specify the Output columns yourself.
Long comment:
Technically speaking, it is the driver/database you are using in SSIS that would decide the behavior when working with temp tables.
Metadata is an important factor when using SSIS's pipeline components. By metadata, I mean the names of the columns, their data types etc that a pipeline component uses. When designing a data flow, someone/something should provide this metadata to the components that require it.
In most cases, SSIS automatically retreives the metadata. Components that do not connect to a external data source, like Conditional Split etc, get their metadata from the other components they are connected to. For the pipeline components that connect to a external data source (like Oledb source, oledb destination, Lookup etc.), SSIS provides a mechanism to get this metadata without human involvement. This mechanism involves the driver connecting to the database and retrieving the metadata of the output. If the driver/database is capable of returning the metadata, then that metadata is used. If the driver/database is incapable, then you get the errors you are seeing. The rest of my comments are based on the assumption that you are using a SQL Server database in your question.
When working with a SQL Server database in SSIS, typically, we use the native client drivers provided by Microsoft. When trying to get the metadata, these drivers try to get the metadata without actually executing the SQL Statement (actual execution can have side effects; and also, might take more than a few seconds/minutes/hours; and you dont want side effects and long wait times during package design time.) So to get the metadata, the driver relies on the metadata of the actual objects used in the sql command. If the command uses a physical table or view, SQL Server already has the metadata available and can supply it to the driver. If it is a temp table, SQL Server does not have the metadata until it can create the temp table. If using FMT ONLY option, you can use it in such a way to create the temp tables, but avoid any heavy processing/side affects and thus be able to retrieve metadata without penalties. Post 2012, these native client drivers rely on some newer functionality to retrieve metadata than the drivers before 2012. In 2012 and after, the driver uses the sp_describe_first_result_set proc to retrieve metadata. So, whether you can get metadata or not is determined by the ability of the sp_describe_first_result_set proc.
So while SSIS can automatically get the metadata (because of the driver/database), it does not automatically get the metadata in some cases (again because of the driver/database). In cases involving the second scenario, some other process (typically a human) can help the driver infer metadata or provide the metadata to the component directly.
To help the driver, in case of SQL Server 2012 and after, you can use the WITH RESULTSETS clause to specify the output metadata. When this clause is present, the driver will use it and doesnt try to query the metadata from system objects; and thus avoid the error which you would otherwise get. If you are using the drivers that came with SQL Server 2008, you can use FMT ONLY. This option is at the driver/database level.
Another option could be to use a Script Component as the Source and in the Output columns, you can specify the columns/metadata. SSIS would not try to retrieve metadata from the datasource in this case, but would rely on the definitions you provided in the Output section of the Script Component.
As you can see, both options involve a human (or some other process) specifying the metadata instead of SSIS trying to retrieve the metadata in an automated fashion. I would prefer the first option if working with SQL Server and the second option if working with databases like MySql.
I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.
First, I am newer to SSIS. I've been working with it for a while, but not very often and only for very basic simple things.
I have an SSIS package set up to load about 30 tables from a "live" data source. This has been working fine and we add new tables and update existing tables all the time. We use these tables to query for reports. This was not a planned out project, it was/is being done on the fly.
Then I was told I need to create "point in time" tables. Basically, duplicate tables from above need to be loaded 3 times every year: once in the beginning, middle and end of year. The data cannot change, but just like the "live" tables, it will be a work in progress with new fields/tables being added. I was under the gun to get it done now (along with a bunch of other stuff of course) and I did it in a way that "works for now", but cannot obviously stay this way. Here is basically what I did:
I created synonyms and revised all the stored procedures to use the synonyms
I set up an SSIS package with the following steps:
Execute SQL Task - Truncate tables
Execute SQL Task - Point synonyms to beginning of the year data source
Data Flow Task - Load tables
Execute SQL Task - Point synonyms to middle of the year data source
Data Flow Task - Load tables
Execute SQL Task - Point synonyms to end of the year data source
Data Flow Task - Load tables
Obviously it cannot stay this way as I now have 4 copies of the same Data Flow Tasks. What a nightmare if I need to add fields or a new table! At least I am able to use the same stored procedures though. Can someone please help me with an idea for a better set up?
Unfortunately I don't have a repro for my issue, but I thought I would try to describe it in case it sounds familiar to someone... I am using SSIS 2005, SP2.
My package has a package-scope user variable - let's call it user_var
first step in the control flow is an Execute SQL task which runs a stored procedure. All that SP does is insert a record in a SQL table (with an identity column) and then go back and get the max ID value. The Execute SQL task saves this output into user_var
the control flow then has a Data Flow Task - it goes and gets some source data, has a derived column which sets a column called run_id to user_var - and saves the data to a SQL destination
In most cases (this template is used for many packages, running every day) this all works great. All of the destination records created get set with a correct run_id.
However, in some cases, there is a set of the destination data that does not get run_id equal to user_var, but instead gets a value of 0 (0 is the default value for user_var).
I have 2 instances where this has happened, but I can't make it happen. In both cases, it was just less that 10,000 records that have run_id = 0. Since SSIS writes data out in 10,000 record blocks, this really makes me think that, for the first set of data written out, user_var was not yet set. Then, after that first block, for the rest of the data, run_id is set to a correct value.
But control passed on to my data flow from the Execute SQL task - it would have seemed reasonable to me that it wouldn't go on until the SP has completed and user_var is set. Maybe it just runs the SP, but doesn't wait for it to complete?
In both cases where this has happened there seemed to be a few packages hitting the table to get a new user_var at about the same time. And in both cases lots of data was written (40 million rows, 60 million rows) - my thinking is that that means the writes were happening for a while.
Sorry to be both long-winded AND vague. A winning combination! Does this sound familiar to anyone? Thanks.
Updating to show the SP I use to get the user_var:
CREATE PROCEDURE [dbo].[sp_GetRunIDForPackage] (#pkg varchar(50)) AS
-- add a new entry for this run of this package - the RUN_ID is an IDENTITY column and so
-- will get created for us
INSERT INTO shared.STAGE_LOAD_JOB( EFFECTIVE_TS, EXECUTED_BY )
VALUES( getdate(), #pkg )
-- now go back into the table and get the new RUN_ID for this package
SELECT MAX( RUN_ID )
FROM shared.STAGE_LOAD_JOB
WHERE EXECUTED_BY = #pkg
Is this variable being accessed lots of times, from lots of places? Do you have a bunch of parallel data flows using the same variable?
We've encountered a bug in both SQL 2005 and 2008 whereby a "race condition" causes the variable to be inaccessable from some threads, and the default value is used. In our case, the variable was our "base folder" location for packages, causing our overall execution control package to not find its sub-packages.
More detail here: SSIS Intermittent variable error: The system cannot find the file specified
Unfortunately, the work-around is to hard-code a default value into the variable that will work when the race condition happens. Easy for us (set base folder to be correct for our prod environment), but looks a lot hard for your issue.
Perhaps you could use multiple variables (one for each data flow), and a bunch of Execute SQL tasks to populate those variables? REALLY ugly, but it should help.
Did you check the value of user_var before getting to the Derived Column Component? It sounds like user_var may be 0 so you are doing run_id = user_var; run_id = 0. I may be naive to think it is that simple but that's the first thing I would check.
Given the procedure code, you might want to replace this:
SELECT MAX( RUN_ID )
FROM shared.STAGE_LOAD_JOB
WHERE EXECUTED_BY = #pkg
with this:
select scope_identity()
The scope_identity() function returns the identity that was entered in the current scope, which is the procedure. Not sure if this will solve the problem, but I find it best to work through them all as they might have unrelated consequences.