I am using SSIS for data warehousing to import data from different sources like flat files,.xls and some other SQL server servers.
In my scenario i have 50 data flow task which execute in a package(Control flow) parallel.These data flow are independent means fetching data from different tables and files into my warehouse DB.
In my case sometime structure of my source table or file changed and then my package got failed means show validation error.
I need a solution by which I can skip only corrupted "data flow task" and other data flow task can complete their task. I don't like to make separate package for each data flow task.
Please advise what to do in such situation.
Regards
Shakti
I highly advise putting each of these into a separate package, and then using a scheduling tool or master package to call each one individually. It will make the maintainability of this solution much better.
If you insist on having them all in one package, you can use the "FailParentOnFailure", "FailPackageOnFailure", and "MaximumErrorCount" properties to have your data flow fail, but the container ignore errors, allowing other things to run. Really probably shouldn't do that though - failures could be for any number of reasons and having separate packages that run in parallel makes finding the error during a scheduled run much easier...
Related
SSIS newbie here.
I have an SSIS package I created based on the wizard. I added a SQL task to run the script I was running previously separately, in order to reduce the process to one step. The script uses lots of temp tables, and one global ##temp at the end to make the result accessible outside the process.
When I try to execute the package, I get a complex "Package Validation Error" (error code 0x80040E14). I think the operative part of the error message is "Invalid object name '##roster5'."
I just realized it was the Data Flow task that was throwing the error, so I tried to put another SQL Task before everything else to create the table so that Data Flow task would see that the table is there; but it is still giving me the error: "Invalid object name '##ROSTER_MEMBER_NEW5'."
What am I missing/doing wrong? I don't know what I don't know. It seems like this shouldn't be that complicated (As a newbie, I know that this is probably a duplicate of...something, but I don't know how else to ask the question.)
Based on your responses, another option would be to add a T-SQL step in a SQL Agent job that executes stand-alone T-SQL. You would need to rethink the flow control of your original SSIS package and split that into 2 separate packages. The first SSIS package would execute all that is needed before the T-SQL step, the next step would execute the actual T-SQL needed to aggregate, then the last step would call the second package, which would complete the process.
I'm offering this advice with the caveat that it isn't advisable. What would work best is to communicate with your DBA, who will be able to offer you a service account to execute your SSIS package with the elevated privileges needed to truncate the staging table that will need to exist for your process to manage.
I actually want to post a non-answer. I tried to follow the advice above as well as I could, but nothing worked. My script was supposed to run, and then the data pump was supposed to, essentially copy the content of a global temp to another server/table. I was doing this as two steps, and tried to use SSIS to do it all in one step. there wasn't really a need to pass values within SSIS from component to component. It doesn't seem like this should be that hard.
In any event, as I said nothing worked. Ok, let me tell what I think happened. After making a lot of mistakes, a lot of undo's, and a lot of unsuccessful attempts, something started working. One of the things I think contributes is that I had set the ResultSetType to ResultSetType_None, since I wouldn't be using any results from that step. If anyone thinks that's not what happened, I'm happy to hear the actuality, since I want to learn.
I consider this a non-answer, because I have little confidence that I'm right, or that I got it by anything other than an accident.
This question is going to be a purely organizational question about SSIS project best practice for medium sized imports.
So I have source database which is continuously being enriched with new data. Then I have a staging database in which I sometimes load the data from the source database so I can work on a copy of the source database and migrate the current system. I am actually using a SSIS Visual Studio project to import this data.
My issue is that I realised the actual design of my project is not really optimal and now I would like to move this project to SQL Server so I can schedule the import instead of running manually the Visual Studio project. That means the actual project needs to be cleaned and optimized.
So basically, for each table, the process is simple: truncate table, extract from source and load into destination. And I have about 200 tables. Extractions cannot be parallelized as the source database only accepts one connection at a time. So how would you design such a project?
I read from Microsoft documentation that they recommend to use one Data Flow per package, but managing 200 different package seems quite impossible, especially that I will have to chain for scheduling import. On the other hand a single package with 200 Data Flows seems unamangeable too...
Edit 21/11:
The first apporach I wanted to use when starting this project was to extract my table automatically by iterating on a list of table names. This could have worked out well if my source and destination tables had all the same schema object names, but the source and destination database being from different vendor (BTrieve and Oracle) they also have different naming restrictions. For example BTrieve does not reserve names and allow more than 30 characters names, which Oracle does not. So that is how I ended up manually creating 200 data flows with a semi-automatic column mapping (most were automatic).
When generating the CREATE TABLE query for the destination database, I created a reusable C# library containing the methods to generate the new schema object names, just in case the methodology could automated. If there was any custom tool to generate the package that could use an external .NET library, then this might do the trick.
Have you looked into BIDS Helper's BIML (Business Intelligence Markup Language) as a package generation tool? I've used it to create multiple packages that all follow the same basic truncate-extract-load pattern. If you need slightly more cleverness than what's built into BIML, there's BimlScript, which adds the ability to embed C# code into the processing.
From your problem description, I believe you'd be able to write one BIML file and have that generate two hundred individual packages. You could probably use it to generate one package with two hundred data flow tasks, but I've never tried pushing SSIS that hard.
You can basically create 10 child packages each having 20 data flow tasks and create a master package which triggers these child pkgs.Using parent to child configuration create a single XML file configuration file .Define the precedence constraint for executing the package in serial fashion in master pkg. In this way maintainability will be better compared to having 200 packages or single package with 200 data flow tasks.
Following link may be useful to you.
Single SSIS Package for Staging Process
Hope this helps!
I don't entirely understand the purpose of control flow in an SSIS package. In all of the packages I've created, I simply add a data flow component to control flow and then the rest of the logic is located within the data flow.
I've seen examples of more complicated control flows (EX: foreach loop container that iterates over lines in an Excel file.), but am looking for an example where it could not also be implemented in the data flow. I could just as easily create a connection to the excel file within the data flow.
I'm trying to get a better understanding of when I would need to (or should) implement logic in control flow vs using the data flow to do it all.
What prompted me to start looking into control flow and it's purpose is that I'd like to refactor SSIS data flows as well as break packages down into smaller packages in order to make it easier to support concurrent development.
I'm trying to wrap my mind around how I might use control flow for these purposes.
A data flow defines a flow of data from a source to a destination. You do not start on one data flow task and move to the next. Data flows between your selected entities (sources, transformations, destinations).
Moreover within a data flow task, you cannot perform tasks such as iteration, component execution, etc.
A control flow defines a workflow of tasks to be executed, often a particular order (assuming your included precedence constraints). The looping example is a good example of a control-flow requirement, but you can also execute standalone SQL Scripts, call into COM interfaces, execute .NET components, or send an email. The control flow task itself may not actually have anything whatsoever to do with a database or a file.
A control flow task is doing nothing in itself TO the data. It is executing some that itself may (or may not) act upon data somewhere. The data flow task IS doing something with data. It defines its movement and transformation.
It should be obvious when to execute control flow logic and data flow logic, as it will be the only way to do it. In your example, you cite the foreach container, and state that you could connect to the spreadsheet in the data flow. Sure, for one spreadsheet, but how would you do it for multiple ones in a folder? In the data flow logic, you simply can't!
Hope this helps.
Data flow - are for just moving data from one source to another.
Control flow - provide the logic for when data flow components are run and how they are run. Also control flow can: perform looping, call stored procedures, move files, manage error handling, check a condition and call different tasks (incl data flows) depending on the result, process a cube, trigger another process, etc.
If you're moving data from one location to another and it's the same each time, not based on any other condition, then you can get away with a package with just a dataflow task, but in most cases packages are more complex than that.
We use the control flow for many things. First all our data concerning the data import is stored in tables. So we run procs to start the dataflow and end it, so that our logging works correctly, we do looping through a set of files, we move files to archive locations and rename with the date and delete them from processing locations. We have a separate program that does file movement and validates the files for the correct comlumns and size. We run a proc to make sure the file has been validated before going into the dataflow. Sometimes we have a requirement to send an email when a file is processed or send a report of records which could not process. These emails are put into the control flow. Sometimes we have some clean up steps that are more easily accomplished using a stored proc and thus put the step in the control flow.
Trying to give a basic answer - Control Flow performs operations; such as executing a SQL Statement or Sending an email. When a control flow is complete, it either failed or succeeded.
Data flow on the other hand is found on container flow items and offers ability to move, modify and manipulate data.
I'd like to pass a Data Flow from one package to another for the following reasons:
It would help in refactoring common
logic in SSIS packages.
It would enable concurrent
development of larger SSIS packages.
At first glance, the Execute Package Task sounded promising, but it looks like I can only pass fairly simple variables in and out of the package.
Is there a way to do this using SSIS?
cozyroc.com is a third party tool that can do this I believe.
A bit of clarity Paul, are you talking about 1) code reusability or 2) allowing the results from one DFT to be used in another DFT?
The first you can't do in "native" SSIS I believe - a set of DFT modules that you can call from other packages, but I would approach it as building a set of packages that are quite simple
initialision routines
DFT
cleanup
Then having variables passed to the child package that are (e.g.) table to be processed, variable(s) to be selected from the source table.
It would requrie a very clever schema and some clever thinking about what the common DFT would do. But I think it would be possible.
The second is not possible without jumping through a few hoops - like saving result sets to temporary tables then re-reading the tables into later DFTs, but then you would loose the data actually flowing through the task.
What is the best way to design a SSIS package? I'm loading multiple dimensions and facts as part of a project. Would it be better to:
Have 1 package and 1 data flow with all data extract and load logic in 1 dataflow?
Have 1 package and multiple data flows with each data flow taking on the logic for 1 dimension?
Have 1 package per dimension and then a master package that calls them all?
After doing some research 2 and 3 appears to be more viable options. Any experts out there that want to share their experience and/or propose an alternative?
Microsoft's Project Real is an excellent example of many many best practices:
Package Design and Config for Dimensional Modeling
Package logging
Partitioning
It's based in SQL 2005 but is very applicable to 2008. It supports your option #3.
You could also consider having multiple packages called by a SQL Server Agent job.
I would often go for option 3. This is the method used in the Kimball Microsoft Data Warehouse Toolkit book, worth a read.
http://www.amazon.co.uk/Microsoft-Data-Warehouse-Toolkit-Intelligence/dp/0471267155/ref=sr_1_1?ie=UTF8&s=books&qid=1245347732&sr=8-1
I think the answer is not quite as clear cut ... In the same way that there is often no "best" design for a DWH, I think there is no one "best" package method.
It is quite dependent on the number of dimensions and the number of related dimensions and the structure of data in your staging area.
I quite like the Project Real (mentioned above) approaches, especially thought the package logging was quite well done. I think I have read somewhere that Denali (SQL 2011) will have SSIS logging/tracking built in, but not sure of the details.
From a calling perspective, I would go for one SQL agent job, that calls a Master Package that then calls all the child packages and manages the error handling/logic/emailing etc between them, utilising Log/Error tables to track and manage the package flow. SSIS allows much more complex sets of logic that SQL agent (e.g. call this Child Package if all of tasks A and B and C have finished and not task D)
Further, I would go for one package per Snowflaked dimension, as usually from the staging data one source table will generate a number of snowflaked dimensions (e.g. DimProduct, DimProductCategory, DimProductSubCategory). It would make sense to have the data read in once in on data flow task (DFT) and written out to multiple tables. I would use one container per dimension for separation of logic.