Run 100+ SSIS packages in parallel from a parent package

Run 100+ SSIS packages in parallel from a parent package - ssis

I have 100+ child packages and I need to run them in parallel from a parent package. For this I will have to create 100+ Execute Package tasks and then 100+ File Connections. This doesn't look appealing to me and it is repetative and error prone. Is there any other way to do this. Keep two things in mind.
Child package Execution should be in parallel (so no For loop and stuffs)
I am using CheckPoint based restart-ability and hence need control flow items at compile time (no script component based solutions too)
UPDATE: Even if you have massive hardware, windows limits the number of concurrent tasks you can start simultaneously due to an inherent design issue. Though I achieved parallel execution using jobs, I had to limit it to 25 parallel packages at a time to avoid random failures due to the windows issue.

Does it have to be file connections? Have you looked at the options of having the packages stored in the SSIS package store and referencing it from there.
You would still have your 100+ components, but not your 100+ file connections.

I give up. There is no way AFAIK. I decided to create 100+ jobs, one job per package and using the same schedule. Creating jobs was easier using Dynamic SQL.

You could create the package dynamically with EzAPI.
http://blogs.msdn.com/b/mattm/archive/2008/12/30/ezapi-alternative-package-creation-api.aspx

Related

SSIS ETL solution needs to import 600,000 small simple files every hour. What would be optimal Agent scheduling interval?

The hardware, infrastructure, and redundancy are not in the scope of this question.
I am building an SSIS ETL solution needs to import ~600,000 small, simple files per hour. With my current design, SQL Agent runs the SSIS package, and it takes “n” number of files and processes them.
Number of files per batch “n” is configurable
The SQL Agent SSIS package execution is configurable
I wonder if the above approach is a right choice? Or alternatively, I must have an infinite loop in the SSIS package and keep taking/processing the files?
So the question boils down to a choice between infinite loop vs. batch+schedule. Is there any other better option?
Thank you

In a similar situation, I run an agent job every minute and process all files present. If the job takes 5 minutes to run because there are alot of files, the agent skips the scheduled runs until the first one finishes so there is no worry that two processes will conflict with each other.

Is SSIS the right tool?
Maybe. Let's start with the numbers
600000 files / 60 minutes = 10,000 files per minute
600000 files / (60 minutes * 60 seconds) = 167 files per second.
Regardless of what technology you use, you're looking at some extremes here. Windows NTFS starts to choke around 10k files in a folder so you'll need to employ some folder strategy to keep that count down in addition to regular maintenance
In 2008, the SSIS team managed to load 1TB in 30 minutes which was all sourced from disk so SSIS can perform very well. It can also perform really poorly which is how I've managed to gain ~36k SO Unicorn points.
6 years is a lifetime in the world of computing so you may not need to take such drastic measures as the SSIS team did to set their benchmark but you will need to look at their approach. I know you've stated the hardware is outside of the scope of discussion but it very much is inclusive. If the file system (san, nas, local disk, flash or whatever) can't server 600k files then you'll never be able to clear your work queue.
Your goal is to get as many workers as possible engaged in processing these files. The Work Pile Pattern can be pretty effective to this end. Basically, a process asks: Is there work to be done? If so, I'll take a bit and go work on it. And then you scale up the number of workers asking and doing work. The challenge here is to ensure you have some mechanism to prevent workers from processing the same file. Maybe that's as simple as filtering by directory or file name or some other mechanism that is right for your situation.
I think you're headed down this approach based on your problem definition with the agent jobs that handle N files but wanted to give your pattern a name for further research.
I would agree with Joe C's answer - schedule the SQL Agent job to run as frequently as needed. If it's already running, it won't spawn a second process. Perhaps you're going to have multiple agents that all start every minute - AgentFolderA, AgentFolderB... AgentFolderZZH and they are each launching a master package that then has subprocesses looking for work.

Use WMI Event viewer watcher to know if new file arrived or not and next step you can call job scheduler to execute or execute direct the ssis package.
More details on WMI event .
https://msdn.microsoft.com/en-us/library/ms141130%28v=sql.105%29.aspx

Getting the maximum concurrent executables at runtime in SSIS

I have a SSIS package that is executing some SQL task over a big list of servers. Since the number is quite big I am trying to split the workload and make it process in parallel. The problem is that I need to know exactly in how many parts I can split it, depending on the number of Logical Processors of the machine that runs it.
Is there any way to get the number of logical processors in SSIS so the work can be organized based on that ?

A C# script task returning System.Environment.ProcessorCount , https://msdn.microsoft.com/en-us/library/system.environment.processorcount.aspx .
Or if you want the more specific details, it looks like you need to execute WMI queries, How to find the Number of CPU Cores via .NET/C#? .

SSIS: Package Sequence and Performance

I have a main SSIS package that runs all my other packages. Even though some packages are not dependent on each other, is it always better for performance to put them in a sequence or is it better to run them at the same time (no sequence)?

As Eric has mentioned it truly depends on what the packages do, but I would say if the packages are related to different tables, from my limited experience I have seen better results with having packages run in parallel. I would advice you to go by the dependencies and arrange packages in sequence containers based on which ones can be run parallel. The SSIS engine does a pretty good job of running parallel tasks.

Executing the multiple instances of same package simultaneously through application?

I have a SSIS package which is executed from application. Can the same package be called simultaneously?

Yes, you can run package simultaneously, but keep in mind, this can sometimes lead to a deadlock on processed data (depending on the flow).
Package Property MaxConcurrentExecutables defines how many can remain running at once. The default is -1 and then it depends on the available cores (threads), but you can change it to whatever you want.

Yes, That could be done.
You can use Packages like classes and call multiple instances simultaneously in a parent package providing it with different set of parameters using a sequence container and call multiple instances of same package using Execute package task using same package connection.
Maximum number of tasks that could run simultaneously is dependent on number of cores in your machine.
Number of concurrent tasks= Number of cores +2

SSIS Best Practice - Do 1 of 2 dozen things

I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg

Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.

Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.

May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.

well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008