I have a SSIS package that is executing some SQL task over a big list of servers. Since the number is quite big I am trying to split the workload and make it process in parallel. The problem is that I need to know exactly in how many parts I can split it, depending on the number of Logical Processors of the machine that runs it.
Is there any way to get the number of logical processors in SSIS so the work can be organized based on that ?
A C# script task returning System.Environment.ProcessorCount , https://msdn.microsoft.com/en-us/library/system.environment.processorcount.aspx .
Or if you want the more specific details, it looks like you need to execute WMI queries, How to find the Number of CPU Cores via .NET/C#? .
Related
I have a SSIS package which is executed from application. Can the same package be called simultaneously?
Yes, you can run package simultaneously, but keep in mind, this can sometimes lead to a deadlock on processed data (depending on the flow).
Package Property MaxConcurrentExecutables defines how many can remain running at once. The default is -1 and then it depends on the available cores (threads), but you can change it to whatever you want.
Yes, That could be done.
You can use Packages like classes and call multiple instances simultaneously in a parent package providing it with different set of parameters using a sequence container and call multiple instances of same package using Execute package task using same package connection.
Maximum number of tasks that could run simultaneously is dependent on number of cores in your machine.
Number of concurrent tasks= Number of cores +2
Am trying to design a SSIS package where the first step gets data from a table and for each record it executes a VB script using execute Process task in parallel based on the output from Step 1.
I understand SSIS supports for loop and parallel processing for repetative tasks, but i cannot use for loop because itis not parallel and i cannot design parallel tasks so it will depend on input data. The records from step 1 could be 0,1,10(which have to be executed in parallel).
We dont have the ability to use Script component.
Any suggestions are much appreciated.
thanks
SSIS is pretty restricted if it comes to parallel execution. If you can't use script components / script tasks, it's even worse.
However, you can still create a certain number of execute process tasks and steer via parameter / variable, how many of them are executed and which values are passed to them. But as you might guess, this leaves the bitter taste of the question "What if I need several more tasks?".
Maybe you might want to consider to purchase a third party component - there are several available on the net.
I have 100+ child packages and I need to run them in parallel from a parent package. For this I will have to create 100+ Execute Package tasks and then 100+ File Connections. This doesn't look appealing to me and it is repetative and error prone. Is there any other way to do this. Keep two things in mind.
Child package Execution should be in parallel (so no For loop and stuffs)
I am using CheckPoint based restart-ability and hence need control flow items at compile time (no script component based solutions too)
UPDATE: Even if you have massive hardware, windows limits the number of concurrent tasks you can start simultaneously due to an inherent design issue. Though I achieved parallel execution using jobs, I had to limit it to 25 parallel packages at a time to avoid random failures due to the windows issue.
Does it have to be file connections? Have you looked at the options of having the packages stored in the SSIS package store and referencing it from there.
You would still have your 100+ components, but not your 100+ file connections.
I give up. There is no way AFAIK. I decided to create 100+ jobs, one job per package and using the same schedule. Creating jobs was easier using Dynamic SQL.
You could create the package dynamically with EzAPI.
http://blogs.msdn.com/b/mattm/archive/2008/12/30/ezapi-alternative-package-creation-api.aspx
I use Hudson to automate the testing of a very large important product. I want to have my testing-hosts able to run as many concurrent builds as they will theoretically support with the exception of excel-tests which must only run one per machine at any time. Any number of non-excel tests can run concurrently, however at most one excel test at a time must run per machine.
Background:
Most of my tests are normal unit-tests - the sort of thing that I can easily run in parallel. Unfortunately a substantial and time consuming part of my unit-testing plan consists of tests which have been implemented in Excel.
You might think it crazy to implement a test in Excel - actually there's an important reason: Most of our users access our system via a Excel. Excel has it's own quirky ways of handling data so the only way to guarantee that our stuff works for Excel users is to literally implement our reg-test our application Excel.
I've written a test-runner tool which allows me to easily fire off a group of excel tests: Each test is a single .xls file. Each group is a folder full of excel files. I've got about 30 groups which need to be run for an end-to-end test. My tool converts the result of each of the tests into JUnit style XML which Hudson is able to understand. The tests use the pywin32com library to automate excel. When run on their own they are reliable.
I've got a group of computers which are dedicated to running tests. Each machine is quad-core and can theoretically run quite a lot of stuff at once. Unfortunately I've found that COM cannot be used to safely control more than 1 excel per machine at a time.
That is to say if a 2nd build stars which tries to talk to Excel via COM it might interfere with the one which is already running and cause both tests to fail.
I can run as many other non-excel processes as the machine will allow but I need to find a way so that Hudson does not attempt to launch any more than 1 process which requires excel on any one machine concurrently.
Sounds like the Locks and Latches plugin might help you.
http://hudson.gotdns.com/wiki/display/HUDSON/Locks+and+Latches+plugin
Isn't hudson java?
Since you've tagged this post python, I'll point out that buildbot, has slave locks to limit individual steps on individual slaves (or use them as more coarse locks if you'd like).
I have a million rows in a database table. For each row I have to run a custom exe, parse the output and update another database table
How can I run process multiple rows in parallel?
I now have a simple dataflow task ->GetData->Run Script (Run Process , Parse Output)->Store Data
For 6000 rows it took 3 hours.Way too much.
There is the single bottleneck here, running the process per each row. Increasing "EngineThreads" would not help at all, as there will be only one thread running this particular script transform anyway. The time spent in other transforms probably does not matter at all. Processes are heavy weight objects, and running thousands of them will never be cheap.
I can think of following ideas to make it better:
1) The best way to fix it is to convert your custom EXE into an assembly and call it from the script transform - to avoid the overhead of creating processes, parsing the output etc.
2) If you have to use the separate processes, you can try to run these processes in parallel. It will help if the process mostly waits for some input/output (i.e. it is I/O bound). If the processes are memory bound or CPU bound, you would not win much by running them in parallel.
2A) Complex script, simple package.
To run them in parallel, modify the ProcessInput method in your script to start the process asynchronously, and don't wait for the process completion - move to the next row and create the next process. Subscribe to process output and process Exited event, so you know when it has finished. Limit the number of processes run in parallel - otherwise you'll run out of memory. Wait until all the processes are done before returning from ProcessInput call.
2B) Simple script, complex package.
Keep the current sequential script, but partition the data using SSIS. Add conditional split transform, and split the input stream into multiple streams, based on some hash expression - something that will make each output to receive approximately the same amount of data. The number of streams equals the number of process instances you want to run in parallel. Add your script transform to each output of conditional split. Now you should also increase "Engine Threads" property :) and these transforms will run in parallel. (Note: based on tag, I assume you use SSIS 2008. You'll need to insert additional Union All transforms to make it work in SSIS 2005).
This should make it perform better, but millions of processes is a lot. You'll hardly get really good performance here.
If you are executing this process using the "data flow" container, then there is a property on it called "EngineThreads" which defaults to a value of 5. You can set it to a higher number like 20, which will devote more threads to processing those rows.
That is just a performance tweak or optmisation, if your ssis package is still running really slowly then I would perhaps address the architecture and design of your package.