Am trying to design a SSIS package where the first step gets data from a table and for each record it executes a VB script using execute Process task in parallel based on the output from Step 1.
I understand SSIS supports for loop and parallel processing for repetative tasks, but i cannot use for loop because itis not parallel and i cannot design parallel tasks so it will depend on input data. The records from step 1 could be 0,1,10(which have to be executed in parallel).
We dont have the ability to use Script component.
Any suggestions are much appreciated.
thanks
SSIS is pretty restricted if it comes to parallel execution. If you can't use script components / script tasks, it's even worse.
However, you can still create a certain number of execute process tasks and steer via parameter / variable, how many of them are executed and which values are passed to them. But as you might guess, this leaves the bitter taste of the question "What if I need several more tasks?".
Maybe you might want to consider to purchase a third party component - there are several available on the net.
Related
I can see I'm not the only person who's experienced an issue with the SSIS Transfer Database Object Task and timeouts, however, people using this for the extract phase of an ETL must be something fairly common, so I'm trying to establish what is the usual/accepted way to do this.
I have a web application that uses Entity Framework to generate ~250 tables, some of which occasionally have schema updates.
The bulk of the transform and load portion of our ETL is handled by a series of stored procedures, however, these read from a copy of the application's tables that are initially loaded in the Transfer Database Objects task.
Initially, we set up an SSIS package that simply ran the Transfer Database Objects task, and then kicked off the stored proc. That meant that the job was fairly resilient to change, and the only changes required were changes to the stored proc, if and when a schema update affected the tables that were used therein.
Unfortunately, as one of our application instances has grown over time, the Transfer Database Objects task is reaching the point where I'm regularly seeing Timeout errors. Those don't appear to be connection timeouts, or anything I can control on the server side, and from what I can see, I can't amend the CommandTimeout on the underlying SMO stuff within that Task.
I can see that some people manually craft their extract, such that they run a separate Data Flow task to pull the information from each table, which has the obvious bonus that these can be run in parallel, however, in my case, that's going to mean an initial chunk of work to craft 250ish of these, and a maintenance task whenever the schema changes on the source database, no matter how minor.
I've come across Biml, which looked like a possible way to at least ease that overhead, however, it doesn't appear this can run on VS2017 yet.
Does anyone have any particular patterns they follow for this, or if I do need individual data flow tasks, is there some way to automate the schema update, perhaps using some kind of SSIS automation and something from the entity framework?
It turns out the easiest way around this is to write a clone of the Transfer task, but with appropriate additions to allow more control over batching and timeouts etc. Details are available in this article: https://blogs.msdn.microsoft.com/mattm/2007/04/18/roll-your-own-transfer-sql-server-objects-task/
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
I have a SSIS package that is executing some SQL task over a big list of servers. Since the number is quite big I am trying to split the workload and make it process in parallel. The problem is that I need to know exactly in how many parts I can split it, depending on the number of Logical Processors of the machine that runs it.
Is there any way to get the number of logical processors in SSIS so the work can be organized based on that ?
A C# script task returning System.Environment.ProcessorCount , https://msdn.microsoft.com/en-us/library/system.environment.processorcount.aspx .
Or if you want the more specific details, it looks like you need to execute WMI queries, How to find the Number of CPU Cores via .NET/C#? .
I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg
Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.
Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.
May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.
well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks
I have a million rows in a database table. For each row I have to run a custom exe, parse the output and update another database table
How can I run process multiple rows in parallel?
I now have a simple dataflow task ->GetData->Run Script (Run Process , Parse Output)->Store Data
For 6000 rows it took 3 hours.Way too much.
There is the single bottleneck here, running the process per each row. Increasing "EngineThreads" would not help at all, as there will be only one thread running this particular script transform anyway. The time spent in other transforms probably does not matter at all. Processes are heavy weight objects, and running thousands of them will never be cheap.
I can think of following ideas to make it better:
1) The best way to fix it is to convert your custom EXE into an assembly and call it from the script transform - to avoid the overhead of creating processes, parsing the output etc.
2) If you have to use the separate processes, you can try to run these processes in parallel. It will help if the process mostly waits for some input/output (i.e. it is I/O bound). If the processes are memory bound or CPU bound, you would not win much by running them in parallel.
2A) Complex script, simple package.
To run them in parallel, modify the ProcessInput method in your script to start the process asynchronously, and don't wait for the process completion - move to the next row and create the next process. Subscribe to process output and process Exited event, so you know when it has finished. Limit the number of processes run in parallel - otherwise you'll run out of memory. Wait until all the processes are done before returning from ProcessInput call.
2B) Simple script, complex package.
Keep the current sequential script, but partition the data using SSIS. Add conditional split transform, and split the input stream into multiple streams, based on some hash expression - something that will make each output to receive approximately the same amount of data. The number of streams equals the number of process instances you want to run in parallel. Add your script transform to each output of conditional split. Now you should also increase "Engine Threads" property :) and these transforms will run in parallel. (Note: based on tag, I assume you use SSIS 2008. You'll need to insert additional Union All transforms to make it work in SSIS 2005).
This should make it perform better, but millions of processes is a lot. You'll hardly get really good performance here.
If you are executing this process using the "data flow" container, then there is a property on it called "EngineThreads" which defaults to a value of 5. You can set it to a higher number like 20, which will devote more threads to processing those rows.
That is just a performance tweak or optmisation, if your ssis package is still running really slowly then I would perhaps address the architecture and design of your package.