SSIS package finished successfully with skipping several tasks - ssis

ETL package finished successfully without execution of several last tasks:
Then I've tried to run task with the same type and skip other:
After that I've created separate package with the last five tasks and they run great as expected!
Question:
What's happen with the flow on the first two figures? Why does package skip several tasks without any warnings/errors etc.?
Thanks a lot for answers and any ideas about this strange behavior!
[UPDATE] Answered by #Peter_R:
I've changed both sp_updatestats inputs from AND to OR and everything is ok. Arrows was changed to dotted ones:

The Logical AND constraint requires all tasks to complete before running so you SP_Updatestats will not run until both ProcessFull and MeasureGroupSet Loop have completed.
I am guessing after Deploy Data the Expression is designed to split the workflow depending on a condition you have set. In doing this you will never have both ProcessFull and MeasureGroupSet Loop running in parallel, meaning that the SP_UpdateStats task will never run.
If you change both the connecting constraints to the SP_UpdateStats to Logical OR it will run after either the ProcessFull OR MeasureGroupSet Loop has completed.
This is still the case if something is disabled as well, slightly odd but still the case.

Related

Using SQLAlchemy objects in a long running Celery task

I have a long running task that updates some SQLAlchemy objects. A session is opened at the start of the task, updates are made along the way, and the transaction is committed at the end. The problem is that the task runs very long, so the connection will have closed (timed out, "gone away", whatever you want to call it) before the commit is even able to happen. This will cause the commit to fail and the whole task to fail.
This seems absolutely the correct way to do write to a DB to short tasks or non-Celery related things. But it is certainly a problem if the tasks take too long.
Is there some other recommended pattern? Should the Celery task not even utilize the SQLAlchemy objects and instead use some sort of static class whose data can be used to update the actual SQLAlchemy objects, only at the end of the task, maybe? That is the only possible solution I have come up with. I would like to know if there are others or if my idea has other problematic implications.
Open the transaction only when it must occur. Therefore, the session should be closed while the actual task is running, but the objects should be kept around and edited by the task. Then use merge. Make sure the setting expire_on_commit is False.

SSIS restart/ re-trigger package automatically

Good day!
I have a SSIS package that retrieves data from a database and exports it to a flat file (simple process). The issue I am having is that the data my package retrieves each morning depends on a separate process to load the data into a table prior to my package retrieving it.
Now, the process which initially loads the data does inserts metadata into a table which shows the start and end date/time. I would like to setup something in my package that checks the metadata table for an end date/time for the current date. If the current date exists, then the process continues... IF no date/ time exists then the process stops (here is the kicker) BUT the package re-triggers itself automatically an hour later to check if the initial data load is complete.
I have done research on checkpoints, etc. but all that seems to cover is if the package fails it would pick up where it left when the package is restarted. I don't want to manually re-trigger the process, I'd like it to check the metadata and re-start itself if possible. I could even put in processing that if it checks the metadata 3 times it would stop completely.
Thanks so much for your help
What you want isn't possible exactly the way you describe it. When a package finishes running, it's inert. It can't re-trigger itself, something has to re-trigger it.
That doesn't mean you have to do it manually. The way I would handle this is to have an agent job scheduled to run every hour for X number of hours a day. The job would call the package every time, and the meta data would tell the package whether it needs to do anything or just do nothing.
There would be a couple of ways to handle this.
They all start by setting up the initial check, just as you've outlined above. See if the data you need exists. Based on that, set a boolean variable (I'll call it DataExists) to TRUE if your data is there or FALSE if it isn't. Or 1 or 0, or what have you.
Create two Precedent Contraints coming off that task; one that requires that DataExists==TRUE and, obviously enough, another that requires that DataExists==FALSE.
The TRUE path is your happy path. The package executes your code.
On the FALSE path, you have options.
Personally, I'd have the FALSE path lead to a forced failure of the package. From there, I'd set up the job scheduler to wait an hour, then try again. BUT I'd also set a limit on the retries. After X retries, go ahead and actually raise an error. That way, you'll get a heads up if your data never actually lands in your table.
If you don't want to (or can't) get that level of assistance from your scheduler, you could mimic the functionality in SSIS, but it's not without risk.
On your FALSE path, trigger an Execute SQL Task with a simple WAITFOR DELAY '01:00:00.00' command in it, then have that task call the initial check again when it's done waiting. This will consume a thread on your SQL Server and could end up getting dropped by the SQL Engine if it gets thread starved.
Going the second route, I'd set up another Iteration variable, increment it with each try, and set a limit in the precedent constraint to, again, raise an actual error if you're data doesn't show up within a reasonable number of attempts.
Thanks so much for your help! With some additional research I found the following article that I was able to reference and create a solution for my needs. Although my process doesn't require a failure to attempt a re-try, I set the process to force fail after 3 attempts.
http://microsoft-ssis.blogspot.com/2014/06/retry-task-on-failure.html
Much appreciated
Best wishes

SSIS Best Practice - Do 1 of 2 dozen things

I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg
Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.
Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.
May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.
well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks

SSIS Handling Extenal Issues

I have an SSIS package that works fine. The package runs every night and takes about 4 hours to complete. I have am a newb to SSIS, so I want to see what my options are. I am not finding anything on the web about these two issues, so any advice is greatly appreciated.
What to do when I have an external
issue such as a power
failure/accidental restart. Is there
a way to alert someone or have the
package begin again on restart.
A couple weeks ago there was a
process that got hung and locked
table, making the process not
execute. How is the best way to
handle ensuring I have the proper
access before starting and if not,
get the access. I am ok with killing
the processes etc.
Looking for best practice info. Thanks
For #1 - there is no inherent "restart" mechanism in SSIS, since to start with, there is no inherent "start" mechanism. You'll have to look at the process that you've got managing the scheduled execution of your packages, which I assume could be SQL Agent.
Given that, your options for determining if a SQL Agent job failed, and/or restarting that job are the same whether the contents of the job are SSIS packages or not. There are quite a few stored procedures for monitoring and querying job execution and results. You could also implement your own mechanism for recording job/package status.
SSIS does offer "checkpoints" to help you restart packages from certain points, but the general concensus on that feature is that it is limited in it's applicability - your mileage may vary.
Personally, I always include a failure route in my job to email someone on failure of the job, and configure my jobs and packages to be idempotent - that is, they can be re-run without fear of improperly conducting the same operations twice. They either "reset" the environment (delete and reload), or they can detect exactly where they left off.
Item #2 is a difficult question and depends greatly on your environment and scenario. You can use simple Tasks like an Execute SQL Task to run "test" commands that are tested to fail if sufficient privileges or locks exist. Or you may be able to inquire directly through SPs or other mechanisms to determine if you need to take remedial action before you attempt to run the meat of your package.
Using Precedence Constraints "on failure" can assist with that kind of logic. So can Event Handlers.

How to make this SSIS scenario more parallel

I have a million rows in a database table. For each row I have to run a custom exe, parse the output and update another database table
How can I run process multiple rows in parallel?
I now have a simple dataflow task ->GetData->Run Script (Run Process , Parse Output)->Store Data
For 6000 rows it took 3 hours.Way too much.
There is the single bottleneck here, running the process per each row. Increasing "EngineThreads" would not help at all, as there will be only one thread running this particular script transform anyway. The time spent in other transforms probably does not matter at all. Processes are heavy weight objects, and running thousands of them will never be cheap.
I can think of following ideas to make it better:
1) The best way to fix it is to convert your custom EXE into an assembly and call it from the script transform - to avoid the overhead of creating processes, parsing the output etc.
2) If you have to use the separate processes, you can try to run these processes in parallel. It will help if the process mostly waits for some input/output (i.e. it is I/O bound). If the processes are memory bound or CPU bound, you would not win much by running them in parallel.
2A) Complex script, simple package.
To run them in parallel, modify the ProcessInput method in your script to start the process asynchronously, and don't wait for the process completion - move to the next row and create the next process. Subscribe to process output and process Exited event, so you know when it has finished. Limit the number of processes run in parallel - otherwise you'll run out of memory. Wait until all the processes are done before returning from ProcessInput call.
2B) Simple script, complex package.
Keep the current sequential script, but partition the data using SSIS. Add conditional split transform, and split the input stream into multiple streams, based on some hash expression - something that will make each output to receive approximately the same amount of data. The number of streams equals the number of process instances you want to run in parallel. Add your script transform to each output of conditional split. Now you should also increase "Engine Threads" property :) and these transforms will run in parallel. (Note: based on tag, I assume you use SSIS 2008. You'll need to insert additional Union All transforms to make it work in SSIS 2005).
This should make it perform better, but millions of processes is a lot. You'll hardly get really good performance here.
If you are executing this process using the "data flow" container, then there is a property on it called "EngineThreads" which defaults to a value of 5. You can set it to a higher number like 20, which will devote more threads to processing those rows.
That is just a performance tweak or optmisation, if your ssis package is still running really slowly then I would perhaps address the architecture and design of your package.