SSIS Handling Extenal Issues - ssis

I have an SSIS package that works fine. The package runs every night and takes about 4 hours to complete. I have am a newb to SSIS, so I want to see what my options are. I am not finding anything on the web about these two issues, so any advice is greatly appreciated.
What to do when I have an external
issue such as a power
failure/accidental restart. Is there
a way to alert someone or have the
package begin again on restart.
A couple weeks ago there was a
process that got hung and locked
table, making the process not
execute. How is the best way to
handle ensuring I have the proper
access before starting and if not,
get the access. I am ok with killing
the processes etc.
Looking for best practice info. Thanks

For #1 - there is no inherent "restart" mechanism in SSIS, since to start with, there is no inherent "start" mechanism. You'll have to look at the process that you've got managing the scheduled execution of your packages, which I assume could be SQL Agent.
Given that, your options for determining if a SQL Agent job failed, and/or restarting that job are the same whether the contents of the job are SSIS packages or not. There are quite a few stored procedures for monitoring and querying job execution and results. You could also implement your own mechanism for recording job/package status.
SSIS does offer "checkpoints" to help you restart packages from certain points, but the general concensus on that feature is that it is limited in it's applicability - your mileage may vary.
Personally, I always include a failure route in my job to email someone on failure of the job, and configure my jobs and packages to be idempotent - that is, they can be re-run without fear of improperly conducting the same operations twice. They either "reset" the environment (delete and reload), or they can detect exactly where they left off.
Item #2 is a difficult question and depends greatly on your environment and scenario. You can use simple Tasks like an Execute SQL Task to run "test" commands that are tested to fail if sufficient privileges or locks exist. Or you may be able to inquire directly through SPs or other mechanisms to determine if you need to take remedial action before you attempt to run the meat of your package.
Using Precedence Constraints "on failure" can assist with that kind of logic. So can Event Handlers.

Related

SQL Server continuous job

I have a requirement to run a job continuously which includes a stored procedure. This stored procedure does a critical task where it processes huge load of data as they come. As I know, it is not allowed to run 2 or more instances of a job in the same time by SQL Server it self. So, my questions are
Is there a way to run SQL Sever job continuously?
Do continuously running jobs hurt performance of the server?
There are continuous replication jobs; however, those are continuous because of an inline switch used in the command line and not due to the job being scheduled as continuous.
The only way to emulate a continuous job is to simply have it run often. There is an option under scheduling to run the job down to every second 24/7/365. With that said, you will need to be careful that the job isn't overrunning itself and that it is efficient enough to not cause issues with your server.
Whether it will effect performance is going to be reliant on what it does. If the job only selects the current date/time (not a very useful thing to do but an example), I would not expect an issue; however, if it runs complicated algorithms then it almost certainly going to cause issues.
I would recommend running this on a test server before putting it into production.

How to manage server-side processes using MySQL

I have a perl script which takes in unique parameters (one of the parameters being --user=username_here). Users can start these processes using a web interface I am developing.
A MySQL table, transactions, keeps track of users that run the perl script
id user script_parameters execute last_modified
23 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
24 alex --user=alex --keywords=thisthat 0 2014-05-06 05:49:01
25 alex --user=alex --keywords=lg 0 2014-05-06 05:49:01
26 alex --user=alex --keywords=lg 0 2014-04-30 04:31:39
The execute value for a given row will be "1" if the process should be running. It is set to "0" if the process should be ended.
My perl script constantly checks this value to make sure it's not "0" and if it is, the perl script terminates.
However, I need to manage these process to protect against this problem:
What if my server abruptly crashes and restarts, OR the script crashes? I will need something running in the background, reading the transactions table and make sure it restarts the perl script as many times as needed using the appropriate parameters.
And so, I'm having trouble figuring out how to balance giving control to the user to manage his/her own transaction(s), while I also make sure that the transactions that SHOULD be running, ARE running, and those that AREN'T, AREN'T.
Hope that makes sense and I appreciate any help!
It seems you're trying to launch long-running processes from a web server and then track those processes in a database. That's not impossible, but not a recommended practice.
The main problem is that an HTTP request needs to be currently being handled in your web server for you do actually do anything (including track processes running on the system) -- you need something that can run all the time...
Instead, a better idea would be to have another daemonized "manager" process (as you mention perl, that'd be a good language to write it in) spawn & track the long running tasks (by PID and signals), and for that process to update your SQL database.
You can then have your "manager" process listen for requests to start a new process from your web server. There are various IPC mechanisms you could use. (e.g: signals, SysV shm, unix domain sockets, in-process queues like ZeroMQ, etc).
This has multiple benefits:
If your spawned scripts need to run with user/group based isolation (either from the system or each other), then your webserver doesn't need to run as root, nor be setgid.
If a spawned process "crashes", a signal will be delivered to the "manager" process, so it can track mis-executiions without issues.
If you use in-process queues (e.g: ZeroMQ) to deliver requests to the "manager" process, it can "throttle" requests from the web server (so that users cannot intentionally or accidentally cause D.O.S).
Whether or not the spawned process ends well, you don't need an 'active' HTTP request to the web server in order to update your tracking database.
As to whether something that should be running is running, that's really up to your semantics. (i.e: is it based on a known run time? based on data consumed? etc).
The check as to whether it is running can be two-fold:
The "manager" process updates the database as appropriate, including the spawned PID.
Your web server hosted code can actually list processes to determine if the PID in the database is actually running, and even how much time it's been doing something useful!
The check for whether it is not running would have to be based on convention:
Name the spawned processes something you can predict.
Get a process list to determine what's still running (defunct?) that shouldn't be.
In either case, you could either inform the users who requested the processes be spawned and/or actually do something about it.
One approach might be to have a CRON job which reads from the SQL database and does ps to determine which spawned processes need to be restarted, and then re-requests that the "manager" process does so using the same IPC mechanism used by the web server. How you differentiate starts vs. restarts in your tracking/monitoring/logging is up to you.
If the server itself loses power or crashes, then you could have the "manager" process perform cleanup when it first runs, e.g:
Look for entries in the database for spawned processes that were alegedly running before the server was shut down.
Check for those processes by PID and run time (this is important).
Either re-spawn the spawned proceses that didn't complete, or store something in the database to indicate to the web server that this was the case.
Update #1
Per your comment, here are some pointers to get started:
You mentioned perl, so presuming you have some proficiency there -- here are some perl modules to help you on your way to writing the "manager" process script:
If you're not already familiar with it CPAN is the repository for perl modules that do basically anything.
Daemon::Daemonize - To daemonize process so that it will continue running after you log out. Also provides methods for writing scripts to start/stop/restart the daemon.
Proc::Spawn - Helps with 'spawning' child scripts. Basically does fork() then exec(), but also handles STDIN/STDOUT/STDERR (or even tty) of child process. You could use this to launch your long-running perl scripts.
If your web server front-end code is not already written in perl, you'll need something that's pretty portable for inter-process message-passing and queuing; I'd probably make your web server front end in something easy to deploy (like PHP).
Here are two possibilities (there are many more):
Perl and PHP implementations for the Spread Toolkit.
Perl and PHP implementations for the ZeroMQ library.
Proc::ProcessTable - You can use this check on running processes (and get all sorts of stats as discussed above).
Time::HiRes - Use the high-granularity time functions from this package to implement your 'throttling' framework. Basically just limit the number of requests you de-queue per unit of time.
DBI (with mysql) - Update your MySQL database from the "manager" process.

SSIS Best Practice - Do 1 of 2 dozen things

I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg
Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.
Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.
May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.
well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks

Saving Execution plans before we restart SQL service? Can this even be done?

SQL server deletes all execution plans when SQL service is restarted. Is there any way these can be saved before we restart the service?
No.
The closest thing you can provide is plan guides, but even with plan guides the server will have to recompile every query the first time it is submitted after the restart. Plan guides can theoretically reduce the amount of work required during query compilation.
As Martin suggested, you can pull information out about the query plans, if you want to do analysis, but you can't put that information back in later:
SELECT query_plan FROM sys.dm_exec_cached_plans CROSS APPLY sys.dm_exec_query_plan(plan_handle)
If you are truly concerned, then set up a suite of queries to run after a service restart that will populate the plan cache. But I think that is a waste of time.
The mere fact that the SQL Server service will be down means that users will experience an outage. This already requires planning and communication. Understanding that such a restart's effects have a larger scope than merely the denial of service during the restart time--the first few minutes of usage may be slow--simply changes your planning and communication by a bit. If some kind of theoretical slowness is a big problem, you'll have to plan the downtime more carefully or wait for a more opportune time--2am instead of 2pm perhaps.
I say "theoretical slowness" because no one may experience any true issue with recompiling queries afterward. Queries generally compile very quickly.

Advantage of SSIS package over windows scheduled exe

I have an exe configured under windows scheduler to perform timely operations on a set of data.
The exe calls stored procs to retrieve data and perform some calcualtions and updates the data back to a different database.
I would like to know, what are the pros and cons of using SSIS package over scheduled exe.
Do you mean what are the pros and cons of using SQL Server Agent Jobs for scheduling running SSIS packages and command shell executions? I don't really know the pros about windows scheduler, so I'll stick to listing the pros of SQL Server Agent Jobs.
If you are already using SQL Server Agent Jobs on your server, then running SSIS packages from the agent consolidates the places that you need to monitor to one location.
SQL Server Agent Jobs have built in logging and notification features. I don't know how Windows Scheduler performs in this area.
SQL Server Agent Jobs can run more than just SSIS packages. So you may want to run a T-SQL command as step 1, retry if it fails, eventually move to step 2 if step 1 succeeds, or stop the job and send an error if the step 1 condition is never met. This is really useful for ETL processes where you are trying to monitor another server for some condition before running your ETL.
SQL Server Agent Jobs are easy to report on since their data is stored in the msdb database. We have regualrly scheduled subscriptions for SSRS reports that provide us with data about our jobs. This means I can get an email each morning before I come into the office that tells me if everything is going well or if there are any problems that need to be tackled ASAP.
SQL Server Agent Jobs are used by SSRS subscriptions for scheduling purposes. I commonly need to start SSRS reports by calling their job schedules, so I already have to work with SQL Server Agent Jobs.
SQL Server Agent Jobs can be chained together. A common scenario for my ETL is to have several jobs run on a schedule in the morning. Once all the jobs succeed, another job is called that triggers several SQL Server Agent Jobs. Some jobs run in parallel and some run serially.
SQL Server Agent Jobs are easy to script out and load into our source control system. This allows us to roll back to earlier versions of jobs if necessary. We've done this on a few occassions, particularly when someone deleted a job by accident.
On one ocassion we found a situation where Windows Scheduler was able to do something we couldn't do with a SQL Server Agent Job. During the early days after a SAN migration we had some scripts for snapshotting and cloning drives that didn't work in a SQL Server Agent Job. So we used a Windows Scheduler task to run the code for a while. After about a month, we figured out what we were missing and were able to move the step back to the SQL Server Agent Job.
Regarding SSIS over exe stored procedure calls.
If all you are doing is running stored procedures, then SSIS may not add much for you. Both approaches work, so it really comes down to the differences between what you get from a .exe approach and SSIS as well as how many stored procedures that are being called.
I prefer SSIS because we do so much on my team where we have to download data from other servers, import/export files, or do some crazy https posts. If we only had to run one set of processes and they were all stored procedure calls, then SSIS may have been overkill. For my environment, SSIS is the best tool for moving data because we move all kinds of types of data to and from the server. If you ever expect to move beyond running stored procedures, then it may make sense to adopt SSIS now.
If you are just running a few stored procedures, then you could get away with doing this from the SQL Server Agent Job without SSIS. You can even parallelize jobs by making a master job start several jobs via msdb.dbo.sp_start_job 'Job Name'.
If you want to parallelize a lot of stored procedure calls, then SSIS will probably beat out chaining SQL Server Agent Job calls. Although chaining is possible in code, there's no visual surface and it is harder to understand complex chaining scenarios that are easy to implement in SSIS with sequence containers and precedence constraints.
From a code maintainability perspective, SSIS beats out any exe solution for my team since everyone on my team can understand SSIS and few of us can actually code outside of SSIS. If you are planning to transfer this to someone down the line, then you need to determine what is more maintainable for your environment. If you are building in an environment where your future replacement will be a .NET programmer and not a SQL DBA or Business Intelligence specialist, then SSIS may not be the appropriate code-base to pass on to a future programmer.
SSIS gives you out of the box logging. Although you can certainly implement logging in code, you probably need to wrap everything in try-catch blocks and figure out some strategy for centralizing logging between executables. With SSIS, you can centralize logging to a SQL Server table, log files in some centralized folder, or use another log provider. Personally, I always log to the database and I have SSRS reports setup to help make sense of the data. We usually troubleshoot individual job failures based on the SQL Server Agent Job history step details. Logging from SSIS is more about understanding long-term failure patterns or monitoring warnings that don't result in failures like removing data flow columns that are unused (early indicator for us of changes in the underlying source data structure) or performance metrics (although stored procedures also have a separate form of logging in our systems).
SSIS give you a visual design surface. I mentioned this before briefly, but it is a point worth expanding upon on its own. BIDS is a decent design surface for understanding what's running in what order. You won't get this from writing do-while loops in code. Maybe you have some form of a visualizer that I've never used, but my experience with coding stored procedure calls always happened in a text editor, not in a visual design layer. SSIS makes it relatively easy to understand precedence and order of operations in the control flow which is where you would be working if you are using execute sql tasks.
The deployment story for SSIS is pretty decent. We use BIDS Helper (a free add-in for BIDS), so deploying changes to packages is a right click away on the Solution Explorer. We only have to deploy one package at a time. If you are writing a master executable that runs all the ETL, then you probably have to compile the code and deploy it when none of the ETL is running. SSIS packages are modular code containers, so if you have 50 packages on your server and you make a change in one package, then you only have to deploy the one changed package. If you setup your executable to run code from configuration files and don't have to recompile the whole application, then this may not be a major win.
Testing changes to an individual package is probably generally easier than testing changes in an application. Meaning, if you change one ETL process in one part of your code, you may have to regression test (or unit test) your entire application. If you change one SSIS package, you can generally test it by running it in BIDS and then deploying it when you are comfortable with the changes.
If you have to deploy all your changes through a release process and there are pre-release testing processes that you must pass, then an executable approach may be easier. I've never found an effective way to automatically unit test a SSIS package. I know there are frameworks and test harnesses for doing this, but I don't have any experience with them so I can't speak for the efficacy or ease of use. In all of my work with SSIS, I've always pushed the changes to our production server within minutes or seconds of writing the changes.
Let me know if you need me to elaborate on any points. Good luck!
If you have dependency on Windows features, like logging, eventing, access to windows resources- go windows scheduler/windows services route. If it is just db to db movement or if you need some kind of heavy db function usage- go SSIS route.