Nightly data integration using NServiceBus - ssis

In the investment firm I work for, we have several daily data integration tasks that every night/early morning.
Most if not all are done using SSIS, and they are all scheduled to start at certain times. They work (as in the SSIS packages do their job).
However, we have serious problems when it comes to tackling with dependent processes. For example, if SSIS package is scheduled to run at 3AM does not get the FTP file from a vendor, then dependent processes that are set to run at 3:30AM and 4AM all fail.
Currently we have about 8 different inhouse applications/endpoints, with data coming from around 6 external vendors and the vendor list is beginning to grow and we expect it would around 20 or so in a year.
We don't want to go BizTalk route for financial/complexity/implementation difficulties.
I wanted to create a event-driven approach to this, where the SSIS/nightly processes are oblivious to each other, as they are, yet need subscribe to dependent parent processes to start/stop.
At a glance I feel NServiceBus would give us that flexibility. It lets us keep what we already have and work, but yet give an event driven mechanism.
I need some input here.

Related

Active Collab 5 Webhooks / Maintaining "metric" data

I have an application I am working on that basically takes the data from Active Collab and creates reports / graphs out of the data. The API itself is insufficient to get the proper data on a per request basis so I resorted to pulling the data down into a separate data set that can be queried more efficiently.
So in order to avoid needing to query the entire API constantly I decided to make use of webhooks in order to make the transformations to the relevant data and lower the need to resync the data.
However I notice not all events are sent, notably the following.
TaskListUpdated
MemberUpdated
TimeRecordUpdated
ProjectUpdated
There is probably more but these are the main ones I noticed so far,
Time reports is probably the most important, in fact it missing from webhooks means that almost any application has a good chance of incorrect data if it needs time record data. Its fairly common to do a typo in a time record and then adjust it later.
So am I missing anything here? Is there some way to see these events reliably?
EDIT:
In order to avoid a long comment to Ilija I am putting the bulk here.
Webhooks apart, what information do you need to pull? API that powers
time tracking reports can do all sorts of cross project filtering, so
your approach to keep a separate database may be an overkill.
Basically we are doing a multi-variable tiered time report. It can be sorted / grouped by any conceivable method you may want to look at.
http://www.appsmagnet.com/product/time-reports-plus/
This is the closest to what we are trying to do, back when we used Active Collab 4 this did the job, but even with it we had to consolidate it in our own spreadsheets.
So the idea of this is to better integrate our Active Collab data into our own workflow.
So the main data we are looking for in this case is
Job Types
Projects
Task Lists
Tasks
Time Records
Categories
Members / Clients
Companies
These items can feed not only our reports, but many other aspects of our company as well. For us Active Collab is the point of truth, so we want the data quickly accessible and fully query-able.
So I have set up a sync system that initially grabs all the data it can from Active Collab and then uses a mix of cron's and webhooks to keep it up to date.
Cron jobs work well for all aspects that do not have "sub items" (projects/tasks/task lists/time records). So those I need to rely on the webhook since syncing them takes to much time to be able to keep it up to date in real time.
For the webhook I noticed the above do not carry through. Time Records I figured out a way around it listed in my answer, and member can be done through the cron. However Task list and project updating are the only 2 of some concern. Project is fairly important as the budget can change and that would be used in reports, task lists has the start / end dates that could be used as well. Since going through every project / task list constantly to see if there is a change is really not a great idea I am looking for a way to reliably see updates for them.
I have based this system on https://developers.activecollab.com/api-documentation/ but I know there are at least a few end points that are not listed.
Cross-project time-record filtering using Active Collab 5 API
This question is actually from another developer on the same system (and also shows a TrackingFilter report not listed in the docs). Due to issues with maintaining an accurate set of data we had to adapt it. I actually notice that you (Ilija) are the person replying and did recommend we move over to this style of system.
This is not a total answer but a way to solve the issue with TimeRecordUpdated not going through the webhook.
There is another API endpoint for /whats-new This endpoint describes changes for the last day or so and it has a category called TrackingObjectUpdatedActivityLog this refers to an updated time record.
So I set up a cron job to check this fairly consistently and manually push the TimeRecordUpdated event through my system to keep it consistent.
For MemberUpdated since the data for a member being updated is unlikely to affect much, having a daily cron for checking the users seems good enough.
ProjectUpdated could technically be considered the same, but with the absence of TaskListUpdated that leads to far to many api calls to sync the data. I have not found a solution for this yet unfortunately.

SSIS ETL solution needs to import 600,000 small simple files every hour. What would be optimal Agent scheduling interval?

The hardware, infrastructure, and redundancy are not in the scope of this question.
I am building an SSIS ETL solution needs to import ~600,000 small, simple files per hour. With my current design, SQL Agent runs the SSIS package, and it takes ā€œnā€ number of files and processes them.
Number of files per batch ā€œnā€ is configurable
The SQL Agent SSIS package execution is configurable
I wonder if the above approach is a right choice? Or alternatively, I must have an infinite loop in the SSIS package and keep taking/processing the files?
So the question boils down to a choice between infinite loop vs. batch+schedule. Is there any other better option?
Thank you
In a similar situation, I run an agent job every minute and process all files present. If the job takes 5 minutes to run because there are alot of files, the agent skips the scheduled runs until the first one finishes so there is no worry that two processes will conflict with each other.
Is SSIS the right tool?
Maybe. Let's start with the numbers
600000 files / 60 minutes = 10,000 files per minute
600000 files / (60 minutes * 60 seconds) = 167 files per second.
Regardless of what technology you use, you're looking at some extremes here. Windows NTFS starts to choke around 10k files in a folder so you'll need to employ some folder strategy to keep that count down in addition to regular maintenance
In 2008, the SSIS team managed to load 1TB in 30 minutes which was all sourced from disk so SSIS can perform very well. It can also perform really poorly which is how I've managed to gain ~36k SO Unicorn points.
6 years is a lifetime in the world of computing so you may not need to take such drastic measures as the SSIS team did to set their benchmark but you will need to look at their approach. I know you've stated the hardware is outside of the scope of discussion but it very much is inclusive. If the file system (san, nas, local disk, flash or whatever) can't server 600k files then you'll never be able to clear your work queue.
Your goal is to get as many workers as possible engaged in processing these files. The Work Pile Pattern can be pretty effective to this end. Basically, a process asks: Is there work to be done? If so, I'll take a bit and go work on it. And then you scale up the number of workers asking and doing work. The challenge here is to ensure you have some mechanism to prevent workers from processing the same file. Maybe that's as simple as filtering by directory or file name or some other mechanism that is right for your situation.
I think you're headed down this approach based on your problem definition with the agent jobs that handle N files but wanted to give your pattern a name for further research.
I would agree with Joe C's answer - schedule the SQL Agent job to run as frequently as needed. If it's already running, it won't spawn a second process. Perhaps you're going to have multiple agents that all start every minute - AgentFolderA, AgentFolderB... AgentFolderZZH and they are each launching a master package that then has subprocesses looking for work.
Use WMI Event viewer watcher to know if new file arrived or not and next step you can call job scheduler to execute or execute direct the ssis package.
More details on WMI event .
https://msdn.microsoft.com/en-us/library/ms141130%28v=sql.105%29.aspx

SSIS Best Practice - Do 1 of 2 dozen things

I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg
Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.
Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.
May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.
well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks

SSIS Handling Extenal Issues

I have an SSIS package that works fine. The package runs every night and takes about 4 hours to complete. I have am a newb to SSIS, so I want to see what my options are. I am not finding anything on the web about these two issues, so any advice is greatly appreciated.
What to do when I have an external
issue such as a power
failure/accidental restart. Is there
a way to alert someone or have the
package begin again on restart.
A couple weeks ago there was a
process that got hung and locked
table, making the process not
execute. How is the best way to
handle ensuring I have the proper
access before starting and if not,
get the access. I am ok with killing
the processes etc.
Looking for best practice info. Thanks
For #1 - there is no inherent "restart" mechanism in SSIS, since to start with, there is no inherent "start" mechanism. You'll have to look at the process that you've got managing the scheduled execution of your packages, which I assume could be SQL Agent.
Given that, your options for determining if a SQL Agent job failed, and/or restarting that job are the same whether the contents of the job are SSIS packages or not. There are quite a few stored procedures for monitoring and querying job execution and results. You could also implement your own mechanism for recording job/package status.
SSIS does offer "checkpoints" to help you restart packages from certain points, but the general concensus on that feature is that it is limited in it's applicability - your mileage may vary.
Personally, I always include a failure route in my job to email someone on failure of the job, and configure my jobs and packages to be idempotent - that is, they can be re-run without fear of improperly conducting the same operations twice. They either "reset" the environment (delete and reload), or they can detect exactly where they left off.
Item #2 is a difficult question and depends greatly on your environment and scenario. You can use simple Tasks like an Execute SQL Task to run "test" commands that are tested to fail if sufficient privileges or locks exist. Or you may be able to inquire directly through SPs or other mechanisms to determine if you need to take remedial action before you attempt to run the meat of your package.
Using Precedence Constraints "on failure" can assist with that kind of logic. So can Event Handlers.

How do I ensure that only one of a certain category of job runs at once in Hudson?

I use Hudson to automate the testing of a very large important product. I want to have my testing-hosts able to run as many concurrent builds as they will theoretically support with the exception of excel-tests which must only run one per machine at any time. Any number of non-excel tests can run concurrently, however at most one excel test at a time must run per machine.
Background:
Most of my tests are normal unit-tests - the sort of thing that I can easily run in parallel. Unfortunately a substantial and time consuming part of my unit-testing plan consists of tests which have been implemented in Excel.
You might think it crazy to implement a test in Excel - actually there's an important reason: Most of our users access our system via a Excel. Excel has it's own quirky ways of handling data so the only way to guarantee that our stuff works for Excel users is to literally implement our reg-test our application Excel.
I've written a test-runner tool which allows me to easily fire off a group of excel tests: Each test is a single .xls file. Each group is a folder full of excel files. I've got about 30 groups which need to be run for an end-to-end test. My tool converts the result of each of the tests into JUnit style XML which Hudson is able to understand. The tests use the pywin32com library to automate excel. When run on their own they are reliable.
I've got a group of computers which are dedicated to running tests. Each machine is quad-core and can theoretically run quite a lot of stuff at once. Unfortunately I've found that COM cannot be used to safely control more than 1 excel per machine at a time.
That is to say if a 2nd build stars which tries to talk to Excel via COM it might interfere with the one which is already running and cause both tests to fail.
I can run as many other non-excel processes as the machine will allow but I need to find a way so that Hudson does not attempt to launch any more than 1 process which requires excel on any one machine concurrently.
Sounds like the Locks and Latches plugin might help you.
http://hudson.gotdns.com/wiki/display/HUDSON/Locks+and+Latches+plugin
Isn't hudson java?
Since you've tagged this post python, I'll point out that buildbot, has slave locks to limit individual steps on individual slaves (or use them as more coarse locks if you'd like).