SSIS Transformation level logging - sql-server-2008

Is it possible to log all the transformations in SSIS? I have a custom logging solution which logs the data for each of the control flow element. But I would like to do so at the transformation level if it is possible. If not, thats ok...but I'd still like to know if its possible or not.
For example, if the package was to fail at the lookup, could I have the exact reason for why it failed and what data it failed on? Additionally, the outputs of the Source, Derived column and etc.

I found answer to this. On the canned logging (that comes along with SSIS) you can select a property called "PipeLineComponentTime". This gives the component wise execution time in milliseconds. Only thing is, you will have to filter out the information from your tables because of the amount of log that this would generate.

Related

Is there a good way for surfacing the time data was last updated in a Workshop Module?

Is there a good way to surface when a dataset backing an object was last built in a Workshop module? This would be helpful for giving users of the module a view on data freshness.
The ideal situation is that your data encodes the relevant information about how fresh it is; for instance if your object type represent "flights" then you can write a Function that sorts and returns the most recent flight and present it's departure timestamp as the "latest" update, since it represents the most recent data available.
The next best approach would be to have a last_updated column or similar that's either coming from the source system or added during the sync step. If the data connection is a JDBC or similar connection, this would be straightforward; something like select *, now() as last_updated_timestamp. If you have a file-based connection, you might need to get a bit more creative. This still falls short of accurately conveying the actual "latest data" available in the object type, but at least let's the user know when the last extract from the source system occured.
There are API endpoints for various services in Foundry related to executing schedules and builds, but metadata from these can be misleading if presented to users as an indication of data freshness because they don't actually "know" anything about the data itself - for example you might get the timestamp of when the latest pipeline build started, but if the source system has a 4 hour lag before data is synced to the relevant export tables, then you'll still be "off". So again, best to get this from inside your data wherever possible.

SSIS Is it a good idea to set ValidateExternalMetadata to false?

I have an SSIS package that reads from and writes to a table using OLE DB source and target. Recently we've added two new columns to the table. My SSIS package doesn't use these columns, but now I'm getting "The external columns are out of synchronization with the data source" warnings when I open the package.
I tried to run the package and it finished successfully, but I can see these warnings in the execution results. I could refresh the metadata of course, but there are many packages that are running in Production and using this table, so I don't think it's a good idea to refresh all of them and redeploy...
Is it a good idea to set ValidateExternalMetadata to false when I create a package so I won't get these warnings in the future? Any other suggestions for this?
The ValidateExternalMetadata setting is "You can pay me now or pay me later"
A data flow must ensure the meta data it was built against remains true whenever the task runs. The SSIS designer also validates the metadata whenever the package is opened for editing.
Flipping the default setting from True to False can save you cycles during development if a participant (source/destination) is extremely complex/busy/latency-filled.
Setting this to False can also improve the start time of an SSIS package as the task will only be validated if it runs. Say you have a foreach file enumerator and it only finds a file once a quarter (but runs every day as Accounting can't quite tell you when they'll have the final numbers ready HYPOTHETICALLY SPEAKING). Since the Data Flow task will only get validated 4 out of 365.25 days, that could be a beneficial performance savings. Probably not much but if you're trying to eek out every last bit of performance, that's a knob you can flip.
Another coding aphorism is that "Warnings are just errors waiting to grow up." Adding columns is unlikely to grow into an error but you are now spending CPU cycles on every package execution raising and handling the OnWarning event due to mismatched meta data. If you have an ops team, they may grow complacent about warning's from SSIS packages and miss a more critical warning.
A way to avoid this in the future is to write explicit queries in your source. Currently, the SELECT * (or the underlying table as source) are reporting back the new columns which is where the impedance mismatch comes into play. If you only ever bring in the columns you asked for, adding columns won't cause this warning to surface.
Removing a column of course will flat out cause the package to fail (under either the explicit or implicit column selection approaches).

SSIS restart/ re-trigger package automatically

Good day!
I have a SSIS package that retrieves data from a database and exports it to a flat file (simple process). The issue I am having is that the data my package retrieves each morning depends on a separate process to load the data into a table prior to my package retrieving it.
Now, the process which initially loads the data does inserts metadata into a table which shows the start and end date/time. I would like to setup something in my package that checks the metadata table for an end date/time for the current date. If the current date exists, then the process continues... IF no date/ time exists then the process stops (here is the kicker) BUT the package re-triggers itself automatically an hour later to check if the initial data load is complete.
I have done research on checkpoints, etc. but all that seems to cover is if the package fails it would pick up where it left when the package is restarted. I don't want to manually re-trigger the process, I'd like it to check the metadata and re-start itself if possible. I could even put in processing that if it checks the metadata 3 times it would stop completely.
Thanks so much for your help
What you want isn't possible exactly the way you describe it. When a package finishes running, it's inert. It can't re-trigger itself, something has to re-trigger it.
That doesn't mean you have to do it manually. The way I would handle this is to have an agent job scheduled to run every hour for X number of hours a day. The job would call the package every time, and the meta data would tell the package whether it needs to do anything or just do nothing.
There would be a couple of ways to handle this.
They all start by setting up the initial check, just as you've outlined above. See if the data you need exists. Based on that, set a boolean variable (I'll call it DataExists) to TRUE if your data is there or FALSE if it isn't. Or 1 or 0, or what have you.
Create two Precedent Contraints coming off that task; one that requires that DataExists==TRUE and, obviously enough, another that requires that DataExists==FALSE.
The TRUE path is your happy path. The package executes your code.
On the FALSE path, you have options.
Personally, I'd have the FALSE path lead to a forced failure of the package. From there, I'd set up the job scheduler to wait an hour, then try again. BUT I'd also set a limit on the retries. After X retries, go ahead and actually raise an error. That way, you'll get a heads up if your data never actually lands in your table.
If you don't want to (or can't) get that level of assistance from your scheduler, you could mimic the functionality in SSIS, but it's not without risk.
On your FALSE path, trigger an Execute SQL Task with a simple WAITFOR DELAY '01:00:00.00' command in it, then have that task call the initial check again when it's done waiting. This will consume a thread on your SQL Server and could end up getting dropped by the SQL Engine if it gets thread starved.
Going the second route, I'd set up another Iteration variable, increment it with each try, and set a limit in the precedent constraint to, again, raise an actual error if you're data doesn't show up within a reasonable number of attempts.
Thanks so much for your help! With some additional research I found the following article that I was able to reference and create a solution for my needs. Although my process doesn't require a failure to attempt a re-try, I set the process to force fail after 3 attempts.
http://microsoft-ssis.blogspot.com/2014/06/retry-task-on-failure.html
Much appreciated
Best wishes

A way to execute pipeline periodically from bounded source in Apache Beam

I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.

SSIS Best Practice - Do 1 of 2 dozen things

I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg
Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.
Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.
May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.
well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks