I have an SSIS package that reads from and writes to a table using OLE DB source and target. Recently we've added two new columns to the table. My SSIS package doesn't use these columns, but now I'm getting "The external columns are out of synchronization with the data source" warnings when I open the package.
I tried to run the package and it finished successfully, but I can see these warnings in the execution results. I could refresh the metadata of course, but there are many packages that are running in Production and using this table, so I don't think it's a good idea to refresh all of them and redeploy...
Is it a good idea to set ValidateExternalMetadata to false when I create a package so I won't get these warnings in the future? Any other suggestions for this?
The ValidateExternalMetadata setting is "You can pay me now or pay me later"
A data flow must ensure the meta data it was built against remains true whenever the task runs. The SSIS designer also validates the metadata whenever the package is opened for editing.
Flipping the default setting from True to False can save you cycles during development if a participant (source/destination) is extremely complex/busy/latency-filled.
Setting this to False can also improve the start time of an SSIS package as the task will only be validated if it runs. Say you have a foreach file enumerator and it only finds a file once a quarter (but runs every day as Accounting can't quite tell you when they'll have the final numbers ready HYPOTHETICALLY SPEAKING). Since the Data Flow task will only get validated 4 out of 365.25 days, that could be a beneficial performance savings. Probably not much but if you're trying to eek out every last bit of performance, that's a knob you can flip.
Another coding aphorism is that "Warnings are just errors waiting to grow up." Adding columns is unlikely to grow into an error but you are now spending CPU cycles on every package execution raising and handling the OnWarning event due to mismatched meta data. If you have an ops team, they may grow complacent about warning's from SSIS packages and miss a more critical warning.
A way to avoid this in the future is to write explicit queries in your source. Currently, the SELECT * (or the underlying table as source) are reporting back the new columns which is where the impedance mismatch comes into play. If you only ever bring in the columns you asked for, adding columns won't cause this warning to surface.
Removing a column of course will flat out cause the package to fail (under either the explicit or implicit column selection approaches).
Related
Good day!
I have a SSIS package that retrieves data from a database and exports it to a flat file (simple process). The issue I am having is that the data my package retrieves each morning depends on a separate process to load the data into a table prior to my package retrieving it.
Now, the process which initially loads the data does inserts metadata into a table which shows the start and end date/time. I would like to setup something in my package that checks the metadata table for an end date/time for the current date. If the current date exists, then the process continues... IF no date/ time exists then the process stops (here is the kicker) BUT the package re-triggers itself automatically an hour later to check if the initial data load is complete.
I have done research on checkpoints, etc. but all that seems to cover is if the package fails it would pick up where it left when the package is restarted. I don't want to manually re-trigger the process, I'd like it to check the metadata and re-start itself if possible. I could even put in processing that if it checks the metadata 3 times it would stop completely.
Thanks so much for your help
What you want isn't possible exactly the way you describe it. When a package finishes running, it's inert. It can't re-trigger itself, something has to re-trigger it.
That doesn't mean you have to do it manually. The way I would handle this is to have an agent job scheduled to run every hour for X number of hours a day. The job would call the package every time, and the meta data would tell the package whether it needs to do anything or just do nothing.
There would be a couple of ways to handle this.
They all start by setting up the initial check, just as you've outlined above. See if the data you need exists. Based on that, set a boolean variable (I'll call it DataExists) to TRUE if your data is there or FALSE if it isn't. Or 1 or 0, or what have you.
Create two Precedent Contraints coming off that task; one that requires that DataExists==TRUE and, obviously enough, another that requires that DataExists==FALSE.
The TRUE path is your happy path. The package executes your code.
On the FALSE path, you have options.
Personally, I'd have the FALSE path lead to a forced failure of the package. From there, I'd set up the job scheduler to wait an hour, then try again. BUT I'd also set a limit on the retries. After X retries, go ahead and actually raise an error. That way, you'll get a heads up if your data never actually lands in your table.
If you don't want to (or can't) get that level of assistance from your scheduler, you could mimic the functionality in SSIS, but it's not without risk.
On your FALSE path, trigger an Execute SQL Task with a simple WAITFOR DELAY '01:00:00.00' command in it, then have that task call the initial check again when it's done waiting. This will consume a thread on your SQL Server and could end up getting dropped by the SQL Engine if it gets thread starved.
Going the second route, I'd set up another Iteration variable, increment it with each try, and set a limit in the precedent constraint to, again, raise an actual error if you're data doesn't show up within a reasonable number of attempts.
Thanks so much for your help! With some additional research I found the following article that I was able to reference and create a solution for my needs. Although my process doesn't require a failure to attempt a re-try, I set the process to force fail after 3 attempts.
http://microsoft-ssis.blogspot.com/2014/06/retry-task-on-failure.html
Much appreciated
Best wishes
We are using ms access2010 and we are having unnecessary 50% increase of the data file problem
every day. We use the compact and repair process on a daily basis at every nights.
But almost every day, in the middle of day, when this huge increase happens and performance
is badly affected we have to run this process again manually,after that this huge size difference disappears. I suspect the problem would be because of the internal behaviour of Access engine while updating data.
Can anyone please explain to me when updating a record how much space is wasted internally by
data base engine?
For instance, suppose we have a record of 100 bytes, when we update it somehow and the size decreases to 80 how much will the wasted space be? is it 20 or much more than that?
Conversely, when we increase a data record by update will it be any wasted space created by the update process in data file?
any idea or suggestion on how to boost the performance would be appreciated.
You can run C&R via VBA
Public Sub CompactDB()
CommandBars("Menu Bar").Controls("Tools").Controls("Database utilities").Controls("Compact and repair database...").accDoDefaultAction
End Sub
Reasons your database can bloat (compacting only solves some of this -- decompiling / recompiling is necessary for the rest, if you code / use macros).
MS Access is file-based, not server transaction based, so you're
always writing and rewriting to the hard drive for a variable space.
To get around this, switch to MS Access ADP files using either MDSE,
which you can install from the MS Office Professional CD by browsing
to it on the CD (not part of the installation wizard), or, hook the
database up to a server, such as SqlServer. You'll have to build a
new MS Access document of type ADP (as opposed to MDB). Doing so
puts you in a different developmental regime, however, than you're
used to, so read about this before doing it.
Compiling. Using macros plus the "compile in background" option is no different than compiling your MS Access project by having coded in Access Basic, Visual Basic for Access, or Visual Basic using the VB Editor that comes with MS Access.
Whatever changes you made last time remain as compiled pseudocode, so you are pancaking one change on top of another, even though you only are playing with the lastest version of your code.
Queries, especially large queries, take up space when they're run which is never reclaimed until you compact. You can make your queries more efficient, but you'll never get away from this completely.
Locktypes, cursortypes, and cursorlocations on ADODB, depending on how you set them up, can take up a lot of space if you choose combinations that are really data intensive. These can be marshalled (configured) in such a way to return only what's necessary. There is a knowledge base article on the MDSN library at microsoft.com detailing how ADODB causes a lot of bloat, and recommends to use DAO, but this is a cop-out; what you do is use ADODB well and you'll get around this, and DAO does not eliminate bloat, either.
DAO functions.
Object creation -- tables, forms, controls, reports -- all take up space. If you create a form and delete it later, the space that the form is not reclaimed until you compact.
Cute pictures. These always take up space, and MS Access does not store them efficiently. A 20K JPEG can wind up like an 800K or 1MB bitmap format once stored in Access, and there's nothing you can do about that in MS Access 97. You can put the image on a form and use subform references of the image where ever you want it, but you still don't get around the inefficient storage format.
OLE Objects. If you have an OLE field and decide to insert, say, a spreadsheet in that field, you take the entire Excel Workbook with it, not just that sheet. Be careful how to use OLE objects.
Table properties with the subtable set to [auto]. Set this property, for all tables, to [none]. Depending on how many tables you have, performance can also perceptibly improve.
You can also get the Jet Compact utility from Microsoft.com for databases that are corrupted.
Source
I have a SSIS package that is processing a queue.
I currently have a singel package that is broken into 3 containers
1. gather some meta data
2. do the work
3. re-examine meta data, update the queue w/ what we think happened (success of flavor of failure )
I am not super happy with the speed, part of it is that I am running on a hamster powered server, but that is out of my control.
The middle piece may offer an opportunity for an improvement...
There are 20 tables that may need to be updated.
Each queue item will update 1 table.
I currently have a sequence that contains 20 sequence containers.
They all do essentially the same thing, but I couldnt figure out a way to abstract them.
The first box in each is an empty script action. There is a conditional flow to 'the guts' if there is a match on tablename.
So I open up 20 sequence tasks, 20 empty script tasks and do 20 T/F checks.
Watching the yellow/green light show, this seems to be slow.
Is there a more efficient way? The only way I can think to make it better is to have the 20 empty scripts outside the sequence containers. What that would save is opening the container. I cant believe that is all that expensive to open a sequence container. Does it possibly reverify every task in the container every time?
Just fishing, if anyone has any thoughts I would be very happy to hear them.
Thanks
Greg
Your main issue right now is that you are running this in BIDS. This is designed to make development and debugging of packages easy, so yes to your point it validates all of the objects as it runs. Plus, the "yellow/green light show" is more overhead to show you what is happening in the package as it runs. You will get much better performance when you run it with DTSExec or as part of a scheduled task from Sql server. Are you logging your packages? If so, run from the server and look at the logs to verify how long the process actually takes on the server. If it is still taking too long at that point, then you can implement some of #registered user 's ideas.
Are you running each of the tasks in parallel? If it has to cycle through all 60 objects serially, then your major room for improvement is running each of these in parallel. If you are trying to parallelize the processes, then you could do a few solutions:
Create all 60 objects, each chains of 3 objects. This is labor intensive to setup, but it is the easiest to troubleshoot and allows you to customize it when necessary. Obviously this does not abstract away anything!
Create a parent package and a child package. The child package would contain the structure of what you want to execute. The parent package contains 20 Execute Package tasks. This is similar to 1, but it offers the advantage that you only have one set of code to maintain for the 3-task sequence container. This likely means you will move to a table-driven metadata model. This works well in SSIS with the CozyRoc Data Flow Plus task if you are transferring data from one server to another. If you are doing everything on the same server, then you're really probably organizing stored procedure executions which would be easy to do with this model.
Create a package that uses the CozyRoc Parallel Task and Data Flow Plus. This can allow you to encapsulate all the logic in one package and execute all of them in parallel. WARNING I tried this approach in SQL Server 2008 R2 with great success. However, when SQL Server 2012 was released, the CozyRoc Parallel Task did not behave the way it did in previous versions for me due to some under the cover changes in SSIS. I logged this as a bug with CozyRoc, but as best as I know this issue has not been resolved (as of 4/1/2013). Also, this model may abstract away too much of the ETL and make initial loads and troubleshooting individual table loads in the future more difficult.
Personally, I use solution 1 since any of my team members can implement this code successfully. Metadata driven solutions are sexy, but much harder to code correctly.
May I suggest wrapping your 20 updates in a single stored procedure. Not knowing how variable your input data is, I don't know how suitable this is, but this is my first reaction.
well - here is what I did....
I added a dummy task at the 'top' of the parent sequence container. From that I added 20 flow links to each of the child sequence containers (CSC). Now each CSC gets opened only if necessary.
My throughput did increase by about 30% (26 rpm--> 34 rpm on minimal sampling).
I could go w/ either zmans answer or registeredUsers. Both were helpful. I choose zmans because the real answer always starts with looking at the log to see exactly how long something takes (green/yellow is not real reliable in my experience).
thanks
I have read a few/lots of things on this but they don't seem to help much.
I have an App (it's called "TieUp" but that is irrelevant) I run it manually daily to collate data from several locations.
It is using as sources:
A) Data from a remote SOAP source and loaded into an in-memory TClientDataset via an XMLtransform setup.
B) CSV files downloaded daily and loaded into an in-memory TClientDataset
C) A Mysql Database on the same computer as the program (it's a restored backup of the live source)
D) A remote MS-SQL (SQLServer 2008) database
E) A Mysql Database on a remote server
Data is only read from sources A, B, C and D
Data source E is updated with the consolidated data.
There are between 800 to 2000 records daily so the datasets are not vast although the target (E) has grown to around 150,000 and increasing daily.
I can normally run this all happily and everything works as expected if a little slowly because of all the individual remote lookups to the MS-SQL system) but some days it really screws up and the error is always "Catastrophic Failure!".
The failure does not occur during any particular phase or operation that I can see. The steps are:
1) Get the SOAP(A) data first.
2) Tie in with CSV/In Memory data(B).
3) Lookup References data on Sources C and D to collate
4) Write the consolidated data to source E
After reading in the data into the in memory datasets every thing is In TClientDatasets accesses via DatasetProviders linked to TSQLQueries (they all on the same servers currently but I did it that way to keep some flexibility in future where it might goes true three tier). All queries are contained within the SQLQuery components as they are actually quite simple - it's just a matter of tying things together.
I am using completely standard components from Delphi 2009 Enterprise. All updates and database update packs have been applied. Each data source has its own DataModule these are auto created at startup
There is obviously quite a lot of data access going on here but when it crashes (with catastrophic failure) It gets stuck, completely stuck. Windows can't end the task from the normal "TieUp has stopped working" I have to go to the process and kill it.
There is so much going on and as this only happens once a week or so I really don't know where to start looking.
The reasons for asking the question is twofold: 1) is that I am trying to eliminate any manual stuff and fully automate it, but I can't rely on it if if bombs every week or so. 2) if it happens in the update phase to E - I have to manually delete the new records for the day and start again as I do not have (or haven't written yet) a mechanism to restart from a random point and I would still have to query the DB manually to establish that point for certain.
My next step is to install Delphi on another computer and always run it under the debugger until I can catch it, if it does not freeze first. But that introduces yet another different network connection (instead of the local host one).
So: "Is there a definite answer?" or what is the most likely offending component/connection? Where is the favoured place to start looking?
Thanks in advance...
Is it possible to log all the transformations in SSIS? I have a custom logging solution which logs the data for each of the control flow element. But I would like to do so at the transformation level if it is possible. If not, thats ok...but I'd still like to know if its possible or not.
For example, if the package was to fail at the lookup, could I have the exact reason for why it failed and what data it failed on? Additionally, the outputs of the Source, Derived column and etc.
I found answer to this. On the canned logging (that comes along with SSIS) you can select a property called "PipeLineComponentTime". This gives the component wise execution time in milliseconds. Only thing is, you will have to filter out the information from your tables because of the amount of log that this would generate.