Best practice to organize a 200+ tables import project - ssis

This question is going to be a purely organizational question about SSIS project best practice for medium sized imports.
So I have source database which is continuously being enriched with new data. Then I have a staging database in which I sometimes load the data from the source database so I can work on a copy of the source database and migrate the current system. I am actually using a SSIS Visual Studio project to import this data.
My issue is that I realised the actual design of my project is not really optimal and now I would like to move this project to SQL Server so I can schedule the import instead of running manually the Visual Studio project. That means the actual project needs to be cleaned and optimized.
So basically, for each table, the process is simple: truncate table, extract from source and load into destination. And I have about 200 tables. Extractions cannot be parallelized as the source database only accepts one connection at a time. So how would you design such a project?
I read from Microsoft documentation that they recommend to use one Data Flow per package, but managing 200 different package seems quite impossible, especially that I will have to chain for scheduling import. On the other hand a single package with 200 Data Flows seems unamangeable too...
Edit 21/11:
The first apporach I wanted to use when starting this project was to extract my table automatically by iterating on a list of table names. This could have worked out well if my source and destination tables had all the same schema object names, but the source and destination database being from different vendor (BTrieve and Oracle) they also have different naming restrictions. For example BTrieve does not reserve names and allow more than 30 characters names, which Oracle does not. So that is how I ended up manually creating 200 data flows with a semi-automatic column mapping (most were automatic).
When generating the CREATE TABLE query for the destination database, I created a reusable C# library containing the methods to generate the new schema object names, just in case the methodology could automated. If there was any custom tool to generate the package that could use an external .NET library, then this might do the trick.

Have you looked into BIDS Helper's BIML (Business Intelligence Markup Language) as a package generation tool? I've used it to create multiple packages that all follow the same basic truncate-extract-load pattern. If you need slightly more cleverness than what's built into BIML, there's BimlScript, which adds the ability to embed C# code into the processing.
From your problem description, I believe you'd be able to write one BIML file and have that generate two hundred individual packages. You could probably use it to generate one package with two hundred data flow tasks, but I've never tried pushing SSIS that hard.

You can basically create 10 child packages each having 20 data flow tasks and create a master package which triggers these child pkgs.Using parent to child configuration create a single XML file configuration file .Define the precedence constraint for executing the package in serial fashion in master pkg. In this way maintainability will be better compared to having 200 packages or single package with 200 data flow tasks.

Following link may be useful to you.
Single SSIS Package for Staging Process
Hope this helps!

Related

Mutiple Tables import using single dataflow in ssis

I have 10 tables I am importing to another sql server database using SSIS.
Do I have to create 10 different Dataflow tasks or can I proceed with one Dataflow task and add the 10 tables to it?
I have tried to use a single dataflow task but it is only allowing for a single table.
Do all the source tables share one common schema?
Do all the destination tables share one common schema (which doesn't have to be the same as the common schema for the source tables)?
If the answer to both questions is "yes", then you can in fact write a single Data Flow Task (whose connection managers are parameterized) and put it in a Foreach Loop container.
If the answer to either (or both) of those questions is "no", then you'll have to have separate sources and destinations. You might want to investigate Business Intelligence Markup Language as a way to generate those data flows automatically, although it's probably overkill for "only" ten tables.
The answer depends upon you and your best practices and how many developers you will have working on projects at the same time.
It is entirely possible to put more than one set of tables in a single dataflow. You can simply add additional sources and destinations to your dataflow. However, this is almost never a good idea as it adds to the maintenance effort later in the lifecycle of your project. It makes it more difficult to find and debug errors. It makes the entire project more complex.
If you are working alone and you will be building and maintaining this project's full lifecycle by yourself, then by all means do whatever you feel most comfortable with.
If you are in a group that may all maintain this project, I would suggest that you, at a minimum, break out the dataflow to different tables into different dataflow tasks.
If you are in a larger group and for more flexibility in maintenance, I would suggest that each dataflow be broken out into a different package (assuming 2008 or below. I have not played with the 2012 project models yet, so won't comment on them here), so that each can be worked on by different developers simultaneously. (I would actually recommend coding this way even if you are the only one on the project, but that is just the style I have developed over my career.)

SSIS - 2008 - Use a single config table for multiple copies of the same package

I am somewhat new to SSIS.
I have to deliver a 'generic' SSIS package, that the client will make multiple copies of, deploy and schedule each copy for different source databases. I have a single SSIS Configuration table in a separate common database. I would like to use this single configuration table for all connections. However the challenge is with the configuration filter. When client makes a copy of my package, it will have the same configuration filter just like others. I would like to give an option to the client to change the configuration filter before deploying, because for this new copy, the source database can be different. I do not find an option to control this.
Is there a way to change the configuration filter from outside the package (without editing the executable .dtsx file)? Or is there a better approach that I can follow? I do not prefer XML configuration files, the primary reason being my packages are deployed onto SQL server.
Any help would be greatly appreciated.
-Shahul
Your preferred solution does not align well with the way that SSIS package configurations are typically used. See Jamie Thomson's answer to a similar question on the MSDN forums.
I have created a package with the same requirements for my company. It loads data from different sources and loads them into different destinations based on individual configurations for the instances. It is used as an internal ETL.
We have adapters that connect to different sources and pass data to a common staging table in XML format and the IETL Package loads this data into different tables depending on a number of different settings etc.
i.e. Multiple SSIS package instances can be executed with different configurations. You are on the right track. It can be achieved using SQL Server to hold configurations and XML Config file to hold the database info that has this configurations. When an instance of the package executes it will load the default values configured with the package, but needs to update all variables to reflect the purpose of the new instance.
I have created a Windows app to configure these instances and they settings in the database to make it really easy for the client or consultant to configure them without actually opening the package.

Refreshing a reporting database

We are currently having an OLTP sql server 2005 database for our project. We are planning to build a separate reporting database(de-normalized) so that we can take the load off from our OLTP DB. I'm not quite sure which is the best approach to sync these databases. We are not looking for a real-time system though. Is SSIS a good option? I'm completely new to SSIS, so not sure about the feasibility. Kindly provide your inputs.
Everyone has there own opinion of SSIS. But I have used it for years for datamarts and my current environment which is a full BI installation. I personally love its capabilities to move data and it still is holding the world record for moving 1.13 terabytes in under 30 minutes.
As for setup we use log shipping from our transactional DB to populate a 2nd box. Then use SSIS to de-normalize and warehouse the data. The community for SSIS is also very large and there are tons of free training and helpful resources online.
We build our data warehouse using SSIS from which we run reports. Its a big learning curve and the errors it throws aren't particularly useful, and it helps to be good at SQL, rather than treating it as a 'row by row transfer' - what I mean is you should be creating set based queries in sql command tasks rather than using lots of SSIS component and dataflow tasks.
Understand that every warehouse is difference and you need to decide how to do it best. This link may give you some good idea's.
How we implement ours (we have a postgres backend and use PGNP provider, and making use of linked servers could make your life easier ):
First of all you need to have a time-stamp column in each table so you can when it was last changed.
Then write a query that selects the data that has changed since you last ran the package (using an audit table would help) and get that data into a staging table. We run this as a dataflow task as (using postgres) we don't have any other choice, although you may be able to make use of a normal reference to another database (dbname.schemaname.tablename or somthing like that) or use a linked server query. Either way the idea is the same. You end up with data that has change since your query.
We then update (based on id) the data that already exists then insert the new data (by left joining the table to find out what doesn't already exist in the current warehouse).
So now we have one denormalised table that show in this case jobs per day. From this we calculate other tables based on aggregated values from this one.
Hope that helps, here are some good links that I found useful:
Choosing .Net or SSIS
SSIS Talk
Package Configurations
Improving the Performance of the Data Flow
Trnsformations
Custom Logging / Good Blog

Load XML Using SSIS

I have a ETL type requirement for SQL Server 2005. I am new to SSIS but I believe that it will be the right tool for the job.
The project is related to a loyalty card reward system. Each month partners in the scheme send one or more XML files detailing the qualiifying transactions from the previous month. Each XML file can contain up to 10,000 records. The format of the XML is very simple, 4 "header" elements, then a repeating sequence containing the record elements. The key record elements are card_number, partner_id and points_awarded.
The process is currently running in production but it was developed as a c# app which runs an insert for each record individually. It is very slow, taking over 8 hours to process a 10,000 record file. Through using SSIS I am hoping to improve performance and maintainability.
What I need to do:
Collect the file
Validate against XSD
Business Rule Validation on the records. For each record I need to ensure that a valid partner_id and card_number have been supplied. To do this I need to execute a lookup against the partner and card tables. Any "bad" records should be stripped out and written to a response XML file. This is the same format as the request XML, with the addition of an error_code element. The "good" records need to be imported into a single table.
I have points 1 and 2 working ok. I have also created an XSLT to transform the XML into a flat format ready for insert. For point 3 I had started down the road of using a ForEach Loop Container control in the control flow surface, to loop each XML node, and the SQL Lookup task. However, this would require a call to the database for each lookup and a call to the file system to write out the XML files for the "bad" and "good" records.
I believe that better performance could be achieved by using the Lookup control on the data flow surface. Unfortunately, I have no experience of working with the data flow surface.
Does anyone have a suggestion as to the best way to solve the problem? I searched the web for examples of SSIS packages that do something similar to what I need but found none - are there any out there?
Thanks
Rob.
SSIS is frequently used to load data warehouses, so your requirement is nothing new. Take a look at this question/answer, to get you started with tutorials etc.
For-each in control flow is used to loop through files in directory, tables in a db etc. Data flow is where records fly through transformations from a source (your xml file) to a destination (tables).
You do need a lookup in one of its many flavors. Google for "ssis loading data warehouse dimensions"; this will eventually show you several techniques of efficiently using lookup transformation.
To flatten the XML (if simple enough), I would simply use XML source in data flow, XML task is for heavier stuff.

SSIS Package Design

What is the best way to design a SSIS package? I'm loading multiple dimensions and facts as part of a project. Would it be better to:
Have 1 package and 1 data flow with all data extract and load logic in 1 dataflow?
Have 1 package and multiple data flows with each data flow taking on the logic for 1 dimension?
Have 1 package per dimension and then a master package that calls them all?
After doing some research 2 and 3 appears to be more viable options. Any experts out there that want to share their experience and/or propose an alternative?
Microsoft's Project Real is an excellent example of many many best practices:
Package Design and Config for Dimensional Modeling
Package logging
Partitioning
It's based in SQL 2005 but is very applicable to 2008. It supports your option #3.
You could also consider having multiple packages called by a SQL Server Agent job.
I would often go for option 3. This is the method used in the Kimball Microsoft Data Warehouse Toolkit book, worth a read.
http://www.amazon.co.uk/Microsoft-Data-Warehouse-Toolkit-Intelligence/dp/0471267155/ref=sr_1_1?ie=UTF8&s=books&qid=1245347732&sr=8-1
I think the answer is not quite as clear cut ... In the same way that there is often no "best" design for a DWH, I think there is no one "best" package method.
It is quite dependent on the number of dimensions and the number of related dimensions and the structure of data in your staging area.
I quite like the Project Real (mentioned above) approaches, especially thought the package logging was quite well done. I think I have read somewhere that Denali (SQL 2011) will have SSIS logging/tracking built in, but not sure of the details.
From a calling perspective, I would go for one SQL agent job, that calls a Master Package that then calls all the child packages and manages the error handling/logic/emailing etc between them, utilising Log/Error tables to track and manage the package flow. SSIS allows much more complex sets of logic that SQL agent (e.g. call this Child Package if all of tasks A and B and C have finished and not task D)
Further, I would go for one package per Snowflaked dimension, as usually from the staging data one source table will generate a number of snowflaked dimensions (e.g. DimProduct, DimProductCategory, DimProductSubCategory). It would make sense to have the data read in once in on data flow task (DFT) and written out to multiple tables. I would use one container per dimension for separation of logic.