I have a ETL type requirement for SQL Server 2005. I am new to SSIS but I believe that it will be the right tool for the job.
The project is related to a loyalty card reward system. Each month partners in the scheme send one or more XML files detailing the qualiifying transactions from the previous month. Each XML file can contain up to 10,000 records. The format of the XML is very simple, 4 "header" elements, then a repeating sequence containing the record elements. The key record elements are card_number, partner_id and points_awarded.
The process is currently running in production but it was developed as a c# app which runs an insert for each record individually. It is very slow, taking over 8 hours to process a 10,000 record file. Through using SSIS I am hoping to improve performance and maintainability.
What I need to do:
Collect the file
Validate against XSD
Business Rule Validation on the records. For each record I need to ensure that a valid partner_id and card_number have been supplied. To do this I need to execute a lookup against the partner and card tables. Any "bad" records should be stripped out and written to a response XML file. This is the same format as the request XML, with the addition of an error_code element. The "good" records need to be imported into a single table.
I have points 1 and 2 working ok. I have also created an XSLT to transform the XML into a flat format ready for insert. For point 3 I had started down the road of using a ForEach Loop Container control in the control flow surface, to loop each XML node, and the SQL Lookup task. However, this would require a call to the database for each lookup and a call to the file system to write out the XML files for the "bad" and "good" records.
I believe that better performance could be achieved by using the Lookup control on the data flow surface. Unfortunately, I have no experience of working with the data flow surface.
Does anyone have a suggestion as to the best way to solve the problem? I searched the web for examples of SSIS packages that do something similar to what I need but found none - are there any out there?
Thanks
Rob.
SSIS is frequently used to load data warehouses, so your requirement is nothing new. Take a look at this question/answer, to get you started with tutorials etc.
For-each in control flow is used to loop through files in directory, tables in a db etc. Data flow is where records fly through transformations from a source (your xml file) to a destination (tables).
You do need a lookup in one of its many flavors. Google for "ssis loading data warehouse dimensions"; this will eventually show you several techniques of efficiently using lookup transformation.
To flatten the XML (if simple enough), I would simply use XML source in data flow, XML task is for heavier stuff.
Related
I am using SSIS2017 and part of what I am doing involves running several (30ish) SQL scripts to be output into flat files into the same folder. My question is, to do this do I have to create 30 New File Connections or is there a way to define the folder I want all the outputs to go to, and have them saved there?
I am only really thinking of keeping a tidy Connection Manager tab. If there's a more efficient way to do it than 30something file connections that would be great?
A data flow is tightly bound to the columns and types defined within for performance reasons.
If your use case is "I need to generate an extract of sales by year for the past 30ish" then yes, you can make do with a single Flat File Connection Manager because the columns and data types will not change - you're simply segmenting the data.
However, if your use case is "I need to extract Sales, Employees, Addresses, etc" then you will need a Flat File Connection Manager (and preferably a data flow) per entity/data shape.
It's my experience that you would be nicely served by designing this as 30ish packages (SQL Source -> Flat File Destination) with an overall orchestrator package that uses Execute Package Task to run the dependent processes. Top benefits
You can have a team of developers work on the individual packages
Packages can be re-run individually in the event of failure
Better performance
Being me, I'd also look at Biml and see whether you can't just script all that out.
Addressing comments
To future proof location info, I'd define a project parameter of something like BaseFilePath (assuming the most probably change is that dev I use a path of something like C:\ssisdata\input\file1.txt, C:\ssisdata\input\file3.csv and then production would be \server\share\input\file1.txt or E:\someplace\for\data\file1.txt) which I would populate with the dev value C:\ssisdata\input and then assign the value of \\server\share\input for the production to the project via configuration.
The crucial piece would be to ensure that an Expression exists on the Flat File Connection Manager's ConnectionString property to driven, in part, by the parameter's value. Again, being a programmatically lazy person, I have a Variable named CurrentFilePath with an expression like #[Project$::BaseFilePath] + "\\file1.csv"
The FFCM then uses #[User::CurrentFilePath] to ensure I write the file to the correct location. And since I create 1 package per extract, I don't have to worry about creating a Variable per flat file connection manager as it's all the same pattern.
This question can be seen as very stupid, but i'm actually strugling to make it clear into my head.
I have some academic experience with SSIS, SSAS and SSRS.
In simple terms:
SSIS - Integration of data from a data source to a data destination;
SSAS - Building a cube of data, which allows to analize and discover the data.
SSRS - Allows the data source to create dashboards with charts, etc...
Now, doing a comparison with Qlikview and Qliksense...
Can the Qlik products do exactly the same as SSIS, SSAS, SSRS? Like, can Qlik products do the extraction(SSIS), data proccessing(SSAS) and data visualization(SSRS)? Or it just works more from a SSRS side (creating dashboards with the data sources)? Does the Qlik tools do the ETL stages (extract, transform and load) ?
I'm really struggling here, even after reading tons of information about it, so any clarifications helps ALOT!
Thanks,
Anna
Yes. Qlik (View and Sense) can be used as ETL tool and presentation layer. Each single file (qvw/View and qvf/Sense) contains the script that is used for ETL (load all the required data from all data sources, transform the data if needed), the actual data and the visuals.
Depends on the complexity, only one file can be used for everything. But the process can be organised in multiple files as well (if needed). For example:
Extract - contains the script for data extract (eventually with incremental load implemented if the data volumes are big) and stores the data in qvd files
Transform - loads the qvd files from the extraction process (qvd load is quite fast) and perform the required transformations
Load - load the data model from the transformation file (binary load) and create the visualisations
Another example of multiple files - had a project which required multiple extractors and multiple transformation files. Because the data was extracted from multiple data sources to speed up the process we've ran all the extractors files are the same time, then ran all the transformation files at the same time, then the main transform (which combined all the qvd files) into a single data model.
In addition to the previous comment have a look at the layered Qlik architecture.
There it is described quite well how you should structure your files.
However, I would not recommend to use Qlik for a full-blown data-warehouse (which you could do with SSIS easily) as it lacks some useful functions (e.g. helpers for slowly-changing-dimensions).
I have an SSIS data flow task that reads a CSV file with certain fields, tweaks it a little and inserts results into a table. The source file name is a package parameter. All is good and fine there.
Now, I need to process slightly different kind of CSV files with an extra field. This extra field can be safely ignored, so the processing is essentially the same. The only difference is in the column mapping of the data source..
I could, of course, create a copy of the whole package and tweak the data source to match the second file format. However, this "solution" seems like terrible duplication: if there are any changes in the course of processing, I will have to do them twice. I'd rather pass another parameter to the package that would tell it what kind of file to process.
The trouble is, I don't know how to make SSIS read from one data source or another depending on parameter, hence the question.
I would duplicate the Connection Manager (CSV definition) and Data Flow in the SSIS package and tweak them for the new file format. Then I would use the parameter you described to Enable/Disable either Data Flow.
In essence, SSIS doesnt work with variable metadata. If this is going to be a recurring pattern I would deal with it upstream from SSIS, building a VB / C# command-line app to shred the files into SQL tables.
You could make the connection manager push all the data into 1 column. Then use a script transformation component to parse out the data to the output, depending on the number of fields in the row.
You can split the data based on delimiter into say a string array (I googled for help when I needed to do this). With the array you can tell the size of it and thus what type of file it is that has been connected to.
Then, your mapping to the destination can remain the same. No need to duplicate any components either.
I had to do something similar myself once, because although the files I was using were meant to always be the same format - depending on version of the system sending the file, it could change - and thus by handling it in a script transformation this way I was able to handle the minor variations to the file format. If the files are 99% always the same that is ok.. if they were radically different you would be better to use a separate file connection manager.
I'm pretty inexperienced with SSIS, though I have much experience in SQL and C# and other technologies.
I am converting a task I have written as a stand-alone c# console app into an SSIS package.
I have a OLEDB input source that is a SQL command, this collects certain data in the database that I then feed into a Script Component Transform. I use the input fields as parameters to an OAuth based restful web service, which requires a lot of custom C# code to accomplish. The web service returns an XML respose that includes many rows that must be output for each input row.
My understanding of how the script transform works is that it's more or less one row in, one row out.
I have several questions here really.
Is it a good idea to use the input source this way? Or is there a better way to feed input rows into my web service?
Is a script component transform the correct tool to use here? I can't use a normal web service because the web service is not SOAP or WCF based, and requires OAuth in the request. (or is there a way to use the web service component this way?)
How can output more than one row for every input row?
Does SSIS support a way to take the XML results (that contain multiple rows) and map them to the rows of the output field in the script transform? I know there's an XML Input source, but that's not really this. I'm thinking something that takes XML input and spits out rows of data
UPDATE:
Data from the Web Service looks like this (extra cruft elided):
<user>
<item>
<col1>1</col1>
<col2>2</col2>
<col3>3</col3>
</item>
<item>
<col1>1</col1>
<col2>2</col2>
<col3>3</col3>
</item>
....
</user>
Essentially, the SQL DataSource returns a dataset of of users. The users dataset is fed into the script and used as parameters for the web service calls. The web service calls return a set of XML results, which have multiple "rows" of data that must be output from the script.
In the above data, the outputs of the script would be multiple rows of col1, col2, and col3 for each user supplied in the input source. I need a way to extract those elements and put them into columns in the output buffer for each row of xml data. Or, a way to simply make the xml the output of the script and feed that output into another component to parse the xml into rows (like an XML source does, but obviously you can't put an XML source in the middle of a data flow).
Answering what I can
Is it a good idea to use the input source this way? Or is there a better way to feed input rows into my web service?
It depends but generally, if your data is in a database, an OLE DB, or ADO.NET source is your preferred component for injecting it into the pipeline. Better? It depends on your needs but is there a reason you think it wouldn't be advisable? Nice benefits to using a data flow are built in buffering, parallelism, logging, configuration, etc. I'm assuming that or some other reason is leading you to move your .NET app into an Integration Services package so I would think if you're moving into this space, go whole hog.
Is a script component transform the correct tool to use here?
Definitely. The built-in web-service stuff is less-than-industrial-strength. You're already familiar with .NET so you're well positioned to take maximum advantage of that component.
How can output more than one row for every input row?
Yes. Your assumption of 1:1 input:output is only for the default behaviour. By default, a script component is synchronous so as you've observed, every row has an output. But, by changing your script component to becoming an asynchronous component, then you can have 1B rows transformed into a single row of output or have 1 row of source generate N rows of output. I had to do the latter for a Bill of Materials type problem---I'd receive a parent id and I'd have to lookup all the child rows associated to the parent. Anyways, the linked MSDN article describes how to make it async.
Does SSIS support a way to take the XML results
I don't understand well enough what you're asking to address this. Dummy up some examples for this dummy and I'll see if it clicks.
This question is going to be a purely organizational question about SSIS project best practice for medium sized imports.
So I have source database which is continuously being enriched with new data. Then I have a staging database in which I sometimes load the data from the source database so I can work on a copy of the source database and migrate the current system. I am actually using a SSIS Visual Studio project to import this data.
My issue is that I realised the actual design of my project is not really optimal and now I would like to move this project to SQL Server so I can schedule the import instead of running manually the Visual Studio project. That means the actual project needs to be cleaned and optimized.
So basically, for each table, the process is simple: truncate table, extract from source and load into destination. And I have about 200 tables. Extractions cannot be parallelized as the source database only accepts one connection at a time. So how would you design such a project?
I read from Microsoft documentation that they recommend to use one Data Flow per package, but managing 200 different package seems quite impossible, especially that I will have to chain for scheduling import. On the other hand a single package with 200 Data Flows seems unamangeable too...
Edit 21/11:
The first apporach I wanted to use when starting this project was to extract my table automatically by iterating on a list of table names. This could have worked out well if my source and destination tables had all the same schema object names, but the source and destination database being from different vendor (BTrieve and Oracle) they also have different naming restrictions. For example BTrieve does not reserve names and allow more than 30 characters names, which Oracle does not. So that is how I ended up manually creating 200 data flows with a semi-automatic column mapping (most were automatic).
When generating the CREATE TABLE query for the destination database, I created a reusable C# library containing the methods to generate the new schema object names, just in case the methodology could automated. If there was any custom tool to generate the package that could use an external .NET library, then this might do the trick.
Have you looked into BIDS Helper's BIML (Business Intelligence Markup Language) as a package generation tool? I've used it to create multiple packages that all follow the same basic truncate-extract-load pattern. If you need slightly more cleverness than what's built into BIML, there's BimlScript, which adds the ability to embed C# code into the processing.
From your problem description, I believe you'd be able to write one BIML file and have that generate two hundred individual packages. You could probably use it to generate one package with two hundred data flow tasks, but I've never tried pushing SSIS that hard.
You can basically create 10 child packages each having 20 data flow tasks and create a master package which triggers these child pkgs.Using parent to child configuration create a single XML file configuration file .Define the precedence constraint for executing the package in serial fashion in master pkg. In this way maintainability will be better compared to having 200 packages or single package with 200 data flow tasks.
Following link may be useful to you.
Single SSIS Package for Staging Process
Hope this helps!