SSIS data validation and Cleaning

SSIS data validation and Cleaning - ssis

I need to do like this
Client puts data in FTP folder (data can be in these 3 format- .txt, .csv or .xls), The SSIS package need to pull data from ftp and check the data file for correct format such as last name not empty, phone is 10 digit, zip code is 5 digits, Address is not more than 20 character length etc etc)
After checking data file, if everything okay it should load file in dev. database, if not I need to run some cleaning quires (like taking first 5 digit for zip etc) and load data, if some column is missing, it need to send email to client asking different data file
Till now, I do this task by manually importing file and running lot of sql queries, which is time consuming. My manager asked me to write SSIS package to automate this process
I am fairly new in SSIS, can someone give me SSIS package design idea (I mean which task to use at which sequence etc) so I can try and learn
Thanks for the help

Here are a couple of suggestions:
Configure tasks to send errors caused by bad data to a separate file. This will identify problem rows while letting the good stuff to continue. You can also use the conditional split to redirect rows with bad data such as blank rows.
The Derived Column Transformation is handy to trim, format, slice, and dice data.
Use the Event Handler to send emails if a given condition is true.
Use the logging features. Very helpful in sorting out something that went sideways while you were sleeping.

Related

Multiple Flat File Outputs to a folder in SSIS

I am using SSIS2017 and part of what I am doing involves running several (30ish) SQL scripts to be output into flat files into the same folder. My question is, to do this do I have to create 30 New File Connections or is there a way to define the folder I want all the outputs to go to, and have them saved there?
I am only really thinking of keeping a tidy Connection Manager tab. If there's a more efficient way to do it than 30something file connections that would be great?

A data flow is tightly bound to the columns and types defined within for performance reasons.
If your use case is "I need to generate an extract of sales by year for the past 30ish" then yes, you can make do with a single Flat File Connection Manager because the columns and data types will not change - you're simply segmenting the data.
However, if your use case is "I need to extract Sales, Employees, Addresses, etc" then you will need a Flat File Connection Manager (and preferably a data flow) per entity/data shape.
It's my experience that you would be nicely served by designing this as 30ish packages (SQL Source -> Flat File Destination) with an overall orchestrator package that uses Execute Package Task to run the dependent processes. Top benefits
You can have a team of developers work on the individual packages
Packages can be re-run individually in the event of failure
Better performance
Being me, I'd also look at Biml and see whether you can't just script all that out.
Addressing comments
To future proof location info, I'd define a project parameter of something like BaseFilePath (assuming the most probably change is that dev I use a path of something like C:\ssisdata\input\file1.txt, C:\ssisdata\input\file3.csv and then production would be \server\share\input\file1.txt or E:\someplace\for\data\file1.txt) which I would populate with the dev value C:\ssisdata\input and then assign the value of \\server\share\input for the production to the project via configuration.
The crucial piece would be to ensure that an Expression exists on the Flat File Connection Manager's ConnectionString property to driven, in part, by the parameter's value. Again, being a programmatically lazy person, I have a Variable named CurrentFilePath with an expression like #[Project$::BaseFilePath] + "\\file1.csv"
The FFCM then uses #[User::CurrentFilePath] to ensure I write the file to the correct location. And since I create 1 package per extract, I don't have to worry about creating a Variable per flat file connection manager as it's all the same pattern.

Access throwing unparseable records error from Cognos Report

Before I begin, yes I know everything I am about to explain is back-asswards but at the moment it is what I have to work with.
My organization is in the middle of a Cognos 10 BI implementation, at the moment we are having HUGE performance issues with the data cubes, severely hampering our end users ability to slice data in an ad-hoc fashion. Historically we used large data extracts from SAP, manipulated in ms-access to provided a data source that was updated daily that end users could Pivot around in Excel.
As this is NOT transactional data, it worked as we never had more than a half million records, performance was never an issue.
As our implementation team has been unable to provide management with functioning data cubes we can use to provide static views and reports I have been tasked with using Cognos data extracts to re-create the old system temporarily.
The issue I am running into is that randomly, 3 times a week, 1 time the next, the files will contain unparsable records. I doubt it is a special character issue as I can re-download the file and it functions fine the 2nd or 3rd time.
Does anyone have any experience with something similar? I realize the data sets provided by Cognos were not designed for this purpose, but it seems strange that 20 percent of the files will contain corruptions. Also strange is that when I select a .xls spreadsheet as the download format, it seems to be a Unicode text file with the extension changed to .xls
Any insight would be appreciated.
EDIT: Diffing the files will be my next experiment, even though they are byte for byte comparable, I HAVE however compared the specific records that are unparsable in one file, yet parsable in the next and have not found any difference.
As for the import, I manually convert the file to a Unicode text and import it from that format.

How to make SSIS choose data source depending on parameter?

I have an SSIS data flow task that reads a CSV file with certain fields, tweaks it a little and inserts results into a table. The source file name is a package parameter. All is good and fine there.
Now, I need to process slightly different kind of CSV files with an extra field. This extra field can be safely ignored, so the processing is essentially the same. The only difference is in the column mapping of the data source..
I could, of course, create a copy of the whole package and tweak the data source to match the second file format. However, this "solution" seems like terrible duplication: if there are any changes in the course of processing, I will have to do them twice. I'd rather pass another parameter to the package that would tell it what kind of file to process.
The trouble is, I don't know how to make SSIS read from one data source or another depending on parameter, hence the question.

I would duplicate the Connection Manager (CSV definition) and Data Flow in the SSIS package and tweak them for the new file format. Then I would use the parameter you described to Enable/Disable either Data Flow.
In essence, SSIS doesnt work with variable metadata. If this is going to be a recurring pattern I would deal with it upstream from SSIS, building a VB / C# command-line app to shred the files into SQL tables.

You could make the connection manager push all the data into 1 column. Then use a script transformation component to parse out the data to the output, depending on the number of fields in the row.
You can split the data based on delimiter into say a string array (I googled for help when I needed to do this). With the array you can tell the size of it and thus what type of file it is that has been connected to.
Then, your mapping to the destination can remain the same. No need to duplicate any components either.
I had to do something similar myself once, because although the files I was using were meant to always be the same format - depending on version of the system sending the file, it could change - and thus by handling it in a script transformation this way I was able to handle the minor variations to the file format. If the files are 99% always the same that is ok.. if they were radically different you would be better to use a separate file connection manager.

How to periodically extract data from a CSV file?

I'm currently working in some Q&A projects. I am running tests (which can vary from a couple of minutes to 2-3 days) in an applications that is generating some csv files and updates them periodically, with a new row added with each update (once every couple of seconds or so).
Each CSV file is structured like this:
Header1,Header2,Header3,.................,HeaderN
numerical_value11,numerical_value12,numerical_value13,......,numerical_value1N,
numerical_value21,numerical_value22,numerical_value23,......,numerical_value1N,
etc
The number of columns may vary from csv file to csv file.
I am running in a windows environment. I also have cygwin (http://www.cygwin.com/) installed.
Is there a way I can do a script that runs periodically (once per hour or so), extracts data (a single/multiple values from a row, or the average of the the values from specific rows added in the csv between interrogations) and sends some email alerts if, for example, the data from one column is out of a range?
Thx

This can be done in several ways. Basically, you need to
1) Write a script in maybe pearl or python that does one iteration of what you want it to do.
2) Use windows scheduler to run this scrip at the frequency that you want. The Windows scheduler is very easy to setup from the Control Panel

Using Window' Scheduling, you can very easily get the interval part down; with the program parsing and alerting however, you have a few options. I myself would use C# to make the program. If you want an actual script however, VBA is a viable choice and could very easily Parse a basic CSV file and contact the web to send an email. If you have office already installed, this should give you some more detail. Hope that helps.

Load XML Using SSIS

I have a ETL type requirement for SQL Server 2005. I am new to SSIS but I believe that it will be the right tool for the job.
The project is related to a loyalty card reward system. Each month partners in the scheme send one or more XML files detailing the qualiifying transactions from the previous month. Each XML file can contain up to 10,000 records. The format of the XML is very simple, 4 "header" elements, then a repeating sequence containing the record elements. The key record elements are card_number, partner_id and points_awarded.
The process is currently running in production but it was developed as a c# app which runs an insert for each record individually. It is very slow, taking over 8 hours to process a 10,000 record file. Through using SSIS I am hoping to improve performance and maintainability.
What I need to do:
Collect the file
Validate against XSD
Business Rule Validation on the records. For each record I need to ensure that a valid partner_id and card_number have been supplied. To do this I need to execute a lookup against the partner and card tables. Any "bad" records should be stripped out and written to a response XML file. This is the same format as the request XML, with the addition of an error_code element. The "good" records need to be imported into a single table.
I have points 1 and 2 working ok. I have also created an XSLT to transform the XML into a flat format ready for insert. For point 3 I had started down the road of using a ForEach Loop Container control in the control flow surface, to loop each XML node, and the SQL Lookup task. However, this would require a call to the database for each lookup and a call to the file system to write out the XML files for the "bad" and "good" records.
I believe that better performance could be achieved by using the Lookup control on the data flow surface. Unfortunately, I have no experience of working with the data flow surface.
Does anyone have a suggestion as to the best way to solve the problem? I searched the web for examples of SSIS packages that do something similar to what I need but found none - are there any out there?
Thanks
Rob.

SSIS is frequently used to load data warehouses, so your requirement is nothing new. Take a look at this question/answer, to get you started with tutorials etc.
For-each in control flow is used to loop through files in directory, tables in a db etc. Data flow is where records fly through transformations from a source (your xml file) to a destination (tables).
You do need a lookup in one of its many flavors. Google for "ssis loading data warehouse dimensions"; this will eventually show you several techniques of efficiently using lookup transformation.
To flatten the XML (if simple enough), I would simply use XML source in data flow, XML task is for heavier stuff.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008