Sorting files with the same header names using SSIS - ssis

I have a folder with a lot of data files in. I want to be able to loop through the files, look at the headers and sort them into folders if they have the same headers. Is that possible to do in SSIS? If so would anyone be able point me the direction of how to do this?

I am going to try and explain this as best I can without writing a book as this a multi stepped process that isn't too complex but, might be hard to explain with just test. My apologies but I do not have access to ssdt at the moment so I can not provide images to aid here.
I would use the TextFieldParser class in the VisualBasics.dll. in a script task. This will allow you to read the header from file into a string array. You can then build the string array into a delimited column and load an object variable with a datatable that has been populated with two columns. The first column being the filename and the second being the delimiter headers.
Once you have this variable you can load a sql table with this information. (optional to skip if you want to load the columns directly into sql as you read them. your call)
Once you have your sql table you can create an enumerator for that dataset based on the unique headers column.
Then use a foreach loop task with script task to enumerate thru the unique header sets. Use a sql task to assign the file names that belong to the unique header set.
Within the script loop thru the returned file names and apply the necessary logic to move the files to there respective folders.
This is sort of a high level overview as I am assuming you are familiar enough with SSIS to understand the steps necessary to complete each step. If not then I would be able to elaborate later in the day when I am able to get to my SSIS rig.

Related

Validate .CSV output in ADF

I have a DataFlow that searches an information inside each one of my databases and returns into several .CSV files. When the search returns with data the .CSV contains headers and the data that was founded. When it does not, the .CSV contains only the headers. After that, all the .CSV files are moved into a Sharepoint folder through app logic.
My question is: I need to put those .CSV into two folders "with data" and "no data" to make it easier to check which one of them has or has not data in it. I have tried to use "Conditional Slipt" in my DataFlow but it does not work. Does anyone have any suggestion to deal with that?
As per your scenario since you have incoming files that have rows and empty you can input additional stream with a only header (similar to your CSV with only header and no rows). You can use this to compare with the earlier input stream to decide if it is empty. You could use lookup activity and then use an if activity to decide whether you need to run the copy activity.
In the lookup activity, you could set firstRowOnly as true since you only want to check whether there are data. Check weather the first row is empty, if yes copy that file to folder "no data" else copy that file to folder "with data". Use conditional split here to direct them to different streams to sink (copy) or can use copy activity in the pipeline separately.
From my repro:
1. Consider inputs with and without data (CSV files)
2. Use lookup activity to compare input with a predefined empty source file.
3. Use conditional split activity with suitable condition expression depending on your data schema.
4. Route to appropriate folders using sink
Validate the output:
Refer: Lookup transformation in mapping data flow, Conditional split transformation in mapping data flow, Data transformation expressions in mapping data flow

Variable in recordset destination replaced when the dataflow is in For Each loop container?

I have an excel spreadsheet with multiple sheets. So In For Each loop container have a script which is reading the sheets and saving them to the variable. In dataflow still inside of the For Each loop container is the process which leads to recordset destination which is saving all the columns to another variable. Then outside of the for each loop container is another dataflow which has to read from the variable all rows check for duplicity (in second and 3 sheet is duplicit product id), remove duplicity and upload data into database. I have been searching everywhere and cannot find how to setup the recordset destination to not replace the variable but append it with the values, because end up only with last sheet of data.
Cannot be doing changes in the foreach loop container in settings because of the looping through the sheets.
Thank you in advance for any advise.
While hopefully someone wiser in SSIS will chime in here, but I don't think your current approach will work.
Generally speaking, you can use Expressions within SSIS to get dynamic behaviour. However, the VariableName of the Recordset Destination does not support that.
You might be able to have a Script Task after the Data Flow that copies from rsCurrent into rsIteration1, rsIteration2, etc based on the current loop but at that point, you're double copying data for no real value.
Since you're doing a duplicate check, perhaps a read of sheet 1 goes into a Cache Connection Manager
And then the read from subsequent pages will use the CCM as the lookup. For rows that have matches, then you know you have duplicates (or maybe you only import what doesn't match, I don't quite get your logic)
Debugging some of this is going to be quite challenging. If at all possible, I would stage the data to tables. There you could load all the data + the tab name and then you can test your deduplication and refer back to your inputs and outputs.
The tooling for SSIS variables of type Object is pretty limited, which is a pity.

How to pass Multiple Input for SSIS Script Component

I have a Custom Source DataFlow Component whose O/P will differ every time, I need to Insert those records in to a Destination table.
Problem:-
Can't Specify the Input columns at Design time for Destination Component.. as in actual for every call to the DataFlow task, The source component gonna return Different O/P Columns based on the Table Schema.
Solution Needed For:-
How to accept whatever inputs are available without any mapping in Destination DataFlow component(either by any Existing Component or by Custom Component)?
The data flow's fixed structure is there for data validation and to optimization purposes. All of it's components are going to have fixed input and output columns. I would suggest the following possibilities:
Write a data flow for every possible schema. There are probably a finite number of possibilities. You could reduce the effort of this task by using BIML which could generate the package structure for you. This may also introduce the possibility of parallel loading.
Use a script task instead of a data flow. In the script task, write the rows for each input into a table.
m
If you need to pass multiple inputs to a single script component, the only way I know to do this is by passing the multiple inputs to a UNION ALL component, and then passing the single output from the UNION ALL to the Script.
You'll have to account for any differences between the two columns structures in the UNION ALL, and maybe use derived columns if you need an easy way to identify which original input a row came from.
I know this is way late but I keep seeing this UNION ALL approach and don't like it.
How about this approach.
Run both data flows into their own recordset destination and save into a variable of type ADO object
Create a new dataflow and use a script source and bring in both ADO objects
Fill datatables using adapter and then do what ever you want with them.

process csv File with multiple tables in SSIS

i trying to figure out if its possible to pre-process a CSV file in SSIS before importing the Data into SQL.
I currently receive a file that contains 8 tables with different structures in one flat file.
the Tables are identified by a row with the Table name in it encapsulated by Square Brackets i.e. [DOL_PROD]
the the data is underneath in standard CSV format. Headers first and then the data.
the tables are split by a blank line and the process repeats for the next 7 tables.
[DOL_CONSUME]
TP Ref,Item Code,Description,Qty,Serial,Consume_Ref
12345,abc,xxxxxxxxx,4,123456789,abc
[DOL_ENGPD]
TP Ref,EquipLoc,BackClyLoc,EngineerCom,Changed,NewName
is it posible to split it out into seperate CSV files? or Process it in a loop?
i would really like to be able to perform this all with SSIS automatically.
Kind Regards,
Adam
You can't do that by flat file source and connection manager alone.
There are two ways to achieve your goal:
You can use Script Component as source of the rows and to process the files, and then you'd do whatever you want with a file programatically.
The other way, is to read your flat file treating every row as a single column (i.e. without specifying delimiter), and then, via Data Flow Transformations, you'd be splitting rows, recognizing table names, splitting flows and so on.
I'd strongly advise you to use Script Component, even if you'd have to learn .NET first, because the second option will be a nightmare :). I'd use Flat File Source to extract lines from file as single column, and thet work it in Script Component, rather than reading a "raw" file directly.
Here's a resource that should get you started: http://furrukhbaig.wordpress.com/2012/02/28/processing-large-poorly-formatted-text-file-with-ssis-9/

Reading Data from Header in a flat file in SSIS

I have pipe delineated flat file that SSIS is reading in. This flat file has 7 header rows. There is an option to skip (n) number of header rows, but the problem is, is that I need to have the ability to retrieve data from these rows as well.
What is the best way of retrieving this this information to be used later in data flow?
A couple of things to try.
If there is a field that denotes the header you can read in all the data then use a conditional split to split out the header records from the data.
Or you could use something like this.
When all else fails, you could always use Script Component of type Source.