SSIS - Process a flat file with varying data - ssis

I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.

The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.

Related

SSIS consolidate and concatenate multiple rows into single rows without using SQL

I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.

Create a node for each column only once while importing csv into Neo4j

I have a csv file that looks the following way:
I want to create a database from it in Neo4j. Rows are nodes with labels gene, columns are also nodes with labels cell. I need to write a CREATE query that would create all my gene and cell - nodes and a relationship one for each combination of gene and cell. Currently I am stuck with the following code:
LOAD CSV WITH HEADERS FROM 'file:///merged_full.csv' AS line
CREATE (:Gene {id: line.gene_ids, name: line.wikigene_name})
I need to somehow iterate over all columns - starting from index 3 - after creating gene nodes, but I do not know how to do that.
Here are 3 queries that, performed in order, should do what you want.
This query creates a temporary Headers node with a names property that contains the collection of headers from the CSV file. It uses LIMIT 1 to only process the first row of the file. It also creates all the Cell nodes, each with it own name property.
LOAD CSV FROM 'file:///merged_full.csv' AS line
MERGE (h:Headers)
SET h.names = line
WITH line
LIMIT 1
UNWIND line[3..] AS name
MERGE (c:Cell {name: name})
This query uses the APOC function apoc.map.fromNodes to generate a map named cells, which maps each cell name to its cell node. It also gets the Headers node. It then loads the non-header data from the CSV file (using SKIP 1 to skip over the header row), and processes each row as follows. It uses MERGE to get/create a Gene node, g, with the desired id and name. It uses the REDUCE function to generate a collection of the Cell nodes that have a "1" column value in the current row, and the FOREACH clause then creates a (g)-[:HAS]->(x) relationship (if necessary) for every cell, x, in that collection.
WITH apoc.map.fromNodes('Cell', 'name') AS cells
MATCH (h:Headers)
LOAD CSV FROM 'file:///merged_full.csv' AS line
WITH h, cells, line
SKIP 1
MERGE (g:Gene {id: line[1], name: line[2]})
FOREACH(
x IN REDUCE(s = [], i IN RANGE(3, SIZE(line)-1) |
CASE line[i] WHEN "1" THEN s + cells[h.names[i]] ELSE s END) |
MERGE (g)-[:HAS]->(x))
This query just deletes the temporary Headers node (if you wish):
MATCH (h:Headers)
DELETE h;
If the columns correspond with cell nodes, then you should know all the cell nodes you need just be looking at the CSV header.
I'd recommend writing a small query just to create each of the cell nodes you need, then create an index or unique constraint on :Cell(id) (or name, or whatever the property is that is meant to identify a :Cell).
At that point the problem becomes getting and processing each relevant column (I assume only the ones with 1 as the value). APOC Procedures may help here.
apoc.map.sortedProperties() can be used to take your line map and give you a list of key/value list pairs, which you can filter down to those where the key begins with 'V', and where the value is 1, then use what's remaining to match on the relevant :Cell node and create the relationship.

SSIS: Flat File Source to SQL without Duplicate Rows

I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.
E.g. Data:
Item Number | Item Name | Update Date
ITEM-01 | First Item | 1-Jan-2013
ITEM-01 | First Item | 5-Jan-2013
ITEM-24 | Another Item | 12-Mar-2012
ITEM-24 | Another Item | 13-Mar-2012
ITEM-24 | Another Item | 14-Mar-2012
Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.
I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.
So is there any other way?
There are a couple of approaches you can take to do this.
1. Aggregate Transformation
Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:
2. Script Component Transformation
Essentially, you could implement the logic you mentioned above:
if next item number = previous item number then do NOT import this
line
First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):
Select Transformation as the Script Component type
Add the Script Component after the Flat File Source in your Data Flow:
Double Click the Script Component to open the Script Transformation Editor.
Under Input Columns, select all columns:
Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None
Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):
Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ItemNumber.Equals(previousItemNumber))
{
Output0Buffer.AddRow();
Output0Buffer.ItemName = Row.ItemName;
Output0Buffer.ItemNumber = Row.ItemNumber;
Output0Buffer.UpdateDate = Row.UpdateDate;
}
previousItemNumber = Row.ItemNumber;
}
private string previousItemNumber = string.Empty;
If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.

SSIS - Is there a Data Flow Source component that will handle CSV files where the column order may change?

We have written a number of SSIS packages that import data from CSV files using the Flat File Source.
It now seems that after these packages are deployed into production, the providers of these files may deliver files where the column order of the files changes (Don't ask!). Currently if this happens, our packages will fail.
For example, an additional column is inserted at the beginning of each row. In this case, the flat file source continues to use the existing column order, which obviously has a detrimental effect on the transformation!
Eg. Using a trivial example, the original file has the following content :
OurReference,Client,Amount
235,MFI,20000.00
236,MS,30000.00
The output from the flat file source is :
OurReference Client Amount
235 ClientA 20000.00
236 ClientB 30000.00
Subsequently, the file delivered changes to :
OurReference,ClientReference,Client,Amount
235,A244,ClientA,20000.00
236,B222,ClientB,30000.00
When the existing unchanged package is run against this file, the output from the flat file source is :
OurReference Client Amount
235 A244 ClientA,20000.00
236 B222 ClientB,30000.00
Ideally, we would like to use a data source that will cope with this problem - ie which produces output based on the column names, instead of the column order.
Any suggestions would be welcomed!
Not that I know of.
A possibility to check for the problem in advance is to set up two different connection managers, one with a single flat row. This one can read the first row and tell if it's OK or not and abort.
If you want to do the work, you can take it a step further and make that flat one-field row the only connection manager, and use a script component in your flow to parse the row and assign to the columns you need later in the flow.
As far as I know, there is no way to dynamically add columns to the flow at runtime - so all the columns you need will need to be added to the script task output. Whether they can be found and get parsed from the each line is up to you. Any "new" (i.e. unanticipated) columns cannot be used. Columns which are missing you could default or throw an exception.
A final possibility is to use the SSIS object model to modify the package before running to alter the connection manager - or even to write the entire package dynamically using the object model based on an inspection of the input file. I have done quite a bit of package generation in C# using templates and then adding information based on metadata I obtained from master files describing the mainframe files.
Best approach would be to run a check before the SSIS package imports the CSV data. This may have to be an external script/application, because I don't think you can manipulate data in the MS Business Intelligence Studio.
Here is a rough approach. I will write down the limitations at the end.
Create a flat file source. Put the entire row in one column.
Do not check Column names in first data row.
Create a Script Component
Code:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
string sRow = Row.Column0;
string sManipulated = string.Empty;
string temp = string.Empty;
string[] columns = sRow.Split(',');
foreach (string column in columns)
{
sManipulated = string.Format("{0}{1}", sManipulated, column.PadRight(15, ' '));
}
/* Note: For sake of demonstration I am padding to 15 chars.*/
Row.Column0 = sManipulated;
}
Create a flat file destination
Map Column0 to Column0
Limitation: I have arbitrarily padded each field to 15 characters. Points to consider:
1. Do we need to have each field of same size?
2. If yes, what is that size?
A generic way to handle that would be to create a table to store the file name, fields, and field sizes.
Use the file name to dynamically create the source and destination connection manager.
Use the field name and corresponding field size to decide the padding. Not sure, if you need this much flexibility. If you have any question, please respond.

How to export a flat file with different rows using SSIS?

I have tree tables, Customer, Invoice and InvoiceRow with the standard relations.
These I have to export in one fixed field length file with the first two characters of each row identifying the row type. The row types have different specifications.
I could probably do it with a nested loop in a script block, but this is my first ever SSIS package and that solution feels wrong.
edit:
The output has to have:
Customer
Invoice
Rows
Customer
Invoice
Rows
and so on
Your gut feeling on doing this using a Script Destination component is correct. Unfortunately, this scenario doesn't jive with SSIS well. I don't consider this a beginner package. If you must use SSIS then I'd start by inner joining all the data so there is one row for each InvoiceRow, containing the data needed from all three tables.
CustomerCols, InvoiceCols, RowCols
Then, in the script destination component you'll need to keep track of the customer and invoice values, as they change you'll need to write extra rows to the output.
See Creating a Destination with the Script Component for more information on script destination.
My experience shows that script destinations can have good performance.
I would avoid writing Script Destination, and use just Script Transform + Flat File Destination. This way, you concentrate on the logical output (strings of data), while allowing SSIS to do actual writing to the file (it might be a bit more efficient, plus you concentrate on your business, not on writing to files).
First, you'll need to get denormalized data. You can do joins and sorts in the DBMS, but if you don't want to put too much pressure on DBMS - just get sorted data out of it and merge it using two SSIS Merge Join transforms.
Then do the script: keep running values of current Customer and Invoice, output them when they change, output InvoiceRow on every input. Something like this:
if (this.CustomerID != InputBuffer.CustomerID) {
this.CustomerID = InputBuffer.CustomerID;
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "Customer: " + InputBuffer.CustomerID + " " + InputBuffer.CustomerName;
}
// repeat the same code for Invoice
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "InvoiceRow: " + InputBuffer.InvoiceRowPrice;
Finally, add a Flat File Destination with a single column (OutputColumn created by the script) to write this to the file.
Process your three tables so that the outputs are all appropriate for your output file (including the row type designator). You'll have to do this in three separate flow paths in your data flow, then bring the rows together in a Union All data flow element. From there, process them as needed to create your output file.