In a Data Flow, I have a ADO NET Source which load a table like this:
PersonID, Email
1, "john#hotmail.com"
1, "john_job#yahoo.com"
2, "susan#gmail.com"
2, "sus2010#hotmail.com"
I need to merge emails from each persons and get a result like this:
PersonID, EmailsArray
1, "john#hotmail.com,john_job#yahoo.com"
2, "susan#gmail.com,sus2010#hotmail.com"
How to I do it? Using derived column? a script component? a foreach loop? (in Data Flow doesn't exist). Thanks in advance.
Use an asynchronous script component with something like the following logic:
Sort your data on the ID column.
In the script component, declare a variable that keeps track of the previous id, assign it to the ID column of your input buffer at the end of your script.
For each row in the input buffer, concatenate the email field to a string variable.
Check if the previous ID is equal to the current ID (coming from your input buffer). If it is different, add a row to the output buffer with the previous ID and the concatenated string. Reset the string as empty.
MSDN
Related
I'm attempting to pass the ID from one storage routine into another copy task, which requires a for each to recursively process each ID. I've setup the Lookup ID task, which is working. It's passing these objects into my for each, in which the settings are "sequential" with items set to the following: #activity('LookupUID').output.value
foreach
In my for each, I have 1 activity to copy data from another API call to an Azure SQL Database. I have a linked service, with a parameter that is being passed. I'm attempting to use a dynamic content operator to pass the current item from the for each into this parameter, which then gets sent to the API call for the ID parameter. When I manually plug in a value here, it works fine. However, trying to pass the value from the for each into this copy task parameter doesn't produce a data row when running the task.
copy task
output
You must mention the column name along with the current item in copy activity like #item().ID
Example:
I have a lookup activity to get the IDs from a source. Below is the output of the Lookup activity with a list of IDs.
Lookup Output:
I am looping these IDs in the ForEach activity and passing the current item to a variable.
ForEach activity setting: Items- #activity('Lookup1').output.value
I have a string variable in which I am passing the current item as below using Set Variable activity.
#string(item().ID)
Output:
Use this code and replace with your column name
#activity('lookup1').output.firstrow.columnname
I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.
I have a field in a mysql table that stores events in JSON. This JSON has inside it the ID of a file that the event is about. Is there a way to have another field auto-populate with that file ID? Something like a formula field in Excel? Format inside the JSON is "item_id": "1234567"
New to SQL so help is appreciated. :)
Example JSON in column "event":
{"video_proc_producer_ver": 2,"mp_event":{"project_name": "some
project","project_account": "some customer","mp_notes": "No playable
combo ids found for item_id: 1234abcd.\\n\n","item_name":
"file.mov","mp_m_stamp": "2020-03-09 02:27:50","mp_c_stamp":
"2020-03-09 02:22:14","mp_processing_mask": "4","c_user_id":
"123456","mp_export_time": "2020-03-09 02:24:22","item_id":
"0987654","project_id": "1234","mp_complete_time": "2020-03-09
02:27:50"}}
result in column file_id: 0987654
I know I can use some process outside the table to pull that info and insert it, but I was wondering if there is the equivalent of a calculated column like in except that I can use inside the table to have that auto-populate. If so, what kind of column do I need to create and what formula would I use?
It looks like you want to access the content of the json object. In MySQL, you can use json_extract(), or the ->> operator, like so:
select event, event ->> '$.mp_event.item_id' file_id from mytable
I have to process a flat file whose syntax is as follows, one record per line.
<header>|<datagroup_1>|...|<datagroup_n>|[CR][LF]
The header has a fixed-length field format that never changes (ID, timestamp etc). However, there are different types of data groups and, even though fixed-length, the number of their fields vary depending on the data group type. The three first numbers of a data group define its type. The number of data groups in each record varies also.
My idea is to have a staging table with to which I would insert all the data groups. So two records like this,
12320160101|12323456KKSD3467|456SSGFED43520160101173802|
98720160102|456GGLWSD45960160108854802|
Would produce three records in the staging table.
ID Timestamp Data
123 01/01/2016 12323456KKSD3467
123 01/01/2016 456SSGFED43520160101173802
987 02/01/2016 456GGLWSD45960160108854802
This would allow me to preprocess the staged records for further processing (some would be discarded, some have their data broken down further). My question is how to break down the flat file into the staging table. I can split the entire record with pipe (|) and then use a Derived Column Transformation to break down the header with SUBSTRING. After that it gets trickier because of the varying number of data groups.
The solution I came up with myself doesn't try to split at the flat file source, but rather in a script. My Data Flow looks like this.
So the Flat File Source output is just a single column containing the entire line. The Script Component contains output columns for each column in the Staging table. The script looks like this.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
var splits = Row.Line.Split('|');
for (int i = 1; i < splits.Length; i++)
{
Output0Buffer.AddRow();
Output0Buffer.ID = splits[0].Substring(0, 11);
Output0Buffer.Time = DateTime.ParseExact(splits[0].Substring(14, 14), "yyyyMMddHHmmssFFF", CultureInfo.InvariantCulture);
Output0Buffer.Datagroup = splits[i];
}
}
Note that in the SynchronousInputID property (Script Transformation Editor > Input and Outputs > Output0) must be set to None. Otherwise you won't have Output0Buffer available in your script. Finally the OLE DB Destination just maps the script output columns to the Staging table columns. This solves the problem I had with creating multiple output Records from a single input record.
I have a (bit large) flat file (csv). Which I am trying to import in my SQL Server table using SSIS Package. There is nothing special, its a plain import. The problem is, more than 50% of the lines are duplicate.
E.g. Data:
Item Number | Item Name | Update Date
ITEM-01 | First Item | 1-Jan-2013
ITEM-01 | First Item | 5-Jan-2013
ITEM-24 | Another Item | 12-Mar-2012
ITEM-24 | Another Item | 13-Mar-2012
ITEM-24 | Another Item | 14-Mar-2012
Now I need to create my Master Item record table using this data, as you can see the data is duplicate due to the Update Date. This is guaranteed that file will always be sorted by Item Number. So what I need to do is just to check if next item number = previous item number then do NOT import this line.
I used Sort with Remove Duplicate, in SSIS package, but it is actually trying to sort all the lines which is useless because lines are already sorted. Plus it is taking forever to sort too many lines.
So is there any other way?
There are a couple of approaches you can take to do this.
1. Aggregate Transformation
Group by Item Number and Item Name and then perform an aggregate operation on Update Date. Based on the logic you mentioned above, the Minimum operation should work. In order to use the Minimum operation, you'll need to convert the Update Date column to a date (can't perform Minimum on a string). That conversion can be done in a Data Conversion Transformation. Below are the guts of what this would look like:
2. Script Component Transformation
Essentially, you could implement the logic you mentioned above:
if next item number = previous item number then do NOT import this
line
First, you must configure the Script Component appropriately (the steps below assume that you don't rename the default input and output names):
Select Transformation as the Script Component type
Add the Script Component after the Flat File Source in your Data Flow:
Double Click the Script Component to open the Script Transformation Editor.
Under Input Columns, select all columns:
Under Inputs and Outputs, select Output 0, and set the SynchronousInputID property to None
Now manually add columns to Output 0 to match the columns in Input 0 (don't forget to set the data types):
Finally, edit the script. There will be a method named Input0_ProcessInputRow- modify it as below and add a private field named previousItemNumber as below:
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ItemNumber.Equals(previousItemNumber))
{
Output0Buffer.AddRow();
Output0Buffer.ItemName = Row.ItemName;
Output0Buffer.ItemNumber = Row.ItemNumber;
Output0Buffer.UpdateDate = Row.UpdateDate;
}
previousItemNumber = Row.ItemNumber;
}
private string previousItemNumber = string.Empty;
If performance is a biggy for you I'd suggest you to dump the entire text file into a temporary table on SQL Server and then use a SELECT DISTINCT * to get the desired values.