How to convert Excel to JSON in Azure Data Factory? - json

I want to convert this Excel file which contains two tables in a single worksheet
Into this JSON format
{
parent:
{
"P1":"x1",
"P2":"y1",
"P3":"z1"
}
children: [
{"C1":"a1", "C2":"b1", "C3":"c1", "C4":"d1"},
{"C1":"a2", "C2":"b2", "C3":"c2", "C4":"d2"},
...
]
}
And then post the JSON to a REST endpoint.
How to perform the mapping and posting to REST service?
Also, it appears that I need to sink the JSON to a physical JSON file before I can post as a payload to REST service - is this physical sink step necessary or can it be held in memory?
I cannot use Lookup activity to read in the Excel file because it is limited to 5,000 rows and 4MB.

I managed to do it in ADF, the solution is a bit long, but you can use azure functions to do it programmatically.
Here is a quick demo that i built:
the main idea is to split data, add headers as requested and then re-join data and add relevant keys like parents and children.
ADF:
added Conditional join to split data (see attached pictures).
add surrogate key for each table.
filtered first row to get red off the headers in the csv.
map children/parents' columns: renaming columns using derived column activity
added constant value in children data flow so i can aggregate by it and convert the CSV into a complex data type.
childrenArray: in a derived column,added subcolumn to a new column named Children and in values i added relevant columns.
aggregated children Jsons by using the constant value.
in parents dataFlow: after mapping columns , i created jsons using derived column.(please see attached pictures).
joined the children array and parents jsons into one table so it will be converted to the requested Json.
wrote to cached sink(here you can do the post request instead of writing to sink).
DataFlow:
![enter image description here
Activities:
Conditional Split:
AddSurrogateKey:
(it's the same for parents data flow just change the name of incoming stream as shown in dataflow above)
FilterFirstRow:
MapChildrenColumns:
MapParentColumns:
AddConstantValue:
PartentsJson:
Here i added subcolumn in Expression Builder and sent column name as value,this will build the parents json.
ChildrenArray:
Again in a derived column, added column with a name "children"
and in Expression Builder i added relevant columns.
Aggregate:
the purpose of this activity is to aggregate children Json's and build the array, without it you will not get an array.
the aggregation function is collect().
Join Activity:
Here i added an outer join to join the parents json and the children array.
Select Relevant columns:
Output:

Related

Advanced mapping of JSON in Azure Data Factory - some guidance requested

I'm trying to map a JSON document (sensor data) into a more meaningful representation using Mapping Dataflows. However, hard time getting this to work and would really appreciate some insight/recommendations on how to solve the following:
The input is
What I would like to end up with is the following:
Any pointers as to how this can be implemented are more than welcome.
This can be accomplished using the Copy activity and then split function in Derived Column transformation in Azure Data Factory.
Use the copy activity to read the JSON file as source and in sink, use SQL database to store the data as table. In Mapping tab, Import the schema and map the JSON records to the corresponding column names. Refer this third-part tutorial for guidance - https://sqlkover.com/dynamically-map-json-to-sql-in-azure-data-factory/
Finally, use the Data Flow activity and choose the SQL table as source now which you have used as sink above.
Select the Derived Column transformation.
Use split function.
Add the column which will take the split values which you want to split as shown below.
Use split(<column_name_to_split>, '_') function to split the column on with _ delimiter. Change <column_name_to_split> to the name of column you cant to split. Refer image below.
Preview the data to check the result.

SSIS consolidate and concatenate multiple rows into single rows without using SQL

I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.

How can I reference a JSON source for a derived column action in Azure Data Factory

I'm new to Azure Data Factory. I've been able to generate a set of JSON files from a REST API source using a Pipeline. Each file consists of one top level JSON object with an array of up to 100 child objects. The output is saved to an Azure Blob Storage container.
I now want to use a Mapping Data Flow to modify the JSON before I write it to Azure SQL, however I'm struggling with the syntax. I've configured the source to point to the directory containing the JSON files. The Source Projection tab displays the correct schema. I can preview the data and I see a row for each file and I can expand the child objects to see the full structure.
However, when I add a Derived Column action, the Input Schema is blank in the Expression Builder. I can refer to the top level elements in the source using the byName and byPosition functions, but I don't know how I can reference the child elements.
The examples that I have been able to find online use a SQL table or CSV file as a source. I can't find any examples that use hierarchical data as the source for a derived column.
Am I missing something? Is this scenario supported?
I found a way to achieve what I want. This may not be the best approach, but it works.
It seems that it is difficult to deal with JSON that has multiple hierarchies as a source for copy data activities. You can choose one level of repeating data to map to a table structure (the Collection Reference property on the Mapping tab).
In my scenario, there was additional repeating data within the data I was mapping to my table. I updated the mapping to write the child JSON data to a text field in my SQL table. To do this, I needed to use the Azure Data Factory JSON editor for my pipeline. You can access this from the "Code" link in the top right corner of the pipeline visual editor.
I added the following line after the closing bracket for the "mappings" array for my copy activity:
"mapComplexValuesToString": true
The full path to the mapping array in the activity definition is typeProperties - translator - mappings. Make sure your commas are correct after you add the new element.
With this approach, I had a row in my SQL table for each array item in my Collection Reference. The scalar child elements in the array items are mapped to table columns and the child JSON element is written to a data column in the same table.
To extract the values I need within the child JSON, I created a SQL view that uses the CROSS APPLY OPENJSON syntax. This allows me to treat the JSON in the data field similar to a related table. You can specify the structure that your JSON is in. If you have nested data in your JSON, you can apply the same approach for each level.
The OPENJSON command is only supported by more recent versions of SQL Server. I'm using Azure SQL, so that works for me.

Data Factory v2 - Generate a json file per row

I'm using Data Factory v2. I have a copy activity that has an Azure SQL dataset as input and a Azure Storage Blob as output. I want to write each row in my SQL dataset as a separate blob, but I don't see how I can do this.
I see a copyBehavior in the copy activity, but that only works from a file based source.
Another possible setting is the filePattern in my dataset:
Indicate the pattern of data stored in each JSON file. Allowed values
are: setOfObjects and arrayOfObjects.
setOfObjects - Each file contains single object, or line-delimited/concatenated multiple objects. When this option is chosen in an output dataset, copy activity produces a single JSON file with each object per line (line-delimited).
arrayOfObjects - Each file contains an array of objects.
The description talks about "each file" so initially I thought it would be possible, but now I've tested them it seems that setOfObjects creates a line separated file, where each row is written to a new line. The setOfObjects setting creates a file with a json array and adds each line as a new element of the array.
I'm wondering if I'm missing a configuration somewhere, or is it just not possible?
What I did for now is to load the rows in to a SQL table and run a foreach for each record in the table. The I use a Lookup activity to have an array to loop in a Foreach activity. The foreach activity writes each row to a blob store.
For Olga's documentDb question, it would look like this:
In the lookup, you get a list of the documentid's you want to copy:
You use that set in your foreach activity
Then you copy the files using a copy activity within the foreach activity. You query a single document in your source:
And you can use the id to dynamically name your file in the sink. (you'll have to define the param in your dataset too):

ServiceNow - JSON Web Service, display related tables

I'm working on a C# program that retrieves data from a ServiceNow database and converts that data into C# .NET objects. I'm using the JSON Web Service to return my data in JSON format.
What I want to achieve is as follows: If there is a relational mapping between a value (for
example: I have a table called Company, where CEO is not a TEXT field but an sys_id to a Employee Table) I want to be able to output that data not with an sys_id (or just displaying the name property by using the 'displayvariable' parameter) but by an object displayed in JSON.
This means that the value of a property should be an object in JSON instead of just a single value.
A few examples:
// I don't want the JSON like this
{"Company":{"CEO":"b181e841c9212c008aeb36850331fab2"}}
// Or by displaying the name of the sys_id table
{"Company":{"CEO":"James Henderson" }}
// I want the data as follows, so I can have all the data I need inside a single JSON record.
{"Company":{"CEO":{"name":"James Henderson", "age":34, "sex":"male", "office":"SBN Left Floor 23"}}}
From reading the documentation I couldn't find anything in the JSON Web Service that allowed me to display the information like this nor
find any other alternative. It should have something to do with joining the tables and displaying it all in the right format.
I have been using SNC for almost three years and have not found you can automatically join tables in a web service. Your best option would be to use a scripted web service which possibly takes a query parameter and table parameter. Then you can json serialized your result as you see fit.
Or, another option would be to generate a new processor that will traverse the GlideRecord object. The ?JSON parameter you pass in to the URL is merely a flag to pass your request to a particular processor. Unfortunately the OOB one I believe is a Java class not a JS script, so you would need to write a script much like I mentioned earlier to traverse the object path serializing the object graph as far down as your want to go.