Extracting array properties from Cosmos DB documents using Azure Data Factory - json

I have an Azure Data Factory v2 pipeline that's pulling data from a Cosmos DB collection. This collection has a property that's an array.
I want to, at the least, be able to dump that entire property's value into a column in SQL Azure. I don't need it parsed (although that would be great too), but ADF lists this column as "Unsupported Type" in the dataset definition and listed it in the Excluded Columns section.
Here is an example of the JSON I'm working with. The property I want is "MyArrayProperty":
{
"id": "c4e2012e-af82-4c48-8960-11e0436e6d3f",
"Created": "2019-06-14T16:04:13.9572567Z",
"Updated": "2019-06-14T16:04:14.1920988Z",
"IsActive": true,
"MyArrayProperty": [
{
"SomeId": "a4427015-ca69-4958-90d3-0918fd5dcac1",
"SomeName": "BlahBlah"
}
]
}
}
I've tried manually specifying a column in the ADF data source like "MyArrayProperty" and using a string data type, but the value always comes across as null.

please check this document about schema mapping example between MongoDB and Azure SQL. Basically you should define your collectionReference that will iterate through your nested array of objects and do cross apply.

There may be a better way to solve this problem, but I ended up creating a second copy activity which uses a query against Cosmos rather than a collection based capture. The query flattened the array like so:
SELECT m.id, c.SomeId, c.SomeName
FROM myCollection m join c in m.MyArrayProperty
I then took this data set and dumped it into a table in SQL then did my other work inside SQL Azure itself. You could also use the new Join pipeline task to do this in memory before it gets to the destination.

Related

Map nested JSON (Mongo ATLAS) to SQL [Azure Data Factory]

I want to map nested json to sql table (Microsoft SSMS)
Source is a Dataset of MongoAtlas &
Sink is a Dataset of Azure SQL Database Managed Instance
I am able to map parentArray using collection reference.
but not able to select child under it.
also childArrays are kind of scalar arrays (they don't have any keys)
Note : I tried the option Map complex values to string
but it is putting the values in column cell like ["ABC", PQR] which I dont want
is there any way to map it ?
Expected output for Table : childarray2
Currently in ADF, Copy Activity supports mapping of arrays for only 1 level.
There is not way to map nested arrays.
For this I had to use Data flows.
Limitation was, we cannot use mongoDB/mongo Atlas as a input source in Data flow, so the workaround was
Convert Mongo To Azure Blob JSON (Copy Activity Task)
Use Azure Blob JSON files as a input source and then SQL tables as sink
Note: You can select this option to delete you blob files, to save storage space

Dynamically refer to Json value in Data Factory copy

I have ADF CopyRestToADLS activity which correctly saves json complex object to Data Lake storage. But I additionally need to pass one of the json values (myextravalue) to a stored procedure. I tried referencing it in the stored procedure parameter as #{activity('CopyRESTtoADLS').output.myextravalue but I am getting error
The actions CopyRestToADLS refernced by 'inputs' in the action ExectuteStored procedure1 are not defined in the template
{
"items": [1000 items],
"count": 1000,
"myextravalue": 15983444
}
I would like to try to dynamically reference this value because the CopyRestToADLS source REST dataset dynamically calls different REST endpoints so the structure of JSON object is different each time. But the myextravalue is always present in each JSON call.
How is it possible to refernce myextravalue and use it as a parameter?
Rich750
You could create another lookup active on REST data source to get the json value. Then pass it to the Stored Procedure active.
Yes, it will create a new REST request, and it seams to be an easy way to achieve your purpose. Lookup active to get the content of the source and won't save it.
The another solution may be get the value from the copy active output file, after the copy active completed.
I'm glad you solved it by this way:
"I created a Data Flow to read from the folder where Copy Activity saves dynamically named output json filenames. After importing schema from sample file, I selected the myextravalue as the only mapping in the Sink Mapping section."

How can I reference a JSON source for a derived column action in Azure Data Factory

I'm new to Azure Data Factory. I've been able to generate a set of JSON files from a REST API source using a Pipeline. Each file consists of one top level JSON object with an array of up to 100 child objects. The output is saved to an Azure Blob Storage container.
I now want to use a Mapping Data Flow to modify the JSON before I write it to Azure SQL, however I'm struggling with the syntax. I've configured the source to point to the directory containing the JSON files. The Source Projection tab displays the correct schema. I can preview the data and I see a row for each file and I can expand the child objects to see the full structure.
However, when I add a Derived Column action, the Input Schema is blank in the Expression Builder. I can refer to the top level elements in the source using the byName and byPosition functions, but I don't know how I can reference the child elements.
The examples that I have been able to find online use a SQL table or CSV file as a source. I can't find any examples that use hierarchical data as the source for a derived column.
Am I missing something? Is this scenario supported?
I found a way to achieve what I want. This may not be the best approach, but it works.
It seems that it is difficult to deal with JSON that has multiple hierarchies as a source for copy data activities. You can choose one level of repeating data to map to a table structure (the Collection Reference property on the Mapping tab).
In my scenario, there was additional repeating data within the data I was mapping to my table. I updated the mapping to write the child JSON data to a text field in my SQL table. To do this, I needed to use the Azure Data Factory JSON editor for my pipeline. You can access this from the "Code" link in the top right corner of the pipeline visual editor.
I added the following line after the closing bracket for the "mappings" array for my copy activity:
"mapComplexValuesToString": true
The full path to the mapping array in the activity definition is typeProperties - translator - mappings. Make sure your commas are correct after you add the new element.
With this approach, I had a row in my SQL table for each array item in my Collection Reference. The scalar child elements in the array items are mapped to table columns and the child JSON element is written to a data column in the same table.
To extract the values I need within the child JSON, I created a SQL view that uses the CROSS APPLY OPENJSON syntax. This allows me to treat the JSON in the data field similar to a related table. You can specify the structure that your JSON is in. If you have nested data in your JSON, you can apply the same approach for each level.
The OPENJSON command is only supported by more recent versions of SQL Server. I'm using Azure SQL, so that works for me.

Database of weakly typed objects

I want to organize some part of my system, but i can't choose convenient form for data representation for interaction with my app.
So i have some local "repository" of data object, descripted as follows:
Object1
{ id = TypeId, field1 = value1, otherObjectSpecifedField = value2 ... }
...
There are many objects (for example 1000) of many types (for example 50). Each type have it own UniqueId and his own description and set of fields.
Next thing is that for each object i have a set of filters, which corresponds that this object is actual right now. It looks like this:
Filter
{ filterName1 = filterValue, filterName2 < filterValue }
Object // filter is applied for this object
{ ... }
The process of using this "repository":
In my app i have application states, which means filters from above.
Example: application localization can be 'en' (my application knows this value and can change it on start) and we have filter, named 'localization' and in our repository we can use it like this:
Filter { localization = 'en'}
Object1 { ... } // this object i should choose when localization is en
When my app decides to check which set of obects is actual right now it cames to repository and asks it: "Here you an TypeId and please, walk through each filter+object pair and say what object is actual by filters. If you need to resolve some filter values (localization from example above) i will resolve them for you".
Then repository walks through each object and compare which is actual by filters now, and which is not and give actual to app. So he check every filter of every object and gives it only if all of them is actual and he did it in runtime.
In current implementation this set of fiters + objects is stored in xml file in very specific xml format, which is comfortably to read from app, but very hard to maintain by human. And i think that there is some place to optimization of all process. I think we can delegate of walking through Objects and comparing it filters to someone else.
Now i think in side of NoSQL document oriented databases. Because each Object has his unique structure and maybe using select routine i can choose what i need.
Maybe someone have any suggestions about that type of database organization? Maybe you know some specific data structure for that type of data?
Maybe I've missed something, because it looks to me like you have a number of different types of objects: one type per TypeId. If that's so, then I think this can be done with a standard SQL database, assuming that the fields within an object have a consistent type. If not, it can still be done with a NoSQL database.
In a SQL database, you would use a separate table for each type (since they each have their own set of fields), and search across the appropriate table with SQL. So, for example, you can create a table with two fields (I'm using SQLite here, which doesn't require types for fields):
create table Object1 (field1, otherObjectSpecifiedField);
This table can then have data inserted into it:
insert into Object1 values ("field1value", "otherfieldValue");
Filtering uses standard SQL:
select * from Object1 where field1 = "field1value";
As I mentioned, this could also be done with a NoSQL database such as MongoDB. That would look something like this in the Mongo CLI:
Create table and insert first object:
db.test.insert({ id: "TypeId", field1: "value1", otherObjectSpecifedField: "value2"});
Select an object from the table:
db.test.find({id: "TypeId", field1: "value1"});
/* { "_id" : ObjectId("57cf97060216d33b891615ba"), "id" : "TypeId", "field1" : "value1", "otherObjectSpecifedField" : "value2" } */

ServiceNow - JSON Web Service, display related tables

I'm working on a C# program that retrieves data from a ServiceNow database and converts that data into C# .NET objects. I'm using the JSON Web Service to return my data in JSON format.
What I want to achieve is as follows: If there is a relational mapping between a value (for
example: I have a table called Company, where CEO is not a TEXT field but an sys_id to a Employee Table) I want to be able to output that data not with an sys_id (or just displaying the name property by using the 'displayvariable' parameter) but by an object displayed in JSON.
This means that the value of a property should be an object in JSON instead of just a single value.
A few examples:
// I don't want the JSON like this
{"Company":{"CEO":"b181e841c9212c008aeb36850331fab2"}}
// Or by displaying the name of the sys_id table
{"Company":{"CEO":"James Henderson" }}
// I want the data as follows, so I can have all the data I need inside a single JSON record.
{"Company":{"CEO":{"name":"James Henderson", "age":34, "sex":"male", "office":"SBN Left Floor 23"}}}
From reading the documentation I couldn't find anything in the JSON Web Service that allowed me to display the information like this nor
find any other alternative. It should have something to do with joining the tables and displaying it all in the right format.
I have been using SNC for almost three years and have not found you can automatically join tables in a web service. Your best option would be to use a scripted web service which possibly takes a query parameter and table parameter. Then you can json serialized your result as you see fit.
Or, another option would be to generate a new processor that will traverse the GlideRecord object. The ?JSON parameter you pass in to the URL is merely a flag to pass your request to a particular processor. Unfortunately the OOB one I believe is a Java class not a JS script, so you would need to write a script much like I mentioned earlier to traverse the object path serializing the object graph as far down as your want to go.