ADF - Data Flow- Json Expression for Property name - json

I have a requirement to convert the json into csv(or a SQL table) or any other flatten structure using Data Flow in Azure Data Factory. I need to take the property names at some hierarchy and values of the child properties at lower of hierrarchy from the source json and add them both as column/row values in csv or any other flatten structure.
Source Data Rules/Constraints :
Parent level data property names will change dynamically (e.g. ABCDataPoints,CementUse, CoalUse, ABCUseIndicators names are dynamic)
The hierarchy always remains same as in below sample json.
I need some help in defining Json path/expression to get the names ABCDataPoints,CementUse, CoalUse, ABCUseIndicators etc. I am able to figure out how to retrieve the values for the properties Value,ValueDate,ValueScore,AsReported.
Source Data Structure :
{
"ABCDataPoints": {
"CementUse": {
"Value": null,
"ValueDate": null,
"ValueScore": null,
"AsReported": [],
"Sources": []
},
"CoalUse": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
}
},
"ABCUseIndicators": {
"EnvironmentalControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"RenewableEnergyUseRatio": {
"Value": null,
"ValueDate": null,
"ValueScore": null
}
},
"XYZDataPoints": {
"AccountingControversiesCount": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
},
"AdvanceNotices": {
"Value": null,
"ValueDate": null,
"Sources": []
}
},
"XYXIndicators": {
"AccountingControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"AntiTakeoverDevicesAboveTwo": {
"Value": 4,
"ValueDate": "2021-03-06T23:22:49.870Z",
"ValueScore": "0.8351945854483925"
}
}
}
Expected Flatten structure

Background:
After having multiple calls with ADF experts at Microsoft(Our workplace have Microsoft/Azure partnership), they concluded this is not possible with out of the box activities provided by ADF as is, neither by Dataflow(need not to use data flow though) nor Flatten feature. Reasons are Dataflow/Flatten only unroll the Array objects and there are no mapping functions available to pick the property names - Custom expression are in internal beta testing and will in PA in near future.
Conclusion/Solution:
We concluded with an agreement based on calls with Microsoft emps ended up to go multiple approaches but both needs the custom code - with out custom code this is not possible by using out of box activities.
Solution-1 : Use some code to flatten as per requirement using a ADF Custom Activity. The downside of this you need to use an external compute(VM/Batch), the options supported are not on-demand. So it is little bit expensive but works best if have continuous stream workloads. This approach also continuously monitor if input sources are of different sizes because the compute needs to be elastic in this case or else you will get out of memory exceptions.
Solution-2 : Still needs to write the custom code - but in a function app.
Create a Copy Activity with source as the files with Json content(preferably storage account).
Use target as Rest Endpoint of function(Not as a function activity because it has 90sec timeout when called from an ADF activity)
The function app will takes Json lines as input and parse and flatten.
If you use the above way so you can scale the number of lines cane be send in each request to function and also scale the parallel requests.
The function will do the flatten as required to one file or multiple files and store in blob storage.
The pipeline will continue from there as needed from there.
One problem with this approach is if any of the range is failed the copy activity will retry but it will run the whole process again.

Trying something very similar, is there any other / native solution to address this?
As mentioned in the response above, has this been GA yet? If yes, any reference documentation / samples would be of great help!
Custom expression are in internal beta testing and will in PA in near future.

Related

Modifying JSON Data Structure in Data Factory

I have a JSON file that I need to move to Cosmos DB. I currently have a PowerShell script that will modify this file to a proper format to be used in a Data Flow or Copy activity in Azure Data Factory. However, I was wondering if there is a way to do all these modification in Azure Data Factory without using the Powershell script.
The Powershell script can manipulate a 50MB file in a matter of seconds. I would also like a similar speeds if we build something directly in Azure Data Factory.
Without the modification, I get a error because of the "#" sign. Furthermore, if I want to use companyId as my partition key, it is not allowed because it is inside of an array.
The current JSON file looks similar to the below:
{
"Extract": {
"consumptionInfo": {
"Name": "Test Stuff",
"createdOnTimestamp": "20200101161521Z",
"Version": "1.0",
"extractType": "Incremental",
"extractDate": "20200101113514Z"
},
"company": [{
"company": {
"#action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
}]
}
}
I would like to be converted to the below:
{
"action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
Create a new data flow that reads in your JSON file. Add a Select transformation to choose the properties you wish to send to CosmosDB. If some of those properties are embedded inside of an array, then first use Flatten. You can also use the Select transformation rename "#action" to "action".
Data Factory or Data Flow doesn't works well with nested JSON file. Per my experience, the workaround may be a little complexed but works well:
Source 1 + Flatten active 1 to flat the data in key 'Extract'.
Source 2(same with source 1) + Flatten active 2 to flat the data in
key 'company'.
Add a Union active 1 in source 1 flow to join the data after
flatten active 2
create a Dervied Column to filter the column/key you want after
union active1
Then create the Azure Cosmos DB as sink.
The Data flow overview should like this:

JSON Deserialization on Talend

Trying to figuring out how to deserialize this kind of json in talend components :
{
"ryan#toofr.com": {
"confidence":119,"email":"ryan#toofr.com","default":20
},
"rbuckley#toofr.com": {
"confidence":20,"email":"rbuckley#toofr.com","default":15
},
"ryan.buckley#toofr.com": {
"confidence":18,"email":"ryan.buckley#toofr.com","default":16
},
"ryanbuckley#toofr.com": {
"confidence":17,"email":"ryanbuckley#toofr.com","default":17
},
"ryan_buckley#toofr.com": {
"confidence":16,"email":"ryan_buckley#toofr.com","default":18
},
"ryan-buckley#toofr.com": {
"confidence":15,"email":"ryan-buckley#toofr.com","default":19
},
"ryanb#toofr.com": {
"confidence":14,"email":"ryanb#toofr.com","default":14
},
"buckley#toofr.com": {
"confidence":13,"email":"buckley#toofr.com","default":13
}
}
This JSON comes from the Toofr API where documentation can be found here .
Here the actual sitation :
For each line retreived in the database, I call the API and I got this (the first name, the last name and the company change everytime.
Does anyone know how to modify the tExtractJSONField (or use smthing else) to show the results in tLogRow (for each line in the database) ?
Thank you in advance !
EDIT 1:
Here's my tExtractJSONfields :
When using tExtractJSONFields with XPath, you need
1) a valid XPath loop point
2) valid XPath mapping to your structure relative to the loop path
Also, when using XPath with Talend, every value needs a key. The key cannot change if you want to loop over it. Meaning this is invalid:
{
"ryan#toofr.com": {
"confidence":119,"email":"ryan#toofr.com","default":20
},
"rbuckley#toofr.com": {
"confidence":20,"email":"rbuckley#toofr.com","default":15
},
but this structure would be valid:
{
"contact": {
"confidence":119,"email":"ryan#toofr.com","default":20
},
"contact": {
"confidence":20,"email":"rbuckley#toofr.com","default":15
},
So with the correct data the loop point might be /contact.
Then the mapping for Confidence would be confidence (the name from the JSON), the mapping for Email would be email and vice versa for default.
EDIT
JSONPath has a few disadvantages, one of them being you cannot go higher up in the hierarchy. You can try finding out the correct query with jsonpath.com
The loop expression could be $.*. I am not sure if that will satisfy your need, though - it has been a while since I've been using JSONPath in Talend because of the downsides.
I have been ingesting some complex json structures and did this via minimal json libraries, and tjava components within talend.

getDegree()/isOutgoing() funcitons don't work in graphAware/neo4j-to-elasticsearch mapping.json

Neo4j Version: 3.2.2
Operating System: Ubuntu 16.04
I use getDegree() function in mapping.json file, but the return would always be null; I'm using the dataset neo4j tutorial Movie/Actor dataset.
Output from elasticsearch request
mapping.json
{
"defaults": {
"key_property": "uuid",
"nodes_index": "default-index-node",
"relationships_index": "default-index-relationship",
"include_remaining_properties": true
},
"node_mappings": [
{
"condition": "hasLabel('Person')",
"type": "getLabels()",
"properties": {
"getDegree": "getDegree()",
"getDegree(type)": "getDegree('ACTED_IN')",
"getDegree(direction)": "getGegree('OUTGOING')",
"getDegree('type', 'direction')": "getDegree('ACTED_IN', 'OUTGOING')",
"getDegree-degree": "degree"
}
}
],
"relationship_mappings": [
{
"condition": "allRelationships()",
"type": "type",
}
]
}
Also if I use isOutgoing(), isIncoming(), otherNode function in relationship_mappings properties part, elasticsearch would never load the relationship data from neo4j. I think I probably have some misunderstanding of this sentence only when one of the participating nodes "looking" at the relationship is provided on this page https://github.com/graphaware/neo4j-framework/tree/master/common#inclusion-policies
mapping.json
{
"defaults": {
"key_property": "uuid",
"nodes_index": "default-index-node",
"relationships_index": "default-index-relationship",
"include_remaining_properties": true
},
"node_mappings": [
{
"condition": "allNodes()",
"type": "getLabels()"
}
],
"relationship_mappings": [
{
"condition": "allRelationships()",
"type": "type",
"properties": {
"isOutgoing": "isOutgoing()",
"isIncomming": "isIncomming()",
"otherNode": "otherNode"
}
}
]
}
BTW, is there any page that list all of the functions that we can use in mapping.json? I know two of them
github.com/graphaware/neo4j-framework/tree/master/common#inclusion-policies
github.com/graphaware/neo4j-to-elasticsearch/blob/master/docs/json-mapper.md
but it seems there are more, since I can use getType(), which hasn't been listed in any of the above pages.
Please let me know if I can provide any further help to solve the problem
Thanks!
The getDegree() function is not available to use, in contrary to getType(). I will explain why :
When the mapper (the part responsible to create a node or relationship representation as ES document ) is doing its job, it receive a DetachedGraphObject being a detached node or relationship.
The meaning of detached is that it is happening outside of a transaction and thus query operations are not available against the database anymore. The getType() is available because it is part of the relationship metadata and it is cheap, however if we would want to do the same for getDegree() this can be seriously more costly during the DetachedObject creation (which happen in a tx) depending on the number of different types etc.
This is however something we are working on, by externalising the mapper in a standalone java application coupled with a broker like kafa, rabbit,.. between neo and this app. We would not, however offer the possibilty to requery the graph in the current version of the module as it can have serious performance impacts if the user is not very careful.
As last, the only suggestion I can give you is to keep a property on your node with the updates of degrees you need to replicate to ES.
UPDATE
Regarding this part of the documentation :
For Relationships only when one of the participating nodes "looking" at the relationship is provided:
This is used only when not using the json definition, so you can use one or the other. the json definition has been added later as addition and both cannot be used together.
For answering this part, it means that the nodes of the incoming or outgoing side, depending on the definition, should be included in the inclusion policy for nodes, like hasLabel('Employee') || hasProperty('form') || getProperty('age', 0) > 20 . If you have an allNodes policy then it is fine.

How to tell when a document was created/updated in Couchbase?

I am using spring data to fetch couchbase documents. I assume the document expiry cannot be set as a configurable value externalized in a property file (#Document(expiry = )) based on answer in this link
However can I define a view in cochbase and have query to return only the documents which have been created in the past 15 minutes? I don't want to save additional information regarding date in my document and I am looking to see if there is any meta data that can be used for this purpose
function (doc, meta) {
if(doc._class=="com.customer.types.CustomerInfo" ){
emit(meta.id, doc);
}
}
If you go into the webconsole and create a development view, you can have a look at what metadata are available using the preview:
emit(meta.id, meta)
This gives us
{
"id": "someKey",
"rev": "3-00007098b90700000000000002000000",
"seq": "3",
"vb": "56",
"expiration": 0,
"flags": 33554432,
"type": "json"
}
So as you can see, there is no explicit metadata that you can use to tell if a doc has been created/updated in the last 15 minutes.
Side note: the id of the doc will always be part of the view's response so emitting it is a bit redundant. More importantly, don't emit the whole doc as a value (if in doubt on what to emit as the second field, just emit null). This value field is stored in the index, so that means that you are basically storing your document's content twice (once in the primary store, once in the view index)!

Ember-Data: How to get properties from nested JSON

I am getting JSON returned in this format:
{
"status": "success",
"data": {
"debtor": {
"debtor_id": 1301,
"key": value,
"key": value,
"key": value
}
}
}
Somehow, my RESTAdapter needs to provide my debtor model properties from "debtor" section of the JSON.
Currently, I am getting a successful call back from the server, but a console error saying that Ember cannot find a model for "status". I can't find in the Ember Model Guide how to deal with JSON that is nested like this?
So far, I have been able to do a few simple things like extending the RESTSerializer to accept "debtor_id" as the primaryKey, and also remove the pluralization of the GET URL request... but I can't find any clear guide to reach a deeply nested JSON property.
Extending the problem detail for clarity:
I need to somehow alter the default behavior of the Adapter/Serializer, because this JSON convention is being used for many purposes other than my Ember app.
My solution thus far:
With a friend we were able to dissect the "extract API" (thanks #lame_coder for pointing me to it)
we came up with a way to extend the serializer on a case-by-case basis, but not sure if it really an "Ember Approved" solution...
// app/serializers/debtor.js
export default DS.RESTSerializer.extend({
primaryKey: "debtor_id",
extract: function(store, type, payload, id, requestType) {
payload.data.debtor.id = payload.data.debtor.debtor_id;
return payload.data.debtor;
}
});
It seems that even though I was able to change my primaryKey for requesting data, Ember was still trying to use a hard coded ID to identify the correct record (rather than the debtor_id that I had set). So we just overwrote the extract method to force Ember to look for the correct primary key that I wanted.
Again, this works for me currently, but I have yet to see if this change will cause any problems moving forward....
I would still be looking for a different solution that might be more stable/reusable/future-proof/etc, if anyone has any insights?
From description of the problem it looks like that your model definition and JSON structure is not matching. You need to make it exactly same in order to get it mapped correctly by Serializer.
If you decide to change your REST API return statement would be something like, (I am using mock data)
//your Get method on service
public object Get()
{
return new {debtor= new { debtor_id=1301,key1=value1,key2=value2}};
}
The json that ember is expecting needs to look like this:
"debtor": {
"id": 1301,
"key": value,
"key": value,
"key": value
}
It sees the status as a model that it needs to load data for. The next problem is it needs to have "id" in there and not "debtor_id".
If you need to return several objects you would do this:
"debtors": [{
"id": 1301,
"key": value,
"key": value,
"key": value
},{
"id": 1302,
"key": value,
"key": value,
"key": value
}]
Make sense?