Modifying JSON Data Structure in Data Factory - json

I have a JSON file that I need to move to Cosmos DB. I currently have a PowerShell script that will modify this file to a proper format to be used in a Data Flow or Copy activity in Azure Data Factory. However, I was wondering if there is a way to do all these modification in Azure Data Factory without using the Powershell script.
The Powershell script can manipulate a 50MB file in a matter of seconds. I would also like a similar speeds if we build something directly in Azure Data Factory.
Without the modification, I get a error because of the "#" sign. Furthermore, if I want to use companyId as my partition key, it is not allowed because it is inside of an array.
The current JSON file looks similar to the below:
{
"Extract": {
"consumptionInfo": {
"Name": "Test Stuff",
"createdOnTimestamp": "20200101161521Z",
"Version": "1.0",
"extractType": "Incremental",
"extractDate": "20200101113514Z"
},
"company": [{
"company": {
"#action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
}]
}
}
I would like to be converted to the below:
{
"action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}

Create a new data flow that reads in your JSON file. Add a Select transformation to choose the properties you wish to send to CosmosDB. If some of those properties are embedded inside of an array, then first use Flatten. You can also use the Select transformation rename "#action" to "action".

Data Factory or Data Flow doesn't works well with nested JSON file. Per my experience, the workaround may be a little complexed but works well:
Source 1 + Flatten active 1 to flat the data in key 'Extract'.
Source 2(same with source 1) + Flatten active 2 to flat the data in
key 'company'.
Add a Union active 1 in source 1 flow to join the data after
flatten active 2
create a Dervied Column to filter the column/key you want after
union active1
Then create the Azure Cosmos DB as sink.
The Data flow overview should like this:

Related

How to send a lot of POST request in JSON format through JMeter?

So I have this huge file of json requests that I need to send to an API through POST, they are about 4000 different requests. I tried the CSV method and reference the JSON_FILE in code but it didn't work due to a timeout error, I think 4000 files is just too much for this method
I could create 4000 thread groups, each one with it's individual json request but that would be a huge manual labor
Is there anyway to automatize this process?
The json looks basically like this
{
"u_id": "00",
"u_operation": "Address",
"u_service": "Fiber",
"u_characteristic": 2,
"u_name": "Address #1"
},
{
"u_id": "01",
"u_operation": "Address",
"u_service": "TV",
"u_characteristic": 2,
"u_name": "Address #2"
}
All the way up to 4000
What is the anticipated usage of the API endpoint? If it's supposed to process 4000 files at once and it doesn't - this sounds like a bug or a bottleneck and you need to report it.
If you have a large file with 4000 objects like this:
{
"u_id": "00",
"u_operation": "Address",
"u_service": "Fiber",
"u_characteristic": 2,
"u_name": "Address #1"
}
and want to send them one by one with arbitrary number of users/iterations - you can play the following trick
Add setUp Thread Group to your Test Plan
Add JSR223 Sampler to the setUp Thread Group
Put the following code into "Script" area:
def entries = new groovy.json.JsonSlurper().parse(new File('/path/to/your/large/file.json'))
entries.eachWithIndex { entry, index ->
props.put('entry_' + (index + 1), new groovy.json.JsonBuilder(entry).toPrettyString())
}
it will create 4000 JMeter Properties like entry_1, entry_2, each one holding one entry from your large file:
Then in the main Thread Group you will be able to use __P() and __counter() functions combination so each user would take the next "entry" on each iteration like:
${__P(entry_${__counter(FALSE,)},)}

Ingesting from large json files to kusto from blob - expanding array of objects

I am trying to ingest json file into kusto (.zip file), and further processing json using update policies
Approach 1 :file has following contents
{
"id": "id0",
"logs": [
{
"timestamp": "2021-05-26T11:33:26.182Z",
"message": "test message 1"
},
{
"timestamp": "2021-05-26T11:33:26.182Z",
"message": "test message 1"
}
]
}
.create table test (
logs : dynamic
)
.create-or-alter table test ingestion json mapping 'testmapping'
'['
'{"column":"logs","path":"$.logs","datatype":"","transform":"None"}'
']'
.ingest into table test(
h#"sasUrlfromazureBlob"
)
with (
format = "multijson",
ingestionMappingReference = "testmapping",
zipPattern = "*"
)
Above is ingesting the entire logs array in one row, but I want it to be expanded into multiple rows
Approach2:
Input file conatains:
[
{
"timestamp": "2021-05-26T11:33:26.182Z",
"message": "test message 1"
},
{
"timestamp": "2021-05-26T11:33:26.182Z",
"message": "test message 1"
}
]
.create table test (
logs : dynamic
)
.create-or-alter table test ingestion json mapping 'testmapping'
'['
'{"column":"logs","path":"$","datatype":"","transform":"None"}'
']'
Above nicely expands logs into multiple rows (2 rows in this example)
But when I select array from an object (Approach1) it dumps into single row and problem with this is, it has a limitation of 1MB data for dynamic data type
If the issue is the transformation from the input data as in option #1 to the desired data as option #2, you should use an external service to do the transformation, for example, Azure function that reads the data in format #1 and writes it as #2.
As a side note, with option #2, you lose the "id" property.

ADF - Data Flow- Json Expression for Property name

I have a requirement to convert the json into csv(or a SQL table) or any other flatten structure using Data Flow in Azure Data Factory. I need to take the property names at some hierarchy and values of the child properties at lower of hierrarchy from the source json and add them both as column/row values in csv or any other flatten structure.
Source Data Rules/Constraints :
Parent level data property names will change dynamically (e.g. ABCDataPoints,CementUse, CoalUse, ABCUseIndicators names are dynamic)
The hierarchy always remains same as in below sample json.
I need some help in defining Json path/expression to get the names ABCDataPoints,CementUse, CoalUse, ABCUseIndicators etc. I am able to figure out how to retrieve the values for the properties Value,ValueDate,ValueScore,AsReported.
Source Data Structure :
{
"ABCDataPoints": {
"CementUse": {
"Value": null,
"ValueDate": null,
"ValueScore": null,
"AsReported": [],
"Sources": []
},
"CoalUse": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
}
},
"ABCUseIndicators": {
"EnvironmentalControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"RenewableEnergyUseRatio": {
"Value": null,
"ValueDate": null,
"ValueScore": null
}
},
"XYZDataPoints": {
"AccountingControversiesCount": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
},
"AdvanceNotices": {
"Value": null,
"ValueDate": null,
"Sources": []
}
},
"XYXIndicators": {
"AccountingControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"AntiTakeoverDevicesAboveTwo": {
"Value": 4,
"ValueDate": "2021-03-06T23:22:49.870Z",
"ValueScore": "0.8351945854483925"
}
}
}
Expected Flatten structure
Background:
After having multiple calls with ADF experts at Microsoft(Our workplace have Microsoft/Azure partnership), they concluded this is not possible with out of the box activities provided by ADF as is, neither by Dataflow(need not to use data flow though) nor Flatten feature. Reasons are Dataflow/Flatten only unroll the Array objects and there are no mapping functions available to pick the property names - Custom expression are in internal beta testing and will in PA in near future.
Conclusion/Solution:
We concluded with an agreement based on calls with Microsoft emps ended up to go multiple approaches but both needs the custom code - with out custom code this is not possible by using out of box activities.
Solution-1 : Use some code to flatten as per requirement using a ADF Custom Activity. The downside of this you need to use an external compute(VM/Batch), the options supported are not on-demand. So it is little bit expensive but works best if have continuous stream workloads. This approach also continuously monitor if input sources are of different sizes because the compute needs to be elastic in this case or else you will get out of memory exceptions.
Solution-2 : Still needs to write the custom code - but in a function app.
Create a Copy Activity with source as the files with Json content(preferably storage account).
Use target as Rest Endpoint of function(Not as a function activity because it has 90sec timeout when called from an ADF activity)
The function app will takes Json lines as input and parse and flatten.
If you use the above way so you can scale the number of lines cane be send in each request to function and also scale the parallel requests.
The function will do the flatten as required to one file or multiple files and store in blob storage.
The pipeline will continue from there as needed from there.
One problem with this approach is if any of the range is failed the copy activity will retry but it will run the whole process again.
Trying something very similar, is there any other / native solution to address this?
As mentioned in the response above, has this been GA yet? If yes, any reference documentation / samples would be of great help!
Custom expression are in internal beta testing and will in PA in near future.

Load multiple increasing json files by ELK stack

I crawled a lot of JSON files in data folder, which all named by timestamp (./data/2021-04-05-12-00.json, ./data/2021-04-05-12-30.json, ./data/2021-04-05-13-00.json, ...).
Now I'm tring to use ELK stack to load those increasing JSON files.
The JSON file is pretty printed like:
{
"datetime": "2021-04-05 12:00:00",
"length": 3,
"data": [
{
"id": 97816,
"num_list": [1,2,3],
"meta_data": "{'abc', 'cde'}"
"short_text": "This is data 97816"
},
{
"id": 97817,
"num_list": [4,5,6],
"meta_data": "{'abc'}"
"short_text": "This is data 97817"
},
{
"id": 97818,
"num_list": [],
"meta_data": "{'abc', 'efg'}"
"short_text": "This is data 97818"
},
],
}
I tried using logstash multiline plugins to extract json file, but it seems like it will handle each file as an event. Is there any way to extract each record in JSON data fileds as an event ?
Also, what's the best practice for loading multiple increasing pretty-printed JSON files in ELK ?
Using multiline is correct if you want to handle each file as one input event.
Then you need to leverage the split filter in order to create one event for each element in the data array:
filter {
split {
field => "data"
}
}
So Logstash reads one file as a whole, it passes its content as a single event to the filter layer and then the split filter as shown above will spawn one new event for each element in the data array.

Issue with Cloud Datastore backup in BigQuery

I use an App Enginge Datastore backup file and create a BigQuery table. The issue I face is all the JSON values are treated as 'Flattened strings' by default.
I couldn't access the repeated string value for example as below. Value is for column: qoption
[{
"optionId": 0,
"optionTitle": "All inclusive",
"optionImageUrl": "http://sampleurl",
"masterCatInfo": 95680,
"brInfo": 56502428160,
"category": "",
"tags": ["Holiday"]
}, {
"optionId": 1,
"optionTitle": "Self catered",
"optionImageUrl": "http://sampleurl1",
"masterCatInfo": 520280,
"brId": 56598160,
"category": "",
"tags": ["Holiday"]
}]
Is it possibe to again recreate existing table as in JSON format , ideally through BQ CLI, so that I can access table qoption.optionId, qoption.optionTitle,etc
Take a look at Nested and Repeated Data. Basically you have to manually setup your bigquery schema with a nested data schema. Once that is done and your data is imported you should be able to use your nested properties.
Alternatively big query can parse your json ad-hoc.