Issue with Cloud Datastore backup in BigQuery - json

I use an App Enginge Datastore backup file and create a BigQuery table. The issue I face is all the JSON values are treated as 'Flattened strings' by default.
I couldn't access the repeated string value for example as below. Value is for column: qoption
[{
"optionId": 0,
"optionTitle": "All inclusive",
"optionImageUrl": "http://sampleurl",
"masterCatInfo": 95680,
"brInfo": 56502428160,
"category": "",
"tags": ["Holiday"]
}, {
"optionId": 1,
"optionTitle": "Self catered",
"optionImageUrl": "http://sampleurl1",
"masterCatInfo": 520280,
"brId": 56598160,
"category": "",
"tags": ["Holiday"]
}]
Is it possibe to again recreate existing table as in JSON format , ideally through BQ CLI, so that I can access table qoption.optionId, qoption.optionTitle,etc

Take a look at Nested and Repeated Data. Basically you have to manually setup your bigquery schema with a nested data schema. Once that is done and your data is imported you should be able to use your nested properties.
Alternatively big query can parse your json ad-hoc.

Related

Modifying JSON Data Structure in Data Factory

I have a JSON file that I need to move to Cosmos DB. I currently have a PowerShell script that will modify this file to a proper format to be used in a Data Flow or Copy activity in Azure Data Factory. However, I was wondering if there is a way to do all these modification in Azure Data Factory without using the Powershell script.
The Powershell script can manipulate a 50MB file in a matter of seconds. I would also like a similar speeds if we build something directly in Azure Data Factory.
Without the modification, I get a error because of the "#" sign. Furthermore, if I want to use companyId as my partition key, it is not allowed because it is inside of an array.
The current JSON file looks similar to the below:
{
"Extract": {
"consumptionInfo": {
"Name": "Test Stuff",
"createdOnTimestamp": "20200101161521Z",
"Version": "1.0",
"extractType": "Incremental",
"extractDate": "20200101113514Z"
},
"company": [{
"company": {
"#action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
}]
}
}
I would like to be converted to the below:
{
"action": "create",
"companyId": "xxxxxxx-yyyy-zzzz-aaaa-bbbbbbbbbbbb",
"Status": "1",
"StatusName": "1 - Test - Calendar"
}
Create a new data flow that reads in your JSON file. Add a Select transformation to choose the properties you wish to send to CosmosDB. If some of those properties are embedded inside of an array, then first use Flatten. You can also use the Select transformation rename "#action" to "action".
Data Factory or Data Flow doesn't works well with nested JSON file. Per my experience, the workaround may be a little complexed but works well:
Source 1 + Flatten active 1 to flat the data in key 'Extract'.
Source 2(same with source 1) + Flatten active 2 to flat the data in
key 'company'.
Add a Union active 1 in source 1 flow to join the data after
flatten active 2
create a Dervied Column to filter the column/key you want after
union active1
Then create the Azure Cosmos DB as sink.
The Data flow overview should like this:

ADF - Data Flow- Json Expression for Property name

I have a requirement to convert the json into csv(or a SQL table) or any other flatten structure using Data Flow in Azure Data Factory. I need to take the property names at some hierarchy and values of the child properties at lower of hierrarchy from the source json and add them both as column/row values in csv or any other flatten structure.
Source Data Rules/Constraints :
Parent level data property names will change dynamically (e.g. ABCDataPoints,CementUse, CoalUse, ABCUseIndicators names are dynamic)
The hierarchy always remains same as in below sample json.
I need some help in defining Json path/expression to get the names ABCDataPoints,CementUse, CoalUse, ABCUseIndicators etc. I am able to figure out how to retrieve the values for the properties Value,ValueDate,ValueScore,AsReported.
Source Data Structure :
{
"ABCDataPoints": {
"CementUse": {
"Value": null,
"ValueDate": null,
"ValueScore": null,
"AsReported": [],
"Sources": []
},
"CoalUse": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
}
},
"ABCUseIndicators": {
"EnvironmentalControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"RenewableEnergyUseRatio": {
"Value": null,
"ValueDate": null,
"ValueScore": null
}
},
"XYZDataPoints": {
"AccountingControversiesCount": {
"Value": null,
"ValueDate": null,
"AsReported": [],
"Sources": []
},
"AdvanceNotices": {
"Value": null,
"ValueDate": null,
"Sources": []
}
},
"XYXIndicators": {
"AccountingControversies": {
"Value": false,
"ValueDate": "2021-03-06T23:22:49.870Z"
},
"AntiTakeoverDevicesAboveTwo": {
"Value": 4,
"ValueDate": "2021-03-06T23:22:49.870Z",
"ValueScore": "0.8351945854483925"
}
}
}
Expected Flatten structure
Background:
After having multiple calls with ADF experts at Microsoft(Our workplace have Microsoft/Azure partnership), they concluded this is not possible with out of the box activities provided by ADF as is, neither by Dataflow(need not to use data flow though) nor Flatten feature. Reasons are Dataflow/Flatten only unroll the Array objects and there are no mapping functions available to pick the property names - Custom expression are in internal beta testing and will in PA in near future.
Conclusion/Solution:
We concluded with an agreement based on calls with Microsoft emps ended up to go multiple approaches but both needs the custom code - with out custom code this is not possible by using out of box activities.
Solution-1 : Use some code to flatten as per requirement using a ADF Custom Activity. The downside of this you need to use an external compute(VM/Batch), the options supported are not on-demand. So it is little bit expensive but works best if have continuous stream workloads. This approach also continuously monitor if input sources are of different sizes because the compute needs to be elastic in this case or else you will get out of memory exceptions.
Solution-2 : Still needs to write the custom code - but in a function app.
Create a Copy Activity with source as the files with Json content(preferably storage account).
Use target as Rest Endpoint of function(Not as a function activity because it has 90sec timeout when called from an ADF activity)
The function app will takes Json lines as input and parse and flatten.
If you use the above way so you can scale the number of lines cane be send in each request to function and also scale the parallel requests.
The function will do the flatten as required to one file or multiple files and store in blob storage.
The pipeline will continue from there as needed from there.
One problem with this approach is if any of the range is failed the copy activity will retry but it will run the whole process again.
Trying something very similar, is there any other / native solution to address this?
As mentioned in the response above, has this been GA yet? If yes, any reference documentation / samples would be of great help!
Custom expression are in internal beta testing and will in PA in near future.

Load multiple increasing json files by ELK stack

I crawled a lot of JSON files in data folder, which all named by timestamp (./data/2021-04-05-12-00.json, ./data/2021-04-05-12-30.json, ./data/2021-04-05-13-00.json, ...).
Now I'm tring to use ELK stack to load those increasing JSON files.
The JSON file is pretty printed like:
{
"datetime": "2021-04-05 12:00:00",
"length": 3,
"data": [
{
"id": 97816,
"num_list": [1,2,3],
"meta_data": "{'abc', 'cde'}"
"short_text": "This is data 97816"
},
{
"id": 97817,
"num_list": [4,5,6],
"meta_data": "{'abc'}"
"short_text": "This is data 97817"
},
{
"id": 97818,
"num_list": [],
"meta_data": "{'abc', 'efg'}"
"short_text": "This is data 97818"
},
],
}
I tried using logstash multiline plugins to extract json file, but it seems like it will handle each file as an event. Is there any way to extract each record in JSON data fileds as an event ?
Also, what's the best practice for loading multiple increasing pretty-printed JSON files in ELK ?
Using multiline is correct if you want to handle each file as one input event.
Then you need to leverage the split filter in order to create one event for each element in the data array:
filter {
split {
field => "data"
}
}
So Logstash reads one file as a whole, it passes its content as a single event to the filter layer and then the split filter as shown above will spawn one new event for each element in the data array.

JMeter body data in CSV

I am running some tests in JMeter and the accepted data has to be in the format as shown in the example below.
{
"messageType": 1,
"customerId": 5922429,
"name": "Joe Bloggs",
"phone": "01234567890",
"postcode": "PO6 3EN",
"emailAddress": "joe.bloggs#example.com",
"jobDescription": "do some stuff",
"companyIds": [893999]
}
Now this works great but we want to randomise things up a little and read test data from a CSV file with about 20 samples.
Is this possible with the data having to be set out as above?
Currently the Body Data sits here
You have 2 options:
Modify your payload to rely on JMeter Variables like:
{
"messageType": ${messageType},
"customerId": ${customerId},
"name": "${name}",
"phone": "${phone}",
"postcode": "${postcode}",
"emailAddress": "${emailAddress}",
"jobDescription": "${jobDescription}",
"companyIds": [${companyIds}]
}
once done you can put the values into a CSV file, like:
messageType,customerId,name,phone,postcode,emailAddress,jobDescription,companyIds
1,5922429,Joe Bloggs,01234567890,PO6 3EN,joe.bloggs#example.com,do some stuff,893999
2,5922430,Jane Doe, 0987654321, P06 3EM,janedoe#example.com,do some other stuff,893998
and read the data using CSV Data Set Config so each virtual user will take the next line on each iteration and populate the body with the new values
If you have 20 different JSON files you can use Directory Listing Config plugin to load the file paths and __FileToString() function to read the data from the file in the file system

How to get structured JSON using KSQL?

There are two tables in the database. For example: Layouts(Id, VenueId, Description) and Venues(Id, Adress, Phone). Layout table has reference key to Venues.
There are also two topics in Kafka accordingly tables. How can I send into my output topic JSON like this:
{
"Id": "1",
"Description": "LayoutDesc",
"Venue": {
"Id": 5,
"Adress": "VenueAdress",
"Description": "VenueDesc"
}
}
As of 5.1.1 you can't construct nested objects in KSQL (you can only read them). There is an open issue for this, please do upvote/comment on it: https://github.com/confluentinc/ksql/issues/2147