Parsing Nested JSON using SCALA - json

I am looking to inject Telemetry data and the output is a multi layered nested JSON file. I am interested in very specific fields but I am not able to parse the JSON file to get to the data.
Data Sample:
{ "version_str": "1.0.0", "node_id_str": "router-01", "encoding_path":
"sys/intf", "collection_id": 241466, "collection_start_time": 0,
"collection_end_time": 0, "msg_timestamp": 0, "subscription_id": [ ],
"sensor_group_id": [ ], "data_source": "DME", "data": {
"interfaceEntity": { "attributes": { "childAction": "", "descr": "",
"dn": "sys/intf", "modTs": "2017-09-19T13:24:14.751+00:00",
"monPolDn": "uni/fabric/monfab-default", "persistentOnReload": "true",
"status": "" }, "children": [ { "l3LbRtdIf": { "attributes": {
"adminSt": "up", "childAction": "", "descr": "Nothing", "id":
"lo103", "linkLog": "default", "modTs":
"2017-11-06T23:18:02.974+00:00", "monPolDn":
"uni/fabric/monfab-default", "name": "", "persistentOnReload": "true",
"rn": "lb-[lo103]", "status": "", "uid": "0" }, "children": [ {
"ethpmLbRtdIf": { "attributes": { "currErrIndex": "4294967295",
"ifIndex": "335544423", "iod": "14", "lastErrors": "0,0,0,0",
"operBitset": "", "operDescr": "Nothing", "operMtu": "1500",
"operSt": "up", "operStQual": "none", "rn": "lbrtdif" } } }, {
"nwRtVrfMbr": { "attributes": { "childAction": "", "l3vmCfgFailedBmp":
"", "l3vmCfgFailedTs": "00:00:00:00.000", "l3vmCfgState": "0",
"modTs": "2017-11-06T23:18:02.945+00:00", "monPolDn": "",
"parentSKey": "unspecified", "persistentOnReload": "true", "rn":
"rtvrfMbr", "status": "", "tCl": "l3Inst", "tDn": "sys/inst-default",
"tSKey": "" } } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt":
"up", "childAction": "", "descr": "Nothing", "id": "lo104",
"linkLog": "default", "modTs": "2018-01-25T15:54:20.367+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo104]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544424", "iod": "77",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"Nothing", "operMtu": "1500", "operSt": "up", "operStQual":
"none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr": { "attributes": {
"childAction": "", "l3vmCfgFailedBmp": "", "l3vmCfgFailedTs":
"00:00:00:00.000", "l3vmCfgState": "0", "modTs":
"2018-01-25T15:53:55.757+00:00", "monPolDn": "", "parentSKey":
"unspecified", "persistentOnReload": "true", "rn": "rtvrfMbr",
"status": "", "tCl": "l3Inst", "tDn": "sys/inst-default", "tSKey": ""
} } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt": "up",
"childAction": "", "descr": "Nothing", "id": "lo101",
"linkLog": "default", "modTs": "2017-11-13T21:39:58.910+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo101]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544421", "iod": "12",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"Nothing", "operMtu": "1500", "operSt": "up", "operStQual":
"none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr": { "attributes": {
"childAction": "", "l3vmCfgFailedBmp": "", "l3vmCfgFailedTs":
"00:00:00:00.000", "l3vmCfgState": "0", "modTs":
"2017-11-13T21:39:58.880+00:00", "monPolDn": "", "parentSKey":
"unspecified", "persistentOnReload": "true", "rn": "rtvrfMbr",
"status": "", "tCl": "l3Inst", "tDn": "sys/inst-default", "tSKey": ""
} } } ] } }, { "l3LbRtdIf": { "attributes": { "adminSt": "up",
"childAction": "", "descr": "\"^:tier2:if:loopback:mgmt:l3\"", "id":
"lo0", "linkLog": "default", "modTs": "2017-09-25T20:29:54.003+00:00",
"monPolDn": "uni/fabric/monfab-default", "name": "",
"persistentOnReload": "true", "rn": "lb-[lo0]", "status": "", "uid":
"0" }, "children": [ { "ethpmLbRtdIf": { "attributes": {
"currErrIndex": "4294967295", "ifIndex": "335544320", "iod": "11",
"lastErrors": "0,0,0,0", "operBitset": "", "operDescr":
"\"^:tier2:if:loopback:mgmt:l3\"", "operMtu": "1500", "operSt": "up",
"operStQual": "none", "rn": "lbrtdif" } } }, { "nwRtVrfMbr":...
I am interested in these attributes:
| | | | | | | |-- rmonIfIn: struct (nullable = true)
| | | | | | | | |-- attributes: struct (nullable = true )
| | | | | | | | | |-- broadcastPkts: string (nullabl e = true)
| | | | | | | | | |-- discards: string (nullable = t rue)
| | | | | | | | | |-- errors: string (nullable = tru e)
| | | | | | | | | |-- multicastPkts: string (nullabl e = true)
| | | | | | | | | |-- nUcastPkts: string (nullable = true)
| | | | | | | | | |-- packetRate: string (nullable = true)
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.explode
import spark.implicits._
val spark = SparkSession.builder().getOrCreate
val df = spark.read.option("header","true").option("inferSchema","true").json("file:///usr/local/Projects/out.txt")
val mapDF = df.select($"node_id_str" as "nodename", $"data".getItem("InterfaceEntity").getItem("children").getItem("l1PhysIf").getItem("children").getItem("element"))
I keep getting an error when I attempt to get any deeper, I keep getting data type error:
stringJsonDF: org.apache.spark.sql.DataFrame = [nestDevice: string]
org.apache.spark.sql.AnalysisException: cannot resolve '`data`.`InterfaceEntity`.`children`.`l1PhysIf`.`children`['element']' due to data type mismatch: argument 2 requires integral type, however, ''element'' is of string type.;;

You can use Google Gson Library which is used to work with json. You can convert any object to json and of course do it in reverse. here is an example for doing so:
Gson gson = new Gson();
List<Map<Long, String>> listOfMaps = new ArrayList<>();
//here you can new some maps and add them to the listOfMaps.
String listOfMapsInJsonFormat = gson.toJson(listOfMaps);
above sample code is for converting an object to json. To do the reverse job you can check below one too:
Gson gson = new Gson();
List list = gson.fromJson(listOfMapsInJsonFormat, List.class);
the above code will change your input json string to a list which contains maps. Of course there may be a difference in the type of the map you have had before converting the original object to json and the one gson builds the object from json string. to avoid that you can use TypeToken class:
Gson gson = new Gson();
Type type = new TypeToken()<ArrayList<Map<>>>{}.getType();
ArrayList<Map<>> = gson.fromJson(listOfMapsInJsonFormat, type);

Since the fields are part of multiple nested arrays the logic would assume that you are interested in all iterations of those fields per record (so if one record contains n rmonIfIn items due to nested arrays, you would be interested in each of them?)
If so it makes sense to explode these nested arrays and process the expanded dataframe.
Based on your code and incomplete json example it could look like something like this:
val nested = df
.select(explode($"data.InterfaceEntity").alias("l1"))
.select(explode($"l1.l1PhysIf").alias("l2"))
.select($"l2.rmonIfIn.attributes".alias("l3"))
.select($"l3.broadcastPkts", $"l3.discards", $"l3.errors", $"l3.multicastPkts", $"l3.packetRate")
Returning a dataframe that could look like
+-------------+--------+------+-------------+----------+
|broadcastPkts|discards|errors|multicastPkts|packetRate|
+-------------+--------+------+-------------+----------+
|1 |1 |1 |1 |1 |
|2 |2 |2 |2 |2 |
|3 |3 |3 |3 |3 |
|4 |4 |4 |4 |4 |
+-------------+--------+------+-------------+----------+

Related

JQ using an ETL, to create an inner array, in a JSON Array

I need to update an inner array, with an ETL;
I want to create a new property in an array's element in the JSON tree
Something like
def insideETL(keyname; arrayname; cond; result):
def etl:
. as $parent
| .[arrayname][]
| { parent: $parent, child: .}
| select(cond) | result;
map ( .
+
{(keyname): map(etl)}
)
;
From a previous question
Had almost the needed result, but i do have the need to create more than one array, in each item of the Main JSON array;
Data to filter
[
{
"storeId": "s2",
"storehouseInfo": {
"id": "025453",
"name": "00211 NW, OR",
"maxPallets": 10
},
"workorder":{
"id": "w2s2",
"startDate": "2019-09-06T10:00:00.000Z",
"vendorId":"v2"
},
"events": [
{
"id": "e4",
"storeId": "s2",
"vendorId": "v1",
"startDate": "2019-09-05T10:00:00.000Z",
"endDate": "2019-09-14T00:00:00.000Z",
"palletsUsed": 5
},
{
"id": "e5",
"storeId": "s2",
"vendorId": "v2",
"startDate": "2019-09-05T00:00:00.000Z",
"endDate": "2019-09-14T00:00:00.000Z",
"palletsUsed": 5
},
{
"id": "e10",
"storeId": "s2",
"vendorId": "v1",
"startDate": "2019-09-06T10:00:00.000Z",
"endDate": "2019-09-14T00:00:00.000Z",
"palletsUsed": 5
},
{
"id": "e11",
"storeId": "s2",
"vendorId": "v2",
"startDate": "2019-09-06T00:00:00.000Z",
"endDate": "2019-09-14T00:00:00.000Z",
"palletsUsed": 5
},
{
"id": "e12",
"storeId": "s2",
"vendorId": "v2",
"startDate": "2019-09-06T10:00:00.000Z",
"endDate": "2019-09-14T00:00:00.000Z",
"palletsUsed": 5
}
]
},
]
Desired invocation
.|
insideETL("conflictsInPeriod";
"events";
( (.parent.workorder.startDate | dateDaysAgo(12*7) ) < .child.endDate)
and
(.child.vendorId == .parent.workorder.vendorId);
{
event: .child.id,
wo_sd: .parent.workorder.startDate[:10],
workorder_id: .parent.workorder.id
}
)
Desired output
[
{
// our newly added array
"conflictsInPeriod":[
{
"event":"e5",
"workorder_sd":"2019-09-06",
"workorder_id":"w2s2"
},
{
"event_id":"e11",
"workorder_sd":"2019-09-06",
"workorder_id":"w2s2"
},
{
"event_id":"e12",
"workorder_sd":"2019-09-06",
"workorder_id":"w2s2"
}
],
// all the other previous information in the Item
"storeId":"s2",
"storehouseInfo":{
"id":"025453",
"name":"00211 NW, OR",
"maxPallets":10
},
"workorder":{
"id":"w2s2",
"startDate":"2019-09-06T10:00:00.000Z",
"vendorId":"v2"
},
"events":[
// ... All the events data
]
}
]
Hope it is clear, If it is needed any clarification... please comment.
Rather than tying yourself up in knots, it would I think be better to build on the reusable component that's already been developed and tested:
def etl(keyname; arrayname; cond; result):
def etl:
. as $parent
| .[arrayname][]
| { parent: $parent, child: .}
| select(cond) | result;
{(keyname): map(etl)}
;
Here's one way to do so:
def add(arrayname):
etl(arrayname;
"events";
( (.parent.workorder.startDate | dateDaysAgo(12*7) ) < .child.endDate)
and
(.child.vendorId == .parent.workorder.vendorId);
{
event: .child.id,
wo_sd: .parent.workorder.startDate[:10],
workorder_id: .parent.workorder.id
}
)
;
[add("conflictsInPeriod") + .[]]
With etl you have a reusable component that allows numerous variations while keeping things simple.

How to Bulk Upload Complex JSON to MySQL

I have a JSON file that I am trying to bulk upload to MySql. The file is around 50gb. Is there a simple method to get all of the data into MySql? I tried watching videos on youtube on how to do this, but all of the tutorials were for super simple json data that don't have nested data like this. Any help would be amazing. Here is a sample so you can see the structure of it:
{
"PatentData": [
{
"patentCaseMetadata": {
"applicationNumberText": {
"value": "16315092",
"electronicText": "16315092"
},
"filingDate": "2019-07-03",
"applicationTypeCategory": "Utility",
"partyBag": {
"applicantBagOrInventorBagOrOwnerBag": [
{
"applicant": [
{
"contactOrPublicationContact": [
{
"name": { "personNameOrOrganizationNameOrEntityName": [ { "organizationStandardName": { "content": [ "SEB S.A." ] } } ] },
"cityName": "ECULLY",
"geographicRegionName": {
"value": "",
"geographicRegionCategory": "STATE"
},
"countryCode": "FR"
}
]
}
]
},
{
"inventorOrDeceasedInventor": [
{
"contactOrPublicationContact": [
{
"name": {
"personNameOrOrganizationNameOrEntityName": [
{
"personStructuredName": {
"firstName": "Johan",
"middleName": "",
"lastName": "SABATTIER"
}
}
]
},
"cityName": "Mornant",
"geographicRegionName": {
"value": "",
"geographicRegionCategory": "STATE"
The end goal is to have the JSON file in a MySQL database in the following format:
Name | Address | State | Country ... | Abstract
Tim - 23 North- TX - US ... | The tissue...
Tom - 33 North- TX - US ... | The engineer...
Kim - 78 North- TX - US ... | The lung...
Bob - 123 North- TX - US ... | The tissue...
Rob - 93 North- TX - US ... | The scope...

How do I structure JSON? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have this data sample that I need to put into a JSON format.
What's the best way/structure to do that? If it helps, I'll be developing an angular product selection tool for this.
Item 1: Federation Phaser
Options:
| FORM FACTOR | PRICE |
| Compact | $545 |
| Pistol Grip | $600 |
Item 2: Sith Lightsaber
Options:
| BLADE COLOR | BLADE COUNT | PRICE |
| Red | Single | $1000 |
| Red | Double | $1750 |
| Blue | Single | $1125 |
| Blue | Double | $1875 |
| Green | Single | $1250 |
JSON is formed by name/value pairs and surrounded by curly braces {}. The name/value pair are separated by commas and the values themselves can be JSON objects or arrays.
Example 1 (Simple):
{
"fruit1": "apple",
"fruit2": "pear"
}
Example 2 (more complex):
{
"fruitBasket1": { "fruit1": "apple", "fruit2": "pear"},
"fruitBasket2": { "fruit1": "grape", "fruit2": "orange"}
}
For your example, you could construct the JSON as follows with an array:
{
"item": {
"name": "Federation Phaser",
"options": [
{
"form": "compact",
"price": "$545"
},
{
"form": "Pistol Grip",
"price": "$600"
}
]
},
"item2": {
"name": "Sith Lightsaber",
"options": [
{
"bladeColor": "red",
"count": "single",
"price": "$1000"
},
{
"bladeColor": "blue",
"count": "double",
"price": "$1875"
}
]
}
}
If you want to have a variable number of "items" you could put them into an array too. For example:
{
"items": [
{
"name": "Federation Phaser",
"options": [{
"form": "compact",
"price": "$545"
},
{
"form": "Pistol Grip",
"price": "$600"
}
]
},
{
"name": "Sith Lightsaber",
"options": [{
"bladeColor": "red",
"count": "single",
"price": "$1000"
},
{
"bladeColor": "blue",
"count": "double",
"price": "$1875"
}
]
}
]
}

Flatten Spark JSON Data Frame with nested children attributes having the same names

Being a noob in Scala / Spark, am a bit stuck and would appreciate any help!
Am importing JSON data into a Spark Data Frame. In the process, I end up getting a Data frame having the same nested structure present in the JSON input.
My aim is to flatten the entire Data Frame recursively (including the inner most child attributes in an array / dictionary), using Scala.
Additionally, there may be children attributes which have the same names. Hence, need to differentiate them as well.
A somewhat similar solution (same child attributes for different parents) is shown here - https://stackoverflow.com/a/38460312/3228300
An example of what I am hoping to achieve is as follows:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
The corresponding flattened output Spark DF structure would be:
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters_batter_id_0": "1001",
"batters_batter_type_0": "Regular",
"batters_batter_id_1": "1002",
"batters_batter_type_1": "Chocolate",
"batters_batter_id_2": "1003",
"batters_batter_type_2": "Blueberry",
"batters_batter_id_3": "1004",
"batters_batter_type_3": "Devil's Food",
"topping_id_0": "5001",
"topping_type_0": "None",
"topping_id_1": "5002",
"topping_type_1": "Glazed",
"topping_id_2": "5005",
"topping_type_2": "Sugar",
"topping_id_3": "5007",
"topping_type_3": "Powdered Sugar",
"topping_id_4": "5006",
"topping_type_4": "Chocolate with Sprinkles",
"topping_id_5": "5003",
"topping_type_5": "Chocolate",
"topping_id_6": "5004",
"topping_type_6": "Maple"
}
Not having worked much with Scala and Spark previously, am unsure how to proceed.
Lastly, would be extremely thankful if someone can please help with the code for a general / non-schema solution as I need to be applying it to a lot of different collections.
Thanks a lot :)
Here is one possibility we approach it in one of our project
List item
define a case class that maps a row from the dataframe
case class BattersTopics(id: String, type: String, ..., batters_batter_id_0: String, ..., topping_id_0: String)
List item
map each row from the dataframe to case class
df.map(row => BattersTopics(id = row.getAs[String]("id"), ...,
batters_batter_id_0 = row.getAs[String]("batters_batter_id_0 "), ...)
Collect to a list and make a Map[String, Any] from the dataframe
val rows = dataSet.collect().toList
rows.map(bt => Map (
"id" -> bt.id,
"type" -> bt.type,
"batters" -> Map(
"batter" -> List(Map("id" -> bt.batters_batter_id_0, "type" ->
bt.batters_batter_type_0), ....) // same for the others id and types
"topping" -> List(Map("id"-> bt.topping_id_0, "type" -> bt.topping_type_0), ...) // same for the others id and type
)
))
Use Jackson to convert the Map[String, Any] to Json
Sample Data : which contains All different types of JSON element (Nested JSON Map, JSON Array, long, String etc..)
{"name":"Akash","age":16,"watches":{"name":"Apple","models":["Apple Watch Series 5","Apple Watch Nike"]},"phones":[{"name":"Apple","models":["iphone X","iphone XR","iphone XS","iphone 11","iphone 11 Pro"]},{"name":"Samsung","models":["Galaxy Note10","Galaxy S10e","Galaxy S10"]},{"name":"Google","models":["Pixel 3","Pixel 3a"]}]}
root
|— age: long (nullable = true)
| — name: string (nullable = true)
| — phones: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — models: array (nullable = true)
| | | | — element: string (containsNull = true)
| | | — name: string (nullable = true)
| — watches: struct (nullable = true)
| | — models: array (nullable = true)
| | | — element: string (containsNull = true)
| | — name: string (nullable = true)
this is the sample data which have arraytype and structtype (Map) values in json Data.
We can use write first two switch conditions for each type and repeat this process unlit it flattens out to the required Dataframe.
https://medium.com/#ajpatel.bigdata/flatten-json-data-with-apache-spark-java-api-5f6a8e37596b
Here, is the Spark Java API solution.

JSON JQ Find Replace Value

I am trying to update "image_id" value in a json structure. Using the below command, how do I change ami-d8cf5cab to ami-a4df7gah So far, I have tried this
cat cog.test.tfstate | jq -r '.modules[].resources[] | select(.type == "aws_launch_configuration") | select(.primary.attributes.name_prefix == "pmsadmin-lc-")'
The JSON data is
{
"type": "aws_launch_configuration",
"primary": {
"id": "pmsadmin-lc-v47thk6rcrdgza6dujfzjatmju",
"attributes": {
"associate_public_ip_address": "false",
"ebs_block_device.#": "0",
"ebs_optimized": "false",
"enable_monitoring": "true",
"ephemeral_block_device.#": "0",
"iam_instance_profile": "cog-test-pmsadmin",
"id": "pmsadmin-lc-v47thk6rcrdgza6dujfzjatmju",
"image_id": "ami-d8cf5cab",
"instance_type": "t2.small",
"key_name": "cog-test-internal",
"name": "pmsadmin-lc-v47thk6rcrdgza6dujfzjatmju",
"name_prefix": "pmsadmin-lc-",
"root_block_device.#": "0",
"security_groups.#": "4",
"security_groups.1893851868": "sg-7ee7bf1a",
"security_groups.2774384192": "sg-e2e7bf86",
"security_groups.2825850029": "sg-86e6bee2",
"security_groups.3095009517": "sg-f4e7bf90",
"spot_price": "",
"user_data": "ed03ac6642af8c97562b065c0b37f211b58ad0a2"
}
}
}
Use the |= operator to assign to a property:
jq -r '.modules[].resources[] | select(.type == "aws_launch_configuration") | select(.primary.attributes.name_prefix == "pmsadmin-lc-")| .primary.attributes.image_id |= "ami-a4df7gah"