Extracting json in Scala - json

I have the following data structure:
val jsonStr = """
{
"data1": {
"field1": "data1",
"field2": 1.0,
"field3": true
},
"data211": {
"field1": "data211",
"field2": 4343.0,
"field3": false
},
"data344": {
"field1": "data344",
"field2": 436778.51,
"field3": true
},
"data41": {
"field1": "data41",
"field2": 14348.0,
"field3": true
}
}
"""
I want extract it. Here is what I'm doing without any luck:
#1.
case class Fields(field1: String, field2: Double, field3: Boolean)
json.extract[Map[String, Map[Fields, String]]]
//org.json4s.package$MappingException: Do not know how to convert JBool(true)
//into class java.lang.String
#2.
json.extract[Map[String, Map[String, Fields]]
//java.lang.InternalError: Malformed class name
#3.
json.extract[Map[String, Map[String, Any]]]
//org.json4s.package$MappingException: No information known about type
#4.
json.extract[Map[String, Map[String, String]]]
//org.json4s.package$MappingException: Do not know
//how to convert JBool(true) into class java.lang.String
How do I do that then?
P.S. -- actually, that's https://github.com/json4s/json4s but it's doesn't really matter since lift has the same API regarding json extracting.
UPDATE: It probably will require to use transform method. How will I use it?
val json = parse(jsonStr) transform {
case //.... what should be here to catch JBool -- "field3"?
}
UPDATE2:
#5
json.extract[Map[String, Map[String, JValue]]]
// Works! but it's not what I'm looking for, I need to use a pure Java/Scala type

scala> val jsonStr = """
| {
| "data1": {
| "field1": "data1",
| "field2": 1.0,
| "field3": true
| },
| "data211": {
| "field1": "data211",
| "field2": 4343.0,
| "field3": false
| },
| "data344": {
| "field1": "data344",
| "field2": 436778.51,
| "field3": true
| },
| "data41": {
| "field1": "data41",
| "field2": 14348.0,
| "field3": true
| }
| }
| """
jsonStr: java.lang.String =
"
{
"data1": {
"field1": "data1",
"field2": 1.0,
"field3": true
},
"data211": {
"field1": "data211",
"field2": 4343.0,
"field3": false
},
"data344": {
"field1": "data344",
"field2": 436778.51,
"field3": true
},
"data41": {
"field1": "data41",
"field2": 14348.0,
"field3": true
}
}
"
scala> import net.liftweb.json._
import net.liftweb.json._
scala> implicit val formats = DefaultFormats
formats: net.liftweb.json.DefaultFormats.type = net.liftweb.json.DefaultFormats$#361ee3df
scala> val json = parse(jsonStr)
json: net.liftweb.json.package.JValue = JObject(List(JField(data1,JObject(List(JField(field1,JString(data1)), JField(field2,JDouble(1.0)), JField(field3,JBool(true))))), JField(data211,JObject(List(JField(field1,JString(data211)), JField(field2,JDouble(4343.0)), JField(field3,JBool(false))))), JField(data344,JObject(List(JField(field1,JString(data344)), JField(field2,JDouble(436778.51)), JField(field3,JBool(true))))), JField(data41,JObject(List(JField(field1,JString(data41)), JField(field2,JDouble(14348.0)), JField(field3,JBool(true)))))))
scala> case class Fields(field1: String, field2: Double, field3: Boolean)
defined class Fields
scala> json.extract[Map[String, Fields]]
res1: Map[String,Fields] = Map(data1 -> Fields(data1,1.0,true), data211 -> Fields(data211,4343.0,false), data344 -> Fields(data344,436778.51,true), data41 -> Fields(data41,14348.0,true))

Related

How to code a json tree pruning in Python3?

The structure of json tree is known. However, how do we prune the json tree in Python3?
I had been trying to create a medical file format for patients. Each json object is a case or detail about a patient.
I tried linearizing the json, and count the levels, but the code quickly gets untenable. I also looked at binary trees, but this is not a binary tree. I attempted to itemized each json object as an atom, which means it would be a form of pointer, however, python does not have pointers.
Examples:
insert / replace json into 0.1.2
delete json at 0.1.1.3
extract json at 0.1.1.1 // may be sub-tree
{ // 0
"field1": "value1",
"field2": { // 0.0
"field3": "val3",
"field4": "val4"
}
}
For example, I want to remove 0.0:
{ // 0
"field1": "value1",
// removed
}
to insert 0.1:
{ // 0
"field1": "value1",
"field2": { // 0.0
"field3": "val3",
"field4": "val4"
}
"field2x": { // 0.1
"field3x": "val3x",
"field4x": "val4x"
}
}
0.1 must be given:
"field2x": { // 0.1
"field3x": "val3x",
"field4x": "val4x"
}
now i want to insert 0.1.0:
"field2xx": { // 0.1.0
"field3xx": "val3xx",
"field4xx": "val4xx"
}
{ // 0
"field1": "value1",
"field2": { // 0.0
"field3": "val3",
"field4": "val4"
}
"field2x": { // 0.1
"field3x": "val3x",
"field4x": "val4x"
"field2xx": { // 0.1.0
"field3xx": "val3xx",
"field4xx": "val4xx"
}
}
}
now I want to extract 0.1, it should give me:
"field2x": { // 0.1
"field3x": "val3x",
"field4x": "val4x"
"field2xx": { // 0.1.0
"field3xx": "val3xx",
"field4xx": "val4xx"
}
}
leaving:
{ // 0
"field1": "value1",
"field2": { // 0.0
"field3": "val3",
"field4": "val4"
}
// removed 0.1
}
I would highly recomend not attempting to use indices to find a field in a dictionary. With how JSON works, and their usual mappings to dictionaries/maps in a programming language, You generally cannot guarantee that ordering of the keys is preserved. However, depending on your specific version it may work, you can check the documentation at https://docs.python.org/3.10/library/json.html#json.dump
If you really need to use this kind of access and operations, then given a dictionary dict you can find the key at index i using list(dict.keys())[i], and it's value using list(dict.values())[i]. With that, you can parse your input parameters, crawl to the point in the document you need to make your operation, and perform that operation.
Again, I highly, highly advise against this approach as you want to use arrays instead of objects/dictionaries/maps if ordering is important. But if you really have no control over the input format, and you can guarantee that key ordering is preserved, then the above would work.
json.load() and json.loads() in the standard library take an object_pairs_hook parameter that lets you create custom objects from the JSON source.
You want a dict that lets you access items by index as well as by key. So the strategy is to provide a mapping class that lets you access the items either way. Then provide that class as the object_pairs_hook argument.
There is probably a library that does this, but my Google-fu is off this morning and I couldn't find one. So I wiped this up. Basically the class keeps an internal list of keys by index as well as a regular dict. The dunder methods keep the list and dict in synch.
import json
from collections.abc import MutableMapping
class IndexableDict(MutableMapping):
def __init__(self, *args, **kwds):
self.key_from_index = []
self.data = {}
if args:
for key, value in args:
self.__setitem__(key, value)
if kwds:
for key, value in kwds.items():
self.__setitem__(key, value)
def __setitem__(self, key_or_index, value):
if isinstance(key_or_index, (tuple, list)):
obj = self
for item in key_or_index[:-1]:
obj = obj[item]
obj[key_or_index[-1]] = value
elif isinstance(key_or_index, int):
key = self.key_from_index[key_or_index]
self.data[key] = value
elif isinstance(key_or_index, str):
if key_or_index not in self.data:
self.key_from_index.append(key_or_index)
self.data[key_or_index] = value
else:
raise ValueError(f"Unknown type of key '{key}'")
def __getitem__(self, key_or_index):
if isinstance(key_or_index, (tuple, list)):
obj = self
for item in key_or_index:
obj = obj[item]
return obj
elif isinstance(key_or_index, int):
key = self.key_from_index[key_or_index]
return self.data[key]
elif isinstance(key_or_index, str):
return self.data[key_or_index]
else:
raise ValueError(f"Unknown type of key '{key_or_index}'")
def __delitem__(self, key_or_index):
if isinstance(key_or_index, (tuple, list)):
obj = self
for item in key_or_index[:-1]:
obj = obj[item]
del obj[key_or_index[-1]]
elif isinstance(key_or_index, int):
key = self.key_from_index[key_or_index]
del self.data[key]
del self.key_from_index[key_or_index]
elif isinstance(key_or_index, str):
index = self.key_from_index.find(key_or_index)
del self.key_from_index[index]
del self.data[key_or_index]
else:
raise ValueError(f"Unknown type of key '{key_or_index}'")
def __iter__(self):
yield from self.data.items()
def __len__(self):
return len(self.data)
def __repr__(self):
s = ', '.join(f'{k}={repr(v)}' for k, v in self)
if len(s) > 50:
s = s[:47] + '...'
return f'<IterableDict({s})>'
It can be used like this:
data = """{"zero":0, "one":{"a":1, "b":2}, "two":[3, 4, 5]}"""
def object_pairs_hook(pairs):
return IndexableDict(*pairs)
dd = json.loads(data, object_pairs_hook=object_pairs_hook)
print(dd[0], dd['zero']) # get values by index or key
print(dd[(1,0)]) # get values by a list or tuple of keys
# equivalent to dd[1][0]
print(dd[(2,1)])
dd[['two', 1]] = 42 # sequence works to set a value too
print(dd[(2,1)])
Prints:
0 0
1
4
42
No time to do an insert(), but is should be similar to __setitem__(). It has not been tested much, so there may be some bugs. It could also use some refactoring.
I second the people saying that indexing a dictionary by position is not the natural way, but it is possible since python3.7 as the dict is insertion-ordered as a guaranteed language-feature in python.
This is my working example, the indices are different than your schematic, but it made more sense for me to index it like that. It makes use of recursive traversing of the data by the given indices and then depending on the operation removing, inserting or returning the nested data.
The insertion of data makes use of the mentioned ordering by insertion in python.
data.update(dict(**insert, **after))
It leaves the data before the insertion as is (so it is older and thus staying in front)
Then it updates the inserted data
And last the data after the inserted data (making it the oldest and thus at the back).
from copy import deepcopy
import itertools
import json
def traverse(data, index_list):
index = index_list.pop()
if index_list:
nested_data = list(data.values())[index]
return traverse(nested_data, index_list)
return data, index
def insert(data, data_insert, index_list):
data, index = traverse(data, index_list)
after = dict(itertools.islice(data.items(), index)) or None
data.update(dict(**data_insert, **after))
def remove(data, index_list):
key, data = get(data, index_list)
return {key: data.pop(key)}
def get(data, index_list):
data, index = traverse(data, index_list)
key = list(data.keys())[index]
return key, data
def run_example(example_name, json_in, index_str, operation, data_insert=None):
print("-" * 40 + f"\n{example_name}")
print(f"json before {operation} at {index_str}:")
print(json.dumps(json_in, indent=2, sort_keys=False))
index_list = [int(idx_char) for idx_char in index_str.split(".")]
if operation == "insert":
json_out = insert(json_in, data_insert, index_list)
elif operation == "remove":
json_out = remove(json_in, index_list)
elif operation == "get":
key, data = get(json_in, index_list)
json_out = {key: data[key]}
else:
raise NotImplementedError("Not a valid operation")
print(f"json after:")
print(json.dumps(json_in, indent=2, sort_keys=False))
print(f"json returned:")
print(json.dumps(json_out, indent=2, sort_keys=False))
json_data = {
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
}
}
run_example("example 1", deepcopy(json_data), "1", "remove")
run_example("example 2", json_data, "2", "insert", {"field2x": {"field3x": "val3x", "field4x": "val4x"}})
run_example("example 3", json_data, "2", "get")
run_example("example 4", json_data, "2.2", "insert", {"field2xx": {"field3xx": "val3xx", "field4xx": "val4xx"}})
run_example("example 5", json_data, "2", "remove")
This gives the following output:
----------------------------------------
example 1
json before remove at 1:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
}
}
json after:
{
"field1": "value1"
}
json returned:
{
"field2": {
"field3": "val3",
"field4": "val4"
}
}
----------------------------------------
example 2
json before insert at 2:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
}
}
json after:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
},
"field2x": {
"field3x": "val3x",
"field4x": "val4x"
}
}
json returned:
null
----------------------------------------
example 3
json before get at 2:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
},
"field2x": {
"field3x": "val3x",
"field4x": "val4x"
}
}
json after:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
},
"field2x": {
"field3x": "val3x",
"field4x": "val4x"
}
}
json returned:
{
"field2x": {
"field3x": "val3x",
"field4x": "val4x"
}
}
----------------------------------------
example 4
json before insert at 2.2:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
},
"field2x": {
"field3x": "val3x",
"field4x": "val4x"
}
}
json after:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
},
"field2x": {
"field3x": "val3x",
"field4x": "val4x",
"field2xx": {
"field3xx": "val3xx",
"field4xx": "val4xx"
}
}
}
json returned:
null
----------------------------------------
example 5
json before remove at 2:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
},
"field2x": {
"field3x": "val3x",
"field4x": "val4x",
"field2xx": {
"field3xx": "val3xx",
"field4xx": "val4xx"
}
}
}
json after:
{
"field1": "value1",
"field2": {
"field3": "val3",
"field4": "val4"
}
}
json returned:
{
"field2x": {
"field3x": "val3x",
"field4x": "val4x",
"field2xx": {
"field3xx": "val3xx",
"field4xx": "val4xx"
}
}
}

Scala manipulate json object

I have a dynamic json object generated with a certain format and I would like to manipulate that object to map it to another format in scala.
The problem is that the names of the fields are dynamic so "field1" and "field2" can be anything.
Is there any way to do it dynamically in scala?
The original object:
{
"field1": {
"value" "some value 1",
"description": "some test description",
...
},
"field2": {
"value" "some value 2",
"description": "some test description",
...
}
}
And I'd like to convert it to something like:
{
"field1": "some value 1",
"field2": "some value 2"
}
You can collect all keys and then check if the downField("value") exists
import io.circe._
import io.circe.literal.JsonStringContext
object CirceFieldsToMap {
def main(args: Array[String]): Unit = {
val json: Json =
json"""{
"field1": {
"foo" : "bar1",
"value" : "foobar1"
},
"field2": {
"foo" : "bar2",
"value" : "foobar2"
},
"field3": {
"foo" : "bar2"
}
}"""
implicit val decodeFoo = new Decoder[Map[String, Option[String]]] {
final def apply(c: HCursor): Decoder.Result[Map[String, Option[String]]] = {
val result = c.keys.get //// You should handle the .get properly ( if None it breaks)
.toList
.foldLeft(List.empty[(String, Option[String])]) { case (acc, key) =>
acc :+ (key, c.downField(key).downField("value").as[String].toOption)
}
Right(result.toMap)
}
}
val myField01 = json.as[Map[String, Option[String]]]
println(myField01) //Right(Map(field1 -> Some(foobar1), field2 -> Some(foobar2), field3 -> None))
}
}

Nested Json extract the value with unknown key in the middle

I have a Json column(colJson) in a dataframe like this
{
"a": "value1",
"b": "value1",
"c": true,
"details": {
"qgiejfkfk123": { //unknown value
"model1": {
"score": 0.531,
"version": "v1"
},
"model2": {
"score": 0.840,
"version": "v2"
},
"other_details": {
"decision": false,
"version": "v1"
}
}
}
}
Here 'qgiejfkfk123' is dynamic value and changes with each row. However I need to extract model1.score as well as model2.score.
I tried
sourceDf.withColumn("model1_score",get_json_object(col("colJson"), "$.details.*.model1.score").cast(DoubleType))
.withColumn("model2_score",get_json_object(col("colJson"), "$.details.*.model2.score").cast(DoubleType))
but did not work.
I managed to get your solution by using from_json, parsing the dynamic value as Map<String, Struct> and exploding the values from it:
val schema = "STRUCT<`details`: MAP<STRING, STRUCT<`model1`: STRUCT<`score`: DOUBLE, `version`: STRING>, `model2`: STRUCT<`score`: DOUBLE, `version`: STRING>, `other_details`: STRUCT<`decision`: BOOLEAN, `version`: STRING>>>>"
val fromJsonDf = sourceDf.withColumn("colJson", from_json(col("colJson"), lit(schema)))
val explodeDf = fromJsonDf.select($"*", explode(col("colJson.details")))
// +----------------------------------------------------------+------------+--------------------------------------+
// |colJson |key |value |
// +----------------------------------------------------------+------------+--------------------------------------+
// |{{qgiejfkfk123 -> {{0.531, v1}, {0.84, v2}, {false, v1}}}}|qgiejfkfk123|{{0.531, v1}, {0.84, v2}, {false, v1}}|
// +----------------------------------------------------------+------------+--------------------------------------+
val finalDf = explodeDf.select(col("value.model1.score").as("model1_score"), col("value.model2.score").as("model2_score"))
// +------------+------------+
// |model1_score|model2_score|
// +------------+------------+
// | 0.531| 0.84|
// +------------+------------+

Scala, Circe, Json - how to remove parent node from json?

I have a json structure like this:
"data" : {
"fields": {
"field1": "value1",
"field2": "value2"
}
}
Now I would like to remove fields node and keep data in data:
"data" : {
"field1": "value1",
"field2": "value2"
}
I tried to do it like this:
val result = data.hcursor.downField("fields").as[JsonObject].toOption.head.toString
but I got a strange result, instead of just json in string format
I also tried:
val result = data.hcursor.downField("fields").top.head.toString
but it was the same as:
val result = data.toString
and it includes fields.
How I should change my code to remove fields root and keep data under data property?
Here is a full working solution that traverses the JSON, extracts the fields, removes them and then merges them under data:
import io.circe.Json
import io.circe.parser._
val s =
"""
|{
|"data": {
| "fields": {
| "field1": "value1",
| "field2": "value2"
| }
|}
|}
|""".stripMargin
val modifiedJson =
for {
json <- parse(s)
fields <- json.hcursor
.downField("data")
.downField("fields")
.as[Json]
modifiedRoot <- json.hcursor
.downField("data")
.downField("fields")
.delete
.root
.as[Json]
res <-
modifiedRoot.hcursor
.downField("data")
.withFocus(_.deepMerge(fields))
.root
.as[Json]
} yield res
Yields:
Right({
"data" : {
"field1" : "value1",
"field2" : "value2"
}
})

Json object to map

Hello I would like to transform json object into map using implicit Reads.
With the code below I run into a StackOverflow error, any one could see what the problem is :
"pass": {
"key-1": {
"field1": "aaaa",
"field2": "aaaa"
},
"key-2": {
"field1": "aaaa",
"field2": "aaaa"
},
"key-3": {
"field1": "aaaa",
"field2": "aaaa"
}
}
case class Pass(field1: String, field2: String)
implicit val mapReads: Reads[Map[String, Pass]] = new Reads[Map[String, Pass]] {
def reads(jv: JsValue): JsResult[Map[String, Pass]] =
JsSuccess(jv.as[Map[String, Pass]].map{
case (k, v) => k -> v.asInstanceOf[Pass]
})
}
val passMap = (json \ "pass").validate[Map[String, Pass]]
Here is the stack error :
java.lang.StackOverflowError
at play.api.libs.json.JsReadable$class.as(JsReadable.scala:23)
at play.api.libs.json.JsObject.as(JsValue.scala:124)
at com.MyHelper$$anon$1.reads(MyHelper.scala:51)
at play.api.libs.json.Format$$anon$3.reads(Format.scala:65)
at play.api.libs.json.JsValue$class.validate(JsValue.scala:17)
at play.api.libs.json.JsObject.validate(JsValue.scala:124)
at play.api.libs.json.JsReadable$class.as(JsReadable.scala:23)
at play.api.libs.json.JsObject.as(JsValue.scala:124)
Maybe you will be more likely to create a MapPass class case and use Json.format to do the work for you!
import play.api.libs.json._
val a: String = """{
"pass": {
"key-1": {
"field1": "aaaa",
"field2": "aaaa"
},
"key-2": {
"field1": "aaaa",
"field2": "aaaa"
},
"key-3": {
"field1": "aaaa",
"field2": "aaaa"
}
}
}"""
case class Pass(field1: String, field2: String)
case class MapPass(pass: Map[String, Pass])
implicit val passFormat: Format[Pass] = Json.format[Pass]
implicit val mapPassFormat: Format[MapPass] = Json.format[MapPass]
val json = Json.parse(a)
val mapPassJsResult = json.validate[MapPass]
val mapPass = mapPassJsResult.get
print(mapPass.pass.mkString("\n"))
It worked like that for me: