Scanning a HUGE JSON file for deserializable data in Scala - json

I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.
For example:
Let's say I can only deserialize into instances of the following:
case class Data(val a: Int, val b: Int, val c: Int)
and the expected JSON format is:
{ "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ],
"bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ],
.... MANY ITEMS .... ,
"qux": [ {"a": 0, "b": 0, "c": 0 } }
What I would like to do is:
import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.
As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I'm not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I'm fine with that.
What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.

I have not done it with JSON (and I hope someone will come up with a turnkey solution for you) but done it with XML and here is a way of handling it.
It is basically a simple Map->Reduce process with the help of stream parser.
Map (your advanceTo)
Use a streaming parser like JSON Simple (not tested). When on the callback you match your "path", collect anything below by writing it to a stream (file backed or in-memory, depending on your data). That will be your foo array in your example. If your mapper is sophisticated enough, you may want to collect multiple paths during the map step.
Reduce (your stream[Data])
Since the streams you collected above look pretty small, you probably do not need to map/split them again and you can parse them directly in memory as JSON objects/arrays and manipulate them (transform, recombine, etc...).

Here is the current way I am solving the problem:
import collection.immutable.PagedSeq
import util.parsing.input.PagedSeqReader
import com.codahale.jerkson.Json
import collection.mutable
private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json"))
private val clearAndStop = ']'
private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = {
val str = new StringBuilder()
var readerFinal = readerInitial
while(!readerFinal.atEnd && !str.endsWith(text)) {
str += readerFinal.first
readerFinal = readerFinal.rest
}
if (!str.endsWith(text) || str.contains(clearAndStop))
Taken(readerFinal, None)
else
Taken(readerFinal, Some(str.toString))
}
private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = {
var taken = Taken(readerInitial, None)
chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString))
taken
}
def getJsonData() : Seq[Data] = {
var data = mutable.ListBuffer[Data]()
var taken = takeUntil(fileContent, "\"foo\"")
taken = takeUntil(taken.reader, ':', '[')
var doneFirst = false
while(taken.text != None) {
if (!doneFirst)
doneFirst = true
else
taken = takeUntil(taken.reader, ',')
taken = takeUntil(taken.reader, '}')
if (taken.text != None) {
print(taken.text.get)
places += Json.parse[Data](taken.text.get)
}
}
data
}
case class Taken(reader: PagedSeqReader, text: Option[String])
case class Data(val a: Int, val b: Int, val c: Int)
Granted, This code doesn't exactly handle malformed JSON very cleanly and to use for multiple top-level keys "foo", "bar" and "qux", will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It's not quite as functional as I'd like and isn't super robust but PagedSeqReader definitely keeps this from getting too messy.

Related

How to unpack nested JSON into Python Dataclass

Dataclass example:
#dataclass
class StatusElement:
status: str
orderindex: int
color: str
type: str
#dataclass
class List:
id: int
statuses: List[StatusElement]
JSON example:
json = {
"id": "124",
"statuses": [
{
"status": "to do",
"orderindex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
I can unpack the JSON doing something like this:
object = List(**json)
But I'm not sure how can I also unpack the statuses into a status object and appened to the statuses list of the List object? I'm sure I need to loop over it somehow but not sure how to combine that with unpacking.
Python dataclasses is a great module, but one of the things it doesn't unfortunately handle is parsing a JSON object to a nested dataclass structure.
A few workarounds exist for this:
You can either roll your own JSON parsing helper method, for example a from_json which converts a JSON string to an List instance with a nested dataclass.
You can make use of existing JSON serialization libraries. For example, pydantic is a popular one that supports this use case.
Here is an example using the dataclass-wizard library that works well enough for your use case. It's more lightweight than pydantic and coincidentally also a little faster. It also supports automatic case transforms and type conversions (for example str to annotated int)
Example below:
from dataclasses import dataclass
from typing import List as PyList
from dataclass_wizard import JSONWizard
#dataclass
class List(JSONWizard):
id: int
statuses: PyList['StatusElement']
# on Python 3.9+ you can use the following syntax:
# statuses: list['StatusElement']
#dataclass
class StatusElement:
status: str
order_index: int
color: str
type: str
json = {
"id": "124",
"statuses": [
{
"status": "to do",
"orderIndex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
object = List.from_dict(json)
print(repr(object))
# List(id=124, statuses=[StatusElement(status='to do', order_index=0, color='#d3d3d3', type='open')])
Disclaimer: I am the creator (and maintainer) of this library.
You can now skip the class inheritance as of the latest release of dataclass-wizard. It's straightforward enough to use it; using the same example from above, but I've removed the JSONWizard usage from it completely. Just remember to ensure you don't import asdict from the dataclasses module, even though I guess that should coincidentally work.
Here's the modified version of the above without class inheritance:
from dataclasses import dataclass
from typing import List as PyList
from dataclass_wizard import fromdict, asdict
#dataclass
class List:
id: int
statuses: PyList['StatusElement']
#dataclass
class StatusElement:
status: str
order_index: int
color: str
type: str
json = {
"id": "124",
"statuses": [
{
"status": "to do",
"orderIndex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
# De-serialize the JSON dictionary into a `List` instance.
c = fromdict(List, json)
print(c)
# List(id=124, statuses=[StatusElement(status='to do', order_index=0, color='#d3d3d3', type='open')])
# Convert the instance back to a dictionary object that is JSON-serializable.
d = asdict(c)
print(d)
# {'id': 124, 'statuses': [{'status': 'to do', 'orderIndex': 0, 'color': '#d3d3d3', 'type': 'open'}]}
Also, here's a quick performance comparison with dacite. I wasn't aware of this library before, but it's also very easy to use (and there's also no need to inherit from any class). However, from my personal tests - Windows 10 Alienware PC using Python 3.9.1 - dataclass-wizard seemed to perform much better overall on the de-serialization process.
from dataclasses import dataclass
from timeit import timeit
from typing import List
from dacite import from_dict
from dataclass_wizard import JSONWizard, fromdict
data = {
"id": 124,
"statuses": [
{
"status": "to do",
"orderindex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
#dataclass
class StatusElement:
status: str
orderindex: int
color: str
type: str
#dataclass
class List:
id: int
statuses: List[StatusElement]
class ListWiz(List, JSONWizard):
...
n = 100_000
# 0.37
print('dataclass-wizard: ', timeit('ListWiz.from_dict(data)', number=n, globals=globals()))
# 0.36
print('dataclass-wizard (fromdict): ', timeit('fromdict(List, data)', number=n, globals=globals()))
# 11.2
print('dacite: ', timeit('from_dict(List, data)', number=n, globals=globals()))
lst_wiz1 = ListWiz.from_dict(data)
lst_wiz2 = from_dict(List, data)
lst = from_dict(List, data)
# True
assert lst.__dict__ == lst_wiz1.__dict__ == lst_wiz2.__dict__
A "cleaner" solution (in my eyes). Use dacite
No need to inherit anything.
from dataclasses import dataclass
from typing import List
from dacite import from_dict
data = {
"id": 124,
"statuses": [
{
"status": "to do",
"orderindex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
#dataclass
class StatusElement:
status: str
orderindex: int
color: str
type: str
#dataclass
class List:
id: int
statuses: List[StatusElement]
lst: List = from_dict(List, data)
print(lst)
output
List(id=124, statuses=[StatusElement(status='to do', orderindex=0, color='#d3d3d3', type='open')])
I've spent a few hours investigating options for this. There's no native Python functionality to do this, but there are a few third-party packages (writing in November 2022):
marshmallow_dataclass has this functionality (you need not be using marshmallow in any other capacity in your project). It gives good error messages and the package is actively maintained. I used this for a while before hitting what I believe is a bug parsing a large and complex JSON into deeply nested dataclasses, and then had to switch away.
dataclass-wizard is easy to use and specifically addresses this use case. It has excellent documentation. One significant disadvantage is that it won't automatically attempt to find the right fit for a given JSON, if trying to match against a union of dataclasses (see https://dataclass-wizard.readthedocs.io/en/latest/common_use_cases/dataclasses_in_union_types.html). Instead it asks you to add a "tag key" to the input JSON, which is a robust solution but may not be possible if you have no control over the input JSON.
dataclass-json is similar to dataclass-wizard, and again doesn't attempt to match the correct dataclass within a union.
dacite is the option I have settled upon for the time being. It has similar functionality to marshmallow_dataclass, at least for JSON parsing. The error messages are significantly less clear than marshmallow_dataclass, but slightly offsetting this, it's easier to figure out what's wrong if you pdb in at the point that the error occurs - the internals are quite clear and you can experiment to see what's going wrong. According to others it is rather slow, but that's not a problem in my circumstance.

Scala: How to create a simple nested JSON object

In Scala I want to create a JSON Object or JSON String where the result would be something similar to this (i.e. a relatively basic nested JSON):
{
"info":{
"info1":"1234",
"info2":"123456",
"info3":false
},
"x":[
{
"x1_info":10,
"x2_info":"abcde"
}
],
"y":[
{
"y1_info":"28732",
"y2_info":"defghi",
"y3_info":"1261",
"y_extra_info":[
{
"Id":"543890",
"ye1":"asdfg",
"ye2":"BOOLEAN",
"ye3":false,
"ye4":true,
"ye5":true,
"ye6":0,
"ye7":396387
}
]
}
]
}
I plan to use this object in tests so ideally I can set the information inside the JSON object to whatever I want (I don't plan on sourcing the info for the JSON object from a separate file). I tried the following where I created a Map of the information I want in my JSON object and then tried converting it to json using gson:
val myInfo = Map(
"info" -> Map("info1" -> "1234",
"info2" -> "123456",
"info3" -> "false"),
"x" -> Map("x1_info" -> 10,
"x2_info" -> "abcde"),
"y" -> Map("y1_info" -> "28732",
"y2_info" -> "defghi",
"y3_info" -> "1261",
"y_extra_info"-> Map("Id" -> "543890",
"ye1" -> "asdfg",
"ye2" -> "BOOLEAN",
"ye3" -> false,
"ye4" -> true,
"ye5" -> true,
"ye6" -> 0,
"ye7" -> 396387
)
)
)
val gson = new GsonBuilder().setPrettyPrinting.create
print(gson.toJson(myInfo))
However though it produces JSON correctly does not produce the JSON in the structure I have above but instead structures it like so:
{
"key1": "info",
"value1": {
"key1": "info1",
"value1": "1234",
"key2": "info2",
"value2": "123456",
...(etc)...
I have figured out an inelegant way of achieving what I want by creating a string containing the JSON I expect and then parsing as JSON (but I assume there is a better way)
val jsonString = "{\"info\":{\"info1\":\"1234\", \"info2\":\"123456\", " +
"\"info3\":false}," +
"\"x\":[{ \"x1_info\":10,\"x2_info\":\"abcde\"}]," +
"\"y\":[{\"y1_info\":\"28732\",\"y2_info\":\"defghi\",\"y3_info\":\"1261\", " +
"\"y_extra_info\":[{" +
"\"Id\":\"543890\",\"ye1\":\"asdfg\"," +
"\"ye2\":\"BOOLEAN\",\"ye3\":false,\"ye4\":true," +
"\"ye5\":true,\"ye6\":0,\"ye7\":396387}]}]}"
I'm relatively new to Scala so perhaps using a Map or writing it all out as a string is not the best or most "Scala way"?
Is there an easy way to create such a JSON object using preferable either gson or spray or otherwise by some other means?
In general, there's no good reason to use GSON with Scala, as GSON is Java-based and doesn't really know how to deal with Scala collections. In this case, it doesn't know how to deal with the key-value-ness of a Scala Map and instead (most likely due to reflection) just leaks the implementation detail of the default Scala Map being customized for the cases where the size is less than 4.
JSON libraries which are native to Scala (e.g. circe, play-json, spray-json, etc.) do what you expect for String-keyed Maps. They also have built-in support for directly handling case classes in Scala, e.g. for
case class Foo(id: String, code: Int, notes: String)
val foo = Foo("fooId", 42, "this is a very important foo!")
will serialize as something like (ignoring possible differences around pretty-printing and field ordering)
{ "id": "fooId", "code": 42, "notes": "this is a very important foo!" }
If there's a hard project requirement that you must use GSON, I believe it does understand Java's collections, so you could convert a Scala Map to a java.util.Map by using the conversions in CollectionConverters (in 2.13, it's in the scala.jdk package; in earlier versions, it should be in scala.collection). I'm not going to elaborate on this, because my very strong opinion is that in a Scala codebase, it's best to use a native-to-Scala JSON library.

Play framework (Scala) - Getting subset of json that contains arrays

I have many very large json-objects that I return from Play Framework with Scala.
In most cases the user doesn't need all the data in the objects, only a few fields. So I want to pass in the paths I need (as query parameters), and return a subset of the json object.
I have looked at using JSON Transformers for this task.
Filter code
def filterByPaths(paths: List[JsPath], inputObject: JsObject) : JsObject = {
paths
.map(_.json.pick)
.map(inputObject.transform)
.filter(_.isSuccess)
.map { case JsSuccess(value, path) => (value, path) }
.foldLeft(Json.obj()) { (obj, jsValueAndPath) =>
val(jsValue, path) = jsValueAndPath
val transformer = __.json.update(path.json.put(jsValue))
obj.transform(transformer).get
}
}
Usage:
val input = Json.obj(
"field1" -> Json.obj(
"field2" -> "right result"
),
"field4" -> Json.obj(
"field5" -> "not included"
),
)
val result = filterByPaths(List(JsPath \ "field1" \ "field2"), input)
// {"field1":{"field2":"right result"}}
Problem
This code works fine for JsObjects. But I can't make it work if there are JsArrays in the strucure. I had hoped that my JsPath could contain an index to look up the field, but that's not the case. (Don't know why I assumed that, maybe my head was too far in the JavaScript-world)
So this would fail to return the first entry in the Array:
val input: JsObject = Json.parse("""
{
"arr1" : [{
"field1" : "value1"
}]
}
""").as[JsObject]
val result = filterByPaths(List(JsPath \ "arr1" \ "0"), input)
// {}
Question
My question is: How can I return a subset of a json structure that contains arrays?
Alternative solution
I have the data as a case class first, and I serialize it to Json, and then run filterByPaths on it. Having a Reader that only creates the json I need in the first place might be a better solution, but creating a Reader on the fly, with configuration from queryparams seamed a more difficult task, then just stripping down the json afterwards.
The example of the returning array element:
val input: JsValue = Json.parse("""
{
"arr1" : [{
"field1" : "value1"
}]
}
""")
val firstElement = (input \ "arr1" \ 0).get
val firstElementAnotherWay = input("arr1")(0)
More about this in the Play Framework documentation: https://www.playframework.com/documentation/2.6.x/ScalaJson
Update
It looks like you got the old issue RuntimeException: expected KeyPathNode. JsPath.json.put, JsPath.json.update can't past an object to a nesting array.
https://github.com/playframework/playframework/issues/943
https://github.com/playframework/play-json/issues/82
What you can do:
Use the JSZipper: https://github.com/mandubian/play-json-zipper
Create a script to update arrays "manually"
If you can afford it, strip array in a resulting object
Example of stripping array (point 3):
def filterByPaths(paths: List[JsPath], inputObject: JsObject) : JsObject = {
paths
.map(_.json.pick)
.map(inputObject.transform)
.filter(_.isSuccess)
.map { case JsSuccess(value, path) => (value, path)}
.foldLeft(Json.obj()) { (obj, jsValueAndPath) =>
val (jsValue, path) = jsValueAndPath
val arrayStrippedPath = JsPath(path.path.filter(n => !(n.toJsonString matches """\[\d+\]""")))
val transformer = __.json.update(arrayStrippedPath.json.put(jsValue))
obj.transform(transformer).get
}
}
val result = filterByPaths(List(JsPath \ "arr1" \ "0"), input)
// {"arr1":{"field1":"value1"}}
The example
The best to handle JSON objects is by using case classes and create implicit Reads and Writes, by that you can handle errors every fields directly. Don't make it complicated.
Don't use .get() much recommended to use .getOrElse() because scala is a type-safe programming language.
Don't just use any Libraries except you know the process behind it, much better to create your own parsing method with simplified solution to save memory.
I hope it will help you..

Capturing unused fields while decoding a JSON object with circe

Suppose I have a case class like the following, and I want to decode a JSON object into it, with all of the fields that haven't been used ending up in a special member for the leftovers:
import io.circe.Json
case class Foo(a: Int, b: String, leftovers: Json)
What's the best way to do this in Scala with circe?
(Note: I've seen questions like this a few times, so I'm Q-and-A-ing it for posterity.)
There are a couple of ways you could go about this. One fairly straightforward way would be to filter out the keys you've used after decoding:
import io.circe.{ Decoder, Json, JsonObject }
implicit val decodeFoo: Decoder[Foo] =
Decoder.forProduct2[Int, String, (Int, String)]("a", "b")((_, _)).product(
Decoder[JsonObject]
).map {
case ((a, b), all) =>
Foo(a, b, Json.fromJsonObject(all.remove("a").remove("b")))
}
Which works as you'd expect:
scala> val doc = """{ "something": false, "a": 1, "b": "abc", "0": 0 }"""
doc: String = { "something": false, "a": 1, "b": "abc", "0": 0 }
scala> io.circe.jawn.decode[Foo](doc)
res0: Either[io.circe.Error,Foo] =
Right(Foo(1,abc,{
"something" : false,
"0" : 0
}))
The disadvantage of this approach is that you have to maintain code to remove the keys you've used separately from their use, which can be error-prone. Another approach is to use circe's state-monad-powered decoding tools:
import cats.data.StateT
import cats.instances.either._
import io.circe.{ ACursor, Decoder, Json }
implicit val decodeFoo: Decoder[Foo] = Decoder.fromState(
for {
a <- Decoder.state.decodeField[Int]("a")
b <- Decoder.state.decodeField[String]("b")
rest <- StateT.inspectF((_: ACursor).as[Json])
} yield Foo(a, b, rest)
)
Which works the same way as the previous decoder (apart from some small differences in the errors you'll get if decoding fails):
scala> io.circe.jawn.decode[Foo](doc)
res1: Either[io.circe.Error,Foo] =
Right(Foo(1,abc,{
"something" : false,
"0" : 0
}))
This latter approach doesn't require you to change the used fields in multiple places, and it also has the advantage of looking a little more like any other decoder you'd write manually in circe.

Akka HTTP Streaming JSON Deserialization

Is it possible to dynamically deserialize an external, of unknown length, ByteString stream from Akka HTTP into domain objects?
Context
I call an infinitely long HTTP endpoint that outputs a JSON Array that keeps growing:
[
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
{ "prop": true, "prop2": false, "prop3": 97, "prop4": "sample" },
...
] <- Never sees the daylight
I guess that JsonFraming.objectScanner(Int.MaxValue) should be used in this case. As docs state:
Returns a Flow that implements a "brace counting" based framing
operator for emitting valid JSON chunks. It scans the incoming data
stream for valid JSON objects and returns chunks of ByteStrings
containing only those valid chunks. Typical examples of data that one
may want to frame using this operator include: Very large arrays
So you can end up with something like this:
val response: Future[HttpResponse] = Http().singleRequest(HttpRequest(uri = serviceUrl))
response.onComplete {
case Success(value) =>
value.entity.dataBytes
.via(JsonFraming.objectScanner(Int.MaxValue))
.map(_.utf8String) // In case you have ByteString
.map(decode[MyEntity](_)) // Use any Unmarshaller here
.grouped(20)
.runWith(Sink.ignore) // Do whatever you need here
case Failure(exception) => log.error(exception, "Api call failed")
}
I had a very similar problem trying to parse the Twitter Stream (an infinite string) into a domain object.
I solved it using Json4s, like this:
case class Tweet(username: String, geolocation: Option[Geo])
case class Geo(latitude: Float, longitude: Float)
object Tweet{
def apply(s: String): Tweet = {
parse(StringInput(s), useBigDecimalForDouble = false, useBigIntForLong = false).extract[Tweet]
}
}
Then I just buffer the stream and mapped it to a Tweet:
val reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(inputStream), "UTF-8"))
var line = reader.readLine()
while(line != null){
store(Tweet.apply(line))
line = reader.readLine()
}
Json4s has full support over Option (or custom objects inside the object, like Geo in the example). Therefore, you can put an Option like I did, and if the field doesn't come in the Json, it will be set to None.
Hope it helps!
I think that play-iteratees-extras must help you. This library allow to parse Json via Enumerator/Iteratee pattern and, of course, don't waiting for receiving all data.
For example, lest build 'infinite' stream of bytes that represents 'infinite' Json array.
import play.api.libs.iteratee.{Enumeratee, Enumerator, Iteratee}
var i = 0
var isFirstWas = false
val max = 10000
val stream = Enumerator("[".getBytes) andThen Enumerator.generateM {
Future {
i += 1
if (i < max) {
val json = Json.stringify(Json.obj(
"prop" -> Random.nextBoolean(),
"prop2" -> Random.nextBoolean(),
"prop3" -> Random.nextInt(),
"prop4" -> Random.alphanumeric.take(5).mkString("")
))
val string = if (isFirstWas) {
"," + json
} else {
isFirstWas = true
json
}
Some(Codec.utf_8.encode(string))
} else if (i == max) Some("]".getBytes) // <------ this is the last jsArray closing tag
else None
}
}
Ok, this value contains jsArray of 10000 (or more) objects. Lets define case class that will be contain data of each object in our array.
case class Props(prop: Boolean, prop2: Boolean, prop3: Int, prop4: String)
Now write parser, that will be parse each item
import play.extras.iteratees._
import JsonBodyParser._
import JsonIteratees._
import JsonEnumeratees._
val parser = jsArray(jsValues(jsSimpleObject)) ><> Enumeratee.map { json =>
for {
prop <- json.\("prop").asOpt[Boolean]
prop2 <- json.\("prop2").asOpt[Boolean]
prop3 <- json.\("prop3").asOpt[Int]
prop4 <- json.\("prop4").asOpt[String]
} yield Props(prop, prop2, prop3, prop4)
}
Please, see doc for jsArray, jsValues and jsSimpleObject. To build result producer:
val result = stream &> Encoding.decode() ><> parser
Encoding.decode() from JsonIteratees package will decode bytes as CharString. result value has type Enumerator[Option[Item]] and you can apply some iteratee to this enumerator to start parsing process.
In total, I don't know how you receive bytes (the solution depends heavily on this), but I think that show one of the possible solutions of your problem.