I am trying to parse json with this library https://github.com/momodi/Json4Scala
I have JSON that looks like this:
{
current: {pageviews: 5, time: 50, id: 'jafh784'},
allTime: {pageviews: 20, time: 438, id: 'adsf6447'}
}
val json = Json.parse(x.getString("user"))
json.asMap("current").asMap("pageviews").asInt
It is not working and I have tried several combinations of the above. I tried to use some other libraries but they were even less clear to me. The schema of the json varies but the page views is always in the same location. I am open for suggestions of another library.
Edit: I read about using case classes for nested objects but the schema is not exact across all my json. Can I just use a case class and only declare a minimum of keys?
Play Json allows you to do it without having to specify models:
import play.api.libs.json._
val raw = """{
"current": {"pageviews": 5, "time": 50, "id": "jafh784"},
"allTime": {"pageviews": 20, "time": 438, "id": "adsf6447"}
}"""
val json = Json.parse(raw).as[JsObject]
val currentPageviews = (json \ "current" \ "pageviews").as[Int]
println(currentPageviews) // 5
Here here is a link to a live example.
To include the PlayJson dependency add this to your build sbt:
libraryDependencies += "com.typesafe.play" % "play-json_2.11" % "2.6.2"
(There are also build for 2.12)
Related
Dataclass example:
#dataclass
class StatusElement:
status: str
orderindex: int
color: str
type: str
#dataclass
class List:
id: int
statuses: List[StatusElement]
JSON example:
json = {
"id": "124",
"statuses": [
{
"status": "to do",
"orderindex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
I can unpack the JSON doing something like this:
object = List(**json)
But I'm not sure how can I also unpack the statuses into a status object and appened to the statuses list of the List object? I'm sure I need to loop over it somehow but not sure how to combine that with unpacking.
Python dataclasses is a great module, but one of the things it doesn't unfortunately handle is parsing a JSON object to a nested dataclass structure.
A few workarounds exist for this:
You can either roll your own JSON parsing helper method, for example a from_json which converts a JSON string to an List instance with a nested dataclass.
You can make use of existing JSON serialization libraries. For example, pydantic is a popular one that supports this use case.
Here is an example using the dataclass-wizard library that works well enough for your use case. It's more lightweight than pydantic and coincidentally also a little faster. It also supports automatic case transforms and type conversions (for example str to annotated int)
Example below:
from dataclasses import dataclass
from typing import List as PyList
from dataclass_wizard import JSONWizard
#dataclass
class List(JSONWizard):
id: int
statuses: PyList['StatusElement']
# on Python 3.9+ you can use the following syntax:
# statuses: list['StatusElement']
#dataclass
class StatusElement:
status: str
order_index: int
color: str
type: str
json = {
"id": "124",
"statuses": [
{
"status": "to do",
"orderIndex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
object = List.from_dict(json)
print(repr(object))
# List(id=124, statuses=[StatusElement(status='to do', order_index=0, color='#d3d3d3', type='open')])
Disclaimer: I am the creator (and maintainer) of this library.
You can now skip the class inheritance as of the latest release of dataclass-wizard. It's straightforward enough to use it; using the same example from above, but I've removed the JSONWizard usage from it completely. Just remember to ensure you don't import asdict from the dataclasses module, even though I guess that should coincidentally work.
Here's the modified version of the above without class inheritance:
from dataclasses import dataclass
from typing import List as PyList
from dataclass_wizard import fromdict, asdict
#dataclass
class List:
id: int
statuses: PyList['StatusElement']
#dataclass
class StatusElement:
status: str
order_index: int
color: str
type: str
json = {
"id": "124",
"statuses": [
{
"status": "to do",
"orderIndex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
# De-serialize the JSON dictionary into a `List` instance.
c = fromdict(List, json)
print(c)
# List(id=124, statuses=[StatusElement(status='to do', order_index=0, color='#d3d3d3', type='open')])
# Convert the instance back to a dictionary object that is JSON-serializable.
d = asdict(c)
print(d)
# {'id': 124, 'statuses': [{'status': 'to do', 'orderIndex': 0, 'color': '#d3d3d3', 'type': 'open'}]}
Also, here's a quick performance comparison with dacite. I wasn't aware of this library before, but it's also very easy to use (and there's also no need to inherit from any class). However, from my personal tests - Windows 10 Alienware PC using Python 3.9.1 - dataclass-wizard seemed to perform much better overall on the de-serialization process.
from dataclasses import dataclass
from timeit import timeit
from typing import List
from dacite import from_dict
from dataclass_wizard import JSONWizard, fromdict
data = {
"id": 124,
"statuses": [
{
"status": "to do",
"orderindex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
#dataclass
class StatusElement:
status: str
orderindex: int
color: str
type: str
#dataclass
class List:
id: int
statuses: List[StatusElement]
class ListWiz(List, JSONWizard):
...
n = 100_000
# 0.37
print('dataclass-wizard: ', timeit('ListWiz.from_dict(data)', number=n, globals=globals()))
# 0.36
print('dataclass-wizard (fromdict): ', timeit('fromdict(List, data)', number=n, globals=globals()))
# 11.2
print('dacite: ', timeit('from_dict(List, data)', number=n, globals=globals()))
lst_wiz1 = ListWiz.from_dict(data)
lst_wiz2 = from_dict(List, data)
lst = from_dict(List, data)
# True
assert lst.__dict__ == lst_wiz1.__dict__ == lst_wiz2.__dict__
A "cleaner" solution (in my eyes). Use dacite
No need to inherit anything.
from dataclasses import dataclass
from typing import List
from dacite import from_dict
data = {
"id": 124,
"statuses": [
{
"status": "to do",
"orderindex": 0,
"color": "#d3d3d3",
"type": "open"
}]
}
#dataclass
class StatusElement:
status: str
orderindex: int
color: str
type: str
#dataclass
class List:
id: int
statuses: List[StatusElement]
lst: List = from_dict(List, data)
print(lst)
output
List(id=124, statuses=[StatusElement(status='to do', orderindex=0, color='#d3d3d3', type='open')])
I've spent a few hours investigating options for this. There's no native Python functionality to do this, but there are a few third-party packages (writing in November 2022):
marshmallow_dataclass has this functionality (you need not be using marshmallow in any other capacity in your project). It gives good error messages and the package is actively maintained. I used this for a while before hitting what I believe is a bug parsing a large and complex JSON into deeply nested dataclasses, and then had to switch away.
dataclass-wizard is easy to use and specifically addresses this use case. It has excellent documentation. One significant disadvantage is that it won't automatically attempt to find the right fit for a given JSON, if trying to match against a union of dataclasses (see https://dataclass-wizard.readthedocs.io/en/latest/common_use_cases/dataclasses_in_union_types.html). Instead it asks you to add a "tag key" to the input JSON, which is a robust solution but may not be possible if you have no control over the input JSON.
dataclass-json is similar to dataclass-wizard, and again doesn't attempt to match the correct dataclass within a union.
dacite is the option I have settled upon for the time being. It has similar functionality to marshmallow_dataclass, at least for JSON parsing. The error messages are significantly less clear than marshmallow_dataclass, but slightly offsetting this, it's easier to figure out what's wrong if you pdb in at the point that the error occurs - the internals are quite clear and you can experiment to see what's going wrong. According to others it is rather slow, but that's not a problem in my circumstance.
I am using Scala to parse json with a structure like this:
{
"root": {
"metadata": {
"name": "Farmer John",
"hasTractor": false
},
"plants": {
"corn": 137.137,
"soy": 0.45
},
"animals": {
"cow": 4,
"sheep": 12,
"pig": 1
}
}
}
And I am currently using the org.json library to parse it, like this:
val jsonObject = new JSONObject(jsonString) // Where jsonString is the above json tree
But when I run something like jsonObject.get("root.metadata.name") then I get the error:
JSONObject["root.metadata.name"] not found.
org.json.JSONException: JSONObject["root.metadata.name"] not found.
I suspect I can get the objects one at a time by splitting up that path, but that sounds tedious and I assume a better json library already exists. Is there a way to easily get the data the way I am trying to or is there a better library to use that works better with Scala?
The JSONObject you are using is deprecated. The deprecation message sends you to The Scala Library Index. I'll demonstrate how this can be done in play-json. Let's assume that the json above is stored at jsonString, so we can do:
val json = Json.parse(jsonString)
val path = JsPath \ "root" \ "metadata" \ "name"
path(json) match {
case Nil =>
??? // path does nont exist
case values =>
println(s"Values are: $values")
}
Code run at Scastie.
Looks like I was able to solve it using JsonPath like this:
val document = Configuration.defaultConfiguration().jsonProvider().parse(jsonString): Object
val value = JsonPath.read(document, "$.root.metadata.name"): Any
I am writing a small scala practice code where my input is going to be in the fashion -
{
"code": "",
"unique ID": "",
"count": "",
"names": [
{
"Matt": {
"name": "Matt",
"properties": [
"a",
"b",
"c"
],
"fav-colour": "red"
},
"jack": {
"name": "jack",
"properties": [
"a",
"b"
],
"fav-colour": "blue"
}
}
]
}
I'll be passing this file as an command line argument.
I want to know that how do I accept the input file parse the json and use the json keys in my code?
You may use a json library such as play-json to parse the json content.
You could either operate on the json AST or you could write case classes that have the same structure as your json file and let them be parsed.
You can find the documentation of the library here.
You'll first have to add playjson as depedency to your project. If you're using sbt, just add to your build.sbt file:
libraryDependencies += "com.typesafe.play" %% "play-json" % "2.6.13"
Play json using AST
Let's read the input file:
import play.api.libs.json.Json
object Main extends App {
// first we'll need a inputstream of your json object
// this should be familiar if you know java.
val in = new FileInputStream(args(0))
// now we'll let play-json parse it
val json = Json.parse(in)
}
Let's extract some fields from the AST:
val code = (json \ "code").as[String]
val uniqueID = (json \ "unique ID").as[UUID]
for {
JsObject(nameMap) ← (json \ "names").as[Seq[JsObject]]
(name, userMeta) ← nameMap // nameMap is a Map[String, JsValue]
} println(s"User $name has the favorite color ${(userMeta \ "fav-colour").as[String]}")
Using Deserialization
As I've just described, we may create case classes that represent your structure:
case class InputFile(code: String, `unique ID`: UUID, count: String, names: Seq[Map[String, UserData]])
case class UserData(name: String, properties: Seq[String], `fav-colour`: String)
In addition you'll need to define an implicit Format e.g. in the companion object of each case class. Instead of writing it by hand you can use the Json.format macro that derives it for you:
object UserData {
implicit val format: OFormat[UserData] = Json.format[UserData]
}
object InputFile {
implicit val format: OFormat[InputFile] = Json.format[InputFile]
}
You can now deserialize your json object:
val argumentData = json.as[InputFile]
I generally prefer this approach but in your case the json structure does not fit really well. One improvement could be to add an additional getter to your InputFile class that makes accesing the fields with space and similar in the name easier:
case class InputFile(code: String, `unique ID`: UUID, count: String, names: Seq[Map[String, String]]) {
// this method is nicer to use
def uniqueId = `unique ID`
}
I have a huge JSON file, a small part from it as follows:
{
"socialNews": [{
"adminTagIds": "",
"fileIds": "",
"departmentTagIds": "",
........
........
"comments": [{
"commentId": "",
"newsId": "",
"entityId": "",
....
....
}]
}]
.....
}
I have applied lateral view explode on socialNews as follows:
val rdd = sqlContext.jsonFile("file:///home/ashish/test")
rdd.registerTempTable("social")
val result = sqlContext.sql("select * from social LATERAL VIEW explode(socialNews) social AS comment")
Now I want to convert back this result (DataFrame) to JSON and save into a file, but I am not able to find any Scala API to do the conversion.
Is there any standard library to do this or some way to figure it out?
val result: DataFrame = sqlContext.read.json(path)
result.write.json("/yourPath")
The method write is in the class DataFrameWriter and should be accessible to you on DataFrame objects. Just make sure that your rdd is of type DataFrame and not of deprecated type SchemaRdd. You can explicitly provide type definition val data: DataFrame or cast to dataFrame with toDF().
If you have a DataFrame there is an API to convert back to an RDD[String] that contains the json records.
val df = Seq((2012, 8, "Batman", 9.8), (2012, 8, "Hero", 8.7), (2012, 7, "Robot", 5.5), (2011, 7, "Git", 2.0)).toDF("year", "month", "title", "rating")
df.toJSON.saveAsTextFile("/tmp/jsonRecords")
df.toJSON.take(2).foreach(println)
This should be available from Spark 1.4 onward. Call the API on the result DataFrame you created.
The APIs available are listed here
sqlContext.read().json(dataFrame.toJSON())
When you run your spark job as
--master local --deploy-mode client
Then,
df.write.json('path/to/file/data.json') works.
If you run on cluster [on header node], [--master yarn --deploy-mode cluster] better approach is to write data to aws s3 or azure blob and read from it.
df.write.json('s3://bucket/path/to/file/data.json') works.
If you still can't figure out a way to convert Dataframe into JSON, you can use to_json or toJSON inbuilt Spark functions.
Let me know if you have a sample Dataframe and a format of JSON to convert.
I need to be able to process large JSON files, instantiating objects from deserializable sub-strings as we are iterating-over/streaming-in the file.
For example:
Let's say I can only deserialize into instances of the following:
case class Data(val a: Int, val b: Int, val c: Int)
and the expected JSON format is:
{ "foo": [ {"a": 0, "b": 0, "c": 0 }, {"a": 0, "b": 0, "c": 1 } ],
"bar": [ {"a": 1, "b": 0, "c": 0 }, {"a": 1, "b": 0, "c": 1 } ],
.... MANY ITEMS .... ,
"qux": [ {"a": 0, "b": 0, "c": 0 } }
What I would like to do is:
import com.codahale.jerkson.Json
val dataSeq : Seq[Data] = Json.advanceToValue("foo").stream[Data](fileStream)
// NOTE: this will not compile since I pulled the "advanceToValue" out of thin air.
As a final note, I would prefer to find a solution that involves Jerkson or any other libraries that comes with the Play framework, but if another Scala library handles this scenario with greater ease and decent performance: I'm not opposed to trying another library. If there is a clean way of manually seeking through the file and then using a Json library to continue parsing from there: I'm fine with that.
What I do not want to do is ingest the entire file without streaming or using an iterator, as keeping the entire file in memory at a time would be prohibitively expensive.
I have not done it with JSON (and I hope someone will come up with a turnkey solution for you) but done it with XML and here is a way of handling it.
It is basically a simple Map->Reduce process with the help of stream parser.
Map (your advanceTo)
Use a streaming parser like JSON Simple (not tested). When on the callback you match your "path", collect anything below by writing it to a stream (file backed or in-memory, depending on your data). That will be your foo array in your example. If your mapper is sophisticated enough, you may want to collect multiple paths during the map step.
Reduce (your stream[Data])
Since the streams you collected above look pretty small, you probably do not need to map/split them again and you can parse them directly in memory as JSON objects/arrays and manipulate them (transform, recombine, etc...).
Here is the current way I am solving the problem:
import collection.immutable.PagedSeq
import util.parsing.input.PagedSeqReader
import com.codahale.jerkson.Json
import collection.mutable
private def fileContent = new PagedSeqReader(PagedSeq.fromFile("/home/me/data.json"))
private val clearAndStop = ']'
private def takeUntil(readerInitial: PagedSeqReader, text: String) : Taken = {
val str = new StringBuilder()
var readerFinal = readerInitial
while(!readerFinal.atEnd && !str.endsWith(text)) {
str += readerFinal.first
readerFinal = readerFinal.rest
}
if (!str.endsWith(text) || str.contains(clearAndStop))
Taken(readerFinal, None)
else
Taken(readerFinal, Some(str.toString))
}
private def takeUntil(readerInitial: PagedSeqReader, chars: Char*) : Taken = {
var taken = Taken(readerInitial, None)
chars.foreach(ch => taken = takeUntil(taken.reader, ch.toString))
taken
}
def getJsonData() : Seq[Data] = {
var data = mutable.ListBuffer[Data]()
var taken = takeUntil(fileContent, "\"foo\"")
taken = takeUntil(taken.reader, ':', '[')
var doneFirst = false
while(taken.text != None) {
if (!doneFirst)
doneFirst = true
else
taken = takeUntil(taken.reader, ',')
taken = takeUntil(taken.reader, '}')
if (taken.text != None) {
print(taken.text.get)
places += Json.parse[Data](taken.text.get)
}
}
data
}
case class Taken(reader: PagedSeqReader, text: Option[String])
case class Data(val a: Int, val b: Int, val c: Int)
Granted, This code doesn't exactly handle malformed JSON very cleanly and to use for multiple top-level keys "foo", "bar" and "qux", will require looking ahead (or matching from a list of possible top-level keys), but in general: I believe this does the job. It's not quite as functional as I'd like and isn't super robust but PagedSeqReader definitely keeps this from getting too messy.