1.Input is JSON file that contains multiple records. Example:
[
{"user": "user1", "page": 1, "field": "some"},
{"user": "user2", "page": 2, "field": "some2"},
...
]
2.I need to load each record from the file as a Document to MongoDB collection.
Using casbah for interacting with mongo, inserting data may look like:
def saveCollection(inputListOfDbObjects: List[DBObject]) = {
val xs = inputListOfDbObjects
xs foreach (obj => {
Collection.save(obj)
})
Question: What is the correct way (using scala) to parse JSON to get data as List[DBObject] at output?
Any help is appreciated.
You could use the parser combinator library in Scala.
Here's some code I found that does this for JSON: http://booksites.artima.com/programming_in_scala_2ed/examples/html/ch33.html#sec4
Step 1. Create a class named JSON that contains your parser rules:
import scala.util.parsing.combinator._
class JSON extends JavaTokenParsers {
def value : Parser[Any] = obj | arr |
stringLiteral |
floatingPointNumber |
"null" | "true" | "false"
def obj : Parser[Any] = "{"~repsep(member, ",")~"}"
def arr : Parser[Any] = "["~repsep(value, ",")~"]"
def member: Parser[Any] = stringLiteral~":"~value
}
Step 2. In your main function, read in your JSON file, passing the contents of the file to your parser.
import java.io.FileReader
object ParseJSON extends JSON {
def main(args: Array[String]) {
val reader = new FileReader(args(0))
println(parseAll(value, reader))
}
}
Related
I'm traversing a directory tree, which contains directories and files. I know I could use os.walk for this, but this is just an example of what I'm doing, and the end result has to be recursive.
The function to get the data out is below:
def walkfn(dirname):
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
print(name)
walkfn(path)
elif os.path.isfile(path):
print(name)
Assuming we had a directory structure such as this:
testDir/
a/
1/
2/
testa2.txt
testa.txt
b/
3/
testb3.txt
4/
The code above would return the following:
a
testa.txt
1
2
testa2.txt
c
d
b
4
3
testb3.txt
It's doing what I would expect at this point, and the values are all correct, but I'm trying to get this data into a JSON object. I've seen that I can add these into nested dictionaries, and then convert it to JSON, but I've failed miserably at getting them into nested dictionaries using this recursive method.
The JSON I'm expecting out would be something like:
{
"test": {
"b": {
"4": {},
"3": {
"testb3.txt": null
}
},
"a": {
"testa.txt": null,
"1": {},
"2": {
"testa2.txt": null
}
}
}
}
You should pass json_data in your recursion function:
import os
from pprint import pprint
from typing import Dict
def walkfn(dirname: str, json_data: Dict=None):
if not json_data:
json_data = dict()
for name in os.listdir(dirname):
path = os.path.join(dirname, name)
if os.path.isdir(path):
json_data[name] = dict()
json_data[name] = walkfn(path, json_data=json_data[name])
elif os.path.isfile(path):
json_data.update({name: None})
return json_data
json_data = walkfn(dirname="your_dir_name")
pprint(json_data)
This is sort of data I got in my json file
{"globals":{"code":"1111","country_code":"8888","hits":80,"extra_hit":1,"keep_money":true},"time_window":{"from":"2020.12.14 08:40:00","to":"2020.12.14 08:45:00"},"car":{"have":"nope"}}
After I run it through this groovy code in jmeter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
import groovy.json.JsonOutput
def jsonSlurper = new JsonSlurper().parse(new File("C:/pathToFile/test.json"))
log.info(jsonSlurper.toString())
jsonSlurper.globals.hits = 70
jsonSlurper.time_window.from = "2020.12.14 08:42:00"
jsonSlurper.time_window.to = "2020.12.14 08:48:00"
def builder = new JsonBuilder(jsonSlurper)
log.info(builder.toString())
def json_str = JsonOutput.toJson(builder)
def json_beauty = JsonOutput.prettyPrint(json_str)
log.info(json_beauty.toString())
File file = new File("C:/pathToFile/test.json")
file.write(json_beauty)
the json file is updated, but all data are wrapped in new object "content"
"content": {
"globals": {
"code":"1111",
"country_code": "8888",
"hits": 70,
"extra_hit": 1,
"keep_money": true
},
"time_window": {
"from": "2020.12.14 08:42:00",
"to": "2020.12.14 08:48:00"
},
"car": {
"have": "nope"
}
}
}
How to avoid that wrapping into "content" object?
Copying and pasting the code from Internet without having any idea what it is doing is not the best way to proceed, at some point you will end up running a Barmin's patch
My expectation is that you're looking for JsonBuilder.toPrettyString() function so basically everything which goes after this line:
def builder = new JsonBuilder(jsonSlurper)
can be replaced with:
new File("C:/pathToFile/test.json").text = builder.toPrettyString()
More information:
Apache Groovy: Parsing and producing JSON
Apache Groovy - Why and How You Should Use It
An example of the sort of objects I need to grab from my json can be found in the following example(src):
{
"test": {
"attra": "2017-10-12T11:17:52.971Z",
"attrb": "2017-10-12T11:20:58.374Z"
},
"dummyCheck": false,
"type": "object",
"ruleOne": {
"default": 2557
},
"ruleTwo": {
"default": 2557
}
}
From the example above I want to access the default value under "ruleOne".
I've tried messing about with several different things below but I seem to be struggling. I can grab values like "dummyCheck" ok. What's the best way to key into where I need to go?
Example of how I am trying to get the value below:
import org.json4s._
import org.json4s.native.JsonMethods._
import org.json4s.DefaultFormats
implicit val formats = DefaultFormats
val test = parse(src)
println((test \ "ruleOne.default").extract[Integer])
Edit:
To further extend what is above:
def extractData(data: java.io.File) = {
val json = parse(data)
val result = (json \ "ruleOne" \ "default").extract[Int]
result
}
If I was to extend the above into a function that is called by passing in:
extractData(src)
That would only ever give me RuleOne.default.. is there a way I could extend it so that I could dynamically pass it multiple string arguments to parse (like a splat)
def extractData(data: java.io.File, path: String*) = {
val json = parse(data)
val result = (json \ path: _*).extract[Int]
result
}
so consuming it would be like
extractData(src, "ruleOne", "default")
This here works with "json4s-jackson" % "3.6.0-M2", but it should work in exactly the same way with native backend.
val src = """
|{
| "test": {
| "attra": "2017-10-12T11:17:52.971Z",
| "attrb": "2017-10-12T11:20:58.374Z"
| },
| "dummyCheck": false,
| "type": "object",
| "ruleOne": {
| "default": 2557
| },
| "ruleTwo": {
| "default": 2557
| }
|}""".stripMargin
import org.json4s._
import org.json4s.jackson.JsonMethods._
import org.json4s.DefaultFormats
implicit val formats = DefaultFormats
val test = parse(src)
println((test \ "ruleOne" \ "default").extract[Int])
Output:
2557
To make it work with native, simply replace
import org.json4s.jackson.JsonMethods._
by
import org.json4s.native.JsonMethods._
and make sure that you have the right dependencies.
EDIT
Here is a vararg method that transforms string parameters into a path:
def extract(json: JValue, path: String*): Int = {
path.foldLeft(json)(_ \ _).extract[Int]
}
With this, you can now do:
println(extract(test, "ruleOne", "default"))
println(extract(test, "ruleTwo", "default"))
Note that it accepts a JValue, not a File, because the version with File would be unnecessarily painful to test, whereas JValue-version can be tested with parsed string constants.
I am trying to use Spark for processing JSON data with variable structure(nested JSON). Input JSON data could be very large with more than 1000 of keys per row and one batch could be more than 20 GB.
Entire batch has been generated from 30 data sources and 'key2' of each JSON can be used to identify the source and structure for each source is predefined.
What would be the best approach for processing such data?
I have tried using from_json like below but it works only with fixed schema and to use it first I need to group the data based on each source and then apply the schema.
Due to large data volume my preferred choice is to scan the data only once and extract required values from each source, based on predefined schema.
import org.apache.spark.sql.types._
import spark.implicits._
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
val df = data.toDF
val schema = (new StructType)
.add("key1", StringType)
.add("key2", StringType)
.add("key3", (new StructType)
.add("key3_k1", StringType))
df.select(from_json($"value",schema).as("json_str"))
.select($"json_str.key3.key3_k1").collect
res17: Array[org.apache.spark.sql.Row] = Array([xxx])
This is just a restatement of #Ramesh Maharjan's answer, but with more modern Spark syntax.
I found this method lurking in DataFrameReader which allows you to parse JSON strings from a Dataset[String] into an arbitrary DataFrame and take advantage of the same schema inference Spark gives you with spark.read.json("filepath") when reading directly from a JSON file. The schema of each row can be completely different.
def json(jsonDataset: Dataset[String]): DataFrame
Example usage:
val jsonStringDs = spark.createDataset[String](
Seq(
("""{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}"""),
("""{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"}""")))
jsonStringDs.show
jsonStringDs:org.apache.spark.sql.Dataset[String] = [value: string]
+----------------------------------------------------------------------------------------------------------------------+
|value
|
+----------------------------------------------------------------------------------------------------------------------+
|{"firstname": "Sherlock", "lastname": "Holmes", "address": {"streetNumber": 121, "street": "Baker", "city": "London"}}|
|{"name": "Amazon", "employeeCount": 500000, "marketCap": 817117000000, "revenue": 177900000000, "CEO": "Jeff Bezos"} |
+----------------------------------------------------------------------------------------------------------------------+
val df = spark.read.json(jsonStringDs)
df.show(false)
df:org.apache.spark.sql.DataFrame = [CEO: string, address: struct ... 6 more fields]
+----------+------------------+-------------+---------+--------+------------+------+------------+
|CEO |address |employeeCount|firstname|lastname|marketCap |name |revenue |
+----------+------------------+-------------+---------+--------+------------+------+------------+
|null |[London,Baker,121]|null |Sherlock |Holmes |null |null |null |
|Jeff Bezos|null |500000 |null |null |817117000000|Amazon|177900000000|
+----------+------------------+-------------+---------+--------+------------+------+------------+
The method is available from Spark 2.2.0:
http://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameReader#json(jsonDataset:org.apache.spark.sql.Dataset[String]):org.apache.spark.sql.DataFrame
If you have data as you mentioned in the question as
val data = sc.parallelize(
"""{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1"}}"""
:: Nil)
You don't need to create schema for json data. Spark sql can infer schema from the json string. You just have to use SQLContext.read.json as below
val df = sqlContext.read.json(data)
which will give you schema as below for the rdd data used above
root
|-- key1: string (nullable = true)
|-- key2: string (nullable = true)
|-- key3: struct (nullable = true)
| |-- key3_k1: string (nullable = true)
And you can just select key3_k1 as
df2.select("key3.key3_k1").show(false)
//+-------+
//|key3_k1|
//+-------+
//|key3_v1|
//+-------+
You can manipulate the dataframe as you wish. I hope the answer is helpful
I am not sure if my suggestion can help you although I had a similar case and I solved it as follows:
1) So the idea is to use json rapture (or some other json library) to
load JSON schema dynamically. For instance you could read the 1st
row of the json file to discover the schema(similarly to what I do
here with jsonSchema)
2) Generate schema dynamically. First iterate through the dynamic
fields (notice that I project values of key3 as Map[String, String])
and add a StructField for each one of them to schema
3) Apply the generated schema into your dataframe
import rapture.json._
import jsonBackends.jackson._
val jsonSchema = """{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v1", "key3_k2":"key3_v2", "key3_k3":"key3_v3"}}"""
val json = Json.parse(jsonSchema)
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.{StringType, StructType}
val schema = ArrayBuffer[StructField]()
//we could do this dynamic as well with json rapture
schema.appendAll(List(StructField("key1", StringType), StructField("key2", StringType)))
val items = ArrayBuffer[StructField]()
json.key3.as[Map[String, String]].foreach{
case(k, v) => {
items.append(StructField(k, StringType))
}
}
val complexColumn = new StructType(items.toArray)
schema.append(StructField("key3", complexColumn))
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val sparkConf = new SparkConf().setAppName("dynamic-json-schema").setMaster("local")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val jsonDF = spark.read.schema(StructType(schema.toList)).json("""your_path\data.json""")
jsonDF.select("key1", "key2", "key3.key3_k1", "key3.key3_k2", "key3.key3_k3").show()
I used the next data as input:
{"key1":"val1","key2":"source1","key3":{"key3_k1":"key3_v11", "key3_k2":"key3_v21", "key3_k3":"key3_v31"}}
{"key1":"val2","key2":"source2","key3":{"key3_k1":"key3_v12", "key3_k2":"key3_v22", "key3_k3":"key3_v32"}}
{"key1":"val3","key2":"source3","key3":{"key3_k1":"key3_v13", "key3_k2":"key3_v23", "key3_k3":"key3_v33"}}
And the output:
+----+-------+--------+--------+--------+
|key1| key2| key3_k1| key3_k2| key3_k3|
+----+-------+--------+--------+--------+
|val1|source1|key3_v11|key3_v21|key3_v31|
|val2|source2|key3_v12|key3_v22|key3_v32|
|val2|source3|key3_v13|key3_v23|key3_v33|
+----+-------+--------+--------+--------+
An advanced alternative, which I haven't tested yet, would be to generate a case class e.g called JsonRow from the JSON schema in order to have a strongly typed dataset which provides better serialization performance apart the fact that make your code more maintainable. To make this work you need first to create a JsonRow.scala file then you should implement a sbt pre-build script which will modify the content of JsonRow.scala(you might have more than one of course) dynamically based on your source files. To generate class JsonRow dynamically you can use the next code:
def generateClass(members: Map[String, String], name: String) : Any = {
val classMembers = for (m <- members) yield {
s"${m._1}: String"
}
val classDef = s"""case class ${name}(${classMembers.mkString(",")});scala.reflect.classTag[${name}].runtimeClass"""
classDef
}
The method generateClass accepts a map of strings to create the class members and the class name itself. The members of the generated class you can again populate them from you json schema:
import org.codehaus.jackson.node.{ObjectNode, TextNode}
import collection.JavaConversions._
val mapping = collection.mutable.Map[String, String]()
val fields = json.$root.value.asInstanceOf[ObjectNode].getFields
for (f <- fields) {
(f.getKey, f.getValue) match {
case (k: String, v: TextNode) => mapping(k) = v.asText
case (k: String, v: ObjectNode) => v.getFields.foreach(f => mapping(f.getKey) = f.getValue.asText)
case _ => None
}
}
val dynClass = generateClass(mapping.toMap, "JsonRow")
println(dynClass)
This prints out:
case class JsonRow(key3_k2: String,key3_k1: String,key1: String,key2: String,key3_k3: String);scala.reflect.classTag[JsonRow].runtimeClass
Good luck
I was wondering if there is a parser or an easy way to iterate through a json object without knowing the keys/schema of the json ahead of time in scala. I took a look at a few libraries like json4s, but it seems to still require knowing the schema ahead of time before extracting the fields. I just want to iterate over each field, extract the fields and print out their values something like:
json.foreachkey(key -> println(key +":" + json.get(key))
In Play Json you'll initially parse your json into a JsValue; you can then pattern-match this to determine if it is a JsObject (note that you can find the fields of this using fields or value), a JsArray (again, note the value), or a primitive such as JsString or JsNull
def parse(jsVal: JsValue) {
jsVal match {
case json: JsObject =>
case json: JsArray =>
case json: JsString =>
...
}
}
If by json you mean any JValue, then json4s seems to have this functionality out of the box:
scala> import org.json4s.JsonDSL._
import org.json4s.JsonDSL._
scala> import org.json4s.native.JsonMethods._
import org.json4s.native.JsonMethods._
scala> val json = parse(""" { "numbers" : [1, 2, 3, 4] } """)
json: org.json4s.JValue = JObject(List((numbers,JArray(List(JInt(1), JInt(2), JInt(3), JInt(4))))))
scala> compact(render(json))
res1: String = {"numbers":[1,2,3,4]}
Use liftweb, it allows you to parse the json first into JValue - then extract native scala objects from it, no matter the schema:
val jsonString = """{"menu": {
| "id": "file",
| "value": "File",
| "popup": {
| "menuitem": [
| {"value": "New", "onclick": "CreateNewDoc()"},
| {"value": "Open", "onclick": "OpenDoc()"},
| {"value": "Close", "onclick": "CloseDoc()"}
| ]
| }
|}}""".stripMargin
val jVal: JValue = parse(jsonString)
jVal.values
>>> Map(menu -> Map(id -> file, value -> File, popup -> Map(menuitem -> List(Map(value -> New, onclick -> CreateNewDoc()), Map(value -> Open, onclick -> OpenDoc()), Map(value -> Close, onclick -> CloseDoc())))))