Convert spark decision tree model debug string to nested JSON in scala - json

Similar to the tree json parsing quoted here, I am trying to implement a simple visualization of decision trees in scala. It is exactly same as the display method available in databricks notebooks.
I am new to scala and struggling to get the logic right. I understand we have to make recursive calls to build the children and break when the final prediction values are shown. i have attempted a code here using the below mentioned input model debug string
def getStatmentType(x: String): (String, String) = {
val ifPattern = "If+".r
val ifelsePattern = "Else+".r
var t = ifPattern.findFirstIn(x.toString)
if(t != None){
("If", (x.toString).replace("If",""))
}else {
var ts = ifelsePattern.findFirstIn(x.toString)
if(ts != None) ("Else", (x.toString).replace("Else", ""))
else ("None", (x.toString).replace("(", "").replace(")",""))
}
}
def delete[A](test:List[A])(i: Int) = test.take(i) ++ test.drop((i+1))
def BuildJson(tree:List[String]):List[Map[String, Any]] = {
var block:List[Map[String, Any]] = List()
var lines:List[String] = tree
loop.breakable {
while (lines.length > 0) {
println("here")
var (cond, name) = getStatmentType(lines(0))
println("initial" + cond)
if (cond == "If") {
println("if" + cond)
// lines = lines.tail
lines = delete(lines)(0)
block = block :+ Map("if-name" -> name, "children" -> BuildJson(lines))
println("After pop Else State"+lines(0))
val (p_cond, p_name) = getStatmentType(lines(0))
// println(p_cond + " = "+ p_name+ "\n")
cond = p_cond
name = p_name
println(cond + " after="+ name+ "\n")
if (cond == "Else") {
println("else" + cond)
lines = lines.tail
block = block :+ Map("else-name" -> name, "children" -> BuildJson(lines))
}
}else if( cond == "None") {
println(cond + "NONE")
lines = delete(lines)(0)
block = block :+ Map("predict" -> name)
}else {
println("Finaly Break")
println("While loop--" +lines)
loop.break()
}
}
}
block
}
def treeJson1(str: String):JsValue = {
val str = "If (feature 0 in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0})\n If (feature 0 in {6.0})\n Predict: 17.0\n Else (feature 0 not in {6.0})\n Predict: 6.0\n Else (feature 0 not in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0})\n Predict: 20.0"
val x = str.replace(" ","")
val xs = x.split("\n").toList
var js = BuildJson(xs)
println(MapReader.mapToJson(js))
Json.toJson("")
}
Expected output:
[
{
'name': 'Root',
'children': [
{
'name': 'feature 0 in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0}',
'children': [
{
'name': 'feature 0 in {6.0}',
'children': [
{
'name': 'Predict: 17.0'
}
]
},
{
'name': 'feature 0 not in {6.0}',
'children': [
{
'name': 'Predict: 6.0'
}
]
}
]
},
{
'name': 'feature 0 not in {1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0}',
'children': [
{
'name': 'Predict: 20.0'
}
]
}
]

you don`t need to parse the debugstring, instead, you can parse from the rootnode of the model.
refer to enter link description here

Related

How to print JSON objects in AWK

I was looking for some built-in functions inside awk to easily generate JSON objects. I came across several answers and decided to create my own.
I'd like to generate JSON from multidimensional arrays, where I store table style data, and to use separate and dynamic definition of JSON schema to be generated from that data.
Desired output:
{
"Name": JanA
"Surname": NowakA
"ID": 1234A
"Role": PrezesA
}
{
"Name": JanD
"Surname": NowakD
"ID": 12341D
"Role": PrezesD
}
{
"Name": JanC
"Surname": NowakC
"ID": 12342C
"Role": PrezesC
}
Input file:
pierwsza linia
druga linia
trzecia linia
dane wspólników
imie JanA
nazwisko NowakA
pesel 11111111111A
funkcja PrezesA
imie Ja"nD
nazwisko NowakD
pesel 11111111111
funkcja PrezesD
imie JanC
nazwisko NowakC
pesel 12342C
funkcja PrezesC
czwarta linia
reprezentanci
imie Tomek
Based on input file i created a multidimensional array:
JanA NowaA 1234A PrezesA
JanD NowakD 12341D PrezesD
JanC NowakC 12342C PrezesC
I'll take a stab at a gawk solution. The indenting isn't perfect and the results aren't ordered (see "Sorting" note below), but it's at least able to walk a true multidimensional array recursively and should produce valid, parsable JSON from any array. Bonus: the data array is the schema. Array keys become JSON keys. There's no need to create a separate schema array in addition to the data array.
Just be sure to use the true multidimensional array[d1][d2][d3]... convention of constructing your data array, rather than the concatenated index array[d1,d2,d3...] convention.
Update:
I've got an updated JSON gawk script posted as a GitHub Gist. Although the script below is tested as working with OP's data, I might've made improvements since this post was last edited. Please see the Gist for the most thoroughly tested, bug-squashed version.
#!/usr/bin/gawk -f
BEGIN { IGNORECASE = 1 }
$1 ~ "imie" { record[++idx]["name"] = $2 }
$1 ~ "nazwisko" { record[idx]["surname"] = $2 }
$1 ~ "pesel" { record[idx]["ID"] = $2 }
$1 ~ "funkcja" { record[idx]["role"] = $2 }
END { print serialize(record, "\t") }
# ==== FUNCTIONS ====
function join(arr, sep, _p, i) {
# syntax: join(array, string separator)
# returns a string
for (i in arr) {
_p["result"] = _p["result"] ~ "[[:print:]]" ? _p["result"] sep arr[i] : arr[i]
}
return _p["result"]
}
function quote(str) {
gsub(/\\/, "\\\\", str)
gsub(/\r/, "\\r", str)
gsub(/\n/, "\\n", str)
gsub(/\t/, "\\t", str)
return "\"" str "\""
}
function serialize(arr, indent_with, depth, _p, i, idx) {
# syntax: serialize(array of arrays, indent string)
# returns a JSON formatted string
# sort arrays on key, ensures [...] values remain properly ordered
if (!PROCINFO["sorted_in"]) PROCINFO["sorted_in"] = "#ind_num_asc"
# determine whether array is indexed or associative
for (i in arr) {
_p["assoc"] = or(_p["assoc"], !(++_p["idx"] in arr))
}
# if associative, indent
if (_p["assoc"]) {
for (i = ++depth; i--;) {
_p["end"] = _p["indent"]; _p["indent"] = _p["indent"] indent_with
}
}
for (i in arr) {
# If key length is 0, assume its an empty object
if (!length(i)) return "{}"
# quote key if not already quoted
_p["key"] = i !~ /^".*"$/ ? quote(i) : i
if (isarray(arr[i])) {
if (_p["assoc"]) {
_p["json"][++idx] = _p["indent"] _p["key"] ": " \
serialize(arr[i], indent_with, depth)
} else {
# if indexed array, dont print keys
_p["json"][++idx] = serialize(arr[i], indent_with, depth)
}
} else {
# quote if not numeric, boolean, null, already quoted, or too big for match()
if (!((arr[i] ~ /^[0-9]+([\.e][0-9]+)?$/ && arr[i] !~ /^0[0-9]/) ||
arr[i] ~ /^true|false|null|".*"$/) || length(arr[i]) > 1000)
arr[i] = quote(arr[i])
_p["json"][++idx] = _p["assoc"] ? _p["indent"] _p["key"] ": " arr[i] : arr[i]
}
}
# I trial and errored the hell out of this. Problem is, gawk cant distinguish between
# a value of null and no value. I think this hack is as close as I can get, although
# [""] will become [].
if (!_p["assoc"] && join(_p["json"]) == "\"\"") return "[]"
# surround with curly braces if object, square brackets if array
return _p["assoc"] ? "{\n" join(_p["json"], ",\n") "\n" _p["end"] "}" \
: "[" join(_p["json"], ", ") "]"
}
Output resulting from OP's example data:
[{
"ID": "1234A",
"name": "JanA",
"role": "PrezesA",
"surname": "NowakA"
}, {
"ID": "12341D",
"name": "JanD",
"role": "PrezesD",
"surname": "NowakD"
}, {
"ID": "12342C",
"name": "JanC",
"role": "PrezesC",
"surname": "NowakC"
}, {
"name": "Tomek"
}]
Sorting
Although the results by default are ordered in a manner only gawk understands, it is possible for gawk to sort the results on a field. If you'd like to sort on the ID field for example, add this function:
function cmp_ID(i1, v1, i2, v2) {
if (!isarray(v1) && v1 ~ /"ID"/ ) {
return v1 < v2 ? -1 : (v1 != v2)
}
}
Then insert this line within your END section above print serialize(record):
PROCINFO["sorted_in"] = "cmp_ID"
See Controlling Array Traversal for more information.
My updated awk implementation of simple array printer with regex based validation for each column(running using gawk):
function ltrim(s) { sub(/^[ \t]+/, "", s); return s }
function rtrim(s) { sub(/[ \t]+$/, "", s); return s }
function sTrim(s){
return rtrim(ltrim(s));
}
function jsonEscape(jsValue) {
gsub(/\\/, "\\\\", jsValue)
gsub(/"/, "\\\"", jsValue)
gsub(/\b/, "\\b", jsValue)
gsub(/\f/, "\\f", jsValue)
gsub(/\n/, "\\n", jsValue)
gsub(/\r/, "\\r", jsValue)
gsub(/\t/, "\\t", jsValue)
return jsValue
}
function jsonStringEscapeAndWrap(jsValue) {
return "\42" jsonEscape(jsValue) "\42"
}
function jsonPrint(contentArray, contentRowsCount, schemaArray){
result = ""
schemaLength = length(schemaArray)
for (x = 1; x <= contentRowsCount; x++) {
result = result "{"
for(y = 1; y <= schemaLength; y++){
result = result "\42" sTrim(schemaArray[y]) "\42:" sTrim(contentArray[x, y])
if(y < schemaLength){
result = result ","
}
}
result = result "}"
if(x < contentRowsCount){
result = result ",\n"
}
}
return result
}
function jsonValidateAndPrint(contentArray, contentRowsCount, schemaArray, schemaColumnsCount, errorArray){
result = ""
errorsCount = 1
for (x = 1; x <= contentRowsCount; x++) {
jsonRow = "{"
for(y = 1; y <= schemaColumnsCount; y++){
regexValue = schemaArray[y, 2]
jsonValue = sTrim(contentArray[x, y])
isValid = jsonValue ~ regexValue
if(isValid == 0){
errorArray[errorsCount, 1] = "\42" sTrim(schemaArray[y, 1]) "\42"
errorArray[errorsCount, 2] = "\42Value " jsonValue " not match format: " regexValue " \42"
errorArray[errorsCount, 3] = x
errorsCount++
jsonValue = "null"
}
jsonRow = jsonRow "\42" sTrim(schemaArray[y, 1]) "\42:" jsonValue
if(y < schemaColumnsCount){
jsonRow = jsonRow ","
}
}
jsonRow = jsonRow "}"
result = result jsonRow
if(x < contentRowsCount){
result = result ",\n"
}
}
return result
}
BEGIN{
rowsCount =1
matchCount = 0
errorsCount = 0
shareholdersJsonSchema[1, 1] = "Imie"
shareholdersJsonSchema[2, 1] = "Nazwisko"
shareholdersJsonSchema[3, 1] = "PESEL"
shareholdersJsonSchema[4, 1] = "Funkcja"
shareholdersJsonSchema[1, 2] = "\\.*"
shareholdersJsonSchema[2, 2] = "\\.*"
shareholdersJsonSchema[3, 2] = "^[0-9]{11}$"
shareholdersJsonSchema[4, 2] = "\\.*"
errorsSchema[1] = "PropertyName"
errorsSchema[2] = "Message"
errorsSchema[3] = "PositionIndex"
resultSchema[1]= "ShareHolders"
resultSchema[2]= "Errors"
}
/dane wspólników/,/czwarta linia/{
if(/imie/ || /nazwisko/ || /pesel/ || /funkcja/){
if(/imie/){
shareholdersArray[rowsCount, 1] = jsonStringEscapeAndWrap($2)
matchCount++
}
if(/nazwisko/){
shareholdersArray[rowsCount, 2] = jsonStringEscapeAndWrap($2)
matchCount ++
}
if(/pesel/){
shareholdersArray[rowsCount, 3] = $2
matchCount ++
}
if(/funkcja/){
shareholdersArray[rowsCount, 4] = jsonStringEscapeAndWrap($2)
matchCount ++
}
if(matchCount==4){
rowsCount++
matchCount = 0;
}
}
}
END{
shareHolders = jsonValidateAndPrint(shareholdersArray, rowsCount - 1, shareholdersJsonSchema, 4, errorArray)
shareHoldersErrors = jsonPrint(errorArray, length(errorArray) / length(errorsSchema), errorsSchema)
resultArray[1,1] = "\n[\n" shareHolders "\n]\n"
resultArray[1,2] = "\n[\n" shareHoldersErrors "\n]\n"
resultJson = jsonPrint(resultArray, 1, resultSchema)
print resultJson
}
Produces output:
{"ShareHolders":
[
{"Imie":"JanA","Nazwisko":"NowakA","PESEL":null,"Funkcja":"PrezesA"},
{"Imie":"Ja\"nD","Nazwisko":"NowakD","PESEL":11111111111,"Funkcja":"PrezesD"},
{"Imie":"JanC","Nazwisko":"NowakC","PESEL":null,"Funkcja":"PrezesC"}
]
,"Errors":
[
{"PropertyName":"PESEL","Message":"Value 11111111111A not match format: ^[0-9]{11}$ ","PositionIndex":1},
{"PropertyName":"PESEL","Message":"Value 12342C not match format: ^[0-9]{11}$ ","PositionIndex":3}
]
}

Scala: merging two JSON files using AmazonS3Client getObject Futures

I'm trying to merge two JSON files from an S3 bucket. First file pulls fine, but not the second file.
val eventLogJsonFuture = Future(new AmazonS3Client(credentials))
.map(_.getObject(logBucket, logDirectory + "/" + id + "/event_log.json"))
.map(_.getObjectContent)
.map(Source.fromInputStream(_))
.map(_.mkString)
.map(Json.parse) map { archiveEvents =>
Json.toJson(Json.obj("success" -> true, "data" -> archiveEvents))
} recover {
case NonFatal(error) =>
Json.obj("success" -> false, "errorCode" -> "archive_does_not_exist", "message" -> error.getMessage)
}
val infoJsonFuture = Future(new AmazonS3Client(credentials))
.map(_.getObject(logBucket, logDirectory + "/" + id + "/info.json"))
.map(_.getObjectContent)
.map(Source.fromInputStream(_))
.map(_.mkString)
.map(Json.parse) map { archiveInfo =>
Json.toJson(Json.obj("success" -> true, "data" -> archiveInfo))
} recover {
case NonFatal(error) =>
Json.obj("success" -> false, "errorCode" -> "archive_does_not_exist", "message" -> error.getMessage)
}
val combinedJson = for {
eventLogJson <- eventLogJsonFuture
infoJson <- infoJsonFuture
}
yield {
Json.obj("info" -> infoJson, "events" -> eventLogJson)
}
This is what the result JSON looks like ...
Is there another (better?) way of writing this?
Should you wait 3 parts of JSON from different source ?
I can recommend solution with case class DTO
Simple example:
val firstJson = Future {
//case class json1(...)
}
val secondJson = Future {
//case class json2(...)
...
}
val finalFson = for {
f <- first
s <- second
} yield (f, s)
finalJson onComplete {
case Success(jsons) => {
//merge json here
jsons._1 + jsons._2 ...
}

How to yield a JSON object from a for loop in scala?

for (character <- content) {
if (character == '\n') {
val current_line = line.mkString
line.clear()
current_line match {
case docStartRegex(_*) => {
startDoc = true
endText = false
endDoc = false
}
case docnoRegex(group) => {
docID = group.trim
}
case docTextStartRegex(_*) => {
startText = true
}
case docTextEndRegex(_*) => {
endText = true
startText = false
}
case docEndRegex(_*) => {
endDoc = true
startDoc = false
es_json = Json.obj(
"_index" -> "ES_SPARK_AP",
"_type" -> "document",
"_id" -> docID,
"_source" -> Json.obj(
"text" -> textChunk.mkString(" ")
)
)
// yield es_json
textChunk.clear()
}
case _ => {
if (startDoc && !endDoc && startText) {
textChunk += current_line.trim
}
}
}
} else {
line += character
}
}
The above for-loop parses through a text file and creates a JSON object of each chunk parsed in a loop. This is JSON will be sent to for further processing to Elasticsearch. In python, we can yield the JSON and use generator easily like:
def func():
for i in range(num):
... some computations ...
yield {
JSON ## JSON is yielded
}
for json in func(): ## we parse through the generator here.
process(json)
I cannot understand how I can use yield in similar fashion using scala?
If you want lazy returns, scala does this using Iterator types. Specifically if you want to handle line by line values, I'd split it into lines first with .lines
val content: String = ???
val results: Iterator[Json] =
for {
lines <- content.lines
line <- lines
} yield {
line match {
case docEndRegex(_*) => ...
}
}
You can also use a function directly
def toJson(line: String): Json =
line match {
case "hi" => Json.obj("line" -> "hi")
case "bye" => Json.obj("what" -> "a jerk")
}
val results: Iterator[Json] =
for {
lines <- content.lines
line <- lines
} yield toJson(line)
This is equivalent to doing
content.lines.map(line => toJson(line))
Or somewhat equivalently in python
lines = (line.strip() for line in content.split("\n"))
jsons = (toJson(line) for line in lines)

Serialization error while writing JSON to file

I am reading text files and creating Json objects JsValues in every iteration. I want to save them to a file at every iteration. I am using Play Framework to create JSON objects.
class Cleaner {
def getDocumentData() = {
for (i <- no_of_files) {
.... do something ...
some_json = Json.obj("text" -> LARGE_TEXT)
final_json = Json.stringify(some_json)
//save final_json here to a file
}
}
}
I tried using PrintWriter to save that json but I am getting Exception in thread "main" org.apache.spark.SparkException: Task not serializable as the error.
How should I correct this? or is there any other way I can save the JsValue?
UPDATE:
I read that the trait serializable has to be used in this case. I have the following function:
class Cleaner() extends Serializable {
def readDocumentData() {
val conf = new SparkConf()
.setAppName("linkin_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
val sc = new SparkContext(conf)
val temp = sc.wholeTextFiles("text_doc.dat)
val docStartRegex = """<DOC>""".r
val docEndRegex = """</DOC>""".r
val docTextStartRegex = """<TEXT>""".r
val docTextEndRegex = """</TEXT>""".r
val docnoRegex = """<DOCNO>(.*?)</DOCNO>""".r
val writer = new PrintWriter(new File("test.json"))
for (fileData <- temp) {
val filename = fileData._1
val content: String = fileData._2
println(s"For $filename, the data is:")
var startDoc = false // This is for the
var endDoc = false // whole file
var startText = false //
var endText = false //
var textChunk = new ListBuffer[String]()
var docID: String = ""
var es_json: JsValue = Json.obj()
for (current_line <- content.lines) {
current_line match {
case docStartRegex(_*) => {
startDoc = true
endText = false
endDoc = false
}
case docnoRegex(group) => {
docID = group.trim
}
case docTextStartRegex(_*) => {
startText = true
}
case docTextEndRegex(_*) => {
endText = true
startText = false
}
case docEndRegex(_*) => {
endDoc = true
startDoc = false
es_json = Json.obj(
"_id" -> docID,
"_source" -> Json.obj(
"text" -> textChunk.mkString(" ")
)
)
writer.write(es_json.toString())
println(es_json.toString())
textChunk.clear()
}
case _ => {
if (startDoc && !endDoc && startText) {
textChunk += current_line.trim
}
}
}
}
}
writer.close()
}
}
This is function to which I added the trait but still I am getting the same exception.
I rewrote a smaller version of it:
def foo() {
val conf = new SparkConf()
.setAppName("linkin_spark")
.setMaster("local[2]")
.set("spark.executor.memory", "1g")
.set("spark.rdd.compress", "true")
.set("spark.storage.memoryFraction", "1")
val sc = new SparkContext(conf)
var es_json: JsValue = Json.obj()
val writer = new PrintWriter(new File("test.json"))
for (i <- 1 to 10) {
es_json = Json.obj(
"_id" -> i,
"_source" -> Json.obj(
"text" -> "Eureka!"
)
)
println(es_json)
writer.write(es_json.toString() + "\n")
}
writer.close()
}
This function works fine with and also without serializable. I cannot understand what's happening?
EDIT: First answer made on phone.
It's not your main class that needs to be serializable but the class you use in the rdd processing loop in this case inside for (fileData <- temp)
It needs to be serializable because the spark data is on multiple partitions that may be on multiple computers. So the functions you apply to this data need to be serializable so you can send them to the other computer where they will be executed in parallel.
PrintWriter cannot be serializable since it refers to a file that is only available from the original computer. Hence the serializaion error.
To write your data on the computer initializing the spark process. You need to take the data that is all over the cluster and bring it to your machine then write it.
To do that you can either collect the result. rdd.collect() and that will take all the data from the cluster and put it in your driver thread memory. Then you can write it to a file using the PrintWriter.
like this:
temp.flatMap { fileData =>
val filename = fileData._1
val content: String = fileData._2
println(s"For $filename, the data is:")
var startDoc = false // This is for the
var endDoc = false // whole file
var startText = false //
var endText = false //
var textChunk = new ListBuffer[String]()
var docID: String = ""
var es_json: JsValue = Json.obj()
var results = ArrayBuffer[String]()
for (current_line <- content.lines) {
current_line match {
case docStartRegex(_*) => {
startDoc = true
endText = false
endDoc = false
}
case docnoRegex(group) => {
docID = group.trim
}
case docTextStartRegex(_*) => {
startText = true
}
case docTextEndRegex(_*) => {
endText = true
startText = false
}
case docEndRegex(_*) => {
endDoc = true
startDoc = false
es_json = Json.obj(
"_id" -> docID,
"_source" -> Json.obj(
"text" -> textChunk.mkString(" ")
)
)
results.append(es_json.toString())
println(es_json.toString())
textChunk.clear()
}
case _ => {
if (startDoc && !endDoc && startText) {
textChunk += current_line.trim
}
}
}
}
results
}
.collect()
.foreach(es_json => writer.write(es_json))
If the result is too large for the driver thread memory you can use the saveAsTextFile function that will stream each partition to your drive. In this second case the path you give as argument will be made into a folder and each partition of your rdd will be written to a numbered file in it.
like this:
temp.flatMap { fileData =>
val filename = fileData._1
val content: String = fileData._2
println(s"For $filename, the data is:")
var startDoc = false // This is for the
var endDoc = false // whole file
var startText = false //
var endText = false //
var textChunk = new ListBuffer[String]()
var docID: String = ""
var es_json: JsValue = Json.obj()
var results = ArrayBuffer[String]()
for (current_line <- content.lines) {
current_line match {
case docStartRegex(_*) => {
startDoc = true
endText = false
endDoc = false
}
case docnoRegex(group) => {
docID = group.trim
}
case docTextStartRegex(_*) => {
startText = true
}
case docTextEndRegex(_*) => {
endText = true
startText = false
}
case docEndRegex(_*) => {
endDoc = true
startDoc = false
es_json = Json.obj(
"_id" -> docID,
"_source" -> Json.obj(
"text" -> textChunk.mkString(" ")
)
)
results.append(es_json.toString())
println(es_json.toString())
textChunk.clear()
}
case _ => {
if (startDoc && !endDoc && startText) {
textChunk += current_line.trim
}
}
}
}
results
}
.saveAsTextFile("test.json")

Compare json equality in Scala

How can I compare if two json structures are the same in scala?
For example, if I have:
{
resultCount: 1,
results: [
{
artistId: 331764459,
collectionId: 780609005
}
]
}
and
{
results: [
{
collectionId: 780609005,
artistId: 331764459
}
],
resultCount: 1
}
They should be considered equal
You should be able to simply do json1 == json2, if the json libraries are written correctly. Is that not working for you?
This is with spray-json, although I would expect the same from every json library:
import spray.json._
import DefaultJsonProtocol._
Welcome to Scala version 2.10.4 (OpenJDK 64-Bit Server VM, Java 1.7.0_51).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val json1 = """{ "a": 1, "b": [ { "c":2, "d":3 } ] }""".parseJson
json1: spray.json.JsValue = {"a":1,"b":[{"c":2,"d":3}]}
scala> val json2 = """{ "b": [ { "d":3, "c":2 } ], "a": 1 }""".parseJson
json2: spray.json.JsValue = {"b":[{"d":3,"c":2}],"a":1}
scala> json1 == json2
res1: Boolean = true
Spray-json uses an immutable scala Map to represent a JSON object in the abstract syntax tree resulting from a parse, so it is just Map's equality semantics that make this work.
You can also use scalatest-json
Example:
it("should fail on slightly different json explaining why") {
val input = """{"someField": "valid json"}""".stripMargin
val expected = """{"someField": "different json"}""".stripMargin
input should matchJson(expected)
}
When the 2 jsons doesn't match, a nice diff will be display which is quite useful when working with big jsons.
Can confirm that it also works just fine with the Jackson library using == operator:
val simpleJson =
"""
|{"field1":"value1","field2":"value2"}
""".stripMargin
val simpleJsonNode = objectMapper.readTree(simpleJson)
val simpleJsonNodeFromString = objectMapper.readTree(simpleJsonNode.toString)
assert(simpleJsonNode == simpleJsonNodeFromString)
spray-json is definitely great, but I use Gson since I already had dependency on Gson library on my project. I am using these in my unit tests, works well for simple json.
import com.google.gson.{JsonParser}
import org.apache.flume.event.JSONEvent
import org.scalatest.FunSuite
class LogEnricherSpec extends FunSuite {
test("compares json to json") {
val parser = new JsonParser()
assert(parser.parse("""
{
"eventType" : "TransferItems",
"timeMillis" : "1234567890",
"messageXml":{
"TransferId" : 123456
}
} """.stripMargin)
==
parser.parse("""
{
"timeMillis" : "1234567890",
"eventType" : "TransferItems",
"messageXml":{
"TransferId" : 123456
}
}
""".stripMargin))
}
Calling the method compare_2Json(str1,str2) will return a boolean value.
Please make sure that the two string parameters are json.
Welcome to use and test.
def compare_2Json(js1:String,js2:String): Boolean = {
var js_str1 = js1
var js_str2 = js2
js_str1=js_str1.replaceAll(" ","")
js_str2=js_str2.replaceAll(" ","")
var issame = false
val arrbuff1 = ArrayBuffer[String]()
val arrbuff2 = ArrayBuffer[String]()
if(js_str1.substring(0,1)=="{" && js_str2.substring(0,1)=="{" || js_str1.substring(0,1)=="["&&js_str2.substring(0,1)=="["){
for(small_js1 <- split_JsonintoSmall(js_str1);small_js2 <- split_JsonintoSmall((js_str2))) {
issame = compare_2Json(small_js1,small_js2)
if(issame == true){
js_str1 = js_str1.substring(0,js_str1.indexOf(small_js1))+js_str1.substring(js_str1.indexOf(small_js1)+small_js1.length)
js_str2 = js_str2.substring(0,js_str2.indexOf(small_js2))+js_str2.substring(js_str2.indexOf(small_js2)+small_js2.length)
}
}
js_str1 = js_str1.substring(1,js_str1.length-1)
js_str2 = js_str2.substring(1,js_str2.length-1)
for(str_js1 <- js_str1.split(","); str_js2 <- js_str2.split(",")){
if(str_js1!="" && str_js2!="")
if(str_js1 == str_js2){
js_str1 = js_str1.substring(0,js_str1.indexOf(str_js1))+js_str1.substring(js_str1.indexOf(str_js1)+str_js1.length)
js_str2 = js_str2.substring(0,js_str2.indexOf(str_js2))+js_str2.substring(js_str2.indexOf(str_js2)+str_js2.length)
}
}
js_str1=js_str1.replace(",","")
js_str2=js_str2.replace(",","")
if(js_str1==""&&js_str2=="")return true
else return false
}
else return false
}
def split_JsonintoSmall(js_str: String):ArrayBuffer[String]={
val arrbuff = ArrayBuffer[String]()
var json_str = js_str
while(json_str.indexOf("{",1)>0 || json_str.indexOf("[",1)>0){
if (json_str.indexOf("{", 1) < json_str.indexOf("[", 1) && json_str.indexOf("{",1)>0 || json_str.indexOf("{", 1) > json_str.indexOf("[", 1) && json_str.indexOf("[",1)<0 ) {
val right = findrealm(1, json_str, '{', '}')
arrbuff += json_str.substring(json_str.indexOf("{", 1), right + 1)
json_str = json_str.substring(0,json_str.indexOf("{",1))+json_str.substring(right+1)
}
else {
if(json_str.indexOf("[",1)>0) {
val right = findrealm(1, json_str, '[', ']')
arrbuff += json_str.substring(json_str.indexOf("[", 1), right + 1)
json_str = json_str.substring(0, json_str.indexOf("[", 1)) + json_str.substring(right + 1)
}
}
}
arrbuff
}
def findrealm(begin_loc: Int, str: String, leftch: Char, rightch: Char): Int = {
var left = str.indexOf(leftch, begin_loc)
var right = str.indexOf(rightch, left)
left = str.indexOf(leftch, left + 1)
while (left < right && left > 0) {
right = str.indexOf(rightch, right + 1)
left = str.indexOf(leftch, left + 1)
}
right
}