How to get differences between two JSONs? - json

I want to compare 2 JSON and get all the differences between them in Scala. For example, I would like to compare:
{"a":"aa", "b": "bb", "c":"cc" }
and
{"c":"cc", "a":"aa", "d":"dd"}
I'd like to get b and d.

If it isn't a restriction you can use http://json4s.org/ it has a nice diff feature.
Follow an example based on question:
import org.json4s._
import org.json4s.native.JsonMethods._
val json1 = parse("""{"a":"aa", "b":"bb", "c":"cc"}""")
val json2 = parse("""{"c":"cc", "a":"aa", "d":"dd"}""")
val Diff(changed, added, deleted) = json1 diff json2
It will return:
changed: org.json4s.JsonAST.JValue = JNothing
added: org.json4s.JsonAST.JValue = JObject(List((d,JString(dd))))
deleted: org.json4s.JsonAST.JValue = JObject(List((b,JString(bb))))
Best regards

Finally, I used JSONassert that does the same thing.
For example,
String expected = "{id:1,name:\"Joe\",friends:[{id:2,name:\"Pat\",pets:[\"dog\"]},{id:3,name:\"Sue\",pets:[\"bird\",\"fish\"]}],pets:[]}";
String actual = "{id:1,name:\"Joe\",friends:[{id:2,name:\"Pat\",pets:[\"dog\"]},{id:3,name:\"Sue\",pets:[\"cat\",\"fish\"]}],pets:[]}"
JSONAssert.assertEquals(expected, actual, false);
it returns
friends[id=3].pets[]: Expected bird, but not found ; friends[id=3].pets[]: Contains cat, but not expected
source: http://jsonassert.skyscreamer.org/

Related

Converting csv RDD to map

I have a large CSV( > 500 MB), which I take into a spark RDD, and I want to store it to a large Map[String, Array[Long]].
The CSV has multiple columns but I require only two for the time being. The first and second column, and is of the form:
A 12312 [some_value] ....
B 123123[some_value] ....
A 1222 [some_value] ....
C 1231 [some_value] ....
I want my map to basically group by the string and store an array of long
so, for the above case, my map would be:
{"A": [12312, 1222], "B": 123123, "C":1231}
But since this map would be huge, I can't simply do this directly. tsca
I take the CSV in a sql.dataframe
My code so far(Looks incorrect though):
def getMap(df: sql.DataFrame, sc: SparkContext): RDD[Map[String, Array[Long]]] = {
var records = sc.emptyRDD[Map[String, Array[Long]]]
val rows: RDD[Row] = df.rdd
rows.foreachPartition( iter => {
iter.foreach(x =>
if(records.contains(x.get(0).toString)){
val arr = temp_map.getOrElse()
records = records + (x.get(0).toString -> (temp_map.getOrElse(x.get(0).toString) :+ x.get(1).toString.toLong))
}
else{
val arr = new Array[Long](1)
arr(0) = x.get(1).toString.toLong
records = records + (x.get(0).toString -> arr)
}
)
})
}
Thanks in advance!
If I understood your question correctly then
You could groupBy first column and collect_list for the second column
import org.apache.spark.sql.functions._
val newDF = df.groupBy("column1").agg(collect_list("column2"))
newDF.show(faslse)
val rdd = newDF.rdd.map(r => (r.getString(0), r.getAs[List[Long]](1)))
This will give you RDD[(String, List[Long])] where the string will be unique

How can I print nulls when converting a dataframe to json in Spark

I have a dataframe that I read from a csv.
CSV:
name,age,pets
Alice,23,dog
Bob,30,dog
Charlie,35,
Reading this into a DataFrame called myData:
+-------+---+----+
| name|age|pets|
+-------+---+----+
| Alice| 23| dog|
| Bob| 30| dog|
|Charlie| 35|null|
+-------+---+----+
Now, I want to convert each row of this dataframe to a json using myData.toJSON. What I get are the following jsons.
{"name":"Alice","age":"23","pets":"dog"}
{"name":"Bob","age":"30","pets":"dog"}
{"name":"Charlie","age":"35"}
I would like the 3rd row's json to include the null value. Ex.
{"name":"Charlie","age":"35", "pets":null}
However, this doesn't seem to be possible. I debugged through the code and saw that Spark's org.apache.spark.sql.catalyst.json.JacksonGenerator class has the following implementation
private def writeFields(
row: InternalRow, schema: StructType, fieldWriters:
Seq[ValueWriter]): Unit = {
var i = 0
while (i < row.numFields) {
val field = schema(i)
if (!row.isNullAt(i)) {
gen.writeFieldName(field.name)
fieldWriters(i).apply(row, i)
}
i += 1
}
}
This seems to be skipping a column if it is null. I am not quite sure why this is the default behavior but is there a way to print null values in json using Spark's toJSON?
I am using Spark 2.1.0
To print the null values in JSON using Spark's toJSON method, you can use following code:
myData.na.fill("null").toJSON
It will give you expected result:
+-------------------------------------------+
|value |
+-------------------------------------------+
|{"name":"Alice","age":"23","pets":"dog"} |
|{"name":"Bob","age":"30","pets":"dog"} |
|{"name":"Charlie","age":"35","pets":"null"}|
+-------------------------------------------+
I hope it helps!
I have modified JacksonGenerator.writeFields function and included in my project.
Below are the steps-
1) Create package 'org.apache.spark.sql.catalyst.json' inside 'src/main/scala/'
2) Copy JacksonGenerator class
3) Create JacksonGenerator.scala class in '' package and paste the copied code
4) modify writeFields function
private def writeFields(row: InternalRow, schema: StructType, fieldWriters:Seq[ValueWriter]): Unit = {
var i = 0
while (i < row.numFields) {
val field = schema(i)
if (!row.isNullAt(i)) {
gen.writeFieldName(field.name)
fieldWriters(i).apply(row, i)
}
else{
gen.writeNullField(field.name)
}
i += 1
}}
tested with Spark 3.0.0:
When creating your spark session, set spark.sql.jsonGenerator.ignoreNullFields to false.
The toJSON function internally uses org.apache.spark.sql.catalyst.json.JacksonGenerator, which in turn takes org.apache.spark.sql.catalyst.json.JSONOptions for configuration.
The latter includes an option ignoreNullFields.
However, toJSON uses the defaults, which in the case of this particular option is taken from the sql config given above.
An example with the configuration set to false:
val schema = StructType(Seq(StructField("a", StringType), StructField("b", StringType)))
val rows = Seq(Row("a", null), Row(null, "b"))
val frame = spark.createDataFrame(spark.sparkContext.parallelize(rows), schema)
println(frame.toJSON.collect().mkString("\n"))
produces
{"a":"a","b":null}
{"a":null,"b":"b"}
import org.apache.spark.sql.types._
import scala.util.parsing.json.JSONObject
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames).filter(_._2 != null)
JSONObject(m).toString()
}

Spark Sql Flatten Json

I have a JSON which looks like this
{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley","year":2012}]}
I want to store output in a csv file like this:
Michael,{"sname":"stanford", "year":2010}
Michael,{"sname":"berkeley", "year":2012}
I have tried the following:
val people = sqlContext.read.json("people.json")
val flattened = people.select($"name", explode($"schools").as("schools_flat"))
The above code does not give schools_flat as a json.
Any ides on how to get the expected output.
Thanks
You need to specify schema explicitly to read the json file in the desired way.
In this case it would be like this:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
case class json_schema_class( cities: String, name : String, schools: Array[String])
var json_schema = ScalaReflection.schemaFor[json_schema_class].dataType.asInstanceOf[StructType]
var people = sqlContext.read.schema( json_schema ).json("people.json")
var flattened = people.select($"name", explode($"schools").as("schools_flat"))
The 'flattened' dataframe is like this:
+-------+--------------------+
| name| schools_flat|
+-------+--------------------+
|Michael|{"sname":"stanfor...|
|Michael|{"sname":"berkele...|
+-------+--------------------+

Accessing a Single Value from Parsed JObject in Scala (Jackson, json4s)

I have an object like this:
val aa = parse(""" { "vals" : [[1,2,3,4], [4,5,6,7], [8,9,6,3]] } """)
I want to access the value '1' in the first JArray.
println(aa.values ???)
How is this done?
Thanks
One way would be :
val n = (aa \ "vals")(0)(0).extract[Int]
println(n)
Another way is to parse the whole json using a case class :
implicit val formats = DefaultFormats
case class Numbers(vals: List[List[Int]])
val numbers = aa.extract[Numbers]
This way you can access the first value of the first list however you like :
for { list <- numbers.vals.headOption; hd <- list.headOption } println(hd)
// or
println(numbers.vals.head.head)
// or ...

Spark RDD to CSV - Add empty columns

I have a RDD[Map[String,Int]] where the keys of the maps are the column names. Each map is incomplete and to know the column names I would need to union all the keys. Is there a way to avoid this collect operation to know all the keys and use just once rdd.saveAsTextFile(..) to get the csv?
For example, say I have an RDD with two elements (scala notation):
Map("a"->1, "b"->2)
Map("b"->1, "c"->3)
I would like to end up with this csv:
a,b,c
1,2,0
0,1,3
Scala solutions are better but any other Spark-compatible language would do.
EDIT:
I can try to solve my problem from another direction also. Let's say I somehow know all the columns in the beginning, but I want to get rid of columns that have 0 value in all maps. So the problem becomes, I know that the keys are ("a", "b", "c") and from this:
Map("a"->1, "b"->2, "c"->0)
Map("a"->3, "b"->1, "c"->0)
I need to write the csv:
a,b
1,2
3,1
Would it be possible to do this with only one collect?
If you're statement is: "every new element in my RDD may add a new column name I have not seen so far", the answer is obviously can't avoid a full scan. But you don't need to collect all elements on the driver.
You could use aggregate to only collect column names. This method takes two functions, one is to insert a single element into the resulting collection, and another one to merge results from two different partitions.
rdd.aggregate(Set.empty[String])( {(s, m) => s union m.keySet }, { (s1, s2) => s1 union s2 })
You will get back a set of all column names in the RDD. In a second scan you can print the CSV file.
Scala and any other supported language
You can use spark-csv
First lets find all present columns:
val cols = sc.broadcast(rdd.flatMap(_.keys).distinct().collect())
Create RDD[Row]:
val rows = rdd.map {
row => { Row.fromSeq(cols.value.map { row.getOrElse(_, 0) })}
}
Prepare schema:
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val schema = StructType(
cols.value.map(field => StructField(field, IntegerType, true)))
Convert RDD[Row] to Data Frame:
val df = sqlContext.createDataFrame(rows, schema)
Write results:
// Spark 1.4+, for other versions see spark-csv docs
df.write.format("com.databricks.spark.csv").save("mycsv.csv")
You can do pretty much the same thing using other supported languages.
Python
If you use Python and final data fits in a driver memory you can use Pandas through toPandas() method:
rdd = sc.parallelize([{'a': 1, 'b': 2}, {'b': 1, 'c': 3}])
cols = sc.broadcast(rdd.flatMap(lambda row: row.keys()).distinct().collect())
df = sqlContext.createDataFrame(
rdd.map(lambda row: {k: row.get(k, 0) for k in cols.value}))
df.toPandas().save('mycsv.csv')
or directly:
import pandas as pd
pd.DataFrame(rdd.collect()).fillna(0).save('mycsv.csv')
Edit
One possible way to the second collect is to use accumulators to either build a set of all column names or to count these where you found zeros and use this information to map over rows and remove unnecessary columns or to add zeros.
It is possible but inefficient and feels like cheating. The only situation when it makes some sense is when number of zeros is very low, but I guess it is not the case here.
object ColsSetParam extends AccumulatorParam[Set[String]] {
def zero(initialValue: Set[String]): Set[String] = {
Set.empty[String]
}
def addInPlace(s1: Set[String], s2: Set[String]): Set[String] = {
s1 ++ s2
}
}
val colSetAccum = sc.accumulator(Set.empty[String])(ColsSetParam)
rdd.foreach { colSetAccum += _.keys.toSet }
or
// We assume you know this upfront
val allColnames = sc.broadcast(Set("a", "b", "c"))
object ZeroColsParam extends AccumulatorParam[Map[String, Int]] {
def zero(initialValue: Map[String, Int]): Map[String, Int] = {
Map.empty[String, Int]
}
def addInPlace(m1: Map[String, Int], m2: Map[String, Int]): Map[String, Int] = {
val keys = m1.keys ++ m2.keys
keys.map(
(k: String) => (k -> (m1.getOrElse(k, 0) + m2.getOrElse(k, 0)))).toMap
}
}
val accum = sc.accumulator(Map.empty[String, Int])(ZeroColsParam)
rdd.foreach { row =>
// If allColnames.value -- row.keys.toSet is empty we can avoid this part
accum += (allColnames.value -- row.keys.toSet).map(x => (x -> 1)).toMap
}