How to query JSON data according to JSON array's size with Spark SQL? - json

I have a JSON like this:
{
"uin":10000,
"role":[
{"role_id":1, "role_level": 10},
{"role_id":2, "role_level": 1}
]
}
{ "uin":10001,
"role":[
{"role_id":1, "role_level": 1},
{"role_id":2, "role_level": 1},
{"role_id":3, "role_level": 1},
{"role_id":4, "role_level": 20}
]
}
I want to query a uin which has more than two roles. How can I do it using Spark SQL?

You can use DataFrame and UserDefinedFunction to achieve what you want, as shown below. I've tried in spark-shell.
val jsonRdd = sc.parallelize(Seq("""{"uin":10000,"role":[{"role_id":1, "role_level": 10},{"role_id":2, "role_level": 1}]}"""))
val df = sqlContext.jsonRDD(jsonRdd)
val predict = udf((array: Seq[Any]) => if (array.length > 2) true else false)
val df1 = df.where( predict(df("role")) )
df1.show

Her is a simplified python version
r1 = ssc.jsonFile("role.json").select("uin","role.role_id")
r1.show()
slen = udf(lambda s: len(s), IntegerType())
r2 = r1.select(r1.uin,r1.role_id,slen(r1.role_id).alias("slen"))
res = r2.filter(r2.slen>1)
res.show()

Maybe size is what you need:
size(expr) - Returns the size of an array or a map.
In your case, "role" size must be bigger than 2.
If you have this JSON:
json = \
[
{
"uin":10000,
"role":[
{"role_id":1, "role_level": 10},
{"role_id":2, "role_level": 1}
]
},
{
"uin":10001,
"role":[
{"role_id":1, "role_level": 1},
{"role_id":2, "role_level": 1},
{"role_id":3, "role_level": 1},
{"role_id":4, "role_level": 20}
]
}
]
you can use this:
from pyspark.sql import functions as F
rdd = spark.sparkContext.parallelize([json])
df = spark.read.json(rdd)
df = df.filter(F.size('role') > 2)
df.show()
#+--------------------+-----+
#| role| uin|
#+--------------------+-----+
#|[{1, 1}, {2, 1}, ...|10001|
#+--------------------+-----+

Related

Writing Nested JSON Dictionary List To CSV

Issue
I'm trying to write the following nested list of dictionary which has another list of dictionary to csv. I tried multiple ways but I can not get it to properly write it:
Json Data
[
{
"Basic_Information_Source": [
{
"Image": "image1.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277274
}
],
"Basic_Information_Destination": [
{
"Image": "image1_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277539
}
],
"Values": [
{
"Value1": 75.05045463635267,
"Value2": 0.006097560975609756,
"Value3": 0.045083481733371615,
"Value4": 0.008639858263904898
}
]
},
{
"Basic_Information_Source": [
{
"Image": "image2.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1786254
}
],
"Basic_Information_Destination": [
{
"Image": "image2_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1782197
}
],
"Values": [
{
"Value1": 85.52662890580055,
"Value2": 0.0005464352720450282,
"Value3": 0.013496113910369758,
"Value4": 0.003800236380811839
}
]
}
]
Working Code
I tried to use the following code and it works, but it only saved the headers and then dumps all the underlying list as text in the csv file:
import json
import csv
def Convert_CSV():
ar_enc_file = open('analysis_results_enc.json','r')
json_data = json.load(ar_enc_file)
keys = json_data[0].keys()
with open('test.csv', 'w', encoding='utf8', newline='') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(json_data)
ar_enc_file.close()
Convert_CSV()
Working Output / Issue with it
The output writes the following header:
Basic_Information_Source
Basic_Information_Destination
Values
And then it dumps all other data inside each header as a list like this:
[{'Image': 'image1.png', 'Image_Format': 'PNG', 'Image_Mode': 'RGB', 'Image_Width': 574, 'Image_Height': 262, 'Image_Size': 277274}]
Expected Output / Sample
Trying to generate the above type of output for each dictionary in the array of dictionaries.
How do it properly write it?
I'm sure someone will come by with a much more elegant solution. That being said:
You have a few problems.
You have inconsistent entries with the fields you want to align.
Even if you pad your data you have intermediate lists that need flattened out.
Then you still have separated data that needs to be merged together.
DictWriter AFAIK expects it's data in the format of [{'column': 'entry'},{'column': 'entry'} so even if you do all the previous steps you're still not in the right format.
So let's get started.
For the first two parts we can combine.
def pad_list(lst, size, padding=None):
# we wouldn't have to make a copy but I prefer to
# avoid the possibility of getting bitten by mutability
_lst = lst[:]
for _ in range(len(lst), size):
_lst.append(padding)
return _lst
# this expects already parsed json data
def flatten(json_data):
lst = []
for dct in json_data:
# here we're just setting a max size of all dict entries
# this is in case the shorter entry is in the first iteration
max_size = 0
# we initialize a dict for each of the list entries
# this is in case you have inconsistent lengths between lists
flattened = dict()
for k, v in dct.items():
entries = list(next(iter(v), dict()).values())
flattened[k] = entries
max_size = max(len(entries), max_size)
# here we append the padded version of the keys for the dict
lst.append({k: pad_list(v, max_size) for k, v in flattened.items()})
return lst
So now we have a flattened, list of dicts whos values are lists of consistent length. Essentially:
[
{
"Basic_Information_Source": [
"image1.png",
"PNG",
"RGB",
574,
262,
277274
],
"Basic_Information_Destination": [
"image1_dst.png",
"PNG",
"RGB",
574,
262,
277539
],
"Values": [
75.05045463635267,
0.006097560975609756,
0.045083481733371615,
0.008639858263904898,
None,
None
]
}
]
But this list has multiple dicts that need to be merged, not just one.
So we need to merge.
# this should be self explanatory
def merge(flattened):
merged = dict()
for dct in flattened:
for k, v in dct.items():
if k not in merged:
merged[k] = []
merged[k].extend(v)
return merged
This gives us something close to this:
{
"Basic_Information_Source": [
"image1.png",
"PNG",
"RGB",
574,
262,
277274,
"image2.png",
"PNG",
"RGB",
1600,
1066,
1786254
],
"Basic_Information_Destination": [
"image1_dst.png",
"PNG",
"RGB",
574,
262,
277539,
"image2_dst.png",
"PNG",
"RGB",
1600,
1066,
1782197
],
"Values": [
75.05045463635267,
0.006097560975609756,
0.045083481733371615,
0.008639858263904898,
None,
None,
85.52662890580055,
0.0005464352720450282,
0.013496113910369758,
0.003800236380811839,
None,
None
]
}
But wait, we still need to format it for the writer.
Our data needs to be in the format of [{'column_1': 'entry', column_2: 'entry'},{'column_1': 'entry', column_2: 'entry'}
So we format:
def format_for_writer(merged):
formatted = []
for k, v in merged.items():
for i, item in enumerate(v):
# on the first pass this will append an empty dict
# on subsequent passes it will be ignored
# and add keys into the existing dict
if i >= len(formatted):
formatted.append(dict())
formatted[i][k] = item
return formatted
So finally, we have a nice clean formatted data structure we can just hand to our writer function.
def convert_csv(formatted):
keys = formatted[0].keys()
with open('test.csv', 'w', encoding='utf8', newline='') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(formatted)
Full code with json string:
import json
import csv
json_raw = """\
[
{
"Basic_Information_Source": [
{
"Image": "image1.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277274
}
],
"Basic_Information_Destination": [
{
"Image": "image1_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 574,
"Image_Height": 262,
"Image_Size": 277539
}
],
"Values": [
{
"Value1": 75.05045463635267,
"Value2": 0.006097560975609756,
"Value3": 0.045083481733371615,
"Value4": 0.008639858263904898
}
]
},
{
"Basic_Information_Source": [
{
"Image": "image2.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1786254
}
],
"Basic_Information_Destination": [
{
"Image": "image2_dst.png",
"Image_Format": "PNG",
"Image_Mode": "RGB",
"Image_Width": 1600,
"Image_Height": 1066,
"Image_Size": 1782197
}
],
"Values": [
{
"Value1": 85.52662890580055,
"Value2": 0.0005464352720450282,
"Value3": 0.013496113910369758,
"Value4": 0.003800236380811839
}
]
}
]
"""
def pad_list(lst, size, padding=None):
_lst = lst[:]
for _ in range(len(lst), size):
_lst.append(padding)
return _lst
def flatten(json_data):
lst = []
for dct in json_data:
max_size = 0
flattened = dict()
for k, v in dct.items():
entries = list(next(iter(v), dict()).values())
flattened[k] = entries
max_size = max(len(entries), max_size)
lst.append({k: pad_list(v, max_size) for k, v in flattened.items()})
return lst
def merge(flattened):
merged = dict()
for dct in flattened:
for k, v in dct.items():
if k not in merged:
merged[k] = []
merged[k].extend(v)
return merged
def format_for_writer(merged):
formatted = []
for k, v in merged.items():
for i, item in enumerate(v):
if i >= len(formatted):
formatted.append(dict())
formatted[i][k] = item
return formatted
def convert_csv(formatted):
keys = formatted[0].keys()
with open('test.csv', 'w', encoding='utf8', newline='') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(formatted)
def main():
json_data = json.loads(json_raw)
flattened = flatten(json_data)
merged = merge(flattened)
formatted = format_for_writer(merged)
convert_csv(formatted)
if __name__ == '__main__':
main()

Find the path of an JSON element with dynamic key with Play JSON

I am using Play Framework with Scala. I have the following JSON structure:
{
"a": 1540554574847,
"b": 2,
"c": {
"pep3lpnp1n1ugmex5uevekg5k20wkfq3": {
"a": 1,
"b": 1,
"c": 1,
"d": 1
},
"p3zgudnf7tzqvt50g7lpr2ryno7yugmy": {
"b": [
"d10e5600d11e5517"
],
"c": 1,
"d": 1,
"e": 1,
"g": 1,
"h": [
"d10e5600d11e5517",
"d10e5615d11e5527",
"d10e5605d11e5520",
"d10e5610d11e5523",
"d10e5620d11e5530"
],
"q": "a_z6smu56gstysjpqbzp21ruxii6g2ph00"
},
"33qfthhugr36f5ts4251glpqx0o373pe": {
"b": [
"d10e5633d11e5536"
],
"c": 1,
"d": 1,
"e": 1,
"g": 1,
"h": [
"d10e5638d11e5539",
"d10e5633d11e5536",
"d10e5643d11e5542",
"d10e5653d11e5549",
"d10e5648d11e5546"
],
"q": "a_cydo6wu1ds340j3q6qxeig97thocttsp"
}
}
}
I need to fetch values from paths
"c" -> "pep3lpnp1n1ugmex5uevekg5k20wkfq3" -> "b",
"c" -> "p3zgudnf7tzqvt50g7lpr2ryno7yugmy" -> "b",
"c" -> "33qfthhugr36f5ts4251glpqx0o373pe" -> "b", and so on, where "pep3lpnp1n1ugmex5uevekg5k20wkfq3" is dynamic and changes for every JSON input.
Output should be like Seq(object(q,b,c)).
If you don't need to know which generated key belongs to which value you can use recursive path \\ operator:
import play.api.libs.json.Json
import play.api.libs.json._
val jsonText = """{
"a":1540554574847,
"b":2,
"c":{
"onegeneratedkey":{
"a":1,
"b":1,
"c":1,
"d":1
},
"secondsonegeneratedkey":{
"a":1,
"b": [1, 2, 3],
"c":1,
"d":1
}
}
}"""
val result: Seq[JsValue] = Json.parse(jsonText) \ "c" \\ "b"
// res: List(1, [1,2,3])
UPD.
To get all values stored inside object with generated-keys, one can use JsObject#values:
val valuesSeq: Seq[JsValue] = (Json.parse(jsonText) \ "c").toOption // get 'c' field
.collect {case o: JsObject => o.values.toSeq} // get all object that corresponds to generated keys
.getOrElse(Seq.empty)
// res: Seq({"a":1,"b":1,"c":1,"d":1}, {"a":1,"b":[1,2,3],"c":1,"d":1})
val valuesABC = valuesSeq.map(it => (it \ "a", it \ "b", it \ "c"))
// res: Seq((JsDefined(1),JsDefined(1),JsDefined(1)), (JsDefined(1),JsDefined([1,2,3]),JsDefined(1)))
I misread the question, and this is the modified version.
Here I used json.pick to read JsObject and iterate the keys from there.
Ps: You don't have to create Reads or the case classes, but it should made the caller program more readable.
import play.api.libs.json.Json
import play.api.libs.json._
val jsonText =
"""{
"top": {
"level2a": {
"a": 1,
"b": 1,
"c": 1,
"d": 1
},
"level2b": {
"a": 2,
"b": 2,
"nested": {
"b": "not interested"
}
}
}
}"""
case class Data(k: String, v: Int)
case class Datas(list: Seq[Data])
object Datas {
implicit val reads: Reads[Datas] = (__ \ "top").json.pick.map {
case obj: JsObject =>
new Datas(obj.keys.flatMap(k => (obj \ k \ "b").validate[Int] match {
case JsSuccess(v, _) => Some(Data(k, v))
case _ => None
}).toSeq)
}
}
Json.parse(jsonText).validate[Datas].asOpt match {
case Some(d) => println(s"found: $d")
case _ => println("not found")
}
To deserialize the internal structure within level2, you may choose to create the internal structure and use Json.reads to create the default reads. So long as the data structure is known and predictable.
For example
case class Internal(a: Int, b: Int, c: Option[Int], d: Option[Int])
object Internal {
implicit val reads = Json.reads[Internal]
}
case class Data(k: String, v: Internal)
case class Datas(list: Seq[Data])
object Datas {
implicit val reads: Reads[Datas] = (__ \ "top").json.pick.map {
case obj: JsObject =>
new Datas(obj.keys.flatMap(k => (obj \ k).validate[Internal].asOpt
.map(v => Data(k, v))).toSeq)
}
}

how to remove duplicates from a json defaultdict?

(Re-post with accurate data sample)
I have a json dictionary where each value in turn is a defaultdict as follows:
"Parent_Key_A": [{"a": 1.0, "b": 2.0}, {"a": 5.1, "c": 10}, {"b": 20.3, "a": 1.0}]
I am trying to remove both duplicate keys and values so that each element of the json has unique values. So for the above example, I am looking for output something like this:
"Parent_Key_A": {"a":[1.0,5.1], "b":[2.0,20.3], "c":[10]}
Then I need to write this output to a json file. I tried using set to handle duplicates but set is not json serializable.
Any suggestions on how to handle this?
The solution using itertools.chain() and itertools.groupby() functions:
import itertools, json
input_d = { "Parent_Key_A": [{"a": 1.0, "b": 2.0}, {"a": 5.1, "c": 10}, {"b": 20.3, "a": 1.0}] }
items = itertools.chain.from_iterable(list(d.items()) for d in input_d["Parent_Key_A"])
# dict comprehension (updated syntax here)
input_d["Parent_Key_A"] = { k:[i[1] for i in sorted(set(g))]
for k,g in itertools.groupby(sorted(items), key=lambda x: x[0]) }
print(input_d)
The output:
{'Parent_Key_A': {'a': [1.0, 5.1], 'b': [2.0, 20.3], 'c': [10]}}
Printing to json file:
json.dump(input_d, open('output.json', 'w+'), indent=4)
output.json contents:
{
"Parent_Key_A": {
"a": [
1.0,
5.1
],
"c": [
10
],
"b": [
2.0,
20.3
]
}
}

json to case class using multiple rows in spark scala

i have a json file with logs:
{"a": "cat1", "b": "name", "c": "Caesar", "d": "2016-10-01"}
{"a": "cat1", "b": "legs", "c": "4", "d": "2016-10-01"}
{"a": "cat1", "b": "color", "c": "black", "d": "2016-10-01"}
{"a": "cat1", "b": "tail", "c": "20cm", "d": "2016-10-01"}
{"a": "cat2", "b": "name", "c": "Dickens", "d": "2016-10-02"}
{"a": "cat2", "b": "legs", "c": "4", "d": "2016-10-02"}
{"a": "cat2", "b": "color", "c": "red", "d": "2016-10-02"}
{"a": "cat2", "b": "tail", "c": "15cm", "d": "2016-10-02"}
{"a": "cat2", "b": "ears", "c": "5cm", "d": "2016-10-02"}
{"a": "cat1", "b": "tail", "c": "10cm", "d": "2016-10-10"}
desired output:
("id": "cat1", "name": "Caesar", "legs": "4", "color": "black", "tail": "10cm", "day": "2016-10-10")
("id": "cat2", "name": "Dickens", "legs": "4", "color": "red", "tail": "10cm", "ears": "5cm", "day": "2016-10-02")
i can do it step by step using for loops and collects, but I need to do it in proper way using maps, flatmaps, aggregatebykey and other spark magic
case class cat_input(a: String, b:String, c:String, d: String)
case class cat_output(id: String, name: String, legs: String, color: String, tail: String, day: String, ears: String, claws: String)
object CatLog {
def main(args: Array[String]) {
val sconf = new SparkConf().setAppName("Cat log")
val sc = new SparkContext(sconf)
sc.setLogLevel("WARN")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val df = sqlContext.read.json("cats1.txt").as[cat_input]
val step1 = df.rdd.groupBy(_.a)
//step1 = (String, Iterator[cat_input]) = (cat1, CompactBuffer(cat_input( "cat1", "name", "Caesar", "2016-10-01"), ... ) )
val step2 = step1.map(x => x._2)
//step2 = Iterator[cat_input]
val step3 = step2.map(y => (y.b,y.c))
//step3 = ("name", "Caesar")
val step4 = step3.map( case(x,y) => { cat_output(x) = y })
// it should return cat_output(id: "cat1", name: "Caesar", legs: "4", color: "black", tail: "10cm", day: NULL, ears: NULL, claws: NULL)
step4 is obviously not working
how to return at least this cat_output(id: "cat1", name: "Caesar", legs: "4", color: "black", tail: "10cm", day: NULL, ears: NULL, claws: NULL)
how to check values by d column and choose newest one between them and also put newest date to into cat_output(date)?
Assuming data has the unique properties for each cat (cat1, cat2). Apply some logic for duplicates. You can try something like this for your case class:
#method to reduce 2 cat_output objects to one
def makeFinalRec(a: cat_output, b:cat_output): cat_output ={ return cat_output( a.id,
if(a.name=="" && b.name!="") b.name else a.name,
if(a.legs=="" && b.legs!="") b.legs else a.legs,
if(a.color=="" && b.color!="") b.color else a.color,
if(a.tail=="" && b.tail!="") b.tail else a.tail,
if(a.day=="" && b.day!="") b.day else a.day,
if(a.ears=="" && b.ears!="") b.ears else a.ears,
if(a.claws=="" && b.claws!="") b.claws else a.claws ); }
dt.map(x => (x(0), x(1), x(2))).map(x => (x._1.toString,
cat_output(x._1.toString,
(x._2.toString match { case "name" => x._3.toString case _ => ""}),
(x._2.toString match { case "legs" => x._3.toString case _ => ""}),
(x._2.toString match { case "color" => x._3.toString case _ => ""}),
(x._2.toString match { case "tail" => x._3.toString case _ => ""}),
(x._2.toString match { case "day" => x._3.toString case _ => ""}),
(x._2.toString match { case "ears" => x._3.toString case _ => ""}),
(x._2.toString match { case "claws" => x._3.toString case _ => ""})
) )).reduceByKey((a,b) => makeFinalRec(a,b)).map(x=>x._2).toDF().toJSON.foreach(println)
Output:
{"id":"cat2","name":"Dickens","legs":"4","color":"red","tail":"15cm","day":"","ears":"5cm","claws":""}
{"id":"cat1","name":"Caesar","legs":"4","color":"black","tail":"20cm","day":"","ears":"","claws":""}
Also note, I didnt apply the actual "date" because there are duplicates. It needs another map() & max logic to get max for each key then join both datasets.
One way is to use aggregateByKey function and store answer in mutable map.
//case class defined outside main()
case class cat_input(a: String, b:String, c:String, d: String)
val df = sqlContext.read.json("cats1.txt").as[cat_input]
val add_to_map = (a: scala.collection.mutable.Map[String,String], x: cat_input) => {
val ts = x.d
if(a contains "date"){
if((a contains x.b) && (ts>=a("date")))
{
a(x.b) = x.c
a("date")=ts
}
else if (!(a contains x.b))
{
a(x.b) = x.c
if(a("date")<ts){
a("date")=ts
}
}
}
else
{
a(x.b) = x.c
a("date")=ts
}
a
}
val merge_maps = (a:scala.collection.mutable.Map[String,String], b:scala.collection.mutable.Map[String,String]) => {
if( a("date") > b("date") ) {
a.keys.map( k => b(k) = a(k) )
a
} else {
b.keys.map( k => a(k) = b(k) )
b
}
}
val step3 = df.map(x=> (x.a, x)).aggregateByKey( scala.collection.mutable.Map[String,String]() )(add_to_map, merge_maps)

Scala Play framework 'update' json array

Suppose I have a json array that looks like this (I receive it from a remote service):
[{"id": 1}, {"id": 2}, ... , {"id": 10}]
And, say, I want to 'transform' it like this (add 100 to 'id's and other values):
[{"id": 101}, {"id": 102}, ..., {"id": 110} ]
As for starters I wanted to 'update' it so that it would at least replace the initial array with a blank one (just to test how stuff works).
Json.parse("""[{"id": 1}, {"id": 2}]""").transform( (__).json.update( __.read[JsArray].map(_ => JsArray()) ))
But it throws an exception:
play.api.libs.json.JsResult[play.api.libs.json.JsObject] = JsError(List((,List(ValidationError(List(error.expected.jsobject),WrappedArray())))))
However if they are sealed inside of a json object then it kinda works:
Json.parse("""{"ids": [{"id": 1}, {"id": 2}]}""").transform( (__ \ "ids").json.update( __.read[JsArray].map(_ => JsArray()) ))
which results in
play.api.libs.json.JsResult[play.api.libs.json.JsObject] = JsSuccess({"ids":[]},/ids)
How do I deal with array json?
Thanks in advance
Try the following code,
case class ID(id: Int)
implicit val reads = Json.reads[ID]
Json.parse("""[{"id": 1}, {"id": 2}]""").as[JsArray].value.map(_.transform(__.json.update {
__.read[ID].map { case x: ID => Json.obj("id" -> (x.id + 100)) }
}))