JSON string to dataframe "change schema" schema contain Ambiguous column - Spark Scala - json

How to convert this JSON to Dataframe in scala
json_string = """{
"module": {
"col1": "a",
"col2": {
"5": 1,
"3": 4,
"4": {
"numeric reasoning": 2,
"verbal": 4
},
"7": {
"landline": 2,
"landLine": 4
}
}
}
}"""
Function I use -
val jsonRDD = spark.parallelize(json_string::Nil)
val jsonDF = sqlContext.read.json(jsonRDD)
val df = flattenRecursive(jsonDF)
df.show()
def flattenRecursive(df: DataFrame): DataFrame = {
val fields = df.schema.fields
val fieldNames = fields.map(x => x.name)
val length = fields.length
for(i <- 0 to fields.length-1){
val field = fields(i)
val fieldtype = field.dataType
val fieldName = field.name
fieldtype match {
case arrayType: ArrayType =>
println("flatten array")
val newfieldNames = fieldNames.filter(_!=fieldName) ++ Array("explode_outer(".concat(fieldName).concat(") as ").concat(fieldName))
val explodedDf = df.selectExpr(newfieldNames:_*)
return flattenRecursive(explodedDf)
case structType: StructType =>
println("flatten struct")
val newfieldNames = fieldNames.filter(_!= fieldName) ++ structType.fieldNames.map(childname => fieldName.concat(".").concat(childname) .concat(" as ").concat(fieldName).concat("_").concat(childname))
val explodedf = df.selectExpr(newfieldNames:_*)
return flattenRecursive(explodedf)
case _ =>
println("other type")
}
}
df
}
Error I face is -
Ambiguous reference to fields StructField(landLine,LongType,true), StructField(landline,LongType,true);
Required output - if we can edit 1 landline column to landline_1 before explode
**Note - please provide the generic code because
I don't know on which level I face this ambiguity and also
I don't know the schema while running the code**

Related

how to remove empty json object by Scala

some times we have JSON looks like this:
{a:{}, b:{c:{}, d:123}}
you want to remove the empty structures and make it into {b:{d:123}}
here is how you can do it simply in Scala by using Jackson:
val json = """ {"a":{}, "b": {"c": {}, "d": 123}} """
val mapper = new ObjectMapper()
mapper
.setSerializationInclusion(JsonInclude.Include.NON_EMPTY)
.setSerializationInclusion(JsonInclude.Include.NON_NULL)
.setSerializationInclusion(JsonInclude.Include.NON_ABSENT)
val node = mapper.readTree(json)
removeEmptyFields(node)
val cleanJson = mapper.writeValueAsString(node) // {"b": {"d": 123}}
private def removeEmptyFields(obj: Object): Boolean = {
if (obj.isInstanceOf[ArrayNode]) {
val array = obj.asInstanceOf[ArrayNode]
val iter = array.elements()
var i = 0
while (iter.hasNext) {
if(!removeEmptyFields(iter.next())) array.remove(i)
i += 1
}
true
} else if (obj.isInstanceOf[ObjectNode]) {
val json = obj.asInstanceOf[ObjectNode]
val names = json.fieldNames().asScala.toList
if (names == null || names.isEmpty) return false
var removeRoot = true
names.foreach (
name => {
if (!removeEmptyFields(json.get(name))) {
json.remove(name)
} else removeRoot = false
}
)
!removeRoot
} else true
}
Pure, stack-safe implementation with circe:
import cats.Eval
import cats.implicits._
import io.circe.JsonObject
import io.circe.literal._
import io.circe.syntax._
object Main extends App {
def removeEmpty(jo: JsonObject): JsonObject = {
//`Eval` is trampolined so this is stack-safe
def loop(o: JsonObject): Eval[JsonObject] =
o.toList.foldLeftM(o) { case (acc, (k, v)) =>
v.asObject match {
case Some(oo) if oo.isEmpty => acc.remove(k).pure[Eval]
case Some(oo) => Eval.defer(loop(oo)).map(_o => acc.add(k, _o.asJson))
case _ => acc.pure[Eval]
}
}
loop(jo).value
}
//this is a json literal
// if it's from a dynamic string please parse it with `io.circe.parser.parse` first
val json = json"""{"a":{}, "b": {"c": {}, "d": 123}}"""
val res = json.asObject.map(removeEmpty(_).asJson)
println(res)
}

Need to convert a string to Json Object in scala

I am trying to convert a string to JSON using JSON spray. But I am very new in scala and facing trouble to write the code. My input file is a string and may contain more element.
Example input String
12 rob 133 millan
Expected JSON file is below
[
{
"M": {
"Score": {
"N": "12"
},
"TopicID": {
"S": "rob"
}
}
},
{
"M": {
"Score": {
"N": "133"
},
"TopicID": {
"S": "milan"
}
}
}
Any suggestion toward code approach would help as well.
Please suggest.
Regards of creating the json:
First, you need to define the case class:
case class SClass(S: string)
case class NClass(N: string)
case class MClass(Score: NClass, TopicID: SClass)
Then:
implicit val mclassFormat = jsonFormat2(MClass)
implicit val nclassFormat = jsonFormat(NClass)
implicit val sclassFormat = jsonFormat(SClass)
// serialize the json (hardcoded values)
val mClass = MClass(NClass(12.toString(), SClass("rob"))
val mClassJsonString = mClass.toJson.prettyPrint
About parsing your input:
val input = "12 rob 13 bla"
val a = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 == 0) =>
(v,i)}.map(_._1)
val b = input.split(" ").zipWithIndex.collect{case(v,i) if (i % 2 != 0) =>
(v,i)}.map(_._1)
val result = a.zip(b) // [(12,rob),(13,bla)]
now you can traverse on your result like this:
result.foreach{case (x,y) =>
val mClass = MClass(NClass(x.toString(), SClass(y))
val mClassJsonString = mClass.toJson.prettyPrint
}

Accessing inner JSON field in Scala

I have JSON string as
{
"whatsNew" : {
"oldNotificationClass" : "WhatsNewNotification",
"notificationEvent" : { ..... },
"result" : {
"notificationCount" : 10
.....
}
},
......
"someEmpty": { },
......
}
I am trying to get notificationCount field using json4s in Scala as follow but notificationCount is coming as empty for all. Any help?
UPDATE
Also if some pair is empty how can I handle empty condition and continue to loop?
Function returning JSON string from file
def getData(): Map[String, AnyRef] = {
val jsonString = scala.io.Source.fromInputStream(this.getClass.getResourceAsStream("/sample.json")).getLines.mkString
val jsonObject = parse( s""" $jsonString """)
jsonObject.values.asInstanceOf[Map[String, AnyRef]]
}
Code to get fields
val myMap: Map[String, AnyRef] = MyDataLoader.getData
for((key, value) <- myMap) {
val id = key
val eventJsonStr: String = write(value.asInstanceOf[Map[String, String]] get "notificationEvent")
val resultJsonStr: String = write(value.asInstanceOf[Map[String, String]] get "result")
//val notificationCount: String = write(value.asInstanceOf[Map[String, Map[String, String]]] get "notificationCount")
}
You can use path and extract like so:
val count: Int = (parse(jsonString) \ "whatsNew" \ "result" \ "notificationCount").extract[Int]
you will need this import for the .extract[Int] to work:
implicit val formats = DefaultFormats
To do it in a loop:
parse(jsonString).children.map { child =>
val count: Int = (child \ "result" \ "notificationCount").extract[Int]
...
}

Play: How to remove the fields without value from JSON and create a new JSON with them

Given the following JSON:
{
"field1": "value1",
"field2": "",
"field3": "value3",
"field4": ""
}
How do I get two distinct JSONs, one containing the fields with value and another one containing the fields without value? Here below is how the final result should look like:
{
"field1": "value1",
"field3": "value3"
}
{
"field2": "",
"field4": ""
}
You have access to the JSON object's fields as a sequence of (String, JsValue) pairs and you can filter through them. You can filter out the ones with and without value and use the filtered sequences to construct new JsObject objects.
import play.api.libs.json._
val ls =
("field1", JsString("value1")) ::
("field2", JsString("")) ::
("field3", JsString("value3")) ::
("field4", JsString("")) ::
Nil
val js0 = new JsObject(ls)
def withoutValue(v: JsValue) = v match {
case JsNull => true
case JsString("") => true
case _ => false
}
val js1 = JsObject(js0.fields.filterNot(t => withoutValue(t._2)))
val js2 = JsObject(js0.fields.filter(t => withoutValue(t._2)))
You can improve #nietaki solution by using partition function.
import play.api.libs.json._
val ls =
("field1", JsString("value1")) ::
("field2", JsString("")) ::
("field3", JsString("value3")) ::
("field4", JsString("")) ::
Nil
val js0 = new JsObject(ls)
def withoutValue(v: JsValue) = v match {
case JsNull => true
case JsString("") => true
case _ => false
}
val (js1, js2) = js0.fields.partition(t => withoutValue(t._2))
JsObject(js1)
JsObject(js2)
You can also do it in a one liner:
val (js1, js2) = j.fields.partition(_._2 != JsString(""))
println(JsObject(js1))
println(JsObject(js2))
Code run at Scastie

How to parse JSON in Scala using standard Scala classes?

I am using the build in JSON class in Scala 2.8 to parse JSON code. I don't want to use the Liftweb one or any other due to minimizing dependencies.
The way I am doing it seems too imperative, is there a better way to do it?
import scala.util.parsing.json._
...
val json:Option[Any] = JSON.parseFull(jsonString)
val map:Map[String,Any] = json.get.asInstanceOf[Map[String, Any]]
val languages:List[Any] = map.get("languages").get.asInstanceOf[List[Any]]
languages.foreach( langMap => {
val language:Map[String,Any] = langMap.asInstanceOf[Map[String,Any]]
val name:String = language.get("name").get.asInstanceOf[String]
val isActive:Boolean = language.get("is_active").get.asInstanceOf[Boolean]
val completeness:Double = language.get("completeness").get.asInstanceOf[Double]
}
This is a solution based on extractors which will do the class cast:
class CC[T] { def unapply(a:Any):Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
object D extends CC[Double]
object B extends CC[Boolean]
val jsonString =
"""
{
"languages": [{
"name": "English",
"is_active": true,
"completeness": 2.5
}, {
"name": "Latin",
"is_active": false,
"completeness": 0.9
}]
}
""".stripMargin
val result = for {
Some(M(map)) <- List(JSON.parseFull(jsonString))
L(languages) = map("languages")
M(language) <- languages
S(name) = language("name")
B(active) = language("is_active")
D(completeness) = language("completeness")
} yield {
(name, active, completeness)
}
assert( result == List(("English",true,2.5), ("Latin",false,0.9)))
At the start of the for loop I artificially wrap the result in a list so that it yields a list at the end. Then in the rest of the for loop I use the fact that generators (using <-) and value definitions (using =) will make use of the unapply methods.
(Older answer edited away - check edit history if you're curious)
This is the way I do the pattern match:
val result = JSON.parseFull(jsonStr)
result match {
// Matches if jsonStr is valid JSON and represents a Map of Strings to Any
case Some(map: Map[String, Any]) => println(map)
case None => println("Parsing failed")
case other => println("Unknown data structure: " + other)
}
I like #huynhjl's answer, it led me down the right path. However, it isn't great at handling error conditions. If the desired node does not exist, you get a cast exception. I've adapted this slightly to make use of Option to better handle this.
class CC[T] {
def unapply(a:Option[Any]):Option[T] = if (a.isEmpty) {
None
} else {
Some(a.get.asInstanceOf[T])
}
}
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
object D extends CC[Double]
object B extends CC[Boolean]
for {
M(map) <- List(JSON.parseFull(jsonString))
L(languages) = map.get("languages")
language <- languages
M(lang) = Some(language)
S(name) = lang.get("name")
B(active) = lang.get("is_active")
D(completeness) = lang.get("completeness")
} yield {
(name, active, completeness)
}
Of course, this doesn't handle errors so much as avoid them. This will yield an empty list if any of the json nodes are missing. You can use a match to check for the presence of a node before acting...
for {
M(map) <- Some(JSON.parseFull(jsonString))
} yield {
map.get("languages") match {
case L(languages) => {
for {
language <- languages
M(lang) = Some(language)
S(name) = lang.get("name")
B(active) = lang.get("is_active")
D(completeness) = lang.get("completeness")
} yield {
(name, active, completeness)
}
}
case None => "bad json"
}
}
I tried a few things, favouring pattern matching as a way of avoiding casting but ran into trouble with type erasure on the collection types.
The main problem seems to be that the complete type of the parse result mirrors the structure of the JSON data and is either cumbersome or impossible to fully state. I guess that is why Any is used to truncate the type definitions. Using Any leads to the need for casting.
I've hacked something below which is concise but is extremely specific to the JSON data implied by the code in the question. Something more general would be more satisfactory but I'm not sure if it would be very elegant.
implicit def any2string(a: Any) = a.toString
implicit def any2boolean(a: Any) = a.asInstanceOf[Boolean]
implicit def any2double(a: Any) = a.asInstanceOf[Double]
case class Language(name: String, isActive: Boolean, completeness: Double)
val languages = JSON.parseFull(jstr) match {
case Some(x) => {
val m = x.asInstanceOf[Map[String, List[Map[String, Any]]]]
m("languages") map {l => Language(l("name"), l("isActive"), l("completeness"))}
}
case None => Nil
}
languages foreach {println}
val jsonString =
"""
|{
| "languages": [{
| "name": "English",
| "is_active": true,
| "completeness": 2.5
| }, {
| "name": "Latin",
| "is_active": false,
| "completeness": 0.9
| }]
|}
""".stripMargin
val result = JSON.parseFull(jsonString).map {
case json: Map[String, List[Map[String, Any]]] =>
json("languages").map(l => (l("name"), l("is_active"), l("completeness")))
}.get
println(result)
assert( result == List(("English", true, 2.5), ("Latin", false, 0.9)) )
You can do like this! Very easy to parse JSON code :P
package org.sqkb.service.common.bean
import java.text.SimpleDateFormat
import org.json4s
import org.json4s.JValue
import org.json4s.jackson.JsonMethods._
//import org.sqkb.service.common.kit.{IsvCode}
import scala.util.Try
/**
*
*/
case class Order(log: String) {
implicit lazy val formats = org.json4s.DefaultFormats
lazy val json: json4s.JValue = parse(log)
lazy val create_time: String = (json \ "create_time").extractOrElse("1970-01-01 00:00:00")
lazy val site_id: String = (json \ "site_id").extractOrElse("")
lazy val alipay_total_price: Double = (json \ "alipay_total_price").extractOpt[String].filter(_.nonEmpty).getOrElse("0").toDouble
lazy val gmv: Double = alipay_total_price
lazy val pub_share_pre_fee: Double = (json \ "pub_share_pre_fee").extractOpt[String].filter(_.nonEmpty).getOrElse("0").toDouble
lazy val profit: Double = pub_share_pre_fee
lazy val trade_id: String = (json \ "trade_id").extractOrElse("")
lazy val unid: Long = Try((json \ "unid").extractOpt[String].filter(_.nonEmpty).get.toLong).getOrElse(0L)
lazy val cate_id1: Int = (json \ "cate_id").extractOrElse(0)
lazy val cate_id2: Int = (json \ "subcate_id").extractOrElse(0)
lazy val cate_id3: Int = (json \ "cate_id3").extractOrElse(0)
lazy val cate_id4: Int = (json \ "cate_id4").extractOrElse(0)
lazy val coupon_id: Long = (json \ "coupon_id").extractOrElse(0)
lazy val platform: Option[String] = Order.siteMap.get(site_id)
def time_fmt(fmt: String = "yyyy-MM-dd HH:mm:ss"): String = {
val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
val date = dateFormat.parse(this.create_time)
new SimpleDateFormat(fmt).format(date)
}
}
This is the way I do the Scala Parser Combinator Library:
import scala.util.parsing.combinator._
class ImprovedJsonParser extends JavaTokenParsers {
def obj: Parser[Map[String, Any]] =
"{" ~> repsep(member, ",") <~ "}" ^^ (Map() ++ _)
def array: Parser[List[Any]] =
"[" ~> repsep(value, ",") <~ "]"
def member: Parser[(String, Any)] =
stringLiteral ~ ":" ~ value ^^ { case name ~ ":" ~ value => (name, value) }
def value: Parser[Any] = (
obj
| array
| stringLiteral
| floatingPointNumber ^^ (_.toDouble)
|"true"
|"false"
)
}
object ImprovedJsonParserTest extends ImprovedJsonParser {
def main(args: Array[String]) {
val jsonString =
"""
{
"languages": [{
"name": "English",
"is_active": true,
"completeness": 2.5
}, {
"name": "Latin",
"is_active": false,
"completeness": 0.9
}]
}
""".stripMargin
val result = parseAll(value, jsonString)
println(result)
}
}
scala.util.parsing.json.JSON is deprecated.
Here is another approach with circe. FYI documentation: https://circe.github.io/circe/cursors.html
Add the dependency in build.sbt, I used scala 2.13.4, note the scala version must align with the library version.
val circeVersion = "0.14.0-M2"
libraryDependencies ++= Seq(
"io.circe" %% "circe-core" % circeVersion,
"io.circe" %% "circe-generic" % circeVersion,
"io.circe" %% "circe-parser" % circeVersion
)
Example 1:
case class Person(name: String, age: Int)
object Main {
def main(args: Array[String]): Unit = {
val input =
"""
|{
| "kind": "Listing",
| "data": [
| {
| "name": "Frodo",
| "age": 51
| },
| {
| "name": "Bilbo",
| "age": 60
| }
| ]
|}
|""".stripMargin
implicit val decoderPerson: Decoder[Person] = deriveDecoder[Person] // decoder required to parse to custom object
val parseResult: Json = circe.parser.parse(input).getOrElse(Json.Null)
val data: ACursor = parseResult.hcursor.downField("data") // get the data field
val personList: List[Person] = data.as[List[Person]].getOrElse(null) // parse the dataField to a list of Person
for {
person <- personList
} println(person.name + " is " + person.age)
}
}
Example 2, json has an object within an object:
case class Person(name: String, age: Int, position: Position)
case class Position(x: Int, y: Int)
object Main {
def main(args: Array[String]): Unit = {
val input =
"""
|{
| "kind": "Listing",
| "data": [
| {
| "name": "Frodo",
| "age": 51,
| "position": {
| "x": 10,
| "y": 20
| }
| },
| {
| "name": "Bilbo",
| "age": 60,
| "position": {
| "x": 75,
| "y": 85
| }
| }
| ]
|}
|""".stripMargin
implicit val decoderPosition: Decoder[Position] = deriveDecoder[Position] // must be defined before the Person decoder
implicit val decoderPerson: Decoder[Person] = deriveDecoder[Person]
val parseResult = circe.parser.parse(input).getOrElse(Json.Null)
val data = parseResult.hcursor.downField("data")
val personList = data.as[List[Person]].getOrElse(null)
for {
person <- personList
} println(person.name + " is " + person.age + " at " + person.position)
}
}