Unable to parse a json string Binary in spark scala - json

I have a Binary json array string which I need to parse in Spark scala.
This is what I am doing:
import com.github.wnameless.json.flattener.JsonFlattener
def extractBlob(jsonString: String,
defaultValue: String = ""): String = {
JsonFlattener
.flatten(jsonString)
.toString
}
val extractBlobAsText = udf((jS: String) => extractBlob(jS))
val df1 =
df.withColumn("extracted",extractBlobAsText(unhex(col("hex").cast(StringType))))
df1.show(false)
But I am getting this error
com.eclipsesource.json.ParseException: Unexpected end of input at 1:1 at
com.eclipsesource.json.JsonParser.error(JsonParser.java:490) at
com.eclipsesource.json.JsonParser.expected(JsonParser.java:484) at
com.eclipsesource.json.JsonParser.readValue(JsonParser.java:193) at
com.eclipsesource.json.JsonParser.parse(JsonParser.java:152) at
com.eclipsesource.json.JsonParser.parse(JsonParser.java:91) at
com.eclipsesource.json.Json.parse(Json.java:295) at
com.github.wnameless.json.flattener.JsonFlattener.<init>(JsonFlattener.java:144) at
com.github.wnameless.json.flattener.JsonFlattener.flatten(JsonFlattener.java:100) at
notebook0.Cell5$285.extractBlob(Cell5:12) at
Also, I do not know the JSON schema. All I know is that it is a Json array. So I also need to do explode later.

Related

spark streaming JSON value in dataframe column scala

I have a text file with json value. and this gets read into a DF
{"name":"Michael"}
{"name":"Andy", "age":30}
I want to infer the schema dynamically for each line while Streaming and store it in separate locations(tables) depending on its schema.
unfortunately while I try to read the value.schema it still shows as String. Please help on how to do it on Streaming as RDD is not allowed in streaming.
I wanted to use the following code which doesnt work as the value is still read as String format.
val jsonSchema = newdf1.select("value").as[String].schema
val df1 = newdf1.select(from_json($"value", jsonSchema).alias("value_new"))
val df2 = df1.select("value_new.*")
I even tried to use,
schema_of_json("json_schema"))
val jsonSchema: String = newdf.select(schema_of_json(col("value".toString))).as[String].first()
still no hope.. Please help..
You can load the data as textFile, create case class for person and parse every json string to Person instance using json4s or gson, then creating the Dataframe as follows:
case class Person(name: String, age: Int)
val jsons = spark.read.textFile("/my/input")
val persons = jsons.map{json => toPerson(json) //instead of 'toPerson' actually parse with json4s or gson to return Person instance}
val df = sqlContext.createDataFrame(persons)
Deserialize json to case class using json4s:
https://commitlogs.com/2017/01/14/serialize-deserialize-json-with-json4s-in-scala/
Deserialize json to case class using gson:
https://alvinalexander.com/source-code/scala/scala-case-class-gson-json-object-deserialization-and-scalatra

Extract a Json from an array inside a json in spark

I have a complicated JSON column whose structure is :
story{
cards: [{story-elements: [{...}{...}{...}}]}
The length of the story-elements is variable. I need to extract a particular JSON block from the story-elements array. For this, I first need to extract the story-elements.
Here is the code which I have tried, but it is giving error:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson\"cards"\"story-elements").extract[String]
value1
}
val getJsonContentUDF = udf((jsonstring: String) =>
getJsonContent(jsonstring))
input.withColumn("cards",getJsonContentUDF(input("storyDataFrame")))
According to json you provided, story-elements is a an array of json objects, but you trying to extract array as a string ((parsedJson\"cards"\"story-elements").extract[String]).
You can create case class representing on story (like case class Story(description: String, pageUrl: String, ...)) and then instead of extract[String], try extract[List[Story]] or extract[Array[Story]]
If you need just one piece of data from story (e.g. descrition), then you can use xpath-like syntax to get that and then extract List[String]

Klaxon Parsing of Nested Arrays

Im trying to parse this file with Klaxon, generally its going well, except I am totally not succeeding in parsing that subarray of features/[Number]/properties/
So my thought is to get the raw string of properties and to parse it seperately with Klaxon, though I dont succeed in that either. Apart from that I took many other approaches as well.
My code so far:
class Haltestelle(val type: String?, val totalFeatures: Int?, val features: Array<Any>?)
fun main(args: Array<String>) { // Main-Routine
val haltejsonurl = URL("http://online-service.kvb-koeln.de/geoserver/OPENDATA/ows?service=WFS&version=1.0.0&request=GetFeature&typeName=ODENDATA%3Ahaltestellenbereiche&outputFormat=application/json")
val haltestringurl = haltejsonurl.readText()
val halteklx = Klaxon().parse<Haltestelle>(haltestringurl)
println(halteklx?.type)
println(halteklx?.totalFeatures)
println(halteklx?.features)
halteklx?.features!!.forEach {
println(it)
}
I am aware that I am invoking features as an Array of Any, so the Output is just printing me java.lang.Object#blabla everytime. Though, using Array failes either.
Really spend hours in this, how would you go on this?
Regards of newbie
Here's how I did something similar in Kotlin. You can parse the response as a Klaxon JsonObject, then access the "features" element to parse all the array objects into a JsonArray of JsonObjects. This can be iterated over and cast with parseFromJsonObject<Haltestelle> in your example:
import com.beust.klaxon.JsonArray
import com.beust.klaxon.JsonObject
import com.beust.klaxon.Parser
import com.github.aivancioglo.resttest.*
val response : Response = RestTest.get("http://anyurlwithJSONresponse")
val myParser = Parser()
val data : JsonObject = myParser.parse(response.getBody()) as JsonObject
val allFeatures : JsonArray<JsonObject>? = response["features"] as JsonArray<JsonObject>?
for((index,obj) in allFeatures.withIndex()) {
println("Loop Iteration $index on each object")
val yourObj = Klaxon().parseFromJsonObject<Haltestelle>(obj)
}

How to convert Row to json in Spark 2 Scala

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}
You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}
I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.
Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092
JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}
Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.
I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))
if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON
I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Convert scala json string to an object

I need to convert a string json in an object using scala.
I tried this, but I cannot access the prop Name, for example
import scala.util.parsing.json._
val parsed = JSON.parseFull("""{"Name":"abc", "age":10}""")
How can I do to get an string json and convert to an object? Thanks
val parsed = JSON.parseFull("""{"Name":"abc", "age":10}""")
val parsed = JSON.parseFull(parsed)
try it