How to convert Row to json in Spark 2 Scala - json

Is there a simple way to converting a given Row object to json?
Found this about converting a whole Dataframe to json output:
Spark Row to JSON
But I just want to convert a one Row to json.
Here is pseudo code for what I am trying to do.
More precisely I am reading json as input in a Dataframe.
I am producing a new output that is mainly based on columns, but with one json field for all the info that does not fit into the columns.
My question what is the easiest way to write this function: convertRowToJson()
def convertRowToJson(row: Row): String = ???
def transformVenueTry(row: Row): Try[Venue] = {
Try({
val name = row.getString(row.fieldIndex("name"))
val metadataRow = row.getStruct(row.fieldIndex("meta"))
val score: Double = calcScore(row)
val combinedRow: Row = metadataRow ++ ("score" -> score)
val jsonString: String = convertRowToJson(combinedRow)
Venue(name = name, json = jsonString)
})
}
Psidom's Solutions:
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
only works if the Row only has one level not with nested Row. This is the schema:
StructType(
StructField(indicator,StringType,true),
StructField(range,
StructType(
StructField(currency_code,StringType,true),
StructField(maxrate,LongType,true),
StructField(minrate,LongType,true)),true))
Also tried Artem suggestion, but that did not compile:
def row2DataFrame(row: Row, sqlContext: SQLContext): DataFrame = {
val sparkContext = sqlContext.sparkContext
import sparkContext._
import sqlContext.implicits._
import sqlContext._
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataFrame = rowRDD.toDF() //XXX does not compile
dataFrame
}

You can use getValuesMap to convert the row object to a Map and then convert it JSON:
import scala.util.parsing.json.JSONObject
import org.apache.spark.sql._
val df = Seq((1,2,3),(2,3,4)).toDF("A", "B", "C")
val row = df.first() // this is an example row object
def convertRowToJSON(row: Row): String = {
val m = row.getValuesMap(row.schema.fieldNames)
JSONObject(m).toString()
}
convertRowToJSON(row)
// res46: String = {"A" : 1, "B" : 2, "C" : 3}

I need to read json input and produce json output.
Most fields are handled individually, but a few json sub objects need to just be preserved.
When Spark reads a dataframe it turns a record into a Row. The Row is a json like structure. That can be transformed and written out to json.
But I need to take some sub json structures out to a string to use as a new field.
This can be done like this:
dataFrameWithJsonField = dataFrame.withColumn("address_json", to_json($"location.address"))
location.address is the path to the sub json object of the incoming json based dataframe. address_json is the column name of that object converted to a string version of the json.
to_json is implemented in Spark 2.1.
If generating it output json using json4s address_json should be parsed to an AST representation otherwise the output json will have the address_json part escaped.

Pay attention scala class scala.util.parsing.json.JSONObject is deprecated and not support null values.
#deprecated("This class will be removed.", "2.11.0")
"JSONFormat.defaultFormat doesn't handle null values"
https://issues.scala-lang.org/browse/SI-5092

JSon has schema but Row doesn't have a schema, so you need to apply schema on Row & convert to JSon. Here is how you can do it.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
def convertRowToJson(row: Row): String = {
val schema = StructType(
StructField("name", StringType, true) ::
StructField("meta", StringType, false) :: Nil)
return sqlContext.applySchema(row, schema).toJSON
}

Essentially, you can have a dataframe which contains just one row. Thus, you can try to filter your initial dataframe and then parse it to json.

I had the same issue, I had parquet files with canonical schema (no arrays), and I only want to get json events. I did as follows, and it seems to work just fine (Spark 2.1):
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.{DataFrame, Dataset, Row}
import scala.util.parsing.json.JSONFormat.ValueFormatter
import scala.util.parsing.json.{JSONArray, JSONFormat, JSONObject}
def getValuesMap[T](row: Row, schema: StructType): Map[String,Any] = {
schema.fields.map {
field =>
try{
if (field.dataType.typeName.equals("struct")){
field.name -> getValuesMap(row.getAs[Row](field.name), field.dataType.asInstanceOf[StructType])
}else{
field.name -> row.getAs[T](field.name)
}
}catch {case e : Exception =>{field.name -> null.asInstanceOf[T]}}
}.filter(xy => xy._2 != null).toMap
}
def convertRowToJSON(row: Row, schema: StructType): JSONObject = {
val m: Map[String, Any] = getValuesMap(row, schema)
JSONObject(m)
}
//I guess since I am using Any and not nothing the regular ValueFormatter is not working, and I had to add case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
val defaultFormatter : ValueFormatter = (x : Any) => x match {
case s : String => "\"" + JSONFormat.quoteString(s) + "\""
case jo : JSONObject => jo.toString(defaultFormatter)
case jmap : Map[String,Any] => JSONObject(jmap).toString(defaultFormatter)
case ja : JSONArray => ja.toString(defaultFormatter)
case other => other.toString
}
val someFile = "s3a://bucket/file"
val df: DataFrame = sqlContext.read.load(someFile)
val schema: StructType = df.schema
val jsons: Dataset[JSONObject] = df.map(row => convertRowToJSON(row, schema))

if you are iterating through an data frame , you can directly convert the data frame to a new dataframe with json object inside and iterate that
val df_json = df.toJSON

I combining the suggestion from: Artem, KiranM and Psidom. Did a lot of trails and error and came up with this solutions that I tested for nested structures:
def row2Json(row: Row, sqlContext: SQLContext): String = {
import sqlContext.implicits
val rowRDD: RDD[Row] = sqlContext.sparkContext.makeRDD(row :: Nil)
val dataframe = sqlContext.createDataFrame(rowRDD, row.schema)
dataframe.toJSON.first
}
This solution worked, but only while running in driver mode.

Related

Scala - Couldn't remove double quotes for Key -> value "{}" braces while building Json

Scala - Couldn't remove double quotes for "{}" braces while building Json
import scala.util.Random
import math.Ordered.orderingToOrdered
import math.Ordering.Implicits.infixOrderingOps
import play.api.libs.json._
import play.api.libs.json.Writes
import play.api.libs.json.Json.JsValueWrapper
val data1 = (1 to 2)
.map {r => Json.toJson(Map(
"name" -> Json.toJson(s"Perftest${Random.alphanumeric.take(6).mkString}"),
"domainId"->Json.toJson("343RDFDGF4RGGFG"),
"value" ->Json.toJson("{}")))}
val data2 = Json.toJson(data1)
println(data2)
Result :
[{"name":"PerftestpXI1ID","domainId":"343RDFDGF4RGGFG","value":"{}"},{"name":"PerftestHoZSQR","domainId":"343RDFDGF4RGGFG","value":"{}"}]
Expected :
"value":{}
[{"name":"PerftestpXI1ID","domainId":"343RDFDGF4RGGFG","value":{}},{"name":"PerftestHoZSQR","domainId":"343RDFDGF4RGGFG","value":{}}]
Please suggest a solution
You are giving it a String so it is creating a string in JSON. What you actually want is an empty dictionary, which is a Map in Scala:
val data1 = (1 to 2)
.map {r => Json.toJson(Map(
"name" -> Json.toJson(s"Perftest${Random.alphanumeric.take(6).mkString}"),
"domainId"->Json.toJson("343RDFDGF4RGGFG"),
"value" ->Json.toJson(Map.empty[String, String])))}
More generally you should create a case class for the data and create a custom Writes implementation for that class so that you don't have to call Json.toJson on every value.
Here is how to do the conversion using only a single Json.toJson call:
import play.api.libs.json.Json
case class MyData(name: String, domainId: String, value: Map[String,String])
implicit val fmt = Json.format[MyData]
val data1 = (1 to 2)
.map { r => new MyData(
s"Perftest${Random.alphanumeric.take(6).mkString}",
"343RDFDGF4RGGFG",
Map.empty
)
}
val data2 = Json.toJson(data1)
println(data2)
The value field can be a standard type such as Boolean or Double. It could also be another case class to create nested JSON as long as there is a similar Json.format line for the new type.
More complex JSON can be generated by using a custom Writes (and Reads) implementation as described in the documentation.

Get value by key from json array

I have a few json arrays
[{"key":"country","value":"aaa"},{"key":"region","value":"a"},{"key":"city","value":"a1"}]
[{"key":"city","value":"b"},{"key":"street","value":"1"}]
I need to extract city and street value into different columns.
Using get_json_object($"address", "$[2].value").as("city") to get element by it's number doesn't work because arrays can miss some fields.
Instead I decided to turn this array into a map of key -> value pairs, but have trouble doing it. So far I only managed to get an array of arrays.
val schema = ArrayType(StructType(Array(
StructField("key", StringType),
StructField("value", StringType)
)))
from_json($"address", schema)
Returns
[[country, aaa],[region, a],[city, a1]]
[[city, b],[street, 1]]
I'm not sure where to go from here.
val schema = ArrayType(MapType(StringType, StringType))
Fails with
cannot resolve 'jsontostructs(`address`)' due to data type mismatch: Input schema array<map<string,string>> must be a struct or an array of structs.;;
I'm using spark 2.2
Using UDF we can handle this easily. In the below code I have created a map using a UDF. I hope this will suffice the need
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df1 = spark.read.format("text").load("file path")
val schema = ArrayType(StructType(Array(
StructField("key", StringType),
StructField("value", StringType)
)))
val arrayToMap = udf[Map[String, String], Seq[Row]] {
array => array.map { case Row(key: String, value: String) => (key, value) }.toMap
}
val dfJSON = df1.withColumn("jsonData",from_json(col("value"),schema))
.select("jsonData").withColumn("address", arrayToMap(col("jsonData")))
.withColumn("city", when(col("address.city").isNotNull, col("address.city")).otherwise(lit(""))).withColumn("street", when(col("address.street").isNotNull, col("address.street")).otherwise(lit("")))
dfJSON.printSchema()
dfJSON.show(false)

Scala: Transform and replace values of Spark DataFrame with nested json structure

I have a nested json file that I am reading as Spark DataFrame and that I want to replace certain values in using an own transformation.
For now let's assume it looks as follows (which follows this)
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// Convenience function for turning JSON strings into DataFrames.
def jsonToDataFrame(json: String, schema: StructType = null): DataFrame = {
// SparkSessions are available with Spark 2.0+
val reader = spark.read
Option(schema).foreach(reader.schema)
reader.json(sc.parallelize(Array(json)))
}
val df = jsonToDataFrame("""
{
"A": {
"B": "b",
"C": "c",
"D": {"E": "e"
}
}
}
""")
display(df)
df.printSchema()
Suppose the following transformation (turn lower-case to upper-case) shall be applied for certain values in above Spark DataFrame
import org.apache.spark.sql.functions.udf
val upper: String => String = _.toUpperCase
val upperUDF = udf(upper)
While this doesn't work at all:
df.withColumn("A.B", upperUDF('A.B)).show()
the following works:
val df1 = df.select("A.B")
df1.withColumn("B", upperUDF('B)).show()
But in the end I want to stick to my nested structure and just replace certain values accordign to my transformation.
How can one achieve that? How can one preserve the schema wehen using withColumn?
Finally I have found this thread which gives the answer to my question. The trick is to dynamically preserve the schema while transforming the columns. Using the mutate() function defined therein, the following woks well for me:
val df2 = mutate(df, c => if (c.toString == "A.B") upperUDF(c) else c)
val df3 = mutate(df, c => if (c.toString == "A.D.E") upperUDF(c) else c)
display(df2)
df2.printSchema
display(df3)
df3.printSchema

Save a sequence of JsObject into JSON file

I am using Play Framework to convert a case class to JSON object. This for many instances of case class LinkEvolution. Given the structure of each JSON object :
implicit val linkIPFormat = Json.format[LinkIPs]
implicit val linkState = Json.format[LinkState]
// user has JsObject as type
val linkEvolution = LinkEvolution(rawDataLink.link, reference, current, alarms)
val user = Json.obj(
"link" -> rawDataLink.link,
"reference" -> linkEvolution.reference,
"current" -> linkEvolution.current,
"alarms" -> linkEvolution.alarms)
I have a list of users, so a list of JsObject.
My question is how I can save this list in a JSON file, each line of the file is a JsObject.
You can conevrt them to strings and write to a file (I assumed users=List[JsObject]):
import java.io._
val file = "file.json"
val writer = new BufferedWriter(new FileWriter(file))
users.map(_.toString).{ json =>
writer.write(json)
writer.newLine
}
writer.close()

Extract a Json from an array inside a json in spark

I have a complicated JSON column whose structure is :
story{
cards: [{story-elements: [{...}{...}{...}}]}
The length of the story-elements is variable. I need to extract a particular JSON block from the story-elements array. For this, I first need to extract the story-elements.
Here is the code which I have tried, but it is giving error:
import org.json4s.{DefaultFormats, MappingException}
import org.json4s.jackson.JsonMethods._
import org.apache.spark.sql.functions._
def getJsonContent(jsonstring: String): (String) = {
implicit val formats = DefaultFormats
val parsedJson = parse(jsonstring)
val value1 = (parsedJson\"cards"\"story-elements").extract[String]
value1
}
val getJsonContentUDF = udf((jsonstring: String) =>
getJsonContent(jsonstring))
input.withColumn("cards",getJsonContentUDF(input("storyDataFrame")))
According to json you provided, story-elements is a an array of json objects, but you trying to extract array as a string ((parsedJson\"cards"\"story-elements").extract[String]).
You can create case class representing on story (like case class Story(description: String, pageUrl: String, ...)) and then instead of extract[String], try extract[List[Story]] or extract[Array[Story]]
If you need just one piece of data from story (e.g. descrition), then you can use xpath-like syntax to get that and then extract List[String]