xml field in JSON data - json

I would like to know how to read and parse the xml field which is a part of JSON data.
root
|-- fields: struct (nullable = true)
| |-- custid: string (nullable = true)
| |-- password: string (nullable = true)
| |-- role: string (nullable = true)
| |-- xml_data: string (nullable = true)
and that xml_data has lots of column in it. Lets say these fields inside the XML_data are like nested cols of the "FIELDS" data. So how to parse all the columns "custid","password","role","xml_data.refid","xml_data.refname" all of them into one data frame.
Long question short, how to parse and read xml data that is inside a JSON file as a String content.

This is little tricky, however could be achieve in below simple steps:
Parse XML String to JSON String and append identifier to it (below case: ' ')
Convert entire Dataframe to Dataset of JSON String
Map on Dataset of String, create a valid JSON via identifying the identifier appended in step 1.
Convert Dataset of Valid JSON to Dataframe, That's it done!!
That's it done!!
import spark.implicits._
import scala.xml.XML
import org.json4s.Xml.toJson
import org.json4s.jackson.JsonMethods.{compact, render}
import org.apache.spark.sql.functions.udf
val rdd = spark
.sparkContext
.parallelize(Seq("{\"fields\":{\"custid\":\"custid\",\"password\":\"password\",\"role\":\"role\",\"xml_data\":\"<person><refname>Test Person</refname><country>India</country></person>\"}}"))
val df = spark.read.json(rdd.toDS())
val xmlToJsonUDF = udf { xmlString: String =>
val xml = XML.loadString(xmlString)
s"''${compact(render(toJson(xml)))}''"
}
val xmlParsedDf = df.withColumn("xml_data", xmlToJsonUDF(col("fields.xml_data")))
val jsonDs = xmlParsedDf.toJSON
val validJsonDs = jsonDs.map(value => {
val startIndex = value.indexOf("\"''")
val endIndex = value.indexOf("''\"")
val data = value.substring(startIndex, endIndex).replace("\\", "")
val validJson = s"${value.substring(0, startIndex)}$data${value.substring(endIndex)}"
.replace("\"''", "")
.replace("''\"", "")
validJson
})
val finalDf = spark.read.json(validJsonDs)
finalDf.show(10)
finalDf.printSchema()
finalDf
.select("fields.custid", "fields.password", "fields.role", "fields.xml_data", "xml_data.person.refname", "xml_data.person.country")
.show(10)
Input & Output:
//Input
{"fields":{"custid":"custid","password":"password","role":"role","xml_data":"<person><refname>Test Person</refname><country>India</country></person>"}}
//Final Dataframe
+--------------------+--------------------+
| fields| xml_data|
+--------------------+--------------------+
|[custid, password...|[[India, Test Per...|
+--------------------+--------------------+
//Final Dataframe Schema
root
|-- fields: struct (nullable = true)
| |-- custid: string (nullable = true)
| |-- password: string (nullable = true)
| |-- role: string (nullable = true)
| |-- xml_data: string (nullable = true)
|-- xml_data: struct (nullable = true)
| |-- person: struct (nullable = true)
| | |-- country: string (nullable = true)
| | |-- refname: string (nullable = true)

Related

How to create a schema from JSON file using Spark Scala for subset of fields?

I am trying to create a schema of a nested JSON file so that it can become a dataframe.
However, I am not sure if there is way to create a schema without defining all the fields in the JSON file if I only need the 'id' and 'text' from it - a subset.
I am currently doing it using scala in spark shell. As you can see from the file, I downloaded it as part-00000 from HDFS.
.
From the manuals on JSON:
Apply the schema using the .schema method. This read returns only
the columns specified in the schema.
So you are good to go with what you imply.
E.g.
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val schema = new StructType()
.add("op_ts", StringType, true)
val df = spark.read.schema(schema)
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
returns:
root
|-- op_ts: string (nullable = true)
+--------------------------+
|op_ts |
+--------------------------+
|2019-05-31 04:24:34.000327|
+--------------------------+
for this schema:
root
|-- after: struct (nullable = true)
| |-- CODE: string (nullable = true)
| |-- CREATED: string (nullable = true)
| |-- ID: long (nullable = true)
| |-- STATUS: string (nullable = true)
| |-- UPDATE_TIME: string (nullable = true)
|-- before: string (nullable = true)
|-- current_ts: string (nullable = true)
|-- op_ts: string (nullable = true)
|-- op_type: string (nullable = true)
|-- pos: string (nullable = true)
|-- primary_keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- table: string (nullable = true)
|-- tokens: struct (nullable = true)
| |-- csn: string (nullable = true)
| |-- txid: string (nullable = true)
gotten from same file using:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
This latter is just for proof.

creating dataframe from nested json

I would like to know an efficient approach here. Lets say we have a JSON data as follows,
root
|-- fields: struct (nullable = true)
| |-- custid: string (nullable = true)
| |-- password: string (nullable = true)
| |-- role: string (nullable = true)
I can read this into data frame using,
jsonData_1.withColumn("custid", col("fields.custid")).withColumn("password", col("fields.password")).withColumn("role", col("fields.role"))
But if we have 100s of nested columns or if the cols are prone to change overtime or has more nested cols, I feel its not a good decision to use this approach. Is there any way we can make a code automatically look for all the columns and sub-cols and make a dataframe by reading the input JSON file? or is this the only good approach? Please share me your ideas here.
Don't need to specify each and every columns from structtype in spark.
We can extract all struct keys by specifying struct_field.* in .select
Example:
spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr"}}""").toDS).printSchema
//schema
//root
// |-- fields: struct (nullable = true)
// | |-- custid: string (nullable = true)
// | |-- password: string (nullable = true)
// | |-- role: string (nullable = true)
//read the json data into Dataframe.
val df=spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr"}}""").toDS)
//get all fields values extracted from fields struct
df.select("fields.*").show()
//+------+--------+----+
//|custid|password|role|
//+------+--------+----+
//| 1| foo| rr|
//+------+--------+----+
More dynamic way of flattening json here:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val df=spark.read.json(Seq("""{"fields":{"custid":"1","password":"foo","role":"rr","nested-2":{"id":"1"}}}""").toDS)
df.select(flattenSchema(df.schema):_*).show()
//+------+---+--------+----+
//|custid| id|password|role|
//+------+---+--------+----+
//| 1| 1| foo| rr|
//+------+---+--------+----+

Convert a string variable in nested JSON to datetime using Spark Scala

I have a nested JSON dataframe in Spark which looks like below
root
|-- data: struct (nullable = true)
| |-- average: long (nullable = true)
| |-- sum: long (nullable = true)
| |-- time: string (nullable = true)
|-- password: string (nullable = true)
|-- url: string (nullable = true)
|-- username: string (nullable = true)
I need to convert the time variable under the data struct to timestamp data type. Following is the code I tried, but did not give me the result i wanted.
val jsonStr = """{
"url": "imap.yahoo.com",
"username": "myusername",
"password": "mypassword",
"data": {
"time":"2017-1-29 0-54-32",
"average": 234,
"sum": 123}}"""
val json: JsValue = Json.parse(jsonStr)
import sqlContext.implicits._
val rdd = sc.parallelize(jsonStr::Nil);
var df = sqlContext.read.json(rdd);
df.printSchema()
val dfRes = df.withColumn("data",makeTimeStamp(unix_timestamp(df("data.time"),"yyyy-MM-dd hh-mm-ss").cast("timestamp")))
dfRes.printSchema();
case class Convert(time: java.sql.Timestamp)
val makeTimeStamp = udf((time: java.sql.Timestamp) => Convert(
time))
Result of my code:
root
|-- data: struct (nullable = true)
| |-- time: timestamp (nullable = true)
|-- password: string (nullable = true)
|-- url: string (nullable = true)
|-- username: string (nullable = true)
My code is actually removing the other elements inside the data struct(which are average and sum) instead of just casting the time string to timestamp data type. For basic data management operations on JSON dataframes, Do we need to write UDF as and when we need a functionality or is there a library available for JSON data management. I am currently using Play framework for working with JSON objects in spark. Thanks in advance.
You can try this:
val jsonStr = """{
"url": "imap.yahoo.com",
"username": "myusername",
"password": "mypassword",
"data": {
"time":"2017-1-29 0-54-32",
"average": 234,
"sum": 123}}"""
val json: JsValue = Json.parse(jsonStr)
import sqlContext.implicits._
val rdd = sc.parallelize(jsonStr::Nil);
var df = sqlContext.read.json(rdd);
df.printSchema()
val dfRes = df.withColumn("data",makeTimeStamp(unix_timestamp(df("data.time"),"yyyy-MM-dd hh-mm-ss").cast("timestamp"), df("data.average"), df("data.sum")))
case class Convert(time: java.sql.Timestamp, average: Long, sum: Long)
val makeTimeStamp = udf((time: java.sql.Timestamp, average: Long, sum: Long) => Convert(time, average, sum))
This will give the result:
root
|-- url: string (nullable = true)
|-- username: string (nullable = true)
|-- password: string (nullable = true)
|-- data: struct (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- average: long (nullable = false)
| |-- sum: long (nullable = false)
The only thing changed is Convert case class and makeTimeStamp UDF.
Assuming you can specify the Spark schema upfront, the automatic string-to-timestamp type coercion should take care of the conversions.
import org.apache.spark.sql.types._
val dschema = (new StructType).add("url", StringType).add("username", StringType).add
("data", (new StructType).add("sum", LongType).add("time", TimestampType))
val df = spark.read.schema(dschema).json("/your/json/on/hdfs")
df.printSchema
df.show
This article outlines a few more techniques to deal with bad data; worth a read for your use-case.

Re-using A Schema from JSON within a Spark DataFrame using Scala

I have some JSON data like this:
{"gid":"111","createHour":"2014-10-20 01:00:00.0","revisions":[{"revId":"2","modDate":"2014-11-20 01:40:37.0"},{"revId":"4","modDate":"2014-11-20 01:40:40.0"}],"comments":[],"replies":[]}
{"gid":"222","createHour":"2014-12-20 01:00:00.0","revisions":[{"revId":"2","modDate":"2014-11-20 01:39:31.0"},{"revId":"4","modDate":"2014-11-20 01:39:34.0"}],"comments":[],"replies":[]}
{"gid":"333","createHour":"2015-01-21 00:00:00.0","revisions":[{"revId":"25","modDate":"2014-11-21 00:34:53.0"},{"revId":"110","modDate":"2014-11-21 00:47:10.0"}],"comments":[{"comId":"4432","content":"How are you?"}],"replies":[{"repId":"4441","content":"I am good."}]}
{"gid":"444","createHour":"2015-09-20 23:00:00.0","revisions":[{"revId":"2","modDate":"2014-11-20 23:23:47.0"}],"comments":[],"replies":[]}
{"gid":"555","createHour":"2016-01-21 01:00:00.0","revisions":[{"revId":"135","modDate":"2014-11-21 01:01:58.0"}],"comments":[],"replies":[]}
{"gid":"666","createHour":"2016-04-23 19:00:00.0","revisions":[{"revId":"136","modDate":"2014-11-23 19:50:51.0"}],"comments":[],"replies":[]}
I can read it in:
val df = sqlContext.read.json("./data/full.json")
I can print the schema with df.printSchema
root
|-- comments: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- comId: string (nullable = true)
| | |-- content: string (nullable = true)
|-- createHour: string (nullable = true)
|-- gid: string (nullable = true)
|-- replies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: string (nullable = true)
| | |-- repId: string (nullable = true)
|-- revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- modDate: string (nullable = true)
| | |-- revId: string (nullable = true)
I can show the data df.show(10,false)
+---------------------+---------------------+---+-------------------+---------------------------------------------------------+
|comments |createHour |gid|replies |revisions |
+---------------------+---------------------+---+-------------------+---------------------------------------------------------+
|[] |2014-10-20 01:00:00.0|111|[] |[[2014-11-20 01:40:37.0,2], [2014-11-20 01:40:40.0,4]] |
|[] |2014-12-20 01:00:00.0|222|[] |[[2014-11-20 01:39:31.0,2], [2014-11-20 01:39:34.0,4]] |
|[[4432,How are you?]]|2015-01-21 00:00:00.0|333|[[I am good.,4441]]|[[2014-11-21 00:34:53.0,25], [2014-11-21 00:47:10.0,110]]|
|[] |2015-09-20 23:00:00.0|444|[] |[[2014-11-20 23:23:47.0,2]] |
|[] |2016-01-21 01:00:00.0|555|[] |[[2014-11-21 01:01:58.0,135]] |
|[] |2016-04-23 19:00:00.0|666|[] |[[2014-11-23 19:50:51.0,136]] |
+---------------------+---------------------+---+-------------------+---------------------------------------------------------+
I can print / read the schema val dfSc = df.schema as:
StructType(StructField(comments,ArrayType(StructType(StructField(comId,StringType,true), StructField(content,StringType,true)),true),true), StructField(createHour,StringType,true), StructField(gid,StringType,true), StructField(replies,ArrayType(StructType(StructField(content,StringType,true), StructField(repId,StringType,true)),true),true), StructField(revisions,ArrayType(StructType(StructField(modDate,StringType,true), StructField(revId,StringType,true)),true),true))
I can print this out nicer:
println(df.schema.fields.mkString(",\n"))
StructField(comments,ArrayType(StructType(StructField(comId,StringType,true), StructField(content,StringType,true)),true),true),
StructField(createHour,StringType,true),
StructField(gid,StringType,true),
StructField(replies,ArrayType(StructType(StructField(content,StringType,true), StructField(repId,StringType,true)),true),true),
StructField(revisions,ArrayType(StructType(StructField(modDate,StringType,true), StructField(revId,StringType,true)),true),true)
Now if I read in the same file without the comments and replies row, with val df2 = sqlContext.read.
json("./data/partialRevOnly.json") simply deleting those rows, I get something like this with printSchema:
root
|-- comments: array (nullable = true)
| |-- element: string (containsNull = true)
|-- createHour: string (nullable = true)
|-- gid: string (nullable = true)
|-- replies: array (nullable = true)
| |-- element: string (containsNull = true)
|-- revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- modDate: string (nullable = true)
| | |-- revId: string (nullable = true)
I don't like that, so I use:
val df3 = sqlContext.read.
schema(dfSc).
json("./data/partialRevOnly.json")
where the original schema was dfSc. So now I get exactly the schema I had before with the removed data:
root
|-- comments: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- comId: string (nullable = true)
| | |-- content: string (nullable = true)
|-- createHour: string (nullable = true)
|-- gid: string (nullable = true)
|-- replies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: string (nullable = true)
| | |-- repId: string (nullable = true)
|-- revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- modDate: string (nullable = true)
| | |-- revId: string (nullable = true)
This is perfect ... well almost. I would like to assign this schema to a variable similar to this:
val textSc = StructField(comments,ArrayType(StructType(StructField(comId,StringType,true), StructField(content,StringType,true)),true),true),
StructField(createHour,StringType,true),
StructField(gid,StringType,true),
StructField(replies,ArrayType(StructType(StructField(content,StringType,true), StructField(repId,StringType,true)),true),true),
StructField(revisions,ArrayType(StructType(StructField(modDate,StringType,true), StructField(revId,StringType,true)),true),true)
OK - This won't work due to double quotes, and 'some other structural' stuff, so try this (with error):
import org.apache.spark.sql.types._
val textSc = StructType(Array(
StructField("comments",ArrayType(StructType(StructField("comId",StringType,true), StructField("content",StringType,true)),true),true),
StructField("createHour",StringType,true),
StructField("gid",StringType,true),
StructField("replies",ArrayType(StructType(StructField("content",StringType,true), StructField("repId",StringType,true)),true),true),
StructField("revisions",ArrayType(StructType(StructField("modDate",StringType,true), StructField("revId",StringType,true)),true),true)
))
Name: Compile Error
Message: <console>:78: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("comments",ArrayType(StructType(StructField("comId",StringType,true), StructField("content",StringType,true)),true),true),
... Without this error (that I cannot figure a quick way around), I would like to then use textSc in place of dfSc to read in the JSON data with an imposed schema.
I cannot find a '1-to-1 match' way of getting (via println or ...) the schema with acceptable syntax (sort of like above). I suppose some coding can be done with case matching to iron out the double quotes. However, I'm still unclear what rules are required to get the exact schema out of the test fixture that I can simply re-use in my recurring production (versus test fixture) code. Is there a way to get this schema to print exactly as I would code it?
Note: This includes double quotes and all the proper StructField/Types and so forth to be code-compatible drop in.
As a sidebar, I thought about saving a fully-formed golden JSON file to use at the start of the Spark job, but I would like to eventually use date fields and other more concise types instead of strings at the applicable structural locations.
How can I get the dataFrame information coming out of my test harness (using a fully-formed JSON input row with comments and replies) to a point where I can drop the schema as source-code into production code Scala Spark job?
Note: The best answer is some coding means, but an explanation so I can trudge, plod, toil, wade, plow and slog thru the coding is helpful too. :)
I recently ran into this. I'm using Spark 2.0.2 so I don't know if this solution works with earlier versions.
import scala.util.Try
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser
import org.apache.spark.sql.types.{DataType, StructType}
/** Produce a Schema string from a Dataset */
def serializeSchema(ds: Dataset[_]): String = ds.schema.json
/** Produce a StructType schema object from a JSON string */
def deserializeSchema(json: String): StructType = {
Try(DataType.fromJson(json)).getOrElse(LegacyTypeStringParser.parse(json)) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing StructType: $json")
}
}
Note that the "deserialize" function I just copied from a private function in the Spark StructType object. I don't know how well it will be supported across versions.
Well, the error message should tell you everything you have to know here - StructType expects a sequence of fields as an argument. So in your case schema should look like this:
StructType(Seq(
StructField("comments", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("comId", StringType, true),
StructField("content", StringType, true))), true), true),
StructField("createHour", StringType, true),
StructField("gid", StringType, true),
StructField("replies", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("content", StringType, true),
StructField("repId", StringType, true))), true), true),
StructField("revisions", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("modDate", StringType, true),
StructField("revId", StringType, true))),true), true)))

How to read a Json file with a specific format with Spark Scala?

I'm trying to read a Json file which is like :
[
{"IFAM":"EQR","KTM":1430006400000,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"31","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"5","up":null,"Crate":"2"}
,{"MLrate":"34","Nrout":"0","up":null,"Crate":"4"}
,{"MLrate":"33","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"8","up":null,"Crate":"2"}
]}
,{"IFAM":"EQR","KTM":1430006400000,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"0","up":null,"Crate":"0"}
,{"MLrate":"35","Nrout":"1","up":null,"Crate":"5"}
,{"MLrate":"30","Nrout":"6","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"38","Nrout":"8","up":null,"Crate":"1"}
]}
,...
]
I've tried the command:
val df = sqlContext.read.json("namefile")
df.show()
But this does not work : my columns are not recognized...
If you want to use read.json you need a single JSON document per line. If your file contains a valid JSON array with documents it simply won't work as expected. For example if we take your example data input file should be formatted like this:
{"IFAM":"EQR","KTM":1430006400000,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}, {"MLrate":"31","Nrout":"0","up":null,"Crate":"2"}, {"MLrate":"30","Nrout":"5","up":null,"Crate":"2"} ,{"MLrate":"34","Nrout":"0","up":null,"Crate":"4"} ,{"MLrate":"33","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"8","up":null,"Crate":"2"} ]}
{"IFAM":"EQR","KTM":1430006400000,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"0"} ,{"MLrate":"35","Nrout":"1","up":null,"Crate":"5"} ,{"MLrate":"30","Nrout":"6","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"38","Nrout":"8","up":null,"Crate":"1"} ]}
If you use read.json on above structure you'll see it is parsed as expected:
scala> sqlContext.read.json("namefile").printSchema
root
|-- COL: long (nullable = true)
|-- DATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Crate: string (nullable = true)
| | |-- MLrate: string (nullable = true)
| | |-- Nrout: string (nullable = true)
| | |-- up: string (nullable = true)
|-- IFAM: string (nullable = true)
|-- KTM: long (nullable = true)
If you don't want to format your JSON file (line by line) you could create a schema using StructType and MapType using SparkSQL functions
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// Convenience function for turning JSON strings into DataFrames
def jsonToDataFrame(json: String, schema: StructType = null):
DataFrame = {
val reader = spark.read
Option(schema).foreach(reader.schema)
reader.json(sc.parallelize(Array(json)))
}
// Using a struct
val schema = new StructType().add("a", new StructType().add("b", IntegerType))
// call the function passing the sample JSON data and the schema as parameter
val json_df = jsonToDataFrame("""
{
"a": {
"b": 1
}
} """, schema)
// now you can access your json fields
val b_value = json_df.select("a.b")
b_value.show()
See this reference documentation for more examples and details
https://docs.databricks.com/spark/latest/spark-sql/complex-types.html#transform-complex-data-types-scala