I gave a text file with complex type column. Could you please tell about automatically inferring schema with array, map and structure type in Spark.
Source:
name,work_place,gender_age,skills_score,depart_title,work_contractor
Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead
code example:
val employeeComplexDF = spark
.read
.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/employee_complex/employee.txt")
parsed schema (fact):
root
|-- name: string (nullable = true)
|-- work_place: string (nullable = true)
|-- gender_age: string (nullable = true)
|-- skills_score: string (nullable = true)
|-- depart_title: string (nullable = true)
|-- work_contractor: string (nullable = true)
Expected schema is schema with ArrayType, ...
I am trying to create a schema of a nested JSON file so that it can become a dataframe.
However, I am not sure if there is way to create a schema without defining all the fields in the JSON file if I only need the 'id' and 'text' from it - a subset.
I am currently doing it using scala in spark shell. As you can see from the file, I downloaded it as part-00000 from HDFS.
.
From the manuals on JSON:
Apply the schema using the .schema method. This read returns only
the columns specified in the schema.
So you are good to go with what you imply.
E.g.
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val schema = new StructType()
.add("op_ts", StringType, true)
val df = spark.read.schema(schema)
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
returns:
root
|-- op_ts: string (nullable = true)
+--------------------------+
|op_ts |
+--------------------------+
|2019-05-31 04:24:34.000327|
+--------------------------+
for this schema:
root
|-- after: struct (nullable = true)
| |-- CODE: string (nullable = true)
| |-- CREATED: string (nullable = true)
| |-- ID: long (nullable = true)
| |-- STATUS: string (nullable = true)
| |-- UPDATE_TIME: string (nullable = true)
|-- before: string (nullable = true)
|-- current_ts: string (nullable = true)
|-- op_ts: string (nullable = true)
|-- op_type: string (nullable = true)
|-- pos: string (nullable = true)
|-- primary_keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- table: string (nullable = true)
|-- tokens: struct (nullable = true)
| |-- csn: string (nullable = true)
| |-- txid: string (nullable = true)
gotten from same file using:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
This latter is just for proof.
I would like to know how to read and parse the xml field which is a part of JSON data.
root
|-- fields: struct (nullable = true)
| |-- custid: string (nullable = true)
| |-- password: string (nullable = true)
| |-- role: string (nullable = true)
| |-- xml_data: string (nullable = true)
and that xml_data has lots of column in it. Lets say these fields inside the XML_data are like nested cols of the "FIELDS" data. So how to parse all the columns "custid","password","role","xml_data.refid","xml_data.refname" all of them into one data frame.
Long question short, how to parse and read xml data that is inside a JSON file as a String content.
This is little tricky, however could be achieve in below simple steps:
Parse XML String to JSON String and append identifier to it (below case: ' ')
Convert entire Dataframe to Dataset of JSON String
Map on Dataset of String, create a valid JSON via identifying the identifier appended in step 1.
Convert Dataset of Valid JSON to Dataframe, That's it done!!
That's it done!!
import spark.implicits._
import scala.xml.XML
import org.json4s.Xml.toJson
import org.json4s.jackson.JsonMethods.{compact, render}
import org.apache.spark.sql.functions.udf
val rdd = spark
.sparkContext
.parallelize(Seq("{\"fields\":{\"custid\":\"custid\",\"password\":\"password\",\"role\":\"role\",\"xml_data\":\"<person><refname>Test Person</refname><country>India</country></person>\"}}"))
val df = spark.read.json(rdd.toDS())
val xmlToJsonUDF = udf { xmlString: String =>
val xml = XML.loadString(xmlString)
s"''${compact(render(toJson(xml)))}''"
}
val xmlParsedDf = df.withColumn("xml_data", xmlToJsonUDF(col("fields.xml_data")))
val jsonDs = xmlParsedDf.toJSON
val validJsonDs = jsonDs.map(value => {
val startIndex = value.indexOf("\"''")
val endIndex = value.indexOf("''\"")
val data = value.substring(startIndex, endIndex).replace("\\", "")
val validJson = s"${value.substring(0, startIndex)}$data${value.substring(endIndex)}"
.replace("\"''", "")
.replace("''\"", "")
validJson
})
val finalDf = spark.read.json(validJsonDs)
finalDf.show(10)
finalDf.printSchema()
finalDf
.select("fields.custid", "fields.password", "fields.role", "fields.xml_data", "xml_data.person.refname", "xml_data.person.country")
.show(10)
Input & Output:
//Input
{"fields":{"custid":"custid","password":"password","role":"role","xml_data":"<person><refname>Test Person</refname><country>India</country></person>"}}
//Final Dataframe
+--------------------+--------------------+
| fields| xml_data|
+--------------------+--------------------+
|[custid, password...|[[India, Test Per...|
+--------------------+--------------------+
//Final Dataframe Schema
root
|-- fields: struct (nullable = true)
| |-- custid: string (nullable = true)
| |-- password: string (nullable = true)
| |-- role: string (nullable = true)
| |-- xml_data: string (nullable = true)
|-- xml_data: struct (nullable = true)
| |-- person: struct (nullable = true)
| | |-- country: string (nullable = true)
| | |-- refname: string (nullable = true)
I'm trying to learn how to use Spark to process JSON data, and I have a fairly simple JSON file that looks like this:
{"key": { "defaultWeights":"1" }, "measures": { "m1":-0.01, "m2":-0.5.....}}
When I load this file into a Spark dataframe and run the following code:
val flattened = dff.withColumn("default_weights", json_tuple(col("key"), "defaultWeights")).show
I get this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'json_tuple(`key`, 'defaultWeights')' due to data type mismatch: json_tuple requires that all arguments are strings;;
'Project [key#6, measures#7, json_tuple(key#6, defaultWeights) AS default_weights#13]
+- Relation[key#6,measures#7] json
If I change my code to make sure both arguments are strings, I get this error:
<console>:25: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
val flattened = dff.withColumn("default_weights", json_tuple("key", "defaultWeights")).show
So as you can see, I am literally going around in circles!
json_tuple could work if your key column would be a text and not a struct. Let me show you:
val contentStruct =
"""|{"key": { "defaultWeights":"1", "c": "a" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
FileUtils.writeStringToFile(new File("/tmp/test_flat.json"), contentStruct)
val sparkSession: SparkSession = SparkSession.builder()
.appName("Spark SQL json_tuple")
.master("local[*]").getOrCreate()
import sparkSession.implicits._
sparkSession.read.json("/tmp/test_flat.json").printSchema()
The schema will be:
root
|-- key: struct (nullable = true)
| |-- c: string (nullable = true)
| |-- defaultWeights: string (nullable = true)
|-- measures: struct (nullable = true)
| |-- m1: double (nullable = true)
| |-- m2: double (nullable = true)
So de facto, you don't need to extra the defaultWeights. You can simply use them with JSON path (key.defaultWeights):
sparkSession.read.json("/tmp/test_flat.json").select("key.defaultWeights").show()
+--------------+
|defaultWeights|
+--------------+
| 1|
+--------------+
Otherwise, to use json_tuple, your JSON should look like that:
val contentString =
"""|{"key": "{ \"defaultWeights\":\"1\", \"c\": \"a\" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
In that case, the schema will be:
root
|-- key: string (nullable = true)
|-- measures: struct (nullable = true)
| |-- m1: double (nullable = true)
| |-- m2: double (nullable = true)
And:
sparkSession.read.json("/tmp/test_flat.json")
.withColumn("default_weights", functions.json_tuple($"key", "defaultWeights")).show(false)
will return:
+----------------------------------+-------------+---------------+
|key |measures |default_weights|
+----------------------------------+-------------+---------------+
|{ "defaultWeights":"1", "c": "a" }|[-0.01, -0.5]|1 |
+----------------------------------+-------------+---------------+
I'm trying to read a Json file which is like :
[
{"IFAM":"EQR","KTM":1430006400000,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"31","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"5","up":null,"Crate":"2"}
,{"MLrate":"34","Nrout":"0","up":null,"Crate":"4"}
,{"MLrate":"33","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"8","up":null,"Crate":"2"}
]}
,{"IFAM":"EQR","KTM":1430006400000,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"0","up":null,"Crate":"0"}
,{"MLrate":"35","Nrout":"1","up":null,"Crate":"5"}
,{"MLrate":"30","Nrout":"6","up":null,"Crate":"2"}
,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
,{"MLrate":"38","Nrout":"8","up":null,"Crate":"1"}
]}
,...
]
I've tried the command:
val df = sqlContext.read.json("namefile")
df.show()
But this does not work : my columns are not recognized...
If you want to use read.json you need a single JSON document per line. If your file contains a valid JSON array with documents it simply won't work as expected. For example if we take your example data input file should be formatted like this:
{"IFAM":"EQR","KTM":1430006400000,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}, {"MLrate":"31","Nrout":"0","up":null,"Crate":"2"}, {"MLrate":"30","Nrout":"5","up":null,"Crate":"2"} ,{"MLrate":"34","Nrout":"0","up":null,"Crate":"4"} ,{"MLrate":"33","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"8","up":null,"Crate":"2"} ]}
{"IFAM":"EQR","KTM":1430006400000,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"0"} ,{"MLrate":"35","Nrout":"1","up":null,"Crate":"5"} ,{"MLrate":"30","Nrout":"6","up":null,"Crate":"2"} ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} ,{"MLrate":"38","Nrout":"8","up":null,"Crate":"1"} ]}
If you use read.json on above structure you'll see it is parsed as expected:
scala> sqlContext.read.json("namefile").printSchema
root
|-- COL: long (nullable = true)
|-- DATA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Crate: string (nullable = true)
| | |-- MLrate: string (nullable = true)
| | |-- Nrout: string (nullable = true)
| | |-- up: string (nullable = true)
|-- IFAM: string (nullable = true)
|-- KTM: long (nullable = true)
If you don't want to format your JSON file (line by line) you could create a schema using StructType and MapType using SparkSQL functions
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
// Convenience function for turning JSON strings into DataFrames
def jsonToDataFrame(json: String, schema: StructType = null):
DataFrame = {
val reader = spark.read
Option(schema).foreach(reader.schema)
reader.json(sc.parallelize(Array(json)))
}
// Using a struct
val schema = new StructType().add("a", new StructType().add("b", IntegerType))
// call the function passing the sample JSON data and the schema as parameter
val json_df = jsonToDataFrame("""
{
"a": {
"b": 1
}
} """, schema)
// now you can access your json fields
val b_value = json_df.select("a.b")
b_value.show()
See this reference documentation for more examples and details
https://docs.databricks.com/spark/latest/spark-sql/complex-types.html#transform-complex-data-types-scala