Parse JSON file using Spark Scala - json

I've the JSON source data file as like below and i'll need the Expected Results in a quite different format which is also shown below, is there a way i can achieve this using Spark Scala. Appreciate your help on this
JSON source data file
{
"APP": [
{
"E": 1566799999225,
"V": 44.0
},
{
"E": 1566800002758,
"V": 61.0
}
],
"ASP": [
{
"E": 1566800009446,
"V": 23.399999618530273
}
],
"TT": 0,
"TVD": [
{
"E": 1566799964040,
"V": 50876515
}
],
"VIN": "FU74HZ501740XXXXX"
}
Expected Results:
JSON Schema:
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- ASP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- ATO: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- RPM: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- TT: long (nullable = true)
|-- TVD: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- VIN: string (nullable = true)

You can start by reading your json file:
val inputDataFrame: DataFrame = sparkSession
.read
.option("multiline", true)
.json(yourJsonPath)
Then you can create a simple rule to get APP, ASP, ATO, since it's the only fields in the input that have a struct datatype:
val inputDataFrameFields: Array[StructField] = inputDataFrame.schema.fields
var snColumn = new Array[String](inputDataFrame.schema.length)
for( x <- 0 to (inputDataFrame.schema.length -1)) {
if(inputDataFrameFields.apply(x).dataType.isInstanceOf[ArrayType] && !inputDataFrameFields.apply(x).name.isEmpty) {
snColumn(x) = inputDataFrameFields.apply(x).name
}
}
Then you create your empty dataframe as follow and populate it:
val outputSchema = StructType(
List(
StructField("VIN", StringType, true),
StructField(
"EVENTS",
ArrayType(
StructType(Array(
StructField("SN", StringType, true),
StructField("E", IntegerType, true),
StructField("V", DoubleType, true)
)))),
StructField("TT", StringType, true)
)
)
val outputDataFrame = sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row], outputSchema)
Then you need to create some udfs to parse your input and do the correct mapping.
Hope this helps

Here is a solution to parse a json to a spark dataframe adapted to your data :
val input = "{\"APP\":[{\"E\":1566799999225,\"V\":44.0},{\"E\":1566800002758,\"V\":61.0}],\"ASP\":[{\"E\":1566800009446,\"V\":23.399999618530273}],\"TT\":0,\"TVD\":[{\"E\":1566799964040,\"V\":50876515}],\"VIN\":\"FU74HZ501740XXXXX\"}"
import sparkSession.implicits._
val outputDataFrame = sparkSession.read.option("multiline", true).option("mode","PERMISSIVE")
.json(Seq(input).toDS)
.withColumn("APP", explode(col("APP")))
.withColumn("ASP", explode(col("ASP")))
.withColumn("TVD", explode(col("TVD")))
.select(
col("VIN"),col("TT"),
col("APP").getItem("E").as("APP_E"),
col("APP").getItem("V").as("APP_V"),
col("ASP").getItem("E").as("ASP_E"),
col("ASP").getItem("V").as("ASP_E"),
col("TVD").getItem("E").as("TVD_E"),
col("TVD").getItem("V").as("TVD_E")
)
outputDataFrame.show(truncate = false)
/*
+-----------------+---+-------------+-----+-------------+------------------+-------------+--------+
|VIN |TT |APP_E |APP_V|ASP_E |ASP_E |TVD_E |TVD_E |
+-----------------+---+-------------+-----+-------------+------------------+-------------+--------+
|FU74HZ501740XXXXX|0 |1566799999225|44.0 |1566800009446|23.399999618530273|1566799964040|50876515|
|FU74HZ501740XXXXX|0 |1566800002758|61.0 |1566800009446|23.399999618530273|1566799964040|50876515|
+-----------------+---+-------------+-----+-------------+------------------+-------------+--------+
*/

Related

replace values in non standard Json files in pyspark

I need help to solve an issue in Databricks preferably with pyspark code.
I have a troublesome json format, that creates issues when importing the file. The issues lie in the data element, that has been enumerated with 58251 and 58252. The desired state would be to remove this key/numerator. The same issue goes for the "lines" elements 40000, 40001 etc that also would be better with a blank. See example code/json below.
jsonfile= """[ {
"success":true,
"numRows":2,
"data":{
"58251":{
"invoiceno":"58251",
"name":"invoice1",
"companyId":"1000",
"departmentId":"1",
"lines":{
"40000":{
"invoiceline":"40000",
"productid":"1",
"amount":"10000",
"quantity":"7"
},
"40001":{
"invoiceline":"40001",
"productid":"2",
"amount":"9000",
"quantity":"7"
}
}
},
"58252":{
"invoiceno":"58252",
"name":"invoice34",
"companyId":"1001",
"departmentId":"2",
"lines":{
"40002":{
"invoiceline":"40002",
"productid":"3",
"amount":"7000",
"quantity":"6"
},
"40003":{
"invoiceline":"40003",
"productid":"2",
"amount":"9000",
"quantity":"7"
},
"40004":{
"invoiceline":"40004",
"productid":"2",
"amount":"9000",
"quantity":"7"
} } } } }]"""
import pandas as pd
df = pd.read_json(jsonfile)
display(df)
Is it possible to change the json file to the format below with pyspark code?
The desired json format below:
jsonfile= """[ {
"success":true,
"numRows":2,
"data":[
{
"invoiceno":"58251",
"name":"invoice1",
"companyId":"1000",
"departmentId":"1",
"lines":[
{
"invoiceline":"40000",
"productid":"1",
"amount":"10000",
"quantity":"7"
},
{
"invoiceline":"40001",
"productid":"2",
"amount":"9000",
"quantity":"7"
}
]
},
{
"invoiceno":"58252",
"name":"invoice34",
"companyId":"1001",
"departmentId":"2",
"lines":[
{
"invoiceline":"40002",
"productid":"3",
"amount":"7000",
"quantity":"6"
},
{
"invoiceline":"40003",
"productid":"2",
"amount":"9000",
"quantity":"7"
},
{
"invoiceline":"40004",
"productid":"2",
"amount":"9000",
"quantity":"7"
}
]
} ] }]"""
import pandas as pd
df = pd.read_json(jsonfile)
display(df)
This is quite an interesting problem. The main idea of my solution is looping through your schema to build the map between the company's ID and its lines, then select and reconstruct another JSON from there.
Build mapping
# my `a.json` file has exact content as your sample json
df = spark.read.json('a.json', multiLine=True)
mapping = {d.name: d.dataType[3].dataType.fieldNames() for d in df.schema['data'].dataType.fields}
# {'58251': ['40000', '40001'], '58252': ['40002', '40003', '40004']}
Reconstruct JSON
You might want to break this down and debug
cols = []
for company, lines in mapping.items():
cols.append(F.struct(
F.col(f'data.{company}.invoiceno'),
F.col(f'data.{company}.name'),
F.col(f'data.{company}.companyId'),
F.col(f'data.{company}.departmentId'),
F.array([f'data.{company}.lines.{line}' for line in lines]).alias('lines')
))
(df
.select(
F.col('success'),
F.col('numRows'),
F.array(cols).alias('data')
)
.printSchema()
)
root
|-- success: boolean (nullable = true)
|-- numRows: long (nullable = true)
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- invoiceno: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- companyId: string (nullable = true)
| | |-- departmentId: string (nullable = true)
| | |-- lines: array (nullable = false)
| | | |-- element: struct (containsNull = true)
| | | | |-- amount: string (nullable = true)
| | | | |-- invoiceline: string (nullable = true)
| | | | |-- productid: string (nullable = true)
| | | | |-- quantity: string (nullable = true)

How to create a schema from JSON file using Spark Scala for subset of fields?

I am trying to create a schema of a nested JSON file so that it can become a dataframe.
However, I am not sure if there is way to create a schema without defining all the fields in the JSON file if I only need the 'id' and 'text' from it - a subset.
I am currently doing it using scala in spark shell. As you can see from the file, I downloaded it as part-00000 from HDFS.
.
From the manuals on JSON:
Apply the schema using the .schema method. This read returns only
the columns specified in the schema.
So you are good to go with what you imply.
E.g.
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
val schema = new StructType()
.add("op_ts", StringType, true)
val df = spark.read.schema(schema)
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
returns:
root
|-- op_ts: string (nullable = true)
+--------------------------+
|op_ts |
+--------------------------+
|2019-05-31 04:24:34.000327|
+--------------------------+
for this schema:
root
|-- after: struct (nullable = true)
| |-- CODE: string (nullable = true)
| |-- CREATED: string (nullable = true)
| |-- ID: long (nullable = true)
| |-- STATUS: string (nullable = true)
| |-- UPDATE_TIME: string (nullable = true)
|-- before: string (nullable = true)
|-- current_ts: string (nullable = true)
|-- op_ts: string (nullable = true)
|-- op_type: string (nullable = true)
|-- pos: string (nullable = true)
|-- primary_keys: array (nullable = true)
| |-- element: string (containsNull = true)
|-- table: string (nullable = true)
|-- tokens: struct (nullable = true)
| |-- csn: string (nullable = true)
| |-- txid: string (nullable = true)
gotten from same file using:
val df = spark.read
.option("multiLine", true).option("mode", "PERMISSIVE")
.json("/FileStore/tables/json_stuff.txt")
df.printSchema()
df.show(false)
This latter is just for proof.

Errors trying to flatten JSON in Spark

I'm trying to learn how to use Spark to process JSON data, and I have a fairly simple JSON file that looks like this:
{"key": { "defaultWeights":"1" }, "measures": { "m1":-0.01, "m2":-0.5.....}}
When I load this file into a Spark dataframe and run the following code:
val flattened = dff.withColumn("default_weights", json_tuple(col("key"), "defaultWeights")).show
I get this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'json_tuple(`key`, 'defaultWeights')' due to data type mismatch: json_tuple requires that all arguments are strings;;
'Project [key#6, measures#7, json_tuple(key#6, defaultWeights) AS default_weights#13]
+- Relation[key#6,measures#7] json
If I change my code to make sure both arguments are strings, I get this error:
<console>:25: error: type mismatch;
found : String
required: org.apache.spark.sql.Column
val flattened = dff.withColumn("default_weights", json_tuple("key", "defaultWeights")).show
So as you can see, I am literally going around in circles!
json_tuple could work if your key column would be a text and not a struct. Let me show you:
val contentStruct =
"""|{"key": { "defaultWeights":"1", "c": "a" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
FileUtils.writeStringToFile(new File("/tmp/test_flat.json"), contentStruct)
val sparkSession: SparkSession = SparkSession.builder()
.appName("Spark SQL json_tuple")
.master("local[*]").getOrCreate()
import sparkSession.implicits._
sparkSession.read.json("/tmp/test_flat.json").printSchema()
The schema will be:
root
|-- key: struct (nullable = true)
| |-- c: string (nullable = true)
| |-- defaultWeights: string (nullable = true)
|-- measures: struct (nullable = true)
| |-- m1: double (nullable = true)
| |-- m2: double (nullable = true)
So de facto, you don't need to extra the defaultWeights. You can simply use them with JSON path (key.defaultWeights):
sparkSession.read.json("/tmp/test_flat.json").select("key.defaultWeights").show()
+--------------+
|defaultWeights|
+--------------+
| 1|
+--------------+
Otherwise, to use json_tuple, your JSON should look like that:
val contentString =
"""|{"key": "{ \"defaultWeights\":\"1\", \"c\": \"a\" }", "measures": { "m1":-0.01, "m2":-0.5}}""".stripMargin
In that case, the schema will be:
root
|-- key: string (nullable = true)
|-- measures: struct (nullable = true)
| |-- m1: double (nullable = true)
| |-- m2: double (nullable = true)
And:
sparkSession.read.json("/tmp/test_flat.json")
.withColumn("default_weights", functions.json_tuple($"key", "defaultWeights")).show(false)
will return:
+----------------------------------+-------------+---------------+
|key |measures |default_weights|
+----------------------------------+-------------+---------------+
|{ "defaultWeights":"1", "c": "a" }|[-0.01, -0.5]|1 |
+----------------------------------+-------------+---------------+

Structtype definition for the nested Json file in pyspark

I have the json file for which I have created a dataframe in pyspark.
Here is the json file content:
{"xyz": [ {"c1": "a", "c2": "b", "c3": "d"}]}
Here is the Structype schema that I have created.
schema = StructType([
StructField("abc",ArrayType(
StructType([
StructField("c1",StringType()),
StructField("c2",StringType()),
StructField("c3",StringType())])))])
```
rdd = sc.textFile(path).map(lambda x: x.encode("ascii", "ignore")).map(lambda line: json.loads(line))
df = rdd.toDF(schema=schema)`
df_colsexp = df.select(
col('cards.c1').alias('c1'),
col('cards.c2').alias('c2'),
col('cards.c3').alias('c3')
)
df_colsexp.show(5,False)`
df_colsexp.printSchema()
```
>>> df_colsexp.printSchema()
root
|-- c1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- c2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- c3: array (nullable = true)
| |-- element: string (containsNull = true)
df_colsexp.show(5,False)
Output:
c1 c2 c3
[a] [b] [c]
My question is why c1,c2,c3 columns are shown as array in the output?
How can make these columns c1,c2,c3 as a Stringtype?
So that in the output those square brackets will go away. ?

Re-using A Schema from JSON within a Spark DataFrame using Scala

I have some JSON data like this:
{"gid":"111","createHour":"2014-10-20 01:00:00.0","revisions":[{"revId":"2","modDate":"2014-11-20 01:40:37.0"},{"revId":"4","modDate":"2014-11-20 01:40:40.0"}],"comments":[],"replies":[]}
{"gid":"222","createHour":"2014-12-20 01:00:00.0","revisions":[{"revId":"2","modDate":"2014-11-20 01:39:31.0"},{"revId":"4","modDate":"2014-11-20 01:39:34.0"}],"comments":[],"replies":[]}
{"gid":"333","createHour":"2015-01-21 00:00:00.0","revisions":[{"revId":"25","modDate":"2014-11-21 00:34:53.0"},{"revId":"110","modDate":"2014-11-21 00:47:10.0"}],"comments":[{"comId":"4432","content":"How are you?"}],"replies":[{"repId":"4441","content":"I am good."}]}
{"gid":"444","createHour":"2015-09-20 23:00:00.0","revisions":[{"revId":"2","modDate":"2014-11-20 23:23:47.0"}],"comments":[],"replies":[]}
{"gid":"555","createHour":"2016-01-21 01:00:00.0","revisions":[{"revId":"135","modDate":"2014-11-21 01:01:58.0"}],"comments":[],"replies":[]}
{"gid":"666","createHour":"2016-04-23 19:00:00.0","revisions":[{"revId":"136","modDate":"2014-11-23 19:50:51.0"}],"comments":[],"replies":[]}
I can read it in:
val df = sqlContext.read.json("./data/full.json")
I can print the schema with df.printSchema
root
|-- comments: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- comId: string (nullable = true)
| | |-- content: string (nullable = true)
|-- createHour: string (nullable = true)
|-- gid: string (nullable = true)
|-- replies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: string (nullable = true)
| | |-- repId: string (nullable = true)
|-- revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- modDate: string (nullable = true)
| | |-- revId: string (nullable = true)
I can show the data df.show(10,false)
+---------------------+---------------------+---+-------------------+---------------------------------------------------------+
|comments |createHour |gid|replies |revisions |
+---------------------+---------------------+---+-------------------+---------------------------------------------------------+
|[] |2014-10-20 01:00:00.0|111|[] |[[2014-11-20 01:40:37.0,2], [2014-11-20 01:40:40.0,4]] |
|[] |2014-12-20 01:00:00.0|222|[] |[[2014-11-20 01:39:31.0,2], [2014-11-20 01:39:34.0,4]] |
|[[4432,How are you?]]|2015-01-21 00:00:00.0|333|[[I am good.,4441]]|[[2014-11-21 00:34:53.0,25], [2014-11-21 00:47:10.0,110]]|
|[] |2015-09-20 23:00:00.0|444|[] |[[2014-11-20 23:23:47.0,2]] |
|[] |2016-01-21 01:00:00.0|555|[] |[[2014-11-21 01:01:58.0,135]] |
|[] |2016-04-23 19:00:00.0|666|[] |[[2014-11-23 19:50:51.0,136]] |
+---------------------+---------------------+---+-------------------+---------------------------------------------------------+
I can print / read the schema val dfSc = df.schema as:
StructType(StructField(comments,ArrayType(StructType(StructField(comId,StringType,true), StructField(content,StringType,true)),true),true), StructField(createHour,StringType,true), StructField(gid,StringType,true), StructField(replies,ArrayType(StructType(StructField(content,StringType,true), StructField(repId,StringType,true)),true),true), StructField(revisions,ArrayType(StructType(StructField(modDate,StringType,true), StructField(revId,StringType,true)),true),true))
I can print this out nicer:
println(df.schema.fields.mkString(",\n"))
StructField(comments,ArrayType(StructType(StructField(comId,StringType,true), StructField(content,StringType,true)),true),true),
StructField(createHour,StringType,true),
StructField(gid,StringType,true),
StructField(replies,ArrayType(StructType(StructField(content,StringType,true), StructField(repId,StringType,true)),true),true),
StructField(revisions,ArrayType(StructType(StructField(modDate,StringType,true), StructField(revId,StringType,true)),true),true)
Now if I read in the same file without the comments and replies row, with val df2 = sqlContext.read.
json("./data/partialRevOnly.json") simply deleting those rows, I get something like this with printSchema:
root
|-- comments: array (nullable = true)
| |-- element: string (containsNull = true)
|-- createHour: string (nullable = true)
|-- gid: string (nullable = true)
|-- replies: array (nullable = true)
| |-- element: string (containsNull = true)
|-- revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- modDate: string (nullable = true)
| | |-- revId: string (nullable = true)
I don't like that, so I use:
val df3 = sqlContext.read.
schema(dfSc).
json("./data/partialRevOnly.json")
where the original schema was dfSc. So now I get exactly the schema I had before with the removed data:
root
|-- comments: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- comId: string (nullable = true)
| | |-- content: string (nullable = true)
|-- createHour: string (nullable = true)
|-- gid: string (nullable = true)
|-- replies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- content: string (nullable = true)
| | |-- repId: string (nullable = true)
|-- revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- modDate: string (nullable = true)
| | |-- revId: string (nullable = true)
This is perfect ... well almost. I would like to assign this schema to a variable similar to this:
val textSc = StructField(comments,ArrayType(StructType(StructField(comId,StringType,true), StructField(content,StringType,true)),true),true),
StructField(createHour,StringType,true),
StructField(gid,StringType,true),
StructField(replies,ArrayType(StructType(StructField(content,StringType,true), StructField(repId,StringType,true)),true),true),
StructField(revisions,ArrayType(StructType(StructField(modDate,StringType,true), StructField(revId,StringType,true)),true),true)
OK - This won't work due to double quotes, and 'some other structural' stuff, so try this (with error):
import org.apache.spark.sql.types._
val textSc = StructType(Array(
StructField("comments",ArrayType(StructType(StructField("comId",StringType,true), StructField("content",StringType,true)),true),true),
StructField("createHour",StringType,true),
StructField("gid",StringType,true),
StructField("replies",ArrayType(StructType(StructField("content",StringType,true), StructField("repId",StringType,true)),true),true),
StructField("revisions",ArrayType(StructType(StructField("modDate",StringType,true), StructField("revId",StringType,true)),true),true)
))
Name: Compile Error
Message: <console>:78: error: overloaded method value apply with alternatives:
(fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: java.util.List[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType <and>
(fields: Seq[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType
cannot be applied to (org.apache.spark.sql.types.StructField, org.apache.spark.sql.types.StructField)
StructField("comments",ArrayType(StructType(StructField("comId",StringType,true), StructField("content",StringType,true)),true),true),
... Without this error (that I cannot figure a quick way around), I would like to then use textSc in place of dfSc to read in the JSON data with an imposed schema.
I cannot find a '1-to-1 match' way of getting (via println or ...) the schema with acceptable syntax (sort of like above). I suppose some coding can be done with case matching to iron out the double quotes. However, I'm still unclear what rules are required to get the exact schema out of the test fixture that I can simply re-use in my recurring production (versus test fixture) code. Is there a way to get this schema to print exactly as I would code it?
Note: This includes double quotes and all the proper StructField/Types and so forth to be code-compatible drop in.
As a sidebar, I thought about saving a fully-formed golden JSON file to use at the start of the Spark job, but I would like to eventually use date fields and other more concise types instead of strings at the applicable structural locations.
How can I get the dataFrame information coming out of my test harness (using a fully-formed JSON input row with comments and replies) to a point where I can drop the schema as source-code into production code Scala Spark job?
Note: The best answer is some coding means, but an explanation so I can trudge, plod, toil, wade, plow and slog thru the coding is helpful too. :)
I recently ran into this. I'm using Spark 2.0.2 so I don't know if this solution works with earlier versions.
import scala.util.Try
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.catalyst.parser.LegacyTypeStringParser
import org.apache.spark.sql.types.{DataType, StructType}
/** Produce a Schema string from a Dataset */
def serializeSchema(ds: Dataset[_]): String = ds.schema.json
/** Produce a StructType schema object from a JSON string */
def deserializeSchema(json: String): StructType = {
Try(DataType.fromJson(json)).getOrElse(LegacyTypeStringParser.parse(json)) match {
case t: StructType => t
case _ => throw new RuntimeException(s"Failed parsing StructType: $json")
}
}
Note that the "deserialize" function I just copied from a private function in the Spark StructType object. I don't know how well it will be supported across versions.
Well, the error message should tell you everything you have to know here - StructType expects a sequence of fields as an argument. So in your case schema should look like this:
StructType(Seq(
StructField("comments", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("comId", StringType, true),
StructField("content", StringType, true))), true), true),
StructField("createHour", StringType, true),
StructField("gid", StringType, true),
StructField("replies", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("content", StringType, true),
StructField("repId", StringType, true))), true), true),
StructField("revisions", ArrayType(StructType(Seq( // <- Seq[StructField]
StructField("modDate", StringType, true),
StructField("revId", StringType, true))),true), true)))