Using Pyspark to read JSON items from an array? - json

I'm having some issues with reading items from Cosmos DB in databricks, it seems to read the JSON as a string value, and having some issues getting the data out of it to columns.
I have a column called ProductRanges with the following values in a row:
[ {
"name": "Red",
"min": 0,
"max": 99,
"value": "Order More"
},
{
"name": "Amber",
"min": 100,
"max": 499,
"value": "Stock OK"
},
{
"name": "Green",
"min": 500,
"max": 1000000,
"value": "Overstocked"
}
]
In Cosmos DB the JSON document is valid, out when importing the data the datatype in the dataframe is a string, not a JSON object/struct as I would expect.
I would like the be able to get count the number of times "name" comes up and iterate through them the get the min, max and value items, as the number of ranges that we can have can be more than 3. I've been though a few post on stackoverflow and other places but stuck on the formatting. I've tried to use the explode and read the schema based in the column values, but it does say 'in vaild document', think it may be due to Pyspark needing {} at the start and the end, but even concatenating that in the SQL query from cosmos db still ends up as the datatype of string.
Any pointers would be appreciated

I see you retrieved JSON documents from Azure CosmosDB and convert them to PySpark DataFrame, but the nested JSON document or array could not be transformed as a JSON object in a DataFrame column as you expected, because there is not a JSON type defined in pyspark.sql.types module, as below.
I searched a document PySpark: Convert JSON String Column to Array of Object (StructType) in Data Frame which be a suitable solution for your current case, even the same as you want, while I was trying to solve it.
The document above shows how to use ArrayType, StructType, StructField and other base PySpark datatypes to convert a JSON string in a column to a combined datatype which can be processed easier in PySpark via define the column schema and an UDF.
Here is the summary of sample code. Hope it helps.
source = [{"attr_1": 1, "attr_2": "[{\"a\":1,\"b\":1},{\"a\":2,\"b\":2}]"}, {"attr_1": 2, "attr_2": "[{\"a\":3,\"b\":3},{\"a\":4,\"b\":4}]"}]
JSON is read into a data frame through sqlContext. The output is:
+------+--------------------+
|attr_1| attr_2|
+------+--------------------+
| 1|[{"a":1,"b":1},{"...|
| 2|[{"a":3,"b":3},{"...|
+------+--------------------+
root
|-- attr_1: long (nullable = true)
|-- attr_2: string (nullable = true)
Then, to convert the attr_2 column via define column schema and UDF.
# Function to convert JSON array string to a list
import json
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["a"], item["b"])
# Define the schema
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField
json_schema = ArrayType(StructType([StructField('a', IntegerType(
), nullable=False), StructField('b', IntegerType(), nullable=False)]))
# Define udf
from pyspark.sql.functions import udf
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
# Generate a new data frame with the expected schema
df_new = df.select(df.attr_1, udf_parse_json(df.attr_2).alias("attr_2"))
df_new.show()
df_new.printSchema()
The output is as the following:
+------+--------------+
|attr_1| attr_2|
+------+--------------+
| 1|[[1,1], [2,2]]|
| 2|[[3,3], [4,4]]|
+------+--------------+
root
|-- attr_1: long (nullable = true)
|-- attr_2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: integer (nullable = false)
| | |-- b: integer (nullable = false)

Related

Json creation from spark dataframe in scala

Currently, we are converting a spark dataframe to JSON String to be sent to kafka.
In the process, we are doing toJSON twice which inserts \ for the inner json.
Snippet of the code:
val df=spark.sql("select * from dB.tbl")
val bus_dt="2022-09-23"
case class kafkaMsg(busDate:String,msg:String)
Assuming my df has 2 columns as ID,STATUS, this will constitute the inner json of my kafka message.
JSON is created for msg and applied to case class.
val rdd=df.toJSON.rdd.map(msg=>kafkaMsg(busDate,msg))
Output at this step:
kafkaMsg(2022-09-23,{"id":1,"status":"active"})
Now, to send busDate and msg as JSON to kafka ,again a toJSON is applied.
val df1=spark.createDataFrame(rdd).toJSON
The output is:
{"busDate":"2022-09-23","msg":"{\"id\":1,\"status\":\"active\"}"}
The inner JSON is having \ which is not what the consumers are expecting.
Expected JSON:
{"busDate":"2022-09-23","msg":{"id":1,"status":"active"}}
How can I create this json without \ and send to kafka.
Please note the msg value varies and cannot be mapped to a schema.
Your msg is escaped because it's already a string. So, you are toString-ing a String when you convert to JSON...
JSON can be represented as Map[String, ?], so define a schema if your input data doesn't already have it.
Using PySpark as an example.
scm = StructType([
StructField('busDate', StringType(), nullable=False),
StructField('msg', MapType(StringType(), StringType()), nullable=False)
])
sdf = spark.createDataFrame([
('2022-09-23', {"id":1,"status":"active"}),
], schema=scm)
Schema - Notice that msg is not a string, but a Map[String, String]. And no, you cannot have multiple value types - Spark SQL and MapType with string keys and any values
root
|-- busDate: string (nullable = false)
|-- msg: map (nullable = false)
| |-- key: string
| |-- value: string (valueContainsNull = true)
As JSON - You don't need Jackson, or hack around with RDDs...
kafkaDf = sdf.selectExpr("to_json(struct(*)) as value")
kafkaDf.show(truncate=False)
Not escaped...
Notice that the id type was converted. If that's not something you want, then you need to use msg : StructType rather than MapType and give id : IntegerType, for example. (This assumes all records in the dataframe are consistently typed, obviously)
+-----------------------------------------------------------+
|value |
+-----------------------------------------------------------+
|{"busDate":"2022-09-23","msg":{"id":"1","status":"active"}}|
+-----------------------------------------------------------+
You could also pull out the key (switched to using spark.sql.functions)
kafkaDf = sdf.select(
f.col("msg.id").cast("int").alias('key'),
f.to_json(f.struct('*')).alias('value')
)
kafkaDf.printSchema()
kafkaDf.show(truncate=False)
root
|-- key: integer (nullable = true)
|-- value: string (nullable = true)
+---+-----------------------------------------------------------+
|key|value |
+---+-----------------------------------------------------------+
|1 |{"busDate":"2022-09-23","msg":{"id":"1","status":"active"}}|
+---+-----------------------------------------------------------+
Then you can use kafkaDf.write.format("kafka"), as normal
Alternatively, if you wanted to wrap string information in a single field, rather then key-value pairs, then your Kafka consumers would need to handle that on their own, such as double-deserializing both the record, then the inner string (JSON value).

PySpark: TypeError: col should be Column

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:
The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:
However, when I try:
dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
it throws error:
PySpark: TypeError: col should be Column
There is no such problem with any other of the keys in the dict, i.e. "value".
I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?
Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:
schemapath = 'default_schema.json'
with open(schemapath) as f:
d = json.load(f)
schemaNew = StructType.fromJson(d)
responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)
however, line
schemaNew = StructType.fromJson(d)
throws following error:
KeyError: 'fields'
No idea, where this 'fields' is coming from...
Errors in Spark tell truth.
dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
TypeError: col should be Column
DataFrame.withColumn documentation tells you how its input parameters are called and their data types:
Parameters:
- colName: str
string, name of the new column.
- col: Column
a Column expression for the new column.
So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?
print(type(dataframe.keywords_exp.name))
# <class 'method'>
As can be seen, it's not of the expected type Column...
To get Column from Struct's field, you must use a different syntax.
Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.
To access struct fields, you should be using any of the following options:
df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])
(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)
This is a dataframe having the same schema as yours. You can see that withColumn works well:
from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
[(("N", "glocations", 1, "Cuba"),)],
'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
# |-- keywords_exp: struct (nullable = true)
# | |-- major: string (nullable = true)
# | |-- name: string (nullable = true)
# | |-- rank: long (nullable = true)
# | |-- value: string (nullable = true)
df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df.show()
# +--------------------+-------------+
# | keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...| glocations|
# +--------------------+-------------+
Try setting scheme before reading.
Edit: I think the json schema needs to be in specific format. I know it's not documented very well, but you can extract an example using .json() method to see the format and then adjust your schema files. See below updated example:
aa.json
[{"keyword_exp": {"name": "aa", "value": "bb"}}, {"keyword_exp": {"name": "oo", "value": "ee"}}]
test.py
from pyspark.sql.session import SparkSession
import json
if __name__ == '__main__':
spark = SparkSession.builder.appName("test-app").master("local[1]").getOrCreate()
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField('keyword_exp', StructType([
StructField('name', StringType(), False),
StructField('value', StringType(), False),
])),
])
json_str = schema.json()
json_obj = json.loads(json_str)
# Save output of this as file
print(json_str)
# Just to see it pretty
print(json.dumps(json_obj, indent=4))
# Save to file
with open("file_schema.json", "w") as f:
f.write(json_str)
# Load
with open("file_schema.json", "r") as f:
scheme_obj = json.loads(f.read())
# Re-load
loaded_schema = StructType.fromJson(scheme_obj)
df = spark.read.json("./aa.json", schema=schema)
df.printSchema()
df = df.select("keyword_exp.name", "keyword_exp.value")
df.show()
output:
{"fields":[{"metadata":{},"name":"keyword_exp","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":false,"type":"string"},{"metadata":{},"name":"value","nullable":false,"type":"string"}],"type":"struct"}}],"type":"struct"}
{
"fields": [
{
"metadata": {},
"name": "keyword_exp",
"nullable": true,
"type": {
"fields": [
{
"metadata": {},
"name": "name",
"nullable": false,
"type": "string"
},
{
"metadata": {},
"name": "value",
"nullable": false,
"type": "string"
}
],
"type": "struct"
}
}
],
"type": "struct"
}
root
|-- keyword_exp: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- value: string (nullable = true)
+----+-----+
|name|value|
+----+-----+
| aa| bb|
| oo| ee|
+----+-----+
the Spark API seems to have problems with certain protected words. I came across this link when googling the error message
AttributeError: ‘function’ object has no attribute
https://learn.microsoft.com/en-us/azure/databricks/kb/python/function-object-no-attribute
Though "name" is not on the list, I changed all "name"-occurences in the JSON to "nameabcde" and now I can access it:

Create dataframe from json string having true false value

Wanted to create a spark dataframe from json string without using schema in Python. The json is mutlilevel nested which may contain array.
I had used below for creating dataframe, but getting 'Cannot infer Schema'
spark.createDataFrame(jsonStr)
I tried loading same json from file using below
spark.read.option("multiline", "true").json("/path")
This statement didn't have any issue and loaded the data to spark dataframe.
Is there any similar way to load the data from json variable?
It is fine even if all the values are not normallized.
Edit:
Found out that the issue might be due to true and false(Bool value) present in the json, when I was trying to use createDataFrame python is taking true and false as variable.
Is there any way to bypass this, the file also contains true or false. I tried to convert the list (list of nested dictionary) to json by using json.dumps() also. It is giving error as
Can not infer schema for type : <class 'str'>
Edit 2:
Input:
data = [
{
"a":"testA",
"b":"testB",
"c":false
}
]
Required output dataframe
a | b | c
--------------------
testA | testB | false
I get the required output when loading from the file. The file contains exact same as data.
spark.read.option("multiline", "true").json("/path/test.json")
Also if the data is string then I get a error Can not infer schema for type : <class 'str'>
If you don't want to load data from json file, you'd have to provide a schema for the JSON and use from_json to parse it
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('a', T.StringType()),
T.StructField('b', T.StringType()),
T.StructField('c', T.BooleanType()),
]))
df = (spark
.createDataFrame([('dummy',)], ['x'])
.withColumn('x', F.from_json(F.lit(data), schema))
)
df.show(10, False)
df.printSchema()
+-----------------------+
|x |
+-----------------------+
|[{testA, testB, false}]|
+-----------------------+
root
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: boolean (nullable = true)
If your input is a json you can deserialize it to a list of dictionary before creating a spark dataframe:
spark.createDataFrame(json.loads(data))

Interpret timestamp fields in Spark while reading json

I am trying to read a pretty printed json which has time fields in it. I want to interpret the timestamps columns as timestamp fields while reading the json itself. However, it's still reading them as string when I printSchema
E.g.
Input json file -
[{
"time_field" : "2017-09-30 04:53:39.412496Z"
}]
Code -
df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')
Output of df.printSchema() -
root
|-- time_field: string (nullable = true)
What am I missing here?
My own experience with option timestampFormat is that it doesn't quite work as advertised. I would simply read the time fields as strings and use to_timestamp to do the conversion, as shown below (with slightly generalized sample input):
# /path/to/jsonfile
[{
"id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
"id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]
In Python:
from pyspark.sql.functions import to_timestamp
df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df = df.withColumn("timestamp", to_timestamp("time_field"))
df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field |timestamp |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+
df.printSchema()
root
|-- id: long (nullable = true)
|-- time_field: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
In Scala:
val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df.withColumn("timestamp", to_timestamp($"time_field"))
It's bug in Spark version 2.4.0 Issues SPARK-26325
For Spark Version 2.4.4
import org.apache.spark.sql.types.TimestampType
//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")).toDF("input_timestamp")
val df_mod = df.select($"input_timestamp".cast(TimestampType))
df_mod.printSchema
Output
root
|-- input_timestamp: timestamp (nullable = true)

From Postgres JSONB into Spark JSONRDD [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html