How to parse json with mixed nested and non-nested structure? - json

In file 1, the JSON element "image" is nested. The data looks like structured like this:
{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
Resulting schema is correctly inferred by Spark:
val df1 = spark.read.json("/xxx/xxxx/xxxx/nested1.json")
df1.printSchema
root
|-- id: string (nullable = true)
|-- image: struct (nullable = true)
| |-- height: long (nullable = true)
| |-- url: string (nullable = true)
| |-- width: long (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
File nested2.json contains some nested elements and some non-nested (below the second line, element "image" is not nested):
{"id": "0001", "type": "donut", "name": "Cake", "image":{"url": "images/0001.jpg", "width": 200, "height": 200}}
{"id": "0002", "type": "donut", "name": "CupCake", "image": "images/0001.jpg"}
Resulting schema does not contain the nested data:
val df2 = spark.read.json("/xxx/xxx/xxx/nested2.json")
df2.printSchema
root
|-- id: string (nullable = true)
|-- image: string (nullable = true)
|-- name: string (nullable = true)
|-- type: string (nullable = true)
Why is Spark not able to figure out the schema when non-nested elements are present?
How to process a JSON file containing a mixture of records like this using Spark and Scala?

By not specifying a schema, Spark will infer it from the data source by reading the data once. Looking at the source code for inference of json data, there is a comment in the code:
Infer the type of a collection of json records in three stages:
Infer the type of each record
Merge types by choosing the lowest type necessary to cover equal keys
Replace any remaining null fields with string, the top type
In other words, only the most general data type is returned when there are differences. By mixing nested and un-nested types, only the un-nested will be returned, as all rows contain these values.
Instead of infering the schema, create it yourself and supply it to the method.
val schema = StructType(
StructField("id", StringType, true) ::
StructField("type", StringType, true) ::
StructField("name", StringType, true) ::
StructField(
"image",
StructType(
StructField("height", LongType, true) ::
StructField("url", StringType, true) ::
StructField("width", LongType, true) ::
Nil
))
) :: Nil
)
Then use the schema when reading:
val df2 = spark.read
.schema(schema)
.json("/xxx/xxx/xxx/nested2.json")
However, this approach will result in null for the "image" field when it is simply a string. To get both types, read the file twice and merge the results.
val df3 = spark.read
.json("/xxx/xxx/xxx/nested2.json")
The new dataframe has a different schema then the first. Let's make them equal and then merge the dataframes together:
val df4 = df3.withColumn("image",
struct($"image".as("url"),
lit(0L).as("height"),
lit(0L).as("width"))
.select("id", "type", "name", "image")
val df5 = df2.union(df4)
The last select is to make sure the columns are in the same order as df2, otherwise the union will fail.

Related

PySpark: TypeError: col should be Column

I am trying to create a dataframe out of a nested JSON structure, but I am encountering a problem that I don't understand. I have exploded an array-of-dicts structure in the JSON and now I am trying to access these dicts and create columns with the values in there. This is how the dicts look like:
The values at index 1 (subject, glocations etc.) go under the key "name" according to the schema:
However, when I try:
dataframe = dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
it throws error:
PySpark: TypeError: col should be Column
There is no such problem with any other of the keys in the dict, i.e. "value".
I really do not understand the problem, do I have to assume that there are inconsistencies in the data? If yes, can you recommend a way to check for or even dodge them?
Edit: Khalid had a good idea to pre-define the schema. I tried to do so by storing one of the JSON files as a kind of default file. From that file, I wanted to extract the schema as follows:
schemapath = 'default_schema.json'
with open(schemapath) as f:
d = json.load(f)
schemaNew = StructType.fromJson(d)
responseDf = spark.read.schema(schemaNew).json("apiResponse.json", multiLine=True)
however, line
schemaNew = StructType.fromJson(d)
throws following error:
KeyError: 'fields'
No idea, where this 'fields' is coming from...
Errors in Spark tell truth.
dataframe.withColumn("keywords_name", dataframe.keywords_exp.name)
TypeError: col should be Column
DataFrame.withColumn documentation tells you how its input parameters are called and their data types:
Parameters:
- colName: str
string, name of the new column.
- col: Column
a Column expression for the new column.
So, col is parameter's name and Column is its type. Column is the data type which withColumn expects to get as the parameter named col. What did it actually receive? It received dataframe.keywords_exp.name. But what data type is it of?
print(type(dataframe.keywords_exp.name))
# <class 'method'>
As can be seen, it's not of the expected type Column...
To get Column from Struct's field, you must use a different syntax.
Note: data types in the dataframe are not what you think they are. You don't have dicts anymore. Instead, you have a Struct type column. The keys from the old dictionaries are now Field names for Struct type column.
To access struct fields, you should be using any of the following options:
df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df = dataframe.withColumn("keywords_name", dataframe.keywords_exp['name'])
(Both, F.col("keywords_exp.name") and dataframe.keywords_exp['name'] are of type Column.)
This is a dataframe having the same schema as yours. You can see that withColumn works well:
from pyspark.sql import functions as F
dataframe = spark.createDataFrame(
[(("N", "glocations", 1, "Cuba"),)],
'keywords_exp struct<major:string,name:string,rank:bigint,value:string>')
dataframe.printSchema()
# root
# |-- keywords_exp: struct (nullable = true)
# | |-- major: string (nullable = true)
# | |-- name: string (nullable = true)
# | |-- rank: long (nullable = true)
# | |-- value: string (nullable = true)
df = dataframe.withColumn("keywords_name", F.col("keywords_exp.name"))
df.show()
# +--------------------+-------------+
# | keywords_exp|keywords_name|
# +--------------------+-------------+
# |{N, glocations, 1...| glocations|
# +--------------------+-------------+
Try setting scheme before reading.
Edit: I think the json schema needs to be in specific format. I know it's not documented very well, but you can extract an example using .json() method to see the format and then adjust your schema files. See below updated example:
aa.json
[{"keyword_exp": {"name": "aa", "value": "bb"}}, {"keyword_exp": {"name": "oo", "value": "ee"}}]
test.py
from pyspark.sql.session import SparkSession
import json
if __name__ == '__main__':
spark = SparkSession.builder.appName("test-app").master("local[1]").getOrCreate()
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType([
StructField('keyword_exp', StructType([
StructField('name', StringType(), False),
StructField('value', StringType(), False),
])),
])
json_str = schema.json()
json_obj = json.loads(json_str)
# Save output of this as file
print(json_str)
# Just to see it pretty
print(json.dumps(json_obj, indent=4))
# Save to file
with open("file_schema.json", "w") as f:
f.write(json_str)
# Load
with open("file_schema.json", "r") as f:
scheme_obj = json.loads(f.read())
# Re-load
loaded_schema = StructType.fromJson(scheme_obj)
df = spark.read.json("./aa.json", schema=schema)
df.printSchema()
df = df.select("keyword_exp.name", "keyword_exp.value")
df.show()
output:
{"fields":[{"metadata":{},"name":"keyword_exp","nullable":true,"type":{"fields":[{"metadata":{},"name":"name","nullable":false,"type":"string"},{"metadata":{},"name":"value","nullable":false,"type":"string"}],"type":"struct"}}],"type":"struct"}
{
"fields": [
{
"metadata": {},
"name": "keyword_exp",
"nullable": true,
"type": {
"fields": [
{
"metadata": {},
"name": "name",
"nullable": false,
"type": "string"
},
{
"metadata": {},
"name": "value",
"nullable": false,
"type": "string"
}
],
"type": "struct"
}
}
],
"type": "struct"
}
root
|-- keyword_exp: struct (nullable = true)
| |-- name: string (nullable = true)
| |-- value: string (nullable = true)
+----+-----+
|name|value|
+----+-----+
| aa| bb|
| oo| ee|
+----+-----+
the Spark API seems to have problems with certain protected words. I came across this link when googling the error message
AttributeError: ‘function’ object has no attribute
https://learn.microsoft.com/en-us/azure/databricks/kb/python/function-object-no-attribute
Though "name" is not on the list, I changed all "name"-occurences in the JSON to "nameabcde" and now I can access it:

Interpret timestamp fields in Spark while reading json

I am trying to read a pretty printed json which has time fields in it. I want to interpret the timestamps columns as timestamp fields while reading the json itself. However, it's still reading them as string when I printSchema
E.g.
Input json file -
[{
"time_field" : "2017-09-30 04:53:39.412496Z"
}]
Code -
df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')
Output of df.printSchema() -
root
|-- time_field: string (nullable = true)
What am I missing here?
My own experience with option timestampFormat is that it doesn't quite work as advertised. I would simply read the time fields as strings and use to_timestamp to do the conversion, as shown below (with slightly generalized sample input):
# /path/to/jsonfile
[{
"id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
"id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]
In Python:
from pyspark.sql.functions import to_timestamp
df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df = df.withColumn("timestamp", to_timestamp("time_field"))
df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field |timestamp |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+
df.printSchema()
root
|-- id: long (nullable = true)
|-- time_field: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
In Scala:
val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df.withColumn("timestamp", to_timestamp($"time_field"))
It's bug in Spark version 2.4.0 Issues SPARK-26325
For Spark Version 2.4.4
import org.apache.spark.sql.types.TimestampType
//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")).toDF("input_timestamp")
val df_mod = df.select($"input_timestamp".cast(TimestampType))
df_mod.printSchema
Output
root
|-- input_timestamp: timestamp (nullable = true)

Spark 2.0 (not 2.1) Dataset[Row] or Dataframe - Select few columns to JSON

I have a Spark Dataframe with 10 columns and I need to store this in Postgres/ RDBMS. The table has 7 columns and 7th column takes in text (of JSON format) for further processing.
How do I select 6 columns and convert the remaining 4 columns in the DF to JSON format?
If the whole DF is to be stored as JSON, then we could use DF.write.format("json"), but only the last 4 columns are required to be in JSON format.
I tried creating a UDF (with either Jackson or Lift lib), but not successful in sending the 4 columns to the UDF.
for JSON, the DF column name is the key, DF column's value is the value.
eg:
dataset name: ds_base
root
|-- bill_id: string (nullable = true)
|-- trans_id: integer (nullable = true)
|-- billing_id: decimal(3,-10) (nullable = true)
|-- asset_id: string (nullable = true)
|-- row_id: string (nullable = true)
|-- created: string (nullable = true)
|-- end_dt: string (nullable = true)
|-- start_dt: string (nullable = true)
|-- status_cd: string (nullable = true)
|-- update_start_dt: string (nullable = true)
I want to do,
ds_base
.select ( $"bill_id",
$"trans_id",
$"billing_id",
$"asset_id",
$"row_id",
$"created",
?? <JSON format of 4 remaining columns>
)
You can use struct and to_json:
import org.apache.spark.sql.functions.{to_json, struct}
to_json(struct($"end_dt", $"start_dt", $"status_cd", $"update_start_dt"))
As a workaround for legacy Spark versions you could convert whole object to JSON and extracting required:
import org.apache.spark.sql.functions.get_json_object
// List of column names to be kept as-is
val scalarColumns: Seq[String] = Seq("bill_id", "trans_id", ...)
// List of column names to be put in JSON
val jsonColumns: Seq[String] = Seq(
"end_dt", "start_dt", "status_cd", "update_start_dt"
)
// Convert all records to JSON, keeping selected fields as a nested document
val json = df.select(
scalarColumns.map(col _) :+
struct(jsonColumns map col: _*).alias("json"): _*
).toJSON
json.select(
// Extract selected columns from JSON field and cast to required types
scalarColumns.map(c =>
get_json_object($"value", s"$$.$c").alias(c).cast(df.schema(c).dataType)) :+
// Extract JSON struct
get_json_object($"value", "$.json").alias("json"): _*
)
This will work only as long as you have atomic types. Alternatively you could use standard JSON reader and specify schema for the JSON field.
import org.apache.spark.sql.types._
val combined = df.select(
scalarColumns.map(col _) :+
struct(jsonColumns map col: _*).alias("json"): _*
)
val newSchema = StructType(combined.schema.fields map {
case StructField("json", _, _, _) => StructField("json", StringType)
case s => s
})
spark.read.schema(newSchema).json(combined.toJSON.rdd)

Fail to specify JSON schema in Spark [duplicate]

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.
italianVotes.csv
2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0
Scala
import org.apache.spark.sql.types._
val schema = StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)
val fileName = "italianVotes.csv"
val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)
italianDF.printSchema()
// output
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
Python
from pyspark.sql.types import *
schema = StructType([
StructField("id", IntegerType(), False),
StructField("postId", IntegerType(), False),
StructField("voteType", IntegerType(), True),
StructField("time", TimestampType(), True),
])
file_name = "italianVotes.csv"
italian_df = spark.read.csv(file_name, schema = schema, sep = "~")
# print schema
italian_df.printSchema()
root
|-- id: integer (nullable = true)
|-- postId: integer (nullable = true)
|-- voteType: integer (nullable = true)
|-- time: timestamp (nullable = true)
My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?
In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.
You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.

From Postgres JSONB into Spark JSONRDD [duplicate]

I have a Cassandra table that for simplicity looks something like:
key: text
jsonData: text
blobData: blob
I can create a basic data frame for this using spark and the spark-cassandra-connector using:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
I'm struggling though to expand the JSON data into its underlying structure. I ultimately want to be able to filter based on the attributes within the json string and return the blob data. Something like jsonData.foo = "bar" and return blobData. Is this currently possible?
Spark >= 2.4
If needed, schema can be determined using schema_of_json function (please note that this assumes that an arbitrary row is a valid representative of the schema).
import org.apache.spark.sql.functions.{lit, schema_of_json, from_json}
import collection.JavaConverters._
val schema = schema_of_json(lit(df.select($"jsonData").as[String].first))
df.withColumn("jsonData", from_json($"jsonData", schema, Map[String, String]().asJava))
Spark >= 2.1
You can use from_json function:
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("k", StringType, true), StructField("v", DoubleType, true)
))
df.withColumn("jsonData", from_json($"jsonData", schema))
Spark >= 1.6
You can use get_json_object which takes a column and a path:
import org.apache.spark.sql.functions.get_json_object
val exprs = Seq("k", "v").map(
c => get_json_object($"jsonData", s"$$.$c").alias(c))
df.select($"*" +: exprs: _*)
and extracts fields to individual strings which can be further casted to expected types.
The path argument is expressed using dot syntax, with leading $. denoting document root (since the code above uses string interpolation $ has to be escaped, hence $$.).
Spark <= 1.5:
Is this currently possible?
As far as I know it is not directly possible. You can try something similar to this:
val df = sc.parallelize(Seq(
("1", """{"k": "foo", "v": 1.0}""", "some_other_field_1"),
("2", """{"k": "bar", "v": 3.0}""", "some_other_field_2")
)).toDF("key", "jsonData", "blobData")
I assume that blob field cannot be represented in JSON. Otherwise you cab omit splitting and joining:
import org.apache.spark.sql.Row
val blobs = df.drop("jsonData").withColumnRenamed("key", "bkey")
val jsons = sqlContext.read.json(df.drop("blobData").map{
case Row(key: String, json: String) =>
s"""{"key": "$key", "jsonData": $json}"""
})
val parsed = jsons.join(blobs, $"key" === $"bkey").drop("bkey")
parsed.printSchema
// root
// |-- jsonData: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: double (nullable = true)
// |-- key: long (nullable = true)
// |-- blobData: string (nullable = true)
An alternative (cheaper, although more complex) approach is to use an UDF to parse JSON and output a struct or map column. For example something like this:
import net.liftweb.json.parse
case class KV(k: String, v: Int)
val parseJson = udf((s: String) => {
implicit val formats = net.liftweb.json.DefaultFormats
parse(s).extract[KV]
})
val parsed = df.withColumn("parsedJSON", parseJson($"jsonData"))
parsed.show
// +---+--------------------+------------------+----------+
// |key| jsonData| blobData|parsedJSON|
// +---+--------------------+------------------+----------+
// | 1|{"k": "foo", "v":...|some_other_field_1| [foo,1]|
// | 2|{"k": "bar", "v":...|some_other_field_2| [bar,3]|
// +---+--------------------+------------------+----------+
parsed.printSchema
// root
// |-- key: string (nullable = true)
// |-- jsonData: string (nullable = true)
// |-- blobData: string (nullable = true)
// |-- parsedJSON: struct (nullable = true)
// | |-- k: string (nullable = true)
// | |-- v: integer (nullable = false)
zero323's answer is thorough but misses one approach that is available in Spark 2.1+ and is simpler and more robust than using schema_of_json():
import org.apache.spark.sql.functions.from_json
val json_schema = spark.read.json(df.select("jsonData").as[String]).schema
df.withColumn("jsonData", from_json($"jsonData", json_schema))
Here's the Python equivalent:
from pyspark.sql.functions import from_json
json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
df.withColumn("jsonData", from_json("jsonData", json_schema))
The problem with schema_of_json(), as zero323 points out, is that it inspects a single string and derives a schema from that. If you have JSON data with varied schemas, then the schema you get back from schema_of_json() will not reflect what you would get if you were to merge the schemas of all the JSON data in your DataFrame. Parsing that data with from_json() will then yield a lot of null or empty values where the schema returned by schema_of_json() doesn't match the data.
By using Spark's ability to derive a comprehensive JSON schema from an RDD of JSON strings, we can guarantee that all the JSON data can be parsed.
Example: schema_of_json() vs. spark.read.json()
Here's an example (in Python, the code is very similar for Scala) to illustrate the difference between deriving the schema from a single element with schema_of_json() and deriving it from all the data using spark.read.json().
>>> df = spark.createDataFrame(
... [
... (1, '{"a": true}'),
... (2, '{"a": "hello"}'),
... (3, '{"b": 22}'),
... ],
... schema=['id', 'jsonData'],
... )
a has a boolean value in one row and a string value in another. The merged schema for a would set its type to string. b would be an integer.
Let's see how the different approaches compare. First, the schema_of_json() approach:
>>> json_schema = schema_of_json(df.select("jsonData").take(1)[0][0])
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: boolean (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true]|
| 2| null|
| 3| []|
+---+--------+
As you can see, the JSON schema we derived was very limited. "a": "hello" couldn't be parsed as a boolean and returned null, and "b": 22 was just dropped because it wasn't in our schema.
Now with spark.read.json():
>>> json_schema = spark.read.json(df.select("jsonData").rdd.map(lambda x: x[0])).schema
>>> parsed_json_df = df.withColumn("jsonData", from_json("jsonData", json_schema))
>>> parsed_json_df.printSchema()
root
|-- id: long (nullable = true)
|-- jsonData: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: long (nullable = true)
>>> parsed_json_df.show()
+---+--------+
| id|jsonData|
+---+--------+
| 1| [true,]|
| 2|[hello,]|
| 3| [, 22]|
+---+--------+
Here we have all our data preserved, and with a comprehensive schema that accounts for all the data. "a": true was cast as a string to match the schema of "a": "hello".
The main downside of using spark.read.json() is that Spark will scan through all your data to derive the schema. Depending on how much data you have, that overhead could be significant. If you know that all your JSON data has a consistent schema, it's fine to go ahead and just use schema_of_json() against a single element. If you have schema variability but don't want to scan through all your data, you can set samplingRatio to something less than 1.0 in your call to spark.read.json() to look at a subset of the data.
Here are the docs for spark.read.json(): Scala API / Python API
The from_json function is exactly what you're looking for. Your code will look something like:
val df = sqlContext.read
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "mytable", "keyspace" -> "ks1"))
.load()
//You can define whatever struct type that your json states
val schema = StructType(Seq(
StructField("key", StringType, true),
StructField("value", DoubleType, true)
))
df.withColumn("jsonData", from_json(col("jsonData"), schema))
underlying JSON String is
"{ \"column_name1\":\"value1\",\"column_name2\":\"value2\",\"column_name3\":\"value3\",\"column_name5\":\"value5\"}";
Below is the script to filter the JSON and load the required data in to Cassandra.
sqlContext.read.json(rdd).select("column_name1 or fields name in Json", "column_name2","column_name2")
.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "Table_name", "keyspace" -> "Key_Space_name"))
.mode(SaveMode.Append)
.save()
I use the following
(available since 2.2.0, and i am assuming that your json string column is at column index 0)
def parse(df: DataFrame, spark: SparkSession): DataFrame = {
val stringDf = df.map((value: Row) => value.getString(0), Encoders.STRING)
spark.read.json(stringDf)
}
It will automatically infer the schema in your JSON. Documented here:
https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/DataFrameReader.html