PySpark parse array of objects (JSON format) to one column df - json

I have an array of nested JSON objects like that:
[
{
"a": 1,
"n": {}
}
]
I would like to read this JSON file (multiline) into spark DataFrame with one column. Where the column has StringType and contains a JSON object:
+----------+
| json |
+----------+
| {"a": 1, |
| "n": {}} |
+----------+
I tried to do the following:
schema = StructType([StructField("json", StringType(), True)])
spark.read.json('test.json', multiLine=True).show()
But it didn't work. Are there any options to do that in PySpark?

Found solution myself:
json_schema = StructType([
StructField("json", StringType(), True)
])
df.toJSON().map(lambda x: [x]).toDF(schema=json_schema).show()

Related

Explode JSON array into rows

I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).
ID input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output that I need:
ID A B
1 300 400
1 500 600
2 800 900
I tried from_json, explode functions. But data type mismatch error is coming for array columns.
Real data image
In the image, the 1st dataframe is the input dataframe which I need to read and convert to the 2nd dataframe. 3 input rows needs to be converted to 5 output rows.
I have 2 interpretations of what input (column "input_array") data types you have.
If it's a string...
df = spark.createDataFrame(
[(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
(2, '[ {"A": 800, "B": 900} ]')],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: string (nullable = true)
...you can use from_json to extract Spark structure from JSON string and then inline to explode the resulting array of structs into columns.
df = df.selectExpr(
"ID",
"inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
)
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
If it's an array of strings...
df = spark.createDataFrame(
[(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
(2, [ '{"A": 800, "B": 900}' ])],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: array (nullable = true)
# | |-- element: string (containsNull = true)
...you can first use explode to move every array's element into rows thus resulting in a column of string type, then use from_json to create Spark data types from the strings and finally expand * the structs into columns.
from pyspark.sql import functions as F
df = df.withColumn('input_array', F.explode('input_array'))
df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
df = df.select('ID', 'input_array.*')
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function
Doc: pyspark.sql.functions
If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:
from pyspark.sql.types import *
from pyspark.sql import functions as F
data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
,('2', '[{"A": 800, "B": 900}]')
]
my_schema = ArrayType(
StructType([
StructField('A', IntegerType()),
StructField('B', IntegerType())
])
)
df = spark.createDataFrame(data, ['id', 'Input_array'])\
.withColumn('Input_array', F.from_json('Input_array', my_schema))\
.select("id", F.explode("Input_array").alias("Input_array"))\
.select("id", F.col('Input_array.*'))
df.show(truncate=False)
# +---+---+---+
# |id |A |B |
# +---+---+---+
# |1 |300|400|
# |1 |500|600|
# |2 |800|900|
# +---+---+---+

Opening a json column as a string in pyspark schema and working with it

I have a big dataframe I cannot infer the schema from. I have a column that could be read as if each value is a json format, but I cannot know the full detail of it (i.e. the keys and values can vary and I do not know what it can be).
I want to read it as a string and work with it, but the format changes in a strange way in the process ; here is an example:
from pyspark.sql.types import *
data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
{"ID": 2, "Value": {"a":13, "b": "test2"}}
]
df = spark.createDataFrame(data)
#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)
df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()
Gives me
+---+---------------+
| ID| Value|
+---+---------------+
| 1| {a=12, b=test}|
| 2|{a=13, b=test2}|
+---+---------------+
As one can see, the format in column Value is now without quotes, and with = instead of : and I cannot work properly with it anymore.
How can I turn that back into a StructType or MapType ?
Assuming this is your input dataframe:
df2 = spark.createDataFrame([
(1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])
You can use str_to_map function after removing {} from the string column like this:
from pyspark.sql import functions as F
df = df2.withColumn(
"Value",
F.regexp_replace("Value", "[{}]", "")
).withColumn(
"Value",
F.expr("str_to_map(Value, ', ', '=')")
)
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
df.show()
#+---+---------------------+
#|ID |Value |
#+---+---------------------+
#|1 |{a -> 12, b -> test} |
#|2 |{a -> 13, b -> test2}|
#+---+---------------------+

read pretty json format data through spark

We read data present in hour format present in S3 through spark in scala.For example,
spark.read.textFile("s3://'Bucket'/'key'/'yyyy'/'MM'/'dd'/'hh'/*").
spark.read.textFile reads records one line at a time so for example records that are present in jsonLines(full json data in one line) are read and can be parsed later to retrieve data from json.
Now,I have to read data which is having multiple json but in pretty format instead of json lines.Using same strategy gives corrupt record error.For example Dataset[String] obtained after reading through spark.read.textFile:
{
"a": 1,
"b": 2
}
is
_corrupt_record|
+---------------+
| {|
| "a": 1, |
| "b": 2|
| }|
Input data :
{
"key1": "value1",
"key2": "value2"
}
{
"key1": "value1",
"key2": "value2"
}
ExpectedOutput
+------+------+
|key1 |key2 |
+------+------+
|value1|value2|
|value1|value2|
+------+------+
This file has multiple pretty formatted json with delimiter between records as newline.
Approaches already used
spark.read.option("multiline", "true").json("") .This will not work as multiline requires data to be present in form of [{},{}].
Approach working
val x=sparkSession
.read
.json(sc
.wholeTextFiles(filePath)
.values
.flatMap(x=> {x
.replace("\n", "")
.replace("}{", "}}{{")
.split("\\}\\{")}))
I just wanted to ask if there is a better approach as the above solution is doing some slice and dice on data which might lead to performance issue for large data?Thanks
This can be a working solution for you, use from_json() and correct schema in order to parse a json correctly
Create the dataframe here
df = spark.createDataFrame([(str([{"key1":"value1","key2":"value2"}, {"key1": "value3", "key2": "value4"}]))],T.StringType())
df.show(truncate=False)
+----------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|
+----------------------------------------------------------------------------+
Now, use explode() as the value/json column is a list in order to map correctly
And, finally use getItem() to extract the columns
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("key1", df.col.getItem("key1")).withColumn("key2", df.col.getItem("key2"))
+----------------------------------------------------------------------------+--------------------------------+------+------+
|value |col |key1 |key2 |
+----------------------------------------------------------------------------+--------------------------------+------+------+
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value1, key2 -> value2]|value1|value2|
|[{'key1': 'value1', 'key2': 'value2'}, {'key1': 'value3', 'key2': 'value4'}]|[key1 -> value3, key2 -> value4]|value3|value4|
+----------------------------------------------------------------------------+--------------------------------+------+------+
df.show(truncate=False)

Convert string type column to struct column in pyspark

I have a dataframe which has nested structure in it, so I know for sure it is a structType, however since it was converted from a json, it's inferring the schema as string instead of struct. I would like it to remain struct. How should I approach this problem?
The dataframe currently looks like this:
_________________________________________________________________
|issuingClub | memberCards |
-------------------------------------------------------------------
| 1234 |[{u'createdClub': u'1234', u'cardStatus': u'ACTIVE',|
| u'issuedReason': u'new member', u'cardType': |
| u'MEMBERSHIPACCOUNT', u'cardNumber':u'109214092'}] |
-------------------------------------------------------------------
| 3712 |[{u'createdClub': u'3712', u'cardStatus': u'EXPIRE',|
| u'issuedReason': u'old member', u'cardType': |
| u'MEMBERSHIPACCOUNT', u'cardNumber':u'109214092'}] |
-------------------------------------------------------------------
The schema for this dataframe is being inferred as:
root:
|-- issuingClub: string (nullable = true)
|-- memberCards: string (nullable = true)
Instead of memberCards being string, I would like to convert it into StructType. How do I do this? PLease help!
I have tried using this code:
import json
def parse_json(array_str):
json_obj = json.loads(array_str)
for item in json_obj:
yield (item["createdClub"],item["cardStatus"],item["issuedReason"],item["cardType"],item["cardNumber"])
from pyspark.sql.types import ArrayType, IntegerType, StructType, StructField
json_schema = ArrayType(StructType([StructField("u'createdClub", StringType(), nullable=False), StructField("u'cardStatus", StringType(), nullable=False),
StructField("u'issuedReason", StringType(), nullable=False),
StructField("u'cardType",StringType(),nullable=False),
StructField("u'cardNumber",StringType(),nullable=False)]))
from pyspark.sql.functions import udf
udf_parse_json = udf(lambda str: parse_json(str), json_schema)
df_new = old_df.select(delta_final_df["issuingClub"], udf_parse_json(delta_final_df["memberCards"]).alias("memberCards"))
Getting this error as a result of that:
ValueError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)

Array of JSON to Dataframe in Spark received by Kafka

I'm writing a Spark application in Scala using Spark Structured Streaming that receive some data formatted in JSON style from Kafka. This application could receive both a single or multiple JSON object formatted in this way:
[{"key1":"value1","key2":"value2"},{"key1":"value1","key2":"value2"},...,{"key1":"value1","key2":"value2"}]
I tried to define a StructType like:
var schema = StructType(
Array(
StructField("key1",DataTypes.StringType),
StructField("key2",DataTypes.StringType)
))
But it doesn't work.
My actual code for parsing JSON:
var data = (this.stream).getStreamer().load()
.selectExpr("CAST (value AS STRING) as json")
.select(from_json($"json",schema=schema).as("data"))
I would like to get this JSON objects in a dataframe like
+----------+---------+
| key1| key2|
+----------+---------+
| value1| value2|
| value1| value2|
........
| value1| value2|
+----------+---------+
Anyone can help me please?
Thank you!
As your incoming string is Array of JSON, one way is to write a UDF to parse the Array, then explode the parsed Array. Below is the complete code with each steps explained. I have written it for batch but same can be used for streaming with minimal changes.
object JsonParser{
//case class to parse the incoming JSON String
case class JSON(key1: String, key2: String)
def main(args: Array[String]): Unit = {
val spark = SparkSession.
builder().
appName("JSON").
master("local").
getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//sample JSON array String coming from kafka
val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
//UDF to parse JSON array String
val jsonConverter = udf { jsonString: String =>
val mapper = new ObjectMapper()
mapper.registerModule(DefaultScalaModule)
mapper.readValue(jsonString, classOf[Array[JSON]])
}
val df = str.toDF("json") //json String column
.withColumn("array", jsonConverter($"json")) //parse the JSON Array
.withColumn("json", explode($"array")) //explode the Array
.drop("array") //drop unwanted columns
.select("json.*") //explode the JSON to separate columns
//display the DF
df.show()
//+------+------+
//| key1| key2|
//+------+------+
//|value1|value2|
//|value3|value4|
//+------+------+
}
}
This worked fine for me in Spark 3.0.0 and Scala 2.12.10. I used schema_of_json to get the schema of the data in a suitable format for from_json, and applied explode and * operator in the last step of the chain to expand accordingly.
// TO KNOW THE SCHEMA
scala> val str = Seq("""[{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}]""")
str: Seq[String] = List([{"key1":"value1","key2":"value2"},{"key1":"value3","key2":"value4"}])
scala> val df = str.toDF("json")
df: org.apache.spark.sql.DataFrame = [json: string]
scala> df.show()
+--------------------+
| json|
+--------------------+
|[{"key1":"value1"...|
+--------------------+
scala> val schema = df.select(schema_of_json(df.select(col("json")).first.getString(0))).as[String].first
schema: String = array<struct<key1:string,key2:string>>
Use the resulting string as your schema: 'array<structkey1:string,key2:string>', as follows:
// TO PARSE THE ARRAY OF JSON's
scala> val parsedJson1 = df.selectExpr("from_json(json, 'array<struct<key1:string,key2:string>>') as parsed_json")
parsedJson1: org.apache.spark.sql.DataFrame = [parsed_json: array<struct<key1:string,key2:string>>]
scala> parsedJson1.show()
+--------------------+
| parsed_json|
+--------------------+
|[[value1, value2]...|
+--------------------+
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json").select("json.*")
data: org.apache.spark.sql.DataFrame = [key1: string, key2: string]
scala> data.show()
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+
Just FYI, without the star expansion the intermediate result looks as follows:
scala> val data = parsedJson1.selectExpr("explode(parsed_json) as json")
data: org.apache.spark.sql.DataFrame = [json: struct<key1: string, key2: string>]
scala> data.show()
+----------------+
| json|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
You can add ArrayType to your schema and from_json would
convert the data to json.
var schema = ArrayType(StructType(
Array(
StructField("key1", DataTypes.StringType),
StructField("key2", DataTypes.StringType)
)))
Explode it to get the json array element in each row.
val explodedDf = df.withColumn("jsonData", explode(from_json(col("value"), schema)))
.select($"jsonData").show
+----------------+
| jsonData|
+----------------+
|[value1, value2]|
|[value3, value4]|
+----------------+
Select the json keys
explodedDf.select("jsonData.*").show
+------+------+
| key1| key2|
+------+------+
|value1|value2|
|value3|value4|
+------+------+