Explode JSON array into rows - json

I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).
ID input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output that I need:
ID A B
1 300 400
1 500 600
2 800 900
I tried from_json, explode functions. But data type mismatch error is coming for array columns.
Real data image
In the image, the 1st dataframe is the input dataframe which I need to read and convert to the 2nd dataframe. 3 input rows needs to be converted to 5 output rows.

I have 2 interpretations of what input (column "input_array") data types you have.
If it's a string...
df = spark.createDataFrame(
[(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
(2, '[ {"A": 800, "B": 900} ]')],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: string (nullable = true)
...you can use from_json to extract Spark structure from JSON string and then inline to explode the resulting array of structs into columns.
df = df.selectExpr(
"ID",
"inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
)
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
If it's an array of strings...
df = spark.createDataFrame(
[(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
(2, [ '{"A": 800, "B": 900}' ])],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: array (nullable = true)
# | |-- element: string (containsNull = true)
...you can first use explode to move every array's element into rows thus resulting in a column of string type, then use from_json to create Spark data types from the strings and finally expand * the structs into columns.
from pyspark.sql import functions as F
df = df.withColumn('input_array', F.explode('input_array'))
df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
df = df.select('ID', 'input_array.*')
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+

You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function
Doc: pyspark.sql.functions

If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:
from pyspark.sql.types import *
from pyspark.sql import functions as F
data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
,('2', '[{"A": 800, "B": 900}]')
]
my_schema = ArrayType(
StructType([
StructField('A', IntegerType()),
StructField('B', IntegerType())
])
)
df = spark.createDataFrame(data, ['id', 'Input_array'])\
.withColumn('Input_array', F.from_json('Input_array', my_schema))\
.select("id", F.explode("Input_array").alias("Input_array"))\
.select("id", F.col('Input_array.*'))
df.show(truncate=False)
# +---+---+---+
# |id |A |B |
# +---+---+---+
# |1 |300|400|
# |1 |500|600|
# |2 |800|900|
# +---+---+---+

Related

Opening a json column as a string in pyspark schema and working with it

I have a big dataframe I cannot infer the schema from. I have a column that could be read as if each value is a json format, but I cannot know the full detail of it (i.e. the keys and values can vary and I do not know what it can be).
I want to read it as a string and work with it, but the format changes in a strange way in the process ; here is an example:
from pyspark.sql.types import *
data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
{"ID": 2, "Value": {"a":13, "b": "test2"}}
]
df = spark.createDataFrame(data)
#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)
df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()
Gives me
+---+---------------+
| ID| Value|
+---+---------------+
| 1| {a=12, b=test}|
| 2|{a=13, b=test2}|
+---+---------------+
As one can see, the format in column Value is now without quotes, and with = instead of : and I cannot work properly with it anymore.
How can I turn that back into a StructType or MapType ?
Assuming this is your input dataframe:
df2 = spark.createDataFrame([
(1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])
You can use str_to_map function after removing {} from the string column like this:
from pyspark.sql import functions as F
df = df2.withColumn(
"Value",
F.regexp_replace("Value", "[{}]", "")
).withColumn(
"Value",
F.expr("str_to_map(Value, ', ', '=')")
)
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
df.show()
#+---+---------------------+
#|ID |Value |
#+---+---------------------+
#|1 |{a -> 12, b -> test} |
#|2 |{a -> 13, b -> test2}|
#+---+---------------------+

Flattening json string in spark

I have the following dataframe in spark:
root
|-- user_id: string (nullable = true)
|-- payload: string (nullable = true)
in which payload is an json string with no fixed schema, here are some sample data:
{'user_id': '001','payload': '{"country":"US","time":"11111"}'}
{'user_id': '002','payload': '{"message_id":"8936716"}'}
{'user_id': '003','payload': '{"brand":"adidas","when":""}'}
I want to output the above data in json format with the flattened payload(basically just extracting key value pairs from payload and put them into the root level), for example:
{'user_id': '001','country':'US','time':'11111'}
{'user_id': '002','message_id':'8936716'}
{'user_id': '003','brand':'adidas','when':''}
Stackoverflow said this is a duplicated question to Flatten Nested Spark Dataframe but it's not..
The difference here is that the value of payload in my case is just string type.
You can parse the payload JSON as a map<string,string> and add the user_id to the payload:
import pyspark.sql.functions as F
# input dataframe
df.show(truncate=False)
+-------+-------------------------------+
|user_id|payload |
+-------+-------------------------------+
|001 |{"country":"US","time":"11111"}|
|002 |{"message_id":"8936716"} |
|003 |{"brand":"adidas","when":""} |
+-------+-------------------------------+
df2 = df.select(
F.to_json(
F.map_concat(
F.create_map(F.lit('user_id'), F.col('user_id')),
F.from_json('payload', 'map<string,string>')
)
).alias('out')
)
df2.show(truncate=False)
+-----------------------------------------------+
|out |
+-----------------------------------------------+
|{"user_id":"001","country":"US","time":"11111"}|
|{"user_id":"002","message_id":"8936716"} |
|{"user_id":"003","brand":"adidas","when":""} |
+-----------------------------------------------+
To write it to a JSON file, you can do:
df2.coalesce(1).write.text('filepath')
This is how I finally solved the problem
json_schema = spark.read.json(source_parquet_df.rdd.map(lambda row: row.payload)).schema
new_df=source_parquet_df.withColumn('payload_json_obj',from_json(col('payload'),json_schema)).drop(source_parquet_df.payload)
flat_df = new_df.select([c for c in new_df.columns if c != 'payload_json_obj']+['payload_json_obj.*'])

Pyspark dataframe with json, iteration to create new dataframe

I have data with the following format:
customer_id
model
1
[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2
[{color: 'red', group: 'A'}]
I need to process it so that I create a new dataframe with the following output:
customer_id
color
group
1
red
A
1
green
B
2
red
A
Now I can do this easily with python:
import pandas as pd
import json
newdf = pd.DataFrame([])
for index, row in df.iterrows():
s = row['model']
x = json.loads(s)
colors_list = []
users_list = []
groups_list = []
for i in range(len(x)):
colors_list.append(x[i]['color'])
users_list.append(row['user_id'])
groups_list.append(x[i]['group'])
newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))
How can I achieve the same result with pyspark?
I'm showing the first rows and schema of original dataframe:
+-----------+--------------------+
|customer_id| model |
+-----------+--------------------+
| 3541|[{"score":0.04767...|
| 171811|[{"score":0.04473...|
| 12008|[{"score":0.08043...|
| 78964|[{"score":0.06669...|
| 119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows
root
|-- user_id: integer (nullable = true)
|-- groups: string (nullable = true)
from_json can parse a string column that contains Json data:
from pyspark.sql import functions as F
from pyspark.sql import types as T
data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
[2, "[{color: 'red', group: 'A'}]"]]
df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
.withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
.withColumn("model", F.explode("model")) \
.withColumn("color", F.col("model")["color"]) \
.withColumn("group", F.col("model")["group"]) \
.drop("model")
Result:
+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
| 1| red| A|
| 1|green| B|
| 2| red| A|
+-----------+-----+-----+

spark dataframes : reading json having duplicate column names but different datatypes

I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+

pyspark convert row to json with nulls

Goal:
For a dataframe with schema
id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string
I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code
df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))
I am having one issue:
Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""
Any tips?
You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.
Consider the following example DataFrame:
data = [
('one', 1, 10),
(None, 2, 20),
('three', None, 30),
(None, None, 40)
]
sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)
Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.
from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
"JSON",
to_json(
struct(
[
when(
col(x).isNotNull(),
col(x)
).otherwise(lit("")).alias(x)
for x in sdf.columns
]
)
)
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A |B |C |JSON |
#+-----+----+---+-----------------------------+
#|one |1 |10 |{"A":"one","B":"1","C":"10"} |
#|null |2 |20 |{"A":"","B":"2","C":"20"} |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"} |
#+-----+----+---+-----------------------------+
Another option is to use pyspark.sql.functions.coalesce instead of when:
from pyspark.sql.functions import coalesce
sdf.withColumn(
"JSON",
to_json(
struct(
[coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
)
)
).show(truncate=False)
## Same as above