Opening a json column as a string in pyspark schema and working with it - json

I have a big dataframe I cannot infer the schema from. I have a column that could be read as if each value is a json format, but I cannot know the full detail of it (i.e. the keys and values can vary and I do not know what it can be).
I want to read it as a string and work with it, but the format changes in a strange way in the process ; here is an example:
from pyspark.sql.types import *
data = [{"ID": 1, "Value": {"a":12, "b": "test"}},
{"ID": 2, "Value": {"a":13, "b": "test2"}}
]
df = spark.createDataFrame(data)
#change my schema to open the column as string
schema = df.schema
j = schema.jsonValue()
j["fields"][1] = {"name": "Value", "type": "string", "nullable": True, "metadata": {}}
new_schema = StructType.fromJson(j)
df2 = spark.createDataFrame(data, schema=new_schema)
df2.show()
Gives me
+---+---------------+
| ID| Value|
+---+---------------+
| 1| {a=12, b=test}|
| 2|{a=13, b=test2}|
+---+---------------+
As one can see, the format in column Value is now without quotes, and with = instead of : and I cannot work properly with it anymore.
How can I turn that back into a StructType or MapType ?

Assuming this is your input dataframe:
df2 = spark.createDataFrame([
(1, "{a=12, b=test}"), (2, "{a=13, b=test2}")
], ["ID", "Value"])
You can use str_to_map function after removing {} from the string column like this:
from pyspark.sql import functions as F
df = df2.withColumn(
"Value",
F.regexp_replace("Value", "[{}]", "")
).withColumn(
"Value",
F.expr("str_to_map(Value, ', ', '=')")
)
df.printSchema()
#root
# |-- ID: long (nullable = true)
# |-- Value: map (nullable = true)
# | |-- key: string
# | |-- value: string (valueContainsNull = true)
df.show()
#+---+---------------------+
#|ID |Value |
#+---+---------------------+
#|1 |{a -> 12, b -> test} |
#|2 |{a -> 13, b -> test2}|
#+---+---------------------+

Related

Explode JSON array into rows

I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).
ID input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output that I need:
ID A B
1 300 400
1 500 600
2 800 900
I tried from_json, explode functions. But data type mismatch error is coming for array columns.
Real data image
In the image, the 1st dataframe is the input dataframe which I need to read and convert to the 2nd dataframe. 3 input rows needs to be converted to 5 output rows.
I have 2 interpretations of what input (column "input_array") data types you have.
If it's a string...
df = spark.createDataFrame(
[(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
(2, '[ {"A": 800, "B": 900} ]')],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: string (nullable = true)
...you can use from_json to extract Spark structure from JSON string and then inline to explode the resulting array of structs into columns.
df = df.selectExpr(
"ID",
"inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
)
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
If it's an array of strings...
df = spark.createDataFrame(
[(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
(2, [ '{"A": 800, "B": 900}' ])],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: array (nullable = true)
# | |-- element: string (containsNull = true)
...you can first use explode to move every array's element into rows thus resulting in a column of string type, then use from_json to create Spark data types from the strings and finally expand * the structs into columns.
from pyspark.sql import functions as F
df = df.withColumn('input_array', F.explode('input_array'))
df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
df = df.select('ID', 'input_array.*')
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function
Doc: pyspark.sql.functions
If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:
from pyspark.sql.types import *
from pyspark.sql import functions as F
data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
,('2', '[{"A": 800, "B": 900}]')
]
my_schema = ArrayType(
StructType([
StructField('A', IntegerType()),
StructField('B', IntegerType())
])
)
df = spark.createDataFrame(data, ['id', 'Input_array'])\
.withColumn('Input_array', F.from_json('Input_array', my_schema))\
.select("id", F.explode("Input_array").alias("Input_array"))\
.select("id", F.col('Input_array.*'))
df.show(truncate=False)
# +---+---+---+
# |id |A |B |
# +---+---+---+
# |1 |300|400|
# |1 |500|600|
# |2 |800|900|
# +---+---+---+

How to transform json column in databricks with scala

I have two similar columns of json in a dataframe:
Column A's values look like this:
{"categoryCode": "ZZ","yesNoCode": "Y","conditionIndicator1": "ST"}
Column B's values look like this:
{"categoryCode": "ZZ","yesNoCode": "Y","conditionIndicator1": "ST", "conditionIndicator2": "RN"}
I have other columns as well, whose values are similar, but have 3 condition indicators, 4 condition indicators or 5 condition indicators. Each column's values always have the same number of condition indicators.
I'd like to do something easy like val resultDf = df.select(array($"col1", $"col2", ...)) but this fails because the array function won't allow you to select columns of different types. So I need to call some function f to do a transformation on each column's json, so that all of them have five condition indicators.
val resultDf = df.select(array(f($"col1"), f($"col2"), ...)).alias("normalizedData")
This would yield:
[
{"categoryCode": "ZZ", "yesNoCode": "Y", "conditionIndicator1": "ST", "conditionIndicator2": "", "conditionIndicator3": "", "conditionIndicator4": "", "conditionIndicator5": ""},
{"categoryCode": "KF", "yesNoCode": "Y", "conditionIndicator1": "ST", "conditionIndicator2": "RN", "conditionIndicator3": "", "conditionIndicator4": "", "conditionIndicator5": ""},
{"categoryCode": "64", "yesNoCode": "N", "conditionIndicator1": "2M", "conditionIndicator2": "7X", "conditionIndicator3": "34", "conditionIndicator4": "22", "conditionIndicator5": "AE"}
]
So far what I am using is a udf, but it seems like this is the Really Hard Way, worse than string manipulation!
val createCondition = udf {(theCondition: Row) =>
if (theCondition != null) {
var m = Map(
"categoryCode" -> theCondition.getAs[String]("categoryCode"),
"yesNoCode" -> theCondition.getAs[String]("yesNoCode"),
"conditionIndicator1" -> theCondition.getAs[String]("conditionIndicator1")
)
var c2 = theCondition.getAs[String]("conditionIndicator2")
if (c2 == null) {
c2 = ""
}
m = m + ("conditionIndicator2", c2)
var c3 = theCondition.getAs[String]("conditionIndicator3")
if (c3 == null) {
c3 = ""
}
m = m + ("conditionIndicator3", c3)
// et cetera...and way worse if we had a more complex object graph...which I do!
// then return my map, m.
m
} else {
null
}
}
val result = df.select(array(createCondition($"col1"), createCondition($"col2")))
Seems like transformation of complex data is why we have databricks, so I'm certain I am doing this wrong. Should be a way have a function maps objects of type x into type y.
Let's write a function that generates a schema that matches your data and that can have a variable number of conditions:
import org.apache.spark.sql.types._
def gen_schema(numberOfConditions : Int) = {
val root = Array(StructField("categoryCode", StringType), StructField("yesNoCode", StringType))
val conditions = (1 to numberOfConditions)
.map(i => StructField(s"conditionIndicator$i", StringType))
StructType(root ++ conditions)
}
// Let's define sample data and try the function:
val df = Seq(("""{"categoryCode": "ZZ","yesNoCode": "Y","conditionIndicator1": "ST"}""",
"""{"categoryCode": "ZZ","yesNoCode": "Y","conditionIndicator1": "ST", "conditionIndicator2": "RN"}"""))
.toDF("col1", "col2")
df
.withColumn("col1", from_json('col1, gen_schema(1)))
.withColumn("col2", from_json('col2, gen_schema(2)))
.show(false)
+-----------+---------------+
|col1 |col2 |
+-----------+---------------+
|[ZZ, Y, ST]|[ZZ, Y, ST, RN]|
+-----------+---------------+
By doing this, col1 and col2 do not have the same type as you mention it. But if you use gen_schema(5) for all columns, empty columns will be generated if the json entry does not exists. You will then be able to put them all inside an array:
val result = df.select(array(from_json('col1, gen_schema(5)),
from_json('col2, gen_schema(5))))
result.show(false)
result.printSchema
+-----------------------------------------------+
|array(jsontostructs(col1), jsontostructs(col2))|
+-----------------------------------------------+
|[[ZZ, Y, ST,,,,], [ZZ, Y, ST, RN,,,]] |
+-----------------------------------------------+
root
|-- array(jsontostructs(col1), jsontostructs(col2)): array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- categoryCode: string (nullable = true)
| | |-- yesNoCode: string (nullable = true)
| | |-- conditionIndicator1: string (nullable = true)
| | |-- conditionIndicator2: string (nullable = true)
| | |-- conditionIndicator3: string (nullable = true)
| | |-- conditionIndicator4: string (nullable = true)
| | |-- conditionIndicator5: string (nullable = true)
result

Apache Spark - Transform Map[String, String] to Map[String, Struct]

I have below dataset:
{
"col1": "val1",
"col2": {
"key1": "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2": "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"
}
}
with schema StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StringType,true),true)).
I want to convert col2 to below format:
{
"col1": "val1",
"col2": {
"key1": {"SubCol1":"ABCD","SubCol2":"EFGH"},
"key2": {"SubCol1":"IJKL","SubCol2":"MNOP"}
}
}
The updated dataset schema will be as below:
StructType(StructField(col1,StringType,true), StructField(col2,MapType(StringType,StructType(StructField(SubCol1,StringType,true), StructField(SubCol2,StringType,true)),true),true))
You can use transform_values on the map column:
val df2 = df.withColumn(
"col2",
expr("transform_values(col2, (k, x) -> from_json(x, 'struct<SubCol1:string, SubCol2:string>'))")
)
Try below code It will work in spark 2.4.7
Creating DataFrame with sample data.
scala> val df = Seq(
("val1",Map(
"key1" -> "{\"SubCol1\":\"ABCD\",\"SubCol2\":\"EFGH\"}",
"key2" -> "{\"SubCol1\":\"IJKL\",\"SubCol2\":\"MNOP\"}"))
).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: map<string,string>]
Steps:
Extract map keys (map_keys), values (map_values) into different arrays.
Convert map values into desired output. i.e. Struct
Use map_from_arrays function to combine keys & values from the above steps to create Map[String, Struct]
scala>
val finalDF = df
.withColumn(
"col2_new",
map_from_arrays(
map_keys($"col2"),
expr("""transform(map_values(col2), x -> from_json(x,"struct<SubCol1:string, SubCol2:string>"))""")
)
)
Printing Schema
finalDF.printSchema
root
|-- col1: string (nullable = true)
|-- col2: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- col2_new: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- SubCol1: string (nullable = true)
| | |-- SubCol2: string (nullable = true)
Printing Final Output
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|col1|col2 |col2_new |
+----+------------------------------------------------------------------------------------------+--------------------------------------------+
|val1|[key1 -> {"SubCol1":"ABCD","SubCol2":"EFGH"}, key2 -> {"SubCol1":"IJKL","SubCol2":"MNOP"}]|[key1 -> [ABCD, EFGH], key2 -> [IJKL, MNOP]]|
+----+------------------------------------------------------------------------------------------+--------------------------------------------+

Pyspark: store dataframe as JSON in MySQL table column

I have a spark dataframe which needs to be stored in JSON format in MYSQL table as a column value. (along with other string type values in their respective column)
Something similar to this:
column 1
column 2
val 1
[{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...]
val 1
[{"name":"Stewie G", "age":3, "city":"Quahog"}, {"name":"Ron G", "age":41, "city":"Quahog"}, {...}, ...]
...
...
Here [{"name":"Peter G", "age":44, "city":"Quahog"}, {"name":"John G", "age":30, "city":"Quahog"}, {...}, ...] is the result of one dataframe stored as list of dict
I can do:
str(dataframe_object.toJSON().collect())
and then store it to mysql table column, but this would mean loading the entire data in memory, before storing it in mysql table. Is there better/optimal way to achieve this i.e. without using collect()?
I suppose you can convert your StructType column into JSON string, then using spark.write.jdbc to write to MySQL. As long as your MySQL table has that column as JSON type, you should be all set.
# My sample data
{
"c1": "val1",
"c2": [
{ "name": "N1", "age": 100 },
{ "name": "N2", "age": 101 }
]
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.StructType([
T.StructField('c1', T.StringType()),
T.StructField('c2', T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
])))
])
df = spark.read.json('a.json', schema=schema, multiLine=True)
df.show(10, False)
df.printSchema()
+----+----------------------+
|c1 |c2 |
+----+----------------------+
|val1|[{N1, 100}, {N2, 101}]|
+----+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)
df.withColumn('j', F.to_json('c2')).show(10, False)
+----+----------------------+-------------------------------------------------+
|c1 |c2 |j |
+----+----------------------+-------------------------------------------------+
|val1|[{N1, 100}, {N2, 101}]|[{"name":"N1","age":100},{"name":"N2","age":101}]|
+----+----------------------+-------------------------------------------------+
EDIT #1:
# My sample data
{
"c1": "val1",
"c2": "[{ \"name\": \"N1\", \"age\": 100 },{ \"name\": \"N2\", \"age\": 101 }]"
}
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = spark.read.json('a.json', multiLine=True)
df.show(10, False)
df.printSchema()
+----+-----------------------------------------------------------+
|c1 |c2 |
+----+-----------------------------------------------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|
+----+-----------------------------------------------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
schema = T.ArrayType(T.StructType([
T.StructField('name', T.StringType()),
T.StructField('age', T.IntegerType())
]))
df2 = df.withColumn('j', F.from_json('c2', schema))
df2.show(10, False)
df2.printSchema()
+----+-----------------------------------------------------------+----------------------+
|c1 |c2 |j |
+----+-----------------------------------------------------------+----------------------+
|val1|[{ "name": "N1", "age": 100 },{ "name": "N2", "age": 101 }]|[{N1, 100}, {N2, 101}]|
+----+-----------------------------------------------------------+----------------------+
root
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- j: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- age: integer (nullable = true)

pyspark convert row to json with nulls

Goal:
For a dataframe with schema
id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string
I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code
df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))
I am having one issue:
Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""
Any tips?
You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.
Consider the following example DataFrame:
data = [
('one', 1, 10),
(None, 2, 20),
('three', None, 30),
(None, None, 40)
]
sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)
Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.
from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
"JSON",
to_json(
struct(
[
when(
col(x).isNotNull(),
col(x)
).otherwise(lit("")).alias(x)
for x in sdf.columns
]
)
)
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A |B |C |JSON |
#+-----+----+---+-----------------------------+
#|one |1 |10 |{"A":"one","B":"1","C":"10"} |
#|null |2 |20 |{"A":"","B":"2","C":"20"} |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"} |
#+-----+----+---+-----------------------------+
Another option is to use pyspark.sql.functions.coalesce instead of when:
from pyspark.sql.functions import coalesce
sdf.withColumn(
"JSON",
to_json(
struct(
[coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
)
)
).show(truncate=False)
## Same as above