PySpark Explode JSON String into Multiple Columns - json

I have a dataframe with a column of string datatype. The string represents an api request that returns a json.
df = spark.createDataFrame([
("[{original={ranking=1.0, input=top3}, response=[{to=Sam, position=guard}, {to=John, position=center}, {to=Andrew, position=forward}]}]",1)],
"col1:string, col2:int")
df.show()
Which generates a dataframe like:
+--------------------+----+
| col1|col2|
+--------------------+----+
|[{original={ranki...| 1|
+--------------------+----+
The output I would like to have col2 and have two additional columns from the response. Col3 would capture the player name, indicated by to= and col 4 would have their position indicated by position=. As well as the dataframe would now have three rows, since there's three players. Example:
+----+------+-------+
|col2| col3| col4|
+----+------+-------+
| 1| Sam| guard|
| 1| John| center|
| 1|Andrew|forward|
+----+------+-------+
I've read that I can leverage something like:
df.withColumn("col3",explode(from_json("col1")))
However, I'm not sure how to explode given I want two columns instead of one and need the schema.
Note, I can modify the response using json_dumps to return only the response piece of the string or...
[{to=Sam, position=guard}, {to=John, position=center}, {to=Andrew, position=forward}]}]

If you simplify the output like mentioned, you can define a simple JSON schema and convert JSON string into StructType and read each fields
Input
df = spark.createDataFrame([("[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]",1)], "col1:string, col2:int")
# +-----------------------------------------------------------------------------------------------------------------+----+
# |col1 |col2|
# +-----------------------------------------------------------------------------------------------------------------+----+
# |[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]|1 |
# +-----------------------------------------------------------------------------------------------------------------+----+
And this is the transformation
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('to', T.StringType()),
T.StructField('position', T.StringType())
]))
(df
.withColumn('temp', F.explode(F.from_json('col1', schema=schema)))
.select(
F.col('col2'),
F.col('temp.to').alias('col3'),
F.col('temp.position').alias('col4'),
)
.show()
)
# Output
# +----+------+-------+
# |col2| col3| col4|
# +----+------+-------+
# | 1| Sam| guard|
# | 1| John| center|
# | 1|Andrew|forward|
# +----+------+-------+

Related

Explode JSON array into rows

I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).
ID input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output that I need:
ID A B
1 300 400
1 500 600
2 800 900
I tried from_json, explode functions. But data type mismatch error is coming for array columns.
Real data image
In the image, the 1st dataframe is the input dataframe which I need to read and convert to the 2nd dataframe. 3 input rows needs to be converted to 5 output rows.
I have 2 interpretations of what input (column "input_array") data types you have.
If it's a string...
df = spark.createDataFrame(
[(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
(2, '[ {"A": 800, "B": 900} ]')],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: string (nullable = true)
...you can use from_json to extract Spark structure from JSON string and then inline to explode the resulting array of structs into columns.
df = df.selectExpr(
"ID",
"inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
)
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
If it's an array of strings...
df = spark.createDataFrame(
[(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
(2, [ '{"A": 800, "B": 900}' ])],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: array (nullable = true)
# | |-- element: string (containsNull = true)
...you can first use explode to move every array's element into rows thus resulting in a column of string type, then use from_json to create Spark data types from the strings and finally expand * the structs into columns.
from pyspark.sql import functions as F
df = df.withColumn('input_array', F.explode('input_array'))
df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
df = df.select('ID', 'input_array.*')
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function
Doc: pyspark.sql.functions
If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:
from pyspark.sql.types import *
from pyspark.sql import functions as F
data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
,('2', '[{"A": 800, "B": 900}]')
]
my_schema = ArrayType(
StructType([
StructField('A', IntegerType()),
StructField('B', IntegerType())
])
)
df = spark.createDataFrame(data, ['id', 'Input_array'])\
.withColumn('Input_array', F.from_json('Input_array', my_schema))\
.select("id", F.explode("Input_array").alias("Input_array"))\
.select("id", F.col('Input_array.*'))
df.show(truncate=False)
# +---+---+---+
# |id |A |B |
# +---+---+---+
# |1 |300|400|
# |1 |500|600|
# |2 |800|900|
# +---+---+---+

Pyspark dataframe with json, iteration to create new dataframe

I have data with the following format:
customer_id
model
1
[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2
[{color: 'red', group: 'A'}]
I need to process it so that I create a new dataframe with the following output:
customer_id
color
group
1
red
A
1
green
B
2
red
A
Now I can do this easily with python:
import pandas as pd
import json
newdf = pd.DataFrame([])
for index, row in df.iterrows():
s = row['model']
x = json.loads(s)
colors_list = []
users_list = []
groups_list = []
for i in range(len(x)):
colors_list.append(x[i]['color'])
users_list.append(row['user_id'])
groups_list.append(x[i]['group'])
newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))
How can I achieve the same result with pyspark?
I'm showing the first rows and schema of original dataframe:
+-----------+--------------------+
|customer_id| model |
+-----------+--------------------+
| 3541|[{"score":0.04767...|
| 171811|[{"score":0.04473...|
| 12008|[{"score":0.08043...|
| 78964|[{"score":0.06669...|
| 119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows
root
|-- user_id: integer (nullable = true)
|-- groups: string (nullable = true)
from_json can parse a string column that contains Json data:
from pyspark.sql import functions as F
from pyspark.sql import types as T
data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
[2, "[{color: 'red', group: 'A'}]"]]
df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
.withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
.withColumn("model", F.explode("model")) \
.withColumn("color", F.col("model")["color"]) \
.withColumn("group", F.col("model")["group"]) \
.drop("model")
Result:
+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
| 1| red| A|
| 1|green| B|
| 2| red| A|
+-----------+-----+-----+

Convert DataFrame of JSON Strings

Is it possible to convert a DataFrame containing JSON strings to a DataFrame containing a typed representation of the JSON strings using Spark 2.4?
For example: given the definition below, I'd like to convert the single column in jsonDF using a schema that is inferred from the JSON string.
val jsonDF = spark.sparkContext.parallelize(Seq("""{"a": 1, "b": 2}""")).toDF
DataFrameReader can read JSON from string data sets. For example using toDS instead of toDF
val jsonDS = Seq("""{"a": 1, "b": 2}""").toDS
spark.read.json(jsonDS).show()
Output:
+---+---+
| a| b|
+---+---+
| 1| 2|
+---+---+

How to read custom formatted dates as timestamp in pyspark

I want to use spark.read() to pull data from a .csv file, while enforcing a schema. However, I can't get spark to recognize my dates as timestamps.
First I create a dummy file to test with
%scala
Seq("1|1/15/2019 2:24:00 AM","2|test","3|").toDF().write.text("/tmp/input/csvDateReadTest")
Then I try to read it, and provide a dateFormat string, but it doesn't recognize my dates, and sends the records to the badRecordsPath
df = spark.read.format('csv')
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("dateFormat","M/dd/yyyy hh:mm:ss aaa")
.load("/tmp/input/csvDateReadTest")
As the result, I get just 1 record in df (ID 3), when I'm expecting to see 2. (IDs 1 and 3)
df.show()
+---+----+
| id| dt|
+---+----+
| 3|null|
+---+----+
You must change the dateFormat to timestampFormat since in your case you need a timestamp type and not a date. Additionally the value of timestamp format should be mm/dd/yyyy h:mm:ss a.
Sample data:
Seq(
"1|1/15/2019 2:24:00 AM",
"2|test",
"3|5/30/1981 3:11:00 PM"
).toDF().write.text("/tmp/input/csvDateReadTest")
With the changes for the timestamp:
val df = spark.read.format("csv")
.schema("id int, dt timestamp")
.option("delimiter","|")
.option("badRecordsPath","/tmp/badRecordsPath")
.option("timestampFormat","mm/dd/yyyy h:mm:ss a")
.load("/tmp/input/csvDateReadTest")
And the output:
+----+-------------------+
| id| dt|
+----+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
|null| null|
+----+-------------------+
Note that the record with id 2 failed to comply with the schema definition and therefore it will contain null. If you want to keep also the invalid records you need to change the timestamp column into string and the output in this case will be:
+---+--------------------+
| id| dt|
+---+--------------------+
| 1|1/15/2019 2:24:00 AM|
| 3|5/30/1981 3:11:00 PM|
| 2| test|
+---+--------------------+
UPDATE:
In order to change the string dt into timestamp type you could try with df.withColumn("dt", $"dt".cast("timestamp")) although this will fail and replace all the values with null.
You can achieve this with the next code:
import org.apache.spark.sql.Row
import java.text.SimpleDateFormat
import java.util.{Date, Locale}
import java.sql.Timestamp
import scala.util.{Try, Success, Failure}
val formatter = new SimpleDateFormat("mm/dd/yyyy h:mm:ss a", Locale.US)
df.map{ case Row(id:Int, dt:String) =>
val tryParse = Try[Date](formatter.parse(dt))
val p_timestamp = tryParse match {
case Success(parsed) => new Timestamp(parsed.getTime())
case Failure(_) => null
}
(id, p_timestamp)
}.toDF("id", "dt").show
Output:
+---+-------------------+
| id| dt|
+---+-------------------+
| 1|2019-01-15 02:24:00|
| 3|1981-01-30 15:11:00|
| 2| null|
+---+-------------------+
Hi here is the sample code
df.withColumn("times",
from_unixtime(unix_timestamp(col("df"), "M/dd/yyyy hh:mm:ss a"),
"yyyy-MM-dd HH:mm:ss.SSSSSS"))
.show(false)

pyspark convert row to json with nulls

Goal:
For a dataframe with schema
id:string
Cold:string
Medium:string
Hot:string
IsNull:string
annual_sales_c:string
average_check_c:string
credit_rating_c:string
cuisine_c:string
dayparts_c:string
location_name_c:string
market_category_c:string
market_segment_list_c:string
menu_items_c:string
msa_name_c:string
name:string
number_of_employees_c:string
number_of_rooms_c:string
Months In Role:integer
Tenured Status:string
IsCustomer:integer
units_c:string
years_in_business_c:string
medium_interactions_c:string
hot_interactions_c:string
cold_interactions_c:string
is_null_interactions_c:string
I want to add a new column that is a JSON string of all keys and values for the columns. I have used the approach in this post PySpark - Convert to JSON row by row and related questions.
My code
df = df.withColumn("JSON",func.to_json(func.struct([df[x] for x in small_df.columns])))
I am having one issue:
Issue:
When any row has a null value for a column (and my data has many...) the Json string doesn't contain the key. I.e. if only 9 out of the 27 columns have values then the JSON string only has 9 keys... What I would like to do is maintain all keys but for the null values just pass an empty string ""
Any tips?
You should be able to just modify the answer on the question you linked using pyspark.sql.functions.when.
Consider the following example DataFrame:
data = [
('one', 1, 10),
(None, 2, 20),
('three', None, 30),
(None, None, 40)
]
sdf = spark.createDataFrame(data, ["A", "B", "C"])
sdf.printSchema()
#root
# |-- A: string (nullable = true)
# |-- B: long (nullable = true)
# |-- C: long (nullable = true)
Use when to implement if-then-else logic. Use the column if it is not null. Otherwise return an empty string.
from pyspark.sql.functions import col, to_json, struct, when, lit
sdf = sdf.withColumn(
"JSON",
to_json(
struct(
[
when(
col(x).isNotNull(),
col(x)
).otherwise(lit("")).alias(x)
for x in sdf.columns
]
)
)
)
sdf.show()
#+-----+----+---+-----------------------------+
#|A |B |C |JSON |
#+-----+----+---+-----------------------------+
#|one |1 |10 |{"A":"one","B":"1","C":"10"} |
#|null |2 |20 |{"A":"","B":"2","C":"20"} |
#|three|null|30 |{"A":"three","B":"","C":"30"}|
#|null |null|40 |{"A":"","B":"","C":"40"} |
#+-----+----+---+-----------------------------+
Another option is to use pyspark.sql.functions.coalesce instead of when:
from pyspark.sql.functions import coalesce
sdf.withColumn(
"JSON",
to_json(
struct(
[coalesce(col(x), lit("")).alias(x) for x in sdf.columns]
)
)
).show(truncate=False)
## Same as above