I have data with the following format:
customer_id
model
1
[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]
2
[{color: 'red', group: 'A'}]
I need to process it so that I create a new dataframe with the following output:
customer_id
color
group
1
red
A
1
green
B
2
red
A
Now I can do this easily with python:
import pandas as pd
import json
newdf = pd.DataFrame([])
for index, row in df.iterrows():
s = row['model']
x = json.loads(s)
colors_list = []
users_list = []
groups_list = []
for i in range(len(x)):
colors_list.append(x[i]['color'])
users_list.append(row['user_id'])
groups_list.append(x[i]['group'])
newdf = newdf.append(pd.DataFrame({'customer_id': users_list, 'group': groups_list, 'color': colors_list}))
How can I achieve the same result with pyspark?
I'm showing the first rows and schema of original dataframe:
+-----------+--------------------+
|customer_id| model |
+-----------+--------------------+
| 3541|[{"score":0.04767...|
| 171811|[{"score":0.04473...|
| 12008|[{"score":0.08043...|
| 78964|[{"score":0.06669...|
| 119600|[{"score":0.06703...|
+-----------+--------------------+
only showing top 5 rows
root
|-- user_id: integer (nullable = true)
|-- groups: string (nullable = true)
from_json can parse a string column that contains Json data:
from pyspark.sql import functions as F
from pyspark.sql import types as T
data = [[1, "[{color: 'red', group: 'A'},{color: 'green', group: 'B'}]"],
[2, "[{color: 'red', group: 'A'}]"]]
df = spark.createDataFrame(data, schema = ["customer_id", "model"]) \
.withColumn("model", F.from_json("model", T.ArrayType(T.MapType(T.StringType(), T.StringType())), {"allowUnquotedFieldNames": True})) \
.withColumn("model", F.explode("model")) \
.withColumn("color", F.col("model")["color"]) \
.withColumn("group", F.col("model")["group"]) \
.drop("model")
Result:
+-----------+-----+-----+
|customer_id|color|group|
+-----------+-----+-----+
| 1| red| A|
| 1|green| B|
| 2| red| A|
+-----------+-----+-----+
Related
I have a dataframe which has 2 columns" "ID" and "input_array" (values are JSON arrays).
ID input_array
1 [ {“A”:300, “B”:400}, { “A”:500,”B”: 600} ]
2 [ {“A”: 800, “B”: 900} ]
Output that I need:
ID A B
1 300 400
1 500 600
2 800 900
I tried from_json, explode functions. But data type mismatch error is coming for array columns.
Real data image
In the image, the 1st dataframe is the input dataframe which I need to read and convert to the 2nd dataframe. 3 input rows needs to be converted to 5 output rows.
I have 2 interpretations of what input (column "input_array") data types you have.
If it's a string...
df = spark.createDataFrame(
[(1, '[ {"A":300, "B":400}, { "A":500,"B": 600} ]'),
(2, '[ {"A": 800, "B": 900} ]')],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: string (nullable = true)
...you can use from_json to extract Spark structure from JSON string and then inline to explode the resulting array of structs into columns.
df = df.selectExpr(
"ID",
"inline(from_json(input_array, 'array<struct<A:long,B:long>>'))"
)
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
If it's an array of strings...
df = spark.createDataFrame(
[(1, [ '{"A":300, "B":400}', '{ "A":500,"B": 600}' ]),
(2, [ '{"A": 800, "B": 900}' ])],
['ID', 'input_array'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- input_array: array (nullable = true)
# | |-- element: string (containsNull = true)
...you can first use explode to move every array's element into rows thus resulting in a column of string type, then use from_json to create Spark data types from the strings and finally expand * the structs into columns.
from pyspark.sql import functions as F
df = df.withColumn('input_array', F.explode('input_array'))
df = df.withColumn('input_array', F.from_json('input_array', 'struct<A:long,B:long>'))
df = df.select('ID', 'input_array.*')
df.show()
# +---+---+---+
# | ID| A| B|
# +---+---+---+
# | 1|300|400|
# | 1|500|600|
# | 2|800|900|
# +---+---+---+
You can remove square brackets by using regexp_replace or substring functions
Then you can transform strings with multiple jsons to an array by using split function
Then you can unwrap the array and make new row for each element in the array by using explode function
Then you can handle column with json by using from_json function
Doc: pyspark.sql.functions
If Input_array is string then you need to parse this string as a JSON and then explode it into rows and expand the keys to columns. You can parse the array as using ArrayType data structure:
from pyspark.sql.types import *
from pyspark.sql import functions as F
data = [('1', '[{"A":300, "B":400},{ "A":500,"B": 600}]')
,('2', '[{"A": 800, "B": 900}]')
]
my_schema = ArrayType(
StructType([
StructField('A', IntegerType()),
StructField('B', IntegerType())
])
)
df = spark.createDataFrame(data, ['id', 'Input_array'])\
.withColumn('Input_array', F.from_json('Input_array', my_schema))\
.select("id", F.explode("Input_array").alias("Input_array"))\
.select("id", F.col('Input_array.*'))
df.show(truncate=False)
# +---+---+---+
# |id |A |B |
# +---+---+---+
# |1 |300|400|
# |1 |500|600|
# |2 |800|900|
# +---+---+---+
I have a dataframe with a column of string datatype. The string represents an api request that returns a json.
df = spark.createDataFrame([
("[{original={ranking=1.0, input=top3}, response=[{to=Sam, position=guard}, {to=John, position=center}, {to=Andrew, position=forward}]}]",1)],
"col1:string, col2:int")
df.show()
Which generates a dataframe like:
+--------------------+----+
| col1|col2|
+--------------------+----+
|[{original={ranki...| 1|
+--------------------+----+
The output I would like to have col2 and have two additional columns from the response. Col3 would capture the player name, indicated by to= and col 4 would have their position indicated by position=. As well as the dataframe would now have three rows, since there's three players. Example:
+----+------+-------+
|col2| col3| col4|
+----+------+-------+
| 1| Sam| guard|
| 1| John| center|
| 1|Andrew|forward|
+----+------+-------+
I've read that I can leverage something like:
df.withColumn("col3",explode(from_json("col1")))
However, I'm not sure how to explode given I want two columns instead of one and need the schema.
Note, I can modify the response using json_dumps to return only the response piece of the string or...
[{to=Sam, position=guard}, {to=John, position=center}, {to=Andrew, position=forward}]}]
If you simplify the output like mentioned, you can define a simple JSON schema and convert JSON string into StructType and read each fields
Input
df = spark.createDataFrame([("[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]",1)], "col1:string, col2:int")
# +-----------------------------------------------------------------------------------------------------------------+----+
# |col1 |col2|
# +-----------------------------------------------------------------------------------------------------------------+----+
# |[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]|1 |
# +-----------------------------------------------------------------------------------------------------------------+----+
And this is the transformation
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('to', T.StringType()),
T.StructField('position', T.StringType())
]))
(df
.withColumn('temp', F.explode(F.from_json('col1', schema=schema)))
.select(
F.col('col2'),
F.col('temp.to').alias('col3'),
F.col('temp.position').alias('col4'),
)
.show()
)
# Output
# +----+------+-------+
# |col2| col3| col4|
# +----+------+-------+
# | 1| Sam| guard|
# | 1| John| center|
# | 1|Andrew|forward|
# +----+------+-------+
I have json data like below where version field is the differentiator -
file_1 = {"version": 1, "stats": {"hits":20}}
file_2 = {"version": 2, "stats": [{"hour":1,"hits":10},{"hour":2,"hits":12}]}
In the new format, stats column is now Arraytype(StructType).
Earlier only file_1 was needed so I was using
spark.read.schema(schema_def_v1).json(path)
Now I need to read both these type of multiple json files which come together. I cannot define stats as string in schema_def as that would affect the corruptrecord feature(for stats column) which checks malformed json and schema compliance of all the fields.
Example df output required in 1 read only -
version | hour | hits
1 | null | 20
2 | 1 | 10
2 | 2 | 12
I have tried to read with mergeSchema option but that makes stats field String type.
Also, I have tried making two dataframes by filtering on the version field, and applying spark.read.schema(schema_def_v1).json(df_v1.toJSON). Here also stats column becomes String type.
I was thinking if while reading, I could parse the df column headers as stats_v1 and stats_v2 on basis of data-types can solve the problem. Please help with any possible solutions.
UDF to check string or array, if it is string it will convert string to an array.
import org.apache.spark.sql.functions.udf
import org.json4s.{DefaultFormats, JObject}
import org.json4s.jackson.JsonMethods.parse
import org.json4s.jackson.Serialization.write
import scala.util.{Failure, Success, Try}
object Parse {
implicit val formats = DefaultFormats
def toArray(data:String) = {
val json_data = (parse(data))
if(json_data.isInstanceOf[JObject]) write(List(json_data)) else data
}
}
val toJsonArray = udf(Parse.toArray _)
scala> "ls -ltr /tmp/data".!
total 16
-rw-r--r-- 1 srinivas root 37 Jun 26 17:49 file_1.json
-rw-r--r-- 1 srinivas root 69 Jun 26 17:49 file_2.json
res4: Int = 0
scala> val df = spark.read.json("/tmp/data").select("stats","version")
df: org.apache.spark.sql.DataFrame = [stats: string, version: bigint]
scala> df.printSchema
root
|-- stats: string (nullable = true)
|-- version: long (nullable = true)
scala> df.show(false)
+-------+-------------------------------------------+
|version|stats |
+-------+-------------------------------------------+
|1 |{"hits":20} |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|
+-------+-------------------------------------------+
Output
scala>
import org.apache.spark.sql.types._
val schema = ArrayType(MapType(StringType,IntegerType))
df
.withColumn("json_stats",explode(from_json(toJsonArray($"stats"),schema)))
.select(
$"version",
$"stats",
$"json_stats".getItem("hour").as("hour"),
$"json_stats".getItem("hits").as("hits")
).show(false)
+-------+-------------------------------------------+----+----+
|version|stats |hour|hits|
+-------+-------------------------------------------+----+----+
|1 |{"hits":20} |null|20 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|1 |10 |
|2 |[{"hour":1,"hits":10},{"hour":2,"hits":12}]|2 |12 |
+-------+-------------------------------------------+----+----+
Without UDF
scala> val schema = ArrayType(MapType(StringType,IntegerType))
scala> val expr = when(!$"stats".contains("[{"),concat(lit("["),$"stats",lit("]"))).otherwise($"stats")
df
.withColumn("stats",expr)
.withColumn("stats",explode(from_json($"stats",schema)))
.select(
$"version",
$"stats",
$"stats".getItem("hour").as("hour"),
$"stats".getItem("hits").as("hits")
)
.show(false)
+-------+-----------------------+----+----+
|version|stats |hour|hits|
+-------+-----------------------+----+----+
|1 |[hits -> 20] |null|20 |
|2 |[hour -> 1, hits -> 10]|1 |10 |
|2 |[hour -> 2, hits -> 12]|2 |12 |
+-------+-----------------------+----+----+
Read the second file first, explode stats, use schema to read first file.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
file_1 = {"version": 1, "stats": {"hits": 20}}
file_2 = {"version": 2, "stats": [{"hour": 1, "hits": 10}, {"hour": 2, "hits": 12}]}
df1 = spark.read.json(sc.parallelize([file_2])).withColumn('stats', explode('stats'))
schema = df1.schema
spark.read.schema(schema).json(sc.parallelize([file_1])).printSchema()
output >> root
|-- stats: struct (nullable = true)
| |-- hits: long (nullable = true)
| |-- hour: long (nullable = true)
|-- version: long (nullable = true)
IIUC, you can read the JSON files using spark.read.text and then parse the value with json_tuple, from_json. notice for stats field we use coalesce to parse fields based on two or more schema. (add wholetext=True as an argument of spark.read.text if each file contains a single JSON document cross multiple lines)
from pyspark.sql.functions import json_tuple, coalesce, from_json, array
df = spark.read.text("/path/to/all/jsons/")
schema_1 = "array<struct<hour:int,hits:int>>"
schema_2 = "struct<hour:int,hits:int>"
df.select(json_tuple('value', 'version', 'stats').alias('version', 'stats')) \
.withColumn('status', coalesce(from_json('stats', schema_1), array(from_json('stats', schema_2)))) \
.selectExpr('version', 'inline_outer(status)') \
.show()
+-------+----+----+
|version|hour|hits|
+-------+----+----+
| 2| 1| 10|
| 2| 2| 12|
| 1|null| 20|
+-------+----+----+
I want to convert an Spark DataFrame into Json file. Below is the input and output format.
Any help is appreciated.
Input :
+-------------------------+
|Name|Age|City |Data |
+-------------------------+
|Ram |30 |Delhi|[A -> ABC]|
|-------------------------|
|Shan|25 |Delhi|[X -> XYZ]|
|-------------------------|
|Riya|12 |U.P. |[M -> MNO]|
+-------------------------+
Output :
{"Name":"Ram","Age":"30","City":"Delhi","Delhi":{"A":"ABC"}}
{"Name":"Shan","Age":"25","City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Riya","Age":"12","City":"U.P.","U.P.":{"M":"MNO"}}
Scala: Starting from your data,
val df = Seq(("Ram",30,"Delhi",Map("A" -> "ABC")), ("Shan",25,"Delhi",Map("X" -> "XYZ")), ("Riya",12,"U.P.",Map("M" -> "MNO"))).toDF("Name", "Age", "City", "Data")
df.show
// +----+---+-----+----------+
// |Name|Age| City| Data|
// +----+---+-----+----------+
// | Ram| 30|Delhi|[A -> ABC]|
// |Shan| 25|Delhi|[X -> XYZ]|
// |Riya| 12| U.P.|[M -> MNO]|
// +----+---+-----+----------+
To change the key as City not Data,
val df2 = df.groupBy("Name", "Age", "City").pivot("City").agg(first("Data"))
df2.show
// +----+---+-----+----------+----------+
// |Name|Age| City| Delhi| U.P.|
// +----+---+-----+----------+----------+
// |Riya| 12| U.P.| null|[M -> MNO]|
// |Shan| 25|Delhi|[X -> XYZ]| null|
// | Ram| 30|Delhi|[A -> ABC]| null|
// +----+---+-----+----------+----------+
And make it by using toJson and collect.
val jsonArray = df.toJSON.collect
jsonArray.foreach(println)
It will print the result such as:
{"Name":"Riya","Age":12,"City":"U.P.","U.P.":{"M":"MNO"}}
{"Name":"Shan","Age":25,"City":"Delhi","Delhi":{"X":"XYZ"}}
{"Name":"Ram","Age":30,"City":"Delhi","Delhi":{"A":"ABC"}}
You can call write.json on DataFrame.
val df: DataFrame = ....
df.write.json("/jsonFilPath")
Here is a an example using Datasets
scala> case class Data(key: String, value: String)
scala> case class Person(name: String, age: Long, city: String, data: Data)
scala> val peopleDS = Seq(Person("Ram", 30, "Delhi", Data("A", "ABC")), Person("Shan", 25, "Delhi", Data("X", "XYZ")), Person("Riya", 12, "U.P", Data("M", "MNO"))).toDS()
scala> peopleDS.show()
+----+---+-----+--------+
|name|age| city| data|
+----+---+-----+--------+
| Ram| 30|Delhi|[A, ABC]|
|Shan| 25|Delhi|[X, XYZ]|
|Riya| 12| U.P|[M, MNO]|
+----+---+-----+--------+
scala> peopleDS.write.json("pathToData/people")
Then you would find written json files in the given folder.
> cd pathToData/people
> ls -l
part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
> cat part-00000-6bd00826-5a8e-4ab9-bfb0-65d722394108-c000.json
{"name":"Ram","age":30,"city":"Delhi","data":{"key":"A","value":"ABC"}}
{"name":"Shan","age":25,"city":"Delhi","data":{"key":"X","value":"XYZ"}}
{"name":"Riya","age":12,"city":"U.P","data":{"key":"M","value":"MNO"}}
I was trying to flatten the very nested JSON, and create spark dataframe and the ultimate goal is to push the given dataframe to phoenix. I am successfully able to flatten the JSON using code.
def recurs(df: DataFrame): DataFrame = {
if(df.schema.fields.find(_.dataType match {
case ArrayType(StructType(_),_) | StructType(_) => true
case _ => false
}).isEmpty) df
else {
val columns = df.schema.fields.map(f => f.dataType match {
case _: ArrayType => explode(col(f.name)).as(f.name)
case s: StructType => col(s"${f.name}.*")
case _ => col(f.name)
})
recurs(df.select(columns:_*))
}
}
val df = spark.read.json(json_location)
flatten_df = recurs(df)
flatten_df.show()
My nested json is something like:
{
"Total Value": 3,
"Topic": "Example",
"values": [
{
"value": "#example1",
"points": [
[
"123",
"156"
]
],
"properties": {
"date": "12-04-19",
"value": "Model example 1"
}
},
{"value": "#example2",
"points": [
[
"124",
"157"
]
],
"properties": {
"date": "12-05-19",
"value": "Model example 2"
}
}
]
}
The output I am getting:
+-----------+-----------+----------+-------------+------------------------+------------------------+
|Total Value| Topic |value | points | date | value |
+-----------+-----------+----------+-------------+------------------------+------------------------+
| 3 | Example | example1 | [123,156] | 12-04-19 | Model example 1 |
| 3 | Example | example2 | [124,157] | 12-05-19 | Model example 2 |
+-----------+-----------+----------+-------------+------------------------+------------------------+
So, value key is found 2 times in json so it is creating 2 column name but this is an error and not allowed in Phoenix to ingest this data.
The error message is:
ERROR 514 (42892): A duplicate column name was detected in the object definition or ALTER TABLE/VIEW statement
I am expecting this output so that phoenix could differentiate the columns.
+-----------+-----------+--------------+---------------+------------------------+------------------------+
|Total Value| Topic |values.value | values.points | values.properties.date | values.properties.value| |
+-----------+-----------+--------------+---------------+------------------------+------------------------+
| 3 | Example | example1 | [123,156] | 12-04-19 | Model example 1 |
| 3 | Example | example2 | [124,157] | 12-05-19 | Model example 2 |
+-----------+-----------+--------------+---------------+------------------------+------------------------+
In this way phoenix can ingest the data perfectly, please suggest any changes in flattening code or any help to achieve the same. Thanks
You need slight changes to the recurs method:
Dealing with ArrayType(st: StructType, _) instead of ArrayType.
Avoid using *, and name every field in the second match (StructType).
Use backticks at the right places to rename the fields, keeping precedence naming.
Here's some code:
def recurs(df: DataFrame): DataFrame = {
if(!df.schema.fields.exists(_.dataType match {
case ArrayType(StructType(_),_) | StructType(_) => true
case _ => false
})) df
else {
val columns = df.schema.fields.flatMap(f => f.dataType match {
case ArrayType(st: StructType, _) => Seq(explode(col(f.name)).as(f.name))
case s: StructType =>
s.fieldNames.map{sf => col(s"`${f.name}`.$sf").as(s"${f.name}.$sf")}
case _ => Seq(col(s"`${f.name}`"))
})
recurs(df.select(columns:_*))
}
}
val newDF = recurs(df).cache
newDF.show(false)
newDF.printSchema
And the new output:
+-------+-----------+-------------+----------------------+-----------------------+------------+
|Topic |Total Value|values.points|values.properties.date|values.properties.value|values.value|
+-------+-----------+-------------+----------------------+-----------------------+------------+
|Example|3 |[[123, 156]] |12-04-19 |Model example 1 |#example1 |
|Example|3 |[[124, 157]] |12-05-19 |Model example 2 |#example2 |
+-------+-----------+-------------+----------------------+-----------------------+------------+
root
|-- Topic: string (nullable = true)
|-- Total Value: long (nullable = true)
|-- values.points: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- values.properties.date: string (nullable = true)
|-- values.properties.value: string (nullable = true)
|-- values.value: string (nullable = true)