pyspark automatically extract fields inside json to top level columns - json

I have a json like below:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":29,"city":"New York"}
{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}
The following pyspark code:
sc = spark.sparkContext
peopleDF = spark.read.json("people.json")
peopleDF.createOrReplaceTempView("people")
tableDF = spark.sql("SELECT * from people")
tableDF.show()
Produces this output:
+----+--------+---------+-------+
| age| city| data| name|
+----+--------+---------+-------+
|null| null| null|Michael|
| 30| null| null| Andy|
| 19| null| null| Justin|
| 29|New York| null| Bob|
| 49| null|{Test, 1}| Ross|
+----+--------+---------+-------+
But I'm looking for an output like below (Notice how the element inside data have become columns:
+----+--------+----+----+-------+
| age| city| id|Name| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null| 1|Test| Ross|
+----+--------+----+----+-------+
The fields inside the data struct change constantly and so I cannot pre-define the columns. Is there a function in pyspark that can automatically extract every single element in a struct to its top level column? (Its okay if the performance is slow)

You can use "." operator to access nested elements and flatten your schema.
import spark.implicits._
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS).select("age", "city", "data.Name", "data.id", "name")
df.show()
+----+--------+----+----+-------+
| age| city|Name| id| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null|Test| 1| Ross|
+----+--------+----+----+-------+
If you want to flatten schema without selecting columns manually, you can use the following method to do it:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS)
df.select(flattenSchema(df.schema):_*).show()

Related

How to generate stale flag using analytic functions Pyspark?

What is the efficient way to create below output table with minimal join and by using the analytic functions?
expiry_date - Date until the answer is valid.
stale_answer_flag - Flag on the child answer showing that it predates the parent answer
Input tables:
Question:
question_id
parent_question_id
question
1
Are you living in the US?
1A
1
What is your state
Answer:
question_id
date
answer
1
01-sept-2022
yes
1A
01-sept-2022
NY
1
05-sept-2022
yes
Expected Output table:
question_id
parent_question_id
question
answer
date
expiry_date
stale_ans_flag
1
Are you living in the US?
Yes
01-sept-2022
05-sept-2022
Y
1A
1
What is your state
NY
01-sept-2022
'NULL'
N
1
Are you living in the US?
Yes
05-sept-2022
'NULL'
N
A standard SCD2 can be done the following
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# create some data, I added a user column to also add new entries
data_1 = [
("alex","1","2022-09-01","yes"),
("alex","1A","2022-09-01","NY"),
]
schema = ["user","id","date","answer"]
df_1 = spark.createDataFrame(data=data_1, schema = schema)
df_1.show(truncate=False)
+----+---+----------+------+
|user|id |date |answer|
+----+---+----------+------+
|alex|1 |2022-09-01|yes |
|alex|1A |2022-09-01|NY |
+----+---+----------+------+
data_2 = [
("alex","1","2022-09-05","no"),
("john","1","2022-09-05","yes"),
("john","1A","2022-09-05","VA")
]
schema = ["user","id","date","answer"]
df_2 = spark.createDataFrame(data=data_2, schema = schema)
df_2.show(truncate=False)
+----+---+----------+------+
|user|id |date |answer|
+----+---+----------+------+
|alex|1 |2022-09-05|no |
|john|1 |2022-09-05|yes |
|john|1A |2022-09-05|VA |
+----+---+----------+------+
# create the initial dataset of the "old" data with the two additional columns
df_old=(df_1
.withColumn("expiry_date",f.lit(None).cast(DateType()))
.withColumn("stale_ans_flag",f.lit(False))
)
df_old.show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-01| yes| null| false|
|alex| 1A|2022-09-01| NY| null| false|
+----+---+----------+------+-----------+--------------+
# create a new column on the new dataset that will update the old one
df_new=(df_2
.withColumn("expiry_date",f.lit(None).cast(DateType()))
.withColumn("stale_ans_flag",f.lit(False))
)
df_new.show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-05| no| null| false|
|john| 1|2022-09-05| yes| null| false|
|john| 1A|2022-09-05| VA| null| false|
+----+---+----------+------+-----------+--------------+
# now join both together on the relevant keys and add an helper column
df_merge=(df_old.alias("old")
# a full outer join makes it easy to compare new and old and create manual update strategies
.join(df_new.alias("new"),
(df_old.user == df_new.user) &
(df_old.id == df_new.id),
how='fullouter'
)
# helper columns, you can do it without but due to lazy evaluatiuon it does not matter but helps reading the code and debugging
.withColumn("_action",f.when(
# identify new records that have to be inserted
(f.col("old.user").isNull())
& (f.col("old.id").isNull())
& (f.col("new.user").isNotNull())
& (f.col("new.id").isNotNull())
, "new"
).when(
# identify old records that are not changed
(f.col("old.user").isNotNull())
& (f.col("old.id").isNotNull())
& (f.col("new.user").isNull())
& (f.col("new.id").isNull())
, "old"
).when(
# identify update records
(f.col("old.user").isNotNull())
& (f.col("old.id").isNotNull())
& (f.col("new.user").isNotNull())
& (f.col("new.id").isNotNull())
, "update"
)
)
)
df_merge.show()
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
|user| id| date|answer|expiry_date|stale_ans_flag|user| id| date|answer|expiry_date|stale_ans_flag|_action|
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
|null|null| null| null| null| null|john| 1A|2022-09-05| VA| null| false| new|
|alex| 1|2022-09-01| yes| null| false|alex| 1|2022-09-05| no| null| false| update|
|alex| 1A|2022-09-01| NY| null| false|null|null| null| null| null| null| old|
|null|null| null| null| null| null|john| 1|2022-09-05| yes| null| false| new|
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
# finally put all together with a union
df_merge_union=( df_merge
#first take the old ones of the dataset that did not change and are kept as they are
.filter(f.col("_action")=="old")
.select("old.user",
"old.id",
"old.date",
"old.answer",
"old.expiry_date",
"old.stale_ans_flag"
)
# than add the brand new ones that did not exist before
.union(df_merge
.filter(f.col("_action")=="new")
.select("new.user",
"new.id",
"new.date",
"new.answer",
"new.expiry_date",
"new.stale_ans_flag"
)
)
# add the old row with the old values that has been updated and change the flag
.union(df_merge
.filter(f.col("_action")=="update")
.select("old.user",
"old.id",
"old.date",
"old.answer",
"new.date", # or f.current_date(),
f.lit(True)
)
)
# add the old row with the new values that has been updated
.union(df_merge
.filter(f.col("_action")=="update")
.select("new.user",
"new.id",
"new.date",
"new.answer",
f.lit(None),
f.lit(False)
)
)
)
df_merge_union.sort("user","id").show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-01| yes| 2022-09-05| true|
|alex| 1|2022-09-05| no| null| false|
|alex| 1A|2022-09-01| NY| null| false|
|john| 1|2022-09-05| yes| null| false|
|john| 1A|2022-09-05| VA| null| false|
+----+---+----------+------+-----------+--------------+

Reading JSON into spark dataframe

I am learning Spark and I was building a sample project. I have a spark dataframe which has the following syntax. This syntax is when I saved the dataframe to a JSON file:
{"str":["1001","19035004":{"Name":"Chris","Age":"29","Location":"USA"}]}
{"str":["1002","19035005":{"Name":"John","Age":"20","Location":"France"}]}
{"str":["1003","19035006":{"Name":"Mark","Age":"30","Location":"UK"}]}
{"str":["1004","19035007":{"Name":"Mary","Age":"22","Location":"UK"}]}
JSONInput.show() gave me something like the below:
+---------------------------------------------------------------------+
|str |
+---------------------------------------------------------------------+
|[1001,{"19035004":{"Name":"Chris","Age":"29","Location":"USA"}}] |
|[1002,{"19035005":{"Name":"John","Age":"20","Location":"France"}}] |
|[1003,{"19035006":{"Name":"Mark","Age":"30","Location":"UK"}}] |
|[1004,{"19035007":{"Name":"Mary","Age":"22","Location":"UK"}}] |
+---------------------------------------------------------------------|
I know this is not the correct syntax for JSON, but this is what I have.
How can I get this in a relational structure in the first place (because I am pretty new to JSON and Spark. So this is not mandatory):
Name Age Location
-----------------------
Chris 29 USA
John 20 France
Mark 30 UK
Mary 22 UK
And I want to filter for the specific country:
val resultToReturn = JSONInput.filter("Location=USA")
But this results the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve Location given input columns: [str]; line 1 pos 0;
How do I get rid of "str" and make the data in a proper JSON structure? Can anyone help?
You can use from_json to parse the string values :
import org.apache.spark.sql.types._
val schema = MapType(StringType,
StructType(Array(
StructField("Name", StringType, true),
StructField("Age", StringType, true),
StructField("Location", StringType, true)
)
), true
)
val resultToReturn = JSONInput.select(
explode(from_json(col("str")(1), schema))
).select("value.*")
resultToReturn.show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//| John| 20| France|
//| Mark| 30| UK|
//| Mary| 22| UK|
//+-----+---+--------+
Then you can filter :
resultToReturn.filter("Location = 'USA'").show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//+-----+---+--------+
You can extract the innermost JSON using regexp_extract and parse that JSON using from_json. Then you can star-expand the extracted JSON struct.
val parsed_df = JSONInput.selectExpr("""
from_json(
regexp_extract(str[0], '(\\{[^{}]+\\})', 1),
'Name string, Age string, Location string'
) as parsed
""").select("parsed.*")
parsed_df.show(false)
+-----+---+--------+
|Name |Age|Location|
+-----+---+--------+
|Chris|29 |USA |
|John |20 |France |
|Mark |30 |UK |
|Mary |22 |UK |
+-----+---+--------+
And you can filter it using
val filtered = parsed_df.filter("Location = 'USA'")
PS remember to add single quotes around USA.

adding a unique consecutive row number to dataframe in pyspark

I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+

How to use a JSON mapping file to generate a new DataFrame in Spark using Scala

I have two DataFrames, DF1 and DF2, and a JSON file which I need to use as a mapping file to create another dataframe (DF3).
DF1:
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| 100| John| Mumbai|
| 101| Alex| Delhi|
| 104| Divas|Kolkata|
| 108| Jerry|Chennai|
+-------+-------+-------+
DF2:
+-------+-----------+-------+
|column4| column5|column6|
+-------+-----------+-------+
| S1| New| xxx|
| S2| Old| yyy|
| S5|replacement| zzz|
| S10| New| ppp|
+-------+-----------+-------+
Apart from this one mapping file I am having in JSON format which will be use to generate DF3.
Below is the JSON mapping file:
{"targetColumn":"newColumn1","sourceField1":"column2","sourceField2":"column4"}
{"targetColumn":"newColumn2","sourceField1":"column7","sourceField2":"column5"}
{"targetColumn":"newColumn3","sourceField1":"column8","sourceField2":"column6"}
So from this JSON file I need to create DF3 with a column available in the targetColumn section of the mapping and it will check the source column if it is present in DF1 then it map to sourceField1 from DF1 otherwise sourceField2 from DF2.
Below is the expected output.
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Divas|replacement| zzz|
| Jerry| New| ppp|
+----------+-----------+----------+
Any help here will be appropriated.
Parse the JSON and create the below List of custom objects
case class SrcTgtMapping(targetColumn:String,sourceField1:String,sourceField2:String)
val srcTgtMappingList=List(SrcTgtMapping("newColumn1","column2","column4"),SrcTgtMapping("newColumn2","column7","column5"),SrcTgtMapping("newColumn3","column8","column6"))
Add dummy index column to both the dataframes and join both the dataframes based on index column
import org.apache.spark.sql.functions._
val df1WithIndex=df1.withColumn("index",monotonicallyIncreasingId)
val df2WithIndex=df2.withColumn("index",monotonicallyIncreasingId)
val joinedDf=df1WithIndex.join(df2WithIndex,df1WithIndex.col("index")===df2WithIndex.col("index"))
Create the query and execute it.
val df1Columns=df1WithIndex.columns.toList
val df2Columns=df2WithIndex.columns.toList
val query=srcTgtMappingList.map(stm=>if(df1Columns.contains(stm.sourceField1)) joinedDf.col(stm.sourceField1).alias(stm.targetColumn) else joinedDf.col(stm.sourceField2).alias(stm.targetColumn))
val output=joinedDf.select(query:_*)
output.show
Sample Output:
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Jerry| New| ppp|
| Divas|replacement| zzz|
+----------+-----------+----------+
Hope this approach will help you

Structured streaming load json convert to column output is null

JsonData is like {reId: "1",ratingFlowId: "1001",workFlowId:"1"} and I use program as follows:
case class CdrData(reId: String, ratingFlowId: String, workFlowId: String)
object StructuredHdfsJson {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("StructuredHdfsJson")
.master("local")
.getOrCreate()
val schema = Encoders.product[CdrData].schema
val lines = spark.readStream
.format("json")
.schema(schema)
.load("hdfs://iotsparkmaster:9000/json")
val query = lines.writeStream
.outputMode("update")
.format("console")
.start()
query.awaitTermination()
}
}
But the outputs is null, as follows:
-------------------------------------------
Batch: 0
-------------------------------------------
+----+------------+----------+
|reId|ratingFlowId|workFlowId|
+----+------------+----------+
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
+----+------------+----------+
Probably Spark can't parse your JSON. The issue can be related to spaces (or any other characters inside JSON. You should try to clean your data and run the application again.
Edit after comment (for future readers):
keys should be put in quotation marks
Edit 2:
according to json specification keys are represented by strings, and every string should be enclosed by quotation marks. Spark uses Jackson parser to convert strings to object