How to generate stale flag using analytic functions Pyspark? - function

What is the efficient way to create below output table with minimal join and by using the analytic functions?
expiry_date - Date until the answer is valid.
stale_answer_flag - Flag on the child answer showing that it predates the parent answer
Input tables:
Question:
question_id
parent_question_id
question
1
Are you living in the US?
1A
1
What is your state
Answer:
question_id
date
answer
1
01-sept-2022
yes
1A
01-sept-2022
NY
1
05-sept-2022
yes
Expected Output table:
question_id
parent_question_id
question
answer
date
expiry_date
stale_ans_flag
1
Are you living in the US?
Yes
01-sept-2022
05-sept-2022
Y
1A
1
What is your state
NY
01-sept-2022
'NULL'
N
1
Are you living in the US?
Yes
05-sept-2022
'NULL'
N

A standard SCD2 can be done the following
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# create some data, I added a user column to also add new entries
data_1 = [
("alex","1","2022-09-01","yes"),
("alex","1A","2022-09-01","NY"),
]
schema = ["user","id","date","answer"]
df_1 = spark.createDataFrame(data=data_1, schema = schema)
df_1.show(truncate=False)
+----+---+----------+------+
|user|id |date |answer|
+----+---+----------+------+
|alex|1 |2022-09-01|yes |
|alex|1A |2022-09-01|NY |
+----+---+----------+------+
data_2 = [
("alex","1","2022-09-05","no"),
("john","1","2022-09-05","yes"),
("john","1A","2022-09-05","VA")
]
schema = ["user","id","date","answer"]
df_2 = spark.createDataFrame(data=data_2, schema = schema)
df_2.show(truncate=False)
+----+---+----------+------+
|user|id |date |answer|
+----+---+----------+------+
|alex|1 |2022-09-05|no |
|john|1 |2022-09-05|yes |
|john|1A |2022-09-05|VA |
+----+---+----------+------+
# create the initial dataset of the "old" data with the two additional columns
df_old=(df_1
.withColumn("expiry_date",f.lit(None).cast(DateType()))
.withColumn("stale_ans_flag",f.lit(False))
)
df_old.show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-01| yes| null| false|
|alex| 1A|2022-09-01| NY| null| false|
+----+---+----------+------+-----------+--------------+
# create a new column on the new dataset that will update the old one
df_new=(df_2
.withColumn("expiry_date",f.lit(None).cast(DateType()))
.withColumn("stale_ans_flag",f.lit(False))
)
df_new.show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-05| no| null| false|
|john| 1|2022-09-05| yes| null| false|
|john| 1A|2022-09-05| VA| null| false|
+----+---+----------+------+-----------+--------------+
# now join both together on the relevant keys and add an helper column
df_merge=(df_old.alias("old")
# a full outer join makes it easy to compare new and old and create manual update strategies
.join(df_new.alias("new"),
(df_old.user == df_new.user) &
(df_old.id == df_new.id),
how='fullouter'
)
# helper columns, you can do it without but due to lazy evaluatiuon it does not matter but helps reading the code and debugging
.withColumn("_action",f.when(
# identify new records that have to be inserted
(f.col("old.user").isNull())
& (f.col("old.id").isNull())
& (f.col("new.user").isNotNull())
& (f.col("new.id").isNotNull())
, "new"
).when(
# identify old records that are not changed
(f.col("old.user").isNotNull())
& (f.col("old.id").isNotNull())
& (f.col("new.user").isNull())
& (f.col("new.id").isNull())
, "old"
).when(
# identify update records
(f.col("old.user").isNotNull())
& (f.col("old.id").isNotNull())
& (f.col("new.user").isNotNull())
& (f.col("new.id").isNotNull())
, "update"
)
)
)
df_merge.show()
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
|user| id| date|answer|expiry_date|stale_ans_flag|user| id| date|answer|expiry_date|stale_ans_flag|_action|
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
|null|null| null| null| null| null|john| 1A|2022-09-05| VA| null| false| new|
|alex| 1|2022-09-01| yes| null| false|alex| 1|2022-09-05| no| null| false| update|
|alex| 1A|2022-09-01| NY| null| false|null|null| null| null| null| null| old|
|null|null| null| null| null| null|john| 1|2022-09-05| yes| null| false| new|
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
# finally put all together with a union
df_merge_union=( df_merge
#first take the old ones of the dataset that did not change and are kept as they are
.filter(f.col("_action")=="old")
.select("old.user",
"old.id",
"old.date",
"old.answer",
"old.expiry_date",
"old.stale_ans_flag"
)
# than add the brand new ones that did not exist before
.union(df_merge
.filter(f.col("_action")=="new")
.select("new.user",
"new.id",
"new.date",
"new.answer",
"new.expiry_date",
"new.stale_ans_flag"
)
)
# add the old row with the old values that has been updated and change the flag
.union(df_merge
.filter(f.col("_action")=="update")
.select("old.user",
"old.id",
"old.date",
"old.answer",
"new.date", # or f.current_date(),
f.lit(True)
)
)
# add the old row with the new values that has been updated
.union(df_merge
.filter(f.col("_action")=="update")
.select("new.user",
"new.id",
"new.date",
"new.answer",
f.lit(None),
f.lit(False)
)
)
)
df_merge_union.sort("user","id").show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-01| yes| 2022-09-05| true|
|alex| 1|2022-09-05| no| null| false|
|alex| 1A|2022-09-01| NY| null| false|
|john| 1|2022-09-05| yes| null| false|
|john| 1A|2022-09-05| VA| null| false|
+----+---+----------+------+-----------+--------------+

Related

Wrong encoding when reading csv file with pyspark

For my course in university, I run pyspark-notebook docker image
docker pull jupyter/pyspark-notebook
docker run -it --rm -p 8888:8888 -v /path/to/my/working/directory:/home/jovyan/work jupyter/pyspark-notebook
And then run next python code
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
sc = pyspark.SparkContext('local[*]')
spark = SparkSession(sc)
spark
listings_df = spark.read.csv("listings.csv", header=True, mode='DROPMALFORMED')
# adding encoding="utf8" to the line above doesn't help also
listings_df.printSchema()
The problem appears during reading a file. It seems that spark reads my file incorrectly (possibly because of encodings problem?) and after reading listings_df has 16494 lines, while the correct number of lines is 16478 (checked with pandas.read_csv()). You can see that something definitely is broken also by running
listings_df.groupBy("room_type").count().show()
which gives next output
+---------------+-----+
| room_type|count|
+---------------+-----+
| 169| 1|
| 4.88612| 1|
| 4.90075| 1|
| Shared room| 44|
| 35| 1|
| 187| 1|
| null| 16|
| 70| 1|
| 27| 1|
| 75| 1|
| Hotel room| 109|
| 198| 1|
| 60| 1|
| 280| 1|
|Entire home/apt|12818|
| 220| 1|
| 190| 1|
| 156| 1|
| 450| 1|
| 4.88865| 1|
+---------------+-----+
only showing top 20 rows
while real room_type values are only ['Private room', 'Entire home/apt', 'Hotel room', 'Shared room'].
Spark info which might be useful:
SparkSession - in-memory
SparkContext
Spark UI
Version
v3.1.2
Master
local[*]
AppName
pyspark-shell
And encoding of the file
!file listings.csv
listings.csv: UTF-8 Unicode text
listings.csv is an Airbnb statistics csv file downloaded from here
All run & drive code I've also uploaded to Colab
There are two things that I've found:
Some lines have quotes to escape (escape='"')
Also #JosefZ has mentioned about unwanted line breaks (multiLine=True)
That's how you must read it:
input_df = spark.read.csv(path, header=True, multiLine=True, escape='"')
output_df = input_df.groupBy("room_type").count()
output_df.show()
+---------------+-----+
| room_type|count|
+---------------+-----+
| Shared room| 44|
| Hotel room| 110|
|Entire home/apt|12829|
| Private room| 3495|
+---------------+-----+
I think encoding the file from here should solve the problem. So you add encoding="utf8" to your tuple of the variable listings_df.
As shown below:
listings_df = spark.read.csv("listings.csv", encoding="utf8", header=True, mode='DROPMALFORMED')

pyspark automatically extract fields inside json to top level columns

I have a json like below:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":29,"city":"New York"}
{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}
The following pyspark code:
sc = spark.sparkContext
peopleDF = spark.read.json("people.json")
peopleDF.createOrReplaceTempView("people")
tableDF = spark.sql("SELECT * from people")
tableDF.show()
Produces this output:
+----+--------+---------+-------+
| age| city| data| name|
+----+--------+---------+-------+
|null| null| null|Michael|
| 30| null| null| Andy|
| 19| null| null| Justin|
| 29|New York| null| Bob|
| 49| null|{Test, 1}| Ross|
+----+--------+---------+-------+
But I'm looking for an output like below (Notice how the element inside data have become columns:
+----+--------+----+----+-------+
| age| city| id|Name| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null| 1|Test| Ross|
+----+--------+----+----+-------+
The fields inside the data struct change constantly and so I cannot pre-define the columns. Is there a function in pyspark that can automatically extract every single element in a struct to its top level column? (Its okay if the performance is slow)
You can use "." operator to access nested elements and flatten your schema.
import spark.implicits._
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS).select("age", "city", "data.Name", "data.id", "name")
df.show()
+----+--------+----+----+-------+
| age| city|Name| id| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null|Test| 1| Ross|
+----+--------+----+----+-------+
If you want to flatten schema without selecting columns manually, you can use the following method to do it:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS)
df.select(flattenSchema(df.schema):_*).show()

adding a unique consecutive row number to dataframe in pyspark

I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+

Structured streaming load json convert to column output is null

JsonData is like {reId: "1",ratingFlowId: "1001",workFlowId:"1"} and I use program as follows:
case class CdrData(reId: String, ratingFlowId: String, workFlowId: String)
object StructuredHdfsJson {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("StructuredHdfsJson")
.master("local")
.getOrCreate()
val schema = Encoders.product[CdrData].schema
val lines = spark.readStream
.format("json")
.schema(schema)
.load("hdfs://iotsparkmaster:9000/json")
val query = lines.writeStream
.outputMode("update")
.format("console")
.start()
query.awaitTermination()
}
}
But the outputs is null, as follows:
-------------------------------------------
Batch: 0
-------------------------------------------
+----+------------+----------+
|reId|ratingFlowId|workFlowId|
+----+------------+----------+
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
+----+------------+----------+
Probably Spark can't parse your JSON. The issue can be related to spaces (or any other characters inside JSON. You should try to clean your data and run the application again.
Edit after comment (for future readers):
keys should be put in quotation marks
Edit 2:
according to json specification keys are represented by strings, and every string should be enclosed by quotation marks. Spark uses Jackson parser to convert strings to object

Split string into multiple rows with one character each

I want to split a word in a column into multiple rows, each with a single character. Small example below:
Id Name StartDate EndDate
1 raj 2017-07-05 2008-08-06
here the expected result is:
Id Name StartDate EndDate
1 r 2017-07-05 2008-08-06
1 a 2017-07-05 2008-08-06
1 j 2017-07-05 2008-08-06
First split the string into a list and then use explode. Note that filter need to be used as otherwise one row will have an empty string.
val df = spark.createDataFrame(Seq((1, "raj"), (2, "test"))).toDF("Id", "Name")
val df2 = df.withColumn("Name", explode(split($"Name", ""))).filter($"Name" =!= "")
This will give you:
+---+----+
| Id|Name|
+---+----+
| 1| r|
| 1| a|
| 1| j|
| 2| t|
| 2| e|
| 2| s|
| 2| t|
+---+----+
Note, for older versions of Spark (older than 2.0.0), use !== instead of =!= when checking for inequality.