JsonData is like {reId: "1",ratingFlowId: "1001",workFlowId:"1"} and I use program as follows:
case class CdrData(reId: String, ratingFlowId: String, workFlowId: String)
object StructuredHdfsJson {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("StructuredHdfsJson")
.master("local")
.getOrCreate()
val schema = Encoders.product[CdrData].schema
val lines = spark.readStream
.format("json")
.schema(schema)
.load("hdfs://iotsparkmaster:9000/json")
val query = lines.writeStream
.outputMode("update")
.format("console")
.start()
query.awaitTermination()
}
}
But the outputs is null, as follows:
-------------------------------------------
Batch: 0
-------------------------------------------
+----+------------+----------+
|reId|ratingFlowId|workFlowId|
+----+------------+----------+
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
+----+------------+----------+
Probably Spark can't parse your JSON. The issue can be related to spaces (or any other characters inside JSON. You should try to clean your data and run the application again.
Edit after comment (for future readers):
keys should be put in quotation marks
Edit 2:
according to json specification keys are represented by strings, and every string should be enclosed by quotation marks. Spark uses Jackson parser to convert strings to object
Related
What is the efficient way to create below output table with minimal join and by using the analytic functions?
expiry_date - Date until the answer is valid.
stale_answer_flag - Flag on the child answer showing that it predates the parent answer
Input tables:
Question:
question_id
parent_question_id
question
1
Are you living in the US?
1A
1
What is your state
Answer:
question_id
date
answer
1
01-sept-2022
yes
1A
01-sept-2022
NY
1
05-sept-2022
yes
Expected Output table:
question_id
parent_question_id
question
answer
date
expiry_date
stale_ans_flag
1
Are you living in the US?
Yes
01-sept-2022
05-sept-2022
Y
1A
1
What is your state
NY
01-sept-2022
'NULL'
N
1
Are you living in the US?
Yes
05-sept-2022
'NULL'
N
A standard SCD2 can be done the following
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pyspark.sql.functions as f
# create some data, I added a user column to also add new entries
data_1 = [
("alex","1","2022-09-01","yes"),
("alex","1A","2022-09-01","NY"),
]
schema = ["user","id","date","answer"]
df_1 = spark.createDataFrame(data=data_1, schema = schema)
df_1.show(truncate=False)
+----+---+----------+------+
|user|id |date |answer|
+----+---+----------+------+
|alex|1 |2022-09-01|yes |
|alex|1A |2022-09-01|NY |
+----+---+----------+------+
data_2 = [
("alex","1","2022-09-05","no"),
("john","1","2022-09-05","yes"),
("john","1A","2022-09-05","VA")
]
schema = ["user","id","date","answer"]
df_2 = spark.createDataFrame(data=data_2, schema = schema)
df_2.show(truncate=False)
+----+---+----------+------+
|user|id |date |answer|
+----+---+----------+------+
|alex|1 |2022-09-05|no |
|john|1 |2022-09-05|yes |
|john|1A |2022-09-05|VA |
+----+---+----------+------+
# create the initial dataset of the "old" data with the two additional columns
df_old=(df_1
.withColumn("expiry_date",f.lit(None).cast(DateType()))
.withColumn("stale_ans_flag",f.lit(False))
)
df_old.show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-01| yes| null| false|
|alex| 1A|2022-09-01| NY| null| false|
+----+---+----------+------+-----------+--------------+
# create a new column on the new dataset that will update the old one
df_new=(df_2
.withColumn("expiry_date",f.lit(None).cast(DateType()))
.withColumn("stale_ans_flag",f.lit(False))
)
df_new.show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-05| no| null| false|
|john| 1|2022-09-05| yes| null| false|
|john| 1A|2022-09-05| VA| null| false|
+----+---+----------+------+-----------+--------------+
# now join both together on the relevant keys and add an helper column
df_merge=(df_old.alias("old")
# a full outer join makes it easy to compare new and old and create manual update strategies
.join(df_new.alias("new"),
(df_old.user == df_new.user) &
(df_old.id == df_new.id),
how='fullouter'
)
# helper columns, you can do it without but due to lazy evaluatiuon it does not matter but helps reading the code and debugging
.withColumn("_action",f.when(
# identify new records that have to be inserted
(f.col("old.user").isNull())
& (f.col("old.id").isNull())
& (f.col("new.user").isNotNull())
& (f.col("new.id").isNotNull())
, "new"
).when(
# identify old records that are not changed
(f.col("old.user").isNotNull())
& (f.col("old.id").isNotNull())
& (f.col("new.user").isNull())
& (f.col("new.id").isNull())
, "old"
).when(
# identify update records
(f.col("old.user").isNotNull())
& (f.col("old.id").isNotNull())
& (f.col("new.user").isNotNull())
& (f.col("new.id").isNotNull())
, "update"
)
)
)
df_merge.show()
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
|user| id| date|answer|expiry_date|stale_ans_flag|user| id| date|answer|expiry_date|stale_ans_flag|_action|
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
|null|null| null| null| null| null|john| 1A|2022-09-05| VA| null| false| new|
|alex| 1|2022-09-01| yes| null| false|alex| 1|2022-09-05| no| null| false| update|
|alex| 1A|2022-09-01| NY| null| false|null|null| null| null| null| null| old|
|null|null| null| null| null| null|john| 1|2022-09-05| yes| null| false| new|
+----+----+----------+------+-----------+--------------+----+----+----------+------+-----------+--------------+-------+
# finally put all together with a union
df_merge_union=( df_merge
#first take the old ones of the dataset that did not change and are kept as they are
.filter(f.col("_action")=="old")
.select("old.user",
"old.id",
"old.date",
"old.answer",
"old.expiry_date",
"old.stale_ans_flag"
)
# than add the brand new ones that did not exist before
.union(df_merge
.filter(f.col("_action")=="new")
.select("new.user",
"new.id",
"new.date",
"new.answer",
"new.expiry_date",
"new.stale_ans_flag"
)
)
# add the old row with the old values that has been updated and change the flag
.union(df_merge
.filter(f.col("_action")=="update")
.select("old.user",
"old.id",
"old.date",
"old.answer",
"new.date", # or f.current_date(),
f.lit(True)
)
)
# add the old row with the new values that has been updated
.union(df_merge
.filter(f.col("_action")=="update")
.select("new.user",
"new.id",
"new.date",
"new.answer",
f.lit(None),
f.lit(False)
)
)
)
df_merge_union.sort("user","id").show()
+----+---+----------+------+-----------+--------------+
|user| id| date|answer|expiry_date|stale_ans_flag|
+----+---+----------+------+-----------+--------------+
|alex| 1|2022-09-01| yes| 2022-09-05| true|
|alex| 1|2022-09-05| no| null| false|
|alex| 1A|2022-09-01| NY| null| false|
|john| 1|2022-09-05| yes| null| false|
|john| 1A|2022-09-05| VA| null| false|
+----+---+----------+------+-----------+--------------+
I have a json like below:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":29,"city":"New York"}
{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}
The following pyspark code:
sc = spark.sparkContext
peopleDF = spark.read.json("people.json")
peopleDF.createOrReplaceTempView("people")
tableDF = spark.sql("SELECT * from people")
tableDF.show()
Produces this output:
+----+--------+---------+-------+
| age| city| data| name|
+----+--------+---------+-------+
|null| null| null|Michael|
| 30| null| null| Andy|
| 19| null| null| Justin|
| 29|New York| null| Bob|
| 49| null|{Test, 1}| Ross|
+----+--------+---------+-------+
But I'm looking for an output like below (Notice how the element inside data have become columns:
+----+--------+----+----+-------+
| age| city| id|Name| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null| 1|Test| Ross|
+----+--------+----+----+-------+
The fields inside the data struct change constantly and so I cannot pre-define the columns. Is there a function in pyspark that can automatically extract every single element in a struct to its top level column? (Its okay if the performance is slow)
You can use "." operator to access nested elements and flatten your schema.
import spark.implicits._
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS).select("age", "city", "data.Name", "data.id", "name")
df.show()
+----+--------+----+----+-------+
| age| city|Name| id| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null|Test| 1| Ross|
+----+--------+----+----+-------+
If you want to flatten schema without selecting columns manually, you can use the following method to do it:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS)
df.select(flattenSchema(df.schema):_*).show()
I am learning Spark and I was building a sample project. I have a spark dataframe which has the following syntax. This syntax is when I saved the dataframe to a JSON file:
{"str":["1001","19035004":{"Name":"Chris","Age":"29","Location":"USA"}]}
{"str":["1002","19035005":{"Name":"John","Age":"20","Location":"France"}]}
{"str":["1003","19035006":{"Name":"Mark","Age":"30","Location":"UK"}]}
{"str":["1004","19035007":{"Name":"Mary","Age":"22","Location":"UK"}]}
JSONInput.show() gave me something like the below:
+---------------------------------------------------------------------+
|str |
+---------------------------------------------------------------------+
|[1001,{"19035004":{"Name":"Chris","Age":"29","Location":"USA"}}] |
|[1002,{"19035005":{"Name":"John","Age":"20","Location":"France"}}] |
|[1003,{"19035006":{"Name":"Mark","Age":"30","Location":"UK"}}] |
|[1004,{"19035007":{"Name":"Mary","Age":"22","Location":"UK"}}] |
+---------------------------------------------------------------------|
I know this is not the correct syntax for JSON, but this is what I have.
How can I get this in a relational structure in the first place (because I am pretty new to JSON and Spark. So this is not mandatory):
Name Age Location
-----------------------
Chris 29 USA
John 20 France
Mark 30 UK
Mary 22 UK
And I want to filter for the specific country:
val resultToReturn = JSONInput.filter("Location=USA")
But this results the below error:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
cannot resolve Location given input columns: [str]; line 1 pos 0;
How do I get rid of "str" and make the data in a proper JSON structure? Can anyone help?
You can use from_json to parse the string values :
import org.apache.spark.sql.types._
val schema = MapType(StringType,
StructType(Array(
StructField("Name", StringType, true),
StructField("Age", StringType, true),
StructField("Location", StringType, true)
)
), true
)
val resultToReturn = JSONInput.select(
explode(from_json(col("str")(1), schema))
).select("value.*")
resultToReturn.show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//| John| 20| France|
//| Mark| 30| UK|
//| Mary| 22| UK|
//+-----+---+--------+
Then you can filter :
resultToReturn.filter("Location = 'USA'").show
//+-----+---+--------+
//| Name|Age|Location|
//+-----+---+--------+
//|Chris| 29| USA|
//+-----+---+--------+
You can extract the innermost JSON using regexp_extract and parse that JSON using from_json. Then you can star-expand the extracted JSON struct.
val parsed_df = JSONInput.selectExpr("""
from_json(
regexp_extract(str[0], '(\\{[^{}]+\\})', 1),
'Name string, Age string, Location string'
) as parsed
""").select("parsed.*")
parsed_df.show(false)
+-----+---+--------+
|Name |Age|Location|
+-----+---+--------+
|Chris|29 |USA |
|John |20 |France |
|Mark |30 |UK |
|Mary |22 |UK |
+-----+---+--------+
And you can filter it using
val filtered = parsed_df.filter("Location = 'USA'")
PS remember to add single quotes around USA.
I have a python script which is getting stock data(as below) from NYSE every minute in a new file(single line). It contains data of 4 stocks - MSFT, ADBE, GOOGL and FB, as the below json format
[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]
I'm trying to read this file stream into a Spark Streaming dataframe. But I'm not able to define the proper schema for it. Looked into the internet and done the following so far
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
public class Driver1 {
public static void main(String args[]) throws InterruptedException, StreamingQueryException {
SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
Logger.getLogger("org").setLevel(Level.ERROR);
StructType priceData = new StructType()
.add("open", DataTypes.DoubleType)
.add("high", DataTypes.DoubleType)
.add("low", DataTypes.DoubleType)
.add("close", DataTypes.DoubleType)
.add("volume", DataTypes.LongType);
StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("stock", priceData);
Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.printSchema();
rawData.writeStream().format("console").start().awaitTermination();
session.close();
}
}
The output I'm getting is this-
root
|-- symbol: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- stock: struct (nullable = true)
| |-- open: double (nullable = true)
| |-- high: double (nullable = true)
| |-- low: double (nullable = true)
| |-- close: double (nullable = true)
| |-- volume: long (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol| timestamp|stock|
+------+-------------------+-----+
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+
I have even tried first reading the json string as a text file and then applying the schema(like it is done with the Kafka-Streaming)...
Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema));
raw2.writeStream().format("console").start().awaitTermination();
Getting below output, in this case, the rawData dataframe as the json data in string fromat,
+--------------------+
|jsontostructs(value)|
+--------------------+
| null|
| null|
| null|
| null|
| null|
Please help me figure it out.
Just figured it out, Keep the following two things in mind-
While defining the schema make sure you name and order the fields exactly the same as in your json file.
Initially, use only StringType for all your fields, you can apply a transformation to change it back to some specific data type.
This is what worked for me-
StructType priceData = new StructType()
.add("open", DataTypes.StringType)
.add("high", DataTypes.StringType)
.add("low", DataTypes.StringType)
.add("close", DataTypes.StringType)
.add("volume", DataTypes.StringType);
StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("priceData", priceData);
Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.writeStream().format("console").start().awaitTermination();
session.close();
See the output-
+------+-------------------+--------------------+
|symbol| timestamp| priceData|
+------+-------------------+--------------------+
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+
You can now flatten the priceData column using priceData.open, priceData.close etc.
I have two DataFrames, DF1 and DF2, and a JSON file which I need to use as a mapping file to create another dataframe (DF3).
DF1:
+-------+-------+-------+
|column1|column2|column3|
+-------+-------+-------+
| 100| John| Mumbai|
| 101| Alex| Delhi|
| 104| Divas|Kolkata|
| 108| Jerry|Chennai|
+-------+-------+-------+
DF2:
+-------+-----------+-------+
|column4| column5|column6|
+-------+-----------+-------+
| S1| New| xxx|
| S2| Old| yyy|
| S5|replacement| zzz|
| S10| New| ppp|
+-------+-----------+-------+
Apart from this one mapping file I am having in JSON format which will be use to generate DF3.
Below is the JSON mapping file:
{"targetColumn":"newColumn1","sourceField1":"column2","sourceField2":"column4"}
{"targetColumn":"newColumn2","sourceField1":"column7","sourceField2":"column5"}
{"targetColumn":"newColumn3","sourceField1":"column8","sourceField2":"column6"}
So from this JSON file I need to create DF3 with a column available in the targetColumn section of the mapping and it will check the source column if it is present in DF1 then it map to sourceField1 from DF1 otherwise sourceField2 from DF2.
Below is the expected output.
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Divas|replacement| zzz|
| Jerry| New| ppp|
+----------+-----------+----------+
Any help here will be appropriated.
Parse the JSON and create the below List of custom objects
case class SrcTgtMapping(targetColumn:String,sourceField1:String,sourceField2:String)
val srcTgtMappingList=List(SrcTgtMapping("newColumn1","column2","column4"),SrcTgtMapping("newColumn2","column7","column5"),SrcTgtMapping("newColumn3","column8","column6"))
Add dummy index column to both the dataframes and join both the dataframes based on index column
import org.apache.spark.sql.functions._
val df1WithIndex=df1.withColumn("index",monotonicallyIncreasingId)
val df2WithIndex=df2.withColumn("index",monotonicallyIncreasingId)
val joinedDf=df1WithIndex.join(df2WithIndex,df1WithIndex.col("index")===df2WithIndex.col("index"))
Create the query and execute it.
val df1Columns=df1WithIndex.columns.toList
val df2Columns=df2WithIndex.columns.toList
val query=srcTgtMappingList.map(stm=>if(df1Columns.contains(stm.sourceField1)) joinedDf.col(stm.sourceField1).alias(stm.targetColumn) else joinedDf.col(stm.sourceField2).alias(stm.targetColumn))
val output=joinedDf.select(query:_*)
output.show
Sample Output:
+----------+-----------+----------+
|newColumn1| newColumn2|newColumn3|
+----------+-----------+----------+
| John| New| xxx|
| Alex| Old| yyy|
| Jerry| New| ppp|
| Divas|replacement| zzz|
+----------+-----------+----------+
Hope this approach will help you