I have a python script which is getting stock data(as below) from NYSE every minute in a new file(single line). It contains data of 4 stocks - MSFT, ADBE, GOOGL and FB, as the below json format
[{"symbol": "MSFT", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "126.0800", "high": "126.1000", "low": "126.0500", "close": "126.0750", "volume": "57081"}}, {"symbol": "ADBE", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "279.2900", "high": "279.3400", "low": "279.2600", "close": "279.3050", "volume": "12711"}}, {"symbol": "GOOGL", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "1166.4100", "high": "1166.7400", "low": "1166.2900", "close": "1166.7400", "volume": "8803"}}, {"symbol": "FB", "timestamp": "2019-05-02 15:59:00", "priceData": {"open": "192.4200", "high": "192.5000", "low": "192.3600", "close": "192.4800", "volume": "33490"}}]
I'm trying to read this file stream into a Spark Streaming dataframe. But I'm not able to define the proper schema for it. Looked into the internet and done the following so far
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructType;
public class Driver1 {
public static void main(String args[]) throws InterruptedException, StreamingQueryException {
SparkSession session = SparkSession.builder().appName("Spark_Streaming").master("local[2]").getOrCreate();
Logger.getLogger("org").setLevel(Level.ERROR);
StructType priceData = new StructType()
.add("open", DataTypes.DoubleType)
.add("high", DataTypes.DoubleType)
.add("low", DataTypes.DoubleType)
.add("close", DataTypes.DoubleType)
.add("volume", DataTypes.LongType);
StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("stock", priceData);
Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.printSchema();
rawData.writeStream().format("console").start().awaitTermination();
session.close();
}
}
The output I'm getting is this-
root
|-- symbol: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- stock: struct (nullable = true)
| |-- open: double (nullable = true)
| |-- high: double (nullable = true)
| |-- low: double (nullable = true)
| |-- close: double (nullable = true)
| |-- volume: long (nullable = true)
-------------------------------------------
Batch: 0
-------------------------------------------
+------+-------------------+-----+
|symbol| timestamp|stock|
+------+-------------------+-----+
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
| MSFT|2019-05-02 15:59:00| null|
| ADBE|2019-05-02 15:59:00| null|
| GOOGL|2019-05-02 15:59:00| null|
| FB|2019-05-02 15:59:00| null|
+------+-------------------+-----+
I have even tried first reading the json string as a text file and then applying the schema(like it is done with the Kafka-Streaming)...
Dataset<Row> rawData = session.readStream().format("text").load("/home/abhinavrawat/streamingData/data/*");
Dataset<Row> raw2 = rawData.select(org.apache.spark.sql.functions.from_json(rawData.col("value"),schema));
raw2.writeStream().format("console").start().awaitTermination();
Getting below output, in this case, the rawData dataframe as the json data in string fromat,
+--------------------+
|jsontostructs(value)|
+--------------------+
| null|
| null|
| null|
| null|
| null|
Please help me figure it out.
Just figured it out, Keep the following two things in mind-
While defining the schema make sure you name and order the fields exactly the same as in your json file.
Initially, use only StringType for all your fields, you can apply a transformation to change it back to some specific data type.
This is what worked for me-
StructType priceData = new StructType()
.add("open", DataTypes.StringType)
.add("high", DataTypes.StringType)
.add("low", DataTypes.StringType)
.add("close", DataTypes.StringType)
.add("volume", DataTypes.StringType);
StructType schema = new StructType()
.add("symbol", DataTypes.StringType)
.add("timestamp", DataTypes.StringType)
.add("priceData", priceData);
Dataset<Row> rawData = session.readStream().format("json").schema(schema).json("/home/abhinavrawat/streamingData/data/*");
rawData.writeStream().format("console").start().awaitTermination();
session.close();
See the output-
+------+-------------------+--------------------+
|symbol| timestamp| priceData|
+------+-------------------+--------------------+
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
| MSFT|2019-05-02 15:59:00|[126.0800, 126.10...|
| ADBE|2019-05-02 15:59:00|[279.2900, 279.34...|
| GOOGL|2019-05-02 15:59:00|[1166.4100, 1166....|
| FB|2019-05-02 15:59:00|[192.4200, 192.50...|
+------+-------------------+--------------------+
You can now flatten the priceData column using priceData.open, priceData.close etc.
Related
I have a dataframe with 2 string columns, and another one with an array strucuture:
-- music: string (nullable = true)
|-- artist: string (nullable = true)
|-- details: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- Genre: string (nullable = true)
| | |-- Origin: string (nullable = true)
Just to help you, this is a sample data:
music | artist | details
Music_1 | Artist_1 | [{"Genre": "Rock", "Origin": "USA"}]
Music_2 | Artist_3 | [{"Genre": "", "Origin": "USA"}]
Music_3 | Artist_1 | [{"Genre": "Rock", "Origin": "UK"}]
I am trying a simple operation, I guess, just concat the Key and Value by '-'. Basically, what I am trying to do is to get the following strucuture:
music | artist | details
Music_1 | Artist_1 | Genre - Rock, Origin - USA
Music_2 | Artist_3 | Genre - , Origin - USA
Music_3 | Artist_1 | Genre - Rock, Origin - UK
For that I already tried an approach that was sparate first the key and value in different columns to then I can concat the items:
display(df.select(col("music"), col("artist"), posexplode("details").alias("key","value")))
But I got the following result:
music | artist | key | value
Music_1 | Artist_1 | 0 | [{"Genre": "Rock", "Origin": "USA"}]
Music_2 | Artist_3 | 0 | [{"Genre": "", "Origin": "USA"}]
Music_3 | Artist_1 | 0 | [{"Genre": "Rock", "Origin": "UK"}]
Probably is not the best solution, anyone can help me?
Thanks!
You can use built-in higher order function transform() to get desired result (From spark 2.4).
df = # Input data
df.withColumn('details', expr("transform(details, c-> concat_ws(', ', concat_ws(' - ', 'Genre', c['Genre']),
concat_ws(' - ', 'Origin', c['Origin'])))")) \
.withColumn('details', explode_outer('details')) \
.show(truncate=False)
+--------+--------------------------+-------+
|artist |details |music |
+--------+--------------------------+-------+
|Artist_1|Genre - Rock, Origin - USA|Music_1|
|Artist_3|Genre - , Origin - USA |Music_2|
|Artist_1|Genre - Rock, Origin - UK |Music_3|
+--------+--------------------------+-------+
I have a json like below:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Bob", "age":29,"city":"New York"}
{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}
The following pyspark code:
sc = spark.sparkContext
peopleDF = spark.read.json("people.json")
peopleDF.createOrReplaceTempView("people")
tableDF = spark.sql("SELECT * from people")
tableDF.show()
Produces this output:
+----+--------+---------+-------+
| age| city| data| name|
+----+--------+---------+-------+
|null| null| null|Michael|
| 30| null| null| Andy|
| 19| null| null| Justin|
| 29|New York| null| Bob|
| 49| null|{Test, 1}| Ross|
+----+--------+---------+-------+
But I'm looking for an output like below (Notice how the element inside data have become columns:
+----+--------+----+----+-------+
| age| city| id|Name| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null| 1|Test| Ross|
+----+--------+----+----+-------+
The fields inside the data struct change constantly and so I cannot pre-define the columns. Is there a function in pyspark that can automatically extract every single element in a struct to its top level column? (Its okay if the performance is slow)
You can use "." operator to access nested elements and flatten your schema.
import spark.implicits._
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS).select("age", "city", "data.Name", "data.id", "name")
df.show()
+----+--------+----+----+-------+
| age| city|Name| id| name|
+----+--------+----+----+-------+
|null| null|null|null|Michael|
| 30| null|null|null| Andy|
| 19| null|null|null| Justin|
| 29|New York|null|null| Bob|
| 49| null|Test| 1| Ross|
+----+--------+----+----+-------+
If you want to flatten schema without selecting columns manually, you can use the following method to do it:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.functions.col
def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName))
}
})
}
val js = """[{"name":"Michael"},{"name":"Andy", "age":30},{"name":"Justin", "age":19},{"name":"Bob", "age":29,"city":"New York"},{"name":"Ross", "age":49,"data":{"id":1,"Name":"Test"}}]"""
val df = spark.read.json(Seq(js).toDS)
df.select(flattenSchema(df.schema):_*).show()
I have a JSONB string in this format
{
"RouteId": "90679754-89f5-48d7-99e1-5192bf0becf9",
"Started": "2019-11-20T21:24:33.7294486Z",
"RouteName": "ProcessRequestsAndPublishResponse",
"MachineName": "5CG8134NJW-LA",
"ChildProfiles": [
{
"ApiMethod": "ProcessApiRequest",
"ExecuteType": null,
"DurationMilliseconds": 2521.4,
},
{
"ApiMethod": "PublishShipViaToQueue",
"ExecuteType": null,
"DurationMilliseconds": 0.6,
}
],
"DataBaseTimings": null,
"DurationMilliseconds": 2522.6
}
How do I get the output in this format
| RouteName | Metrics | Time | TotalDuration |
---------------------------------------------------------------------------------------------
| ProcessRequestsAndPublishResponse | ProcessApiRequest | 2521.4 | 2522.6 |
| ProcessRequestsAndPublishResponse | PublishShipViaToQueue | 0.6 | 2522.6 |
---------------------------------------------------------------------------------------------
Any help on this is appreciated
How do you also extend this in case there are different arrays. Sorry fairly new to the JSONB world.
{
"RouteId": "af2e9cba-11ae-43a9-813c-d24ea574ee62",
"RouteName": "GenerateRequestAndPublishToQueue",
"ChildProfiles": [
{
"ApiMethod": "PublishShipViaRequestToQueue",
"DurationMilliseconds": 0.1,
}
],
"DataBaseTimings": [
{
"ExecuteType": "OpenAsync",
"DurationMilliseconds": 0.1
},
{
"ExecuteType": "Reader",
"DurationMilliseconds": 72.1
},
{
"ExecuteType": "Close",
"DurationMilliseconds": 15.9
}
],
"DurationMilliseconds": 88.6
}
The required output is something like this
| RouteName | Metrics | Time | TotalDuration |
--------------------------------------------------------------------------------------------------------
| GenerateRequestAndPublishToQueue | PublishShipViaRequestToQueue | 0.1 | 88.6 |
| GenerateRequestAndPublishToQueue | OpenAsync | 0.1 | 88.6 |
| GenerateRequestAndPublishToQueue | Reader | 72.1 | 88.6 |
| GenerateRequestAndPublishToQueue | Close | 15.9 | 88.6 |
---------------------------------------------------------------------------------------------------------
You can do a lateral join and use jsonb_to_recordset() to expand the inner json array as an inline table:
select
js ->> 'RouteName' RouteName,
xs."ApiMethod" Metrics,
xs."DurationMilliseconds" "Time",
js ->> 'DurationMilliseconds' TotalDuration
from t
cross join lateral jsonb_to_recordset( js -> 'ChildProfiles')
as xs("ApiMethod" text, "DurationMilliseconds" numeric)
Demo on DB Fiddlde:
routename | metrics | Time | totalduration
:-------------------------------- | :-------------------- | -----: | :------------
ProcessRequestsAndPublishResponse | ProcessApiRequest | 2521.4 | 2522.6
ProcessRequestsAndPublishResponse | PublishShipViaToQueue | 0.6 | 2522.6
I want to add the unique row number to my dataframe in pyspark and dont want to use monotonicallyIncreasingId & partitionBy methods.
I think that this question might be a duplicate of similar questions asked earlier, still looking for some advice whether I am doing it right way or not.
following is snippet of my code:
I have a csv file with below set of input records:
1,VIKRANT SINGH RANA ,NOIDA ,10000
3,GOVIND NIMBHAL ,DWARKA ,92000
2,RAGHVENDRA KUMAR GUPTA,GURGAON ,50000
4,ABHIJAN SINHA ,SAKET ,65000
5,SUPER DEVELOPER ,USA ,50000
6,RAJAT TYAGI ,UP ,65000
7,AJAY SHARMA ,NOIDA ,70000
8,SIDDHARTH BASU ,SAKET ,72000
9,ROBERT ,GURGAON ,70000
and I have loaded this csv file into a dataframe.
PATH_TO_FILE="file:///u/user/vikrant/testdata/EMP_FILE.csv"
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
+------+--------------------+--------+----------+
|emp_id| emp_name|emp_city|emp_salary|
+------+--------------------+--------+----------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000|
| 4|ABHIJAN SINHA ...|SAKET | 65000|
| 5|SUPER DEVELOPER ...|USA | 50000|
| 6|RAJAT TYAGI ...|UP | 65000|
| 7|AJAY SHARMA ...|NOIDA | 70000|
| 8|SIDDHARTH BASU ...|SAKET | 72000|
| 9|ROBERT ...|GURGAON | 70000|
+------+--------------------+--------+----------+
empRDD = emp_df.rdd.zipWithIndex()
newRDD=empRDD.map(lambda x: (list(x[0]) + [x[1]]))
newRDD.take(2);
[[1, u'VIKRANT SINGH RANA ', u'NOIDA ', 10000, 0], [3, u'GOVIND NIMBHAL ', u'DWARKA ', 92000, 1]]
when I included the int value to my list, I have lost the dataframe schema.
newdf=newRDD.toDF(['emp_id','emp_name','emp_city','emp_salary','row_id'])
newdf.show();
+------+--------------------+--------+----------+------+
|emp_id| emp_name|emp_city|emp_salary|row_id|
+------+--------------------+--------+----------+------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 0|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 3|
| 5|SUPER DEVELOPER ...|USA | 50000| 4|
| 6|RAJAT TYAGI ...|UP | 65000| 5|
| 7|AJAY SHARMA ...|NOIDA | 70000| 6|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 7|
| 9|ROBERT ...|GURGAON | 70000| 8|
+------+--------------------+--------+----------+------+
Am I doing it right way? or is there any better way to add or preserve the schema of dataframe in pyspark?
Is it feasible to use zipWithIndex method to add unique consecutive row number for large size dataframe also? Can we use this row_id to re-partition the dataframe to uniformly distribute the data across the partitions?
I have found a solution and it's very simple.
since I have no column in my dataframe which is having same value across all the rows, so using row_number is not generating unique row numbers when using it with partitionBy clause.
Lets add a new column to the existing dataframe with some default value in it.
emp_df= emp_df.withColumn("new_column",lit("ABC"))
and create a window function with paritionBy using that column "new_column"
w = Window().partitionBy('new_column').orderBy(lit('A'))
df = emp_df.withColumn("row_num", row_number().over(w)).drop("new_column")
you will get the desired results:
+------+--------------------+--------+----------+-------+
|emp_id| emp_name|emp_city|emp_salary|row_num|
+------+--------------------+--------+----------+-------+
| 1|VIKRANT SINGH RAN...|NOIDA | 10000| 1|
| 2|RAGHVENDRA KUMAR ...|GURGAON | 50000| 2|
| 7|AJAY SHARMA ...|NOIDA | 70000| 3|
| 9|ROBERT ...|GURGAON | 70000| 4|
| 4|ABHIJAN SINHA ...|SAKET | 65000| 5|
| 8|SIDDHARTH BASU ...|SAKET | 72000| 6|
| 5|SUPER DEVELOPER ...|USA | 50000| 7|
| 3|GOVIND NIMBHAL ...|DWARKA | 92000| 8|
| 6|RAJAT TYAGI ...|UP | 65000| 9|
+------+--------------------+--------+----------+-------+
Using Spark SQL:
df = spark.sql("""
SELECT
row_number() OVER (
PARTITION BY ''
ORDER BY ''
) as id,
*
FROM
VALUES
('Bob ', 20),
('Alice', 21),
('Gary ', 21),
('Kent ', 25),
('Gary ', 35)
""")
Output:
>>> df.printSchema()
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = false)
|-- col2: integer (nullable = false)
>>> df.show()
+---+-----+----+
| id| col1|col2|
+---+-----+----+
| 1|Bob | 20|
| 2|Alice| 21|
| 3|Gary | 21|
| 4|Kent | 25|
| 5|Gary | 35|
+---+-----+----+
JsonData is like {reId: "1",ratingFlowId: "1001",workFlowId:"1"} and I use program as follows:
case class CdrData(reId: String, ratingFlowId: String, workFlowId: String)
object StructuredHdfsJson {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.appName("StructuredHdfsJson")
.master("local")
.getOrCreate()
val schema = Encoders.product[CdrData].schema
val lines = spark.readStream
.format("json")
.schema(schema)
.load("hdfs://iotsparkmaster:9000/json")
val query = lines.writeStream
.outputMode("update")
.format("console")
.start()
query.awaitTermination()
}
}
But the outputs is null, as follows:
-------------------------------------------
Batch: 0
-------------------------------------------
+----+------------+----------+
|reId|ratingFlowId|workFlowId|
+----+------------+----------+
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
|null| null| null|
+----+------------+----------+
Probably Spark can't parse your JSON. The issue can be related to spaces (or any other characters inside JSON. You should try to clean your data and run the application again.
Edit after comment (for future readers):
keys should be put in quotation marks
Edit 2:
according to json specification keys are represented by strings, and every string should be enclosed by quotation marks. Spark uses Jackson parser to convert strings to object