Spark Streaming MQTT - Apply schema on dataset - json

I'm working on DataBricks (Spark 2.0.1-db1 (Scala 2.11)) and I am trying to use Spark Streaming functions. I am using the libraries :
- spark-sql-streaming-mqtt_2.11-2.1.0-SNAPSHOT.jar (see here : http://bahir.apache.org/docs/spark/current/spark-sql-streaming-mqtt/)
The following command gives me a dataset :
val lines = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("clientId", "sparkTest")
.option("brokerUrl", "tcp://xxx.xxx.xxx.xxx:xxx")
.option("topic", "/Name/data")
.option("localStorage", "dbfs:/models/mqttPersist")
.option("cleanSession", "true")
.load().as[(String, Timestamp)]
with this printSchema :
root
|-- value : string (nullable : true)
|-- timestamp : timestamp (nullable : true)
And I would like to apply a schema on the "value" column of my dataset. you can see my json schema as folowing.
root
|-- id : string (nullable = true)
|-- DateTime : timestamp (nullable = true)
|-- label : double (nullable = true)
Is it possible to parse directly my json in the stream to obtain something like that :
root
|-- value : struct (nullable : true)
|-- id : string (nullable = true)
|-- DateTime : timestamp (nullable = true)
|-- label : double (nullable = true)
|-- timestamp : timestamp (nullable : true)
For now, I don't see any way to parse the json from a mqtt and any help would be very great.
Thanks in advance.

I had this same exact problem today! I used json4s and Jackson to parse the json.
How I am getting the streaming DataSet (pretty much the same as what you have):
val lines = spark.readStream
.format("org.apache.bahir.sql.streaming.mqtt.MQTTStreamSourceProvider")
.option("topic", topic)
.option("brokerUrl",brokerUrl)
.load().as[(String,Timestamp)]
I defined the schema using a case class:
case class DeviceData(devicename: String, time: Long, metric: String, value: Long, unit: String)
Parsing out the JSON column using org.json4s.jackson.JsonMethods.parse:
val ds = lines.map {
row =>
implicit val format = DefaultFormats
parse(row._1).extract[DeviceData]
}
Outputting the results:
val query = ds.writeStream
.format("console")
.option("truncate", false)
.start()
The results:
+----------+-------------+-----------+-----+----+
|devicename|time |metric |value|unit|
+----------+-------------+-----------+-----+----+
|dht11_4 |1486656575772|temperature|9 |C |
|dht11_4 |1486656575772|humidity |36 |% |
+----------+-------------+-----------+-----+----+
I am kind of disappointed I cannot come up with a solution that uses Sparks native json parsing. Instead we must rely on Jackson. You can use spark native json parsing if you are reading a file as a stream. As so:
val lines = spark.readStream
.....
.json("./path/to/file").as[(String,Timestamp)]
But for MQTT we cannot do this.

Related

Unable to create the column and value from the JSON nested key Value Pair using Spark/Scala

I am working on converting the JSON which has nested key value pair to automatically create the columns for the keys and populate the rows for the values. I don’t want to create the schema as the no of columns (keys) will differ for each of the file.
I am using Spark version 2.3 and Scala version 2.11.8.
I am not a scala expert and just started my hands on Scala, so appreciate your inputs to get this resolved.
Here is the sample JSON format
{"RequestID":"9883a6d0-e002-4487-88a6-c92f6a504d72","OverallStatus":"OK","ele":[{"Name":"UUID","Value":"53f93df3-6528-4d42-a7f5-2876535d4982"},{"Name":"id"},{"Name":"opt_newsletter_email","Value":"boutmathieu#me.com"},{"Name":"parm1","Value":"secure.snnow.ca/orders/summary"},{"Name":"parm2","Value":"fromET"},{"Name":"parm3","Value":"implied"},{"Name":"parm4"},{"Name":"subscribed","Value":"True"},{"Name":"timestamp","Value":"8/6/2019 4:59:00 PM"},{"Name":"list_id","Value":"6"},{"Name":"name","Value":"Event Alerts"},{"Name":"email","Value":"boutmathieu#me.com"},{"Name":"newsletterID","Value":"sports:snnow:event"},{"Name":"subscribeFormIdOrURL"},{"Name":"unsubscribeTimestamp","Value":"8/14/2021 4:58:56 AM"}]}
{"RequestID":"9883a6d0-e002-4487-88a6-c92f6a504d72","OverallStatus":"OK","ele":[{"Name":"UUID","Value":"53f93df3-6528-4d42-a7f5-2876535d4982"},{"Name":"id"},{"Name":"opt_newsletter_email","Value":"boutmathieu#me.com"},{"Name":"parm1","Value":"secure.snnow.ca/orders/summary"},{"Name":"parm2","Value":"fromET"},{"Name":"parm3","Value":"implied"},{"Name":"parm4"},{"Name":"subscribed","Value":"True"},{"Name":"timestamp","Value":"8/6/2019 4:59:00 PM"},{"Name":"list_id","Value":"7"},{"Name":"name","Value":"Partner & Sponsored Offers"},{"Name":"email","Value":"boutmathieu#me.com"},{"Name":"newsletterID","Value":"sports:snnow:affiliate"},{"Name":"subscribeFormIdOrURL"},{"Name":"unsubscribeTimestamp","Value":"8/14/2021 4:58:56 AM"}]}
Expected output
enter image description here
This is my code.
val newDF = spark.read.json("408d392-8c50-425a-a799-355f1783e0be-c000.json")
scala> newDF.printSchema
root
|-- OverallStatus: string (nullable = true)
|-- RequestID: string (nullable = true)
|-- ele: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
| | |-- Value: string (nullable = true)
val jsonDF = newDF.withColumn("colNames", explode($"ele")).select($"RequestID", ($"ColNames"))
scala> jsonDF.printSchema
root
|-- RequestID: string (nullable = true)
|-- setting: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- Value: string (nullable = true)
val finalDF=jsonDF.groupBy($"RequestID").pivot("ColNames.name").agg("ColNames.value")
---------------------------------------------------------------------------------------
I am getting this error while creating the finalDF
<console>:39: error: overloaded method value agg with alternatives:
(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame <and>
(exprs: java.util.Map[String,String])org.apache.spark.sql.DataFrame <and>
(exprs: scala.collection.immutable.Map[String,String])org.apache.spark.sql.DataFrame <and>
(aggExpr: (String, String),aggExprs: (String, String)*)org.apache.spark.sql.DataFrame
cannot be applied to (String)
val finalDF=jsonDF.groupBy($"RequestID").pivot("ColNames.name").agg("ColNames.value")
Any help would be greatly appreciated
You are almost there, agg function might be declared in the following ways:
agg(aggExpr: (String, String)) -> agg("age" -> "max")
agg(exprs: Map[String, String]) -> agg(Map("age" -> "max")
agg(expr: Column) -> agg(max($"age"))
Instead of
agg("ColNames.value")
you should use one of the above examples.
e.g.
import org.apache.spark.sql.functions._
jsonDF.groupBy($"RequestID").pivot("ColNames.name")
.agg(collect_list($"ColNames.value"))

Interpret timestamp fields in Spark while reading json

I am trying to read a pretty printed json which has time fields in it. I want to interpret the timestamps columns as timestamp fields while reading the json itself. However, it's still reading them as string when I printSchema
E.g.
Input json file -
[{
"time_field" : "2017-09-30 04:53:39.412496Z"
}]
Code -
df = spark.read.option("multiLine", "true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSSSSS'Z'").json('path_to_json_file')
Output of df.printSchema() -
root
|-- time_field: string (nullable = true)
What am I missing here?
My own experience with option timestampFormat is that it doesn't quite work as advertised. I would simply read the time fields as strings and use to_timestamp to do the conversion, as shown below (with slightly generalized sample input):
# /path/to/jsonfile
[{
"id": 101, "time_field": "2017-09-30 04:53:39.412496Z"
},
{
"id": 102, "time_field": "2017-10-01 01:23:45.123456Z"
}]
In Python:
from pyspark.sql.functions import to_timestamp
df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df = df.withColumn("timestamp", to_timestamp("time_field"))
df.show(2, False)
+---+---------------------------+-------------------+
|id |time_field |timestamp |
+---+---------------------------+-------------------+
|101|2017-09-30 04:53:39.412496Z|2017-09-30 04:53:39|
|102|2017-10-01 01:23:45.123456Z|2017-10-01 01:23:45|
+---+---------------------------+-------------------+
df.printSchema()
root
|-- id: long (nullable = true)
|-- time_field: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
In Scala:
val df = spark.read.option("multiLine", "true").json("/path/to/jsonfile")
df.withColumn("timestamp", to_timestamp($"time_field"))
It's bug in Spark version 2.4.0 Issues SPARK-26325
For Spark Version 2.4.4
import org.apache.spark.sql.types.TimestampType
//String to timestamps
val df = Seq(("2019-07-01 12:01:19.000"),
("2019-06-24 12:01:19.000"),
("2019-11-16 16:44:55.406"),
("2019-11-16 16:50:59.406")).toDF("input_timestamp")
val df_mod = df.select($"input_timestamp".cast(TimestampType))
df_mod.printSchema
Output
root
|-- input_timestamp: timestamp (nullable = true)

Extract json data in Spark/Scala

I have a json file with this structure
root
|-- labels: struct (nullable = true)
| |-- compute.googleapis.com/resource_name: string (nullable = true)
| |-- container.googleapis.com/namespace_name: string (nullable = true)
| |-- container.googleapis.com/pod_name: string (nullable = true)
| |-- container.googleapis.com/stream: string (nullable = true)
I want to extract the four .....googleapis.com/... into four columns.
I tried this:
import org.apache.spark.sql.functions._
df = df.withColumn("resource_name", df("labels.compute.googleapis.com/resource_name"))
.withColumn("namespace_name", df("labels.compute.googleapis.com/namespace_name"))
.withColumn("pod_name", df("labels.compute.googleapis.com/pod_name"))
.withColumn("stream", df("labels.compute.googleapis.com/stream"))
I also tried this, making the labels an array which has solved the first error that it said the sublevels are not array or map
df2 = df.withColumn("labels", explode(array(col("labels"))))
.select(col("labels.compute.googleapis.com/resource_name").as("resource_name"), col("labels.compute.googleapis.com/namespace_name").as("namespace_name"), col("labels.compute.googleapis.com/pod_name").as("pod_name"), col("labels.compute.googleapis.com/stream").as("stream"))
I still get this error
org.apache.spark.sql.AnalysisException: No such struct field compute in compute.googleapis.com/resource_name .....
I know Spark thinks that each dot is a nested level, but how I can format compute.googleapis.com/resource_name that spark recognises as a name of the level rather than a multilevel.
I also tried to solve as stated here
How to get Apache spark to ignore dots in a query?
But this also did not solve my problem. I have labels.compute.googleapis.com/resource_name, adding backticks to the compute.googleapis.com/resource_name still gives same error.
Renaming the columns (or sublevels), then do the withColumn
val schema = """struct<resource_name:string, namespace_name:string, pod_name:string, stream:string>"""
val df1 = df.withColumn("labels", $"labels".cast(schema))
You can use back apostrophe ` to isolate the names that contain special characters like '.'. You need to use backtick after the labels, as it is parent tag.
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)
Output:
+--------------------+-------------+--------------+--------+------+
|labels |resource_name|namespace_name|pod_name|stream|
+--------------------+-------------+--------------+--------+------+
|[RN_1,NM_1,PM_1,S_1]|RN_1 |NM_1 |PM_1 |S_1 |
+--------------------+-------------+--------------+--------+------+
UPDATE 1
Full working example.
import org.apache.spark.sql.functions._
val j_1 =
"""
|{ "labels" : {
| "compute.googleapis.com/resource_name" : "RN_1",
| "container.googleapis.com/namespace_name" : "NM_1",
| "container.googleapis.com/pod_name" : "PM_1",
| "container.googleapis.com/stream" : "S_1"
| }
|}
""".stripMargin
val df = spark.read.json(Seq(j_1).toDS)
df.printSchema()
val extracted = df.withColumn("resource_name", df("labels.`compute.googleapis.com/resource_name`"))
.withColumn("namespace_name", df("labels.`container.googleapis.com/namespace_name`"))
.withColumn("pod_name", df("labels.`container.googleapis.com/pod_name`"))
.withColumn("stream", df("labels.`container.googleapis.com/stream`"))
extracted.show(10, false)

Spark 2.0 (not 2.1) Dataset[Row] or Dataframe - Select few columns to JSON

I have a Spark Dataframe with 10 columns and I need to store this in Postgres/ RDBMS. The table has 7 columns and 7th column takes in text (of JSON format) for further processing.
How do I select 6 columns and convert the remaining 4 columns in the DF to JSON format?
If the whole DF is to be stored as JSON, then we could use DF.write.format("json"), but only the last 4 columns are required to be in JSON format.
I tried creating a UDF (with either Jackson or Lift lib), but not successful in sending the 4 columns to the UDF.
for JSON, the DF column name is the key, DF column's value is the value.
eg:
dataset name: ds_base
root
|-- bill_id: string (nullable = true)
|-- trans_id: integer (nullable = true)
|-- billing_id: decimal(3,-10) (nullable = true)
|-- asset_id: string (nullable = true)
|-- row_id: string (nullable = true)
|-- created: string (nullable = true)
|-- end_dt: string (nullable = true)
|-- start_dt: string (nullable = true)
|-- status_cd: string (nullable = true)
|-- update_start_dt: string (nullable = true)
I want to do,
ds_base
.select ( $"bill_id",
$"trans_id",
$"billing_id",
$"asset_id",
$"row_id",
$"created",
?? <JSON format of 4 remaining columns>
)
You can use struct and to_json:
import org.apache.spark.sql.functions.{to_json, struct}
to_json(struct($"end_dt", $"start_dt", $"status_cd", $"update_start_dt"))
As a workaround for legacy Spark versions you could convert whole object to JSON and extracting required:
import org.apache.spark.sql.functions.get_json_object
// List of column names to be kept as-is
val scalarColumns: Seq[String] = Seq("bill_id", "trans_id", ...)
// List of column names to be put in JSON
val jsonColumns: Seq[String] = Seq(
"end_dt", "start_dt", "status_cd", "update_start_dt"
)
// Convert all records to JSON, keeping selected fields as a nested document
val json = df.select(
scalarColumns.map(col _) :+
struct(jsonColumns map col: _*).alias("json"): _*
).toJSON
json.select(
// Extract selected columns from JSON field and cast to required types
scalarColumns.map(c =>
get_json_object($"value", s"$$.$c").alias(c).cast(df.schema(c).dataType)) :+
// Extract JSON struct
get_json_object($"value", "$.json").alias("json"): _*
)
This will work only as long as you have atomic types. Alternatively you could use standard JSON reader and specify schema for the JSON field.
import org.apache.spark.sql.types._
val combined = df.select(
scalarColumns.map(col _) :+
struct(jsonColumns map col: _*).alias("json"): _*
)
val newSchema = StructType(combined.schema.fields map {
case StructField("json", _, _, _) => StructField("json", StringType)
case s => s
})
spark.read.schema(newSchema).json(combined.toJSON.rdd)

Simple join of two Spark DataFrame failing with "org.apache.spark.sql.AnalysisException: Cannot resolve column name"

Update
It turns out this has something to do with the way the Databricks Spark CSV reader is creating the DataFrame. In the example below that does not work, I read the people and address CSV using Databricks CSV reader, then write the resulting DataFrame to HDFS in Parquet format.
I changed the code to create the DataFrame with: (similar for the people.csv)
JavaRDD<Address> address = context.textFile("/Users/sfelsheim/data/address.csv").map(
new Function<String, Address>() {
public Address call(String line) throws Exception {
String[] parts = line.split(",");
Address addr = new Address();
addr.setAddrId(parts[0]);
addr.setCity(parts[1]);
addr.setState(parts[2]);
addr.setZip(parts[3]);
return addr;
}
});
and then write the resulting DataFrame to HDFS in Parquet format, and the join works as expected
I am reading the exact same CSV in both cases.
Running into an issue trying to perform a simple join of two DataFrames created from two different parquet files on HDFS.
[main] INFO org.apache.spark.SparkContext - Running Spark version 1.4.1
Using HDFS from Hadoop 2.7.0
Here is a sample to illustrate.
public void testStrangeness(String[] args) {
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("joinIssue");
JavaSparkContext context = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(context);
DataFrame people = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/people.parquet");
DataFrame address = sqlContext.parquetFile("hdfs://localhost:9000//datalake/sample/address.parquet");
people.printSchema();
address.printSchema();
// yeah, works
DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
// boo, fails
DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));
joined.printSchema();
}
Contents of people
first,last,addressid
your,mom,1
fred,flintstone,2
Contents of address
addrid,city,state,zip
1,sometown,wi,4444
2,bedrock,il,1111
people.printSchema();
results in...
root
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
address.printSchema();
results in...
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
DataFrame cartJoin = address.join(people);
cartJoin.printSchema();
Cartesian join works fine, printSchema() results in...
root
|-- addrid: integer (nullable = true)
|-- city: string (nullable = true)
|-- state: string (nullable = true)
|-- zip: integer (nullable = true)
|-- first: string (nullable = true)
|-- last: string (nullable = true)
|-- addressid: integer (nullable = true)
This join...
DataFrame joined = address.join(people,
address.col("addrid").equalTo(people.col("addressid")));
Results in the following exception.
Exception in thread "main" org.apache.spark.sql.AnalysisException: **Cannot resolve column name "addrid" among (addrid, city, state, zip);**
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:159)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:558)
at dw.dataflow.DataflowParser.testStrangeness(DataflowParser.java:36)
at dw.dataflow.DataflowParser.main(DataflowParser.java:119)
I tried changing it so people and address have a common key attribute (addressid) and used..
address.join(people, "addressid");
But got the same result.
Any ideas??
Thanks
Turns out the problem was the CSV file was in UTF-8 format with a BOM. The DataBricks CSV implementation does not handle UTF-8 with BOM. Converted the files to UTF-8 without the BOM and all works fine.
Was able to fix this by using Notepad++. Under the "Encoding" menu, I switched it from "Encode in UTF-8 BOM" to "Encode in UTF-8".