Aggregation over a spark stream - json

I am new to Apache Spark.
My Scala code is consuming JSON messages as strings from a Kafka topic in Apache Spark.
Now I want to aggregate over a certain field in my JSON. What are my options?

You can put the JSON in a dataframe/dataset and perform the following aggregate operations.
groupBy
groupByKey
rollup
cube
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either an RDD of String, or a JSON file.
val json_path = "dir/example.json"
val jsonDF = spark.read.json(json_path)
jsonDF.groupBy("col1").count().show()

Related

Nested JSON to dataframe in Scala

I am using Spark/Scala to make an API Request and parse the response into a dataframe. Following is the sample JSON response I am using for testing purpose:
API Request/Response
However, I tried to use the following answer from StackOverflow to convert to JSON but the nested fields are not being processed. Is there any way to convert the JSON string to a dataframe with columns??
I think the problem is that the json that you have attached, if we read it as a df, it is giving a single row(and it is very huge) and hence spark might be truncating the result.
If this is what you want then you can try to use the spark property spark.debug.maxToStringFields to a higher value(default is 25)
spark.conf().set("spark.debug.maxToStringFields", 100)
However, if you want to process the Results from json, then it would be better to get it as data frame and then do the processing. Here is how you can do it
val results = JsonParser.parseString(<json content>).getAsJsonObject().get("Results").getAsJsonArray.toString
import spark.implicits._
val df = spark.read.json(Seq(results).toDS)
df.show(false)

Is there any way to use GeoJSON conveniently with Geomesa?

I wouldn't like to build a Geomesa Datastore, just want to use the Geomesa Spark Core/SQL module to do some spatial analysis process on spark. My data sources are some GeoJson files on hdfs. However, I have to create a SpatialRDD by SpatialRDDProvider.
There is a Converter RDD Provider example in the documents of Geomesa:
import com.typesafe.config.ConfigFactory
import org.apache.hadoop.conf.Configuration
import org.geotools.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark
val exampleConf = ConfigFactory.load("example.conf").root().render()
val params = Map(
"geomesa.converter" -> exampleConf,
"geomesa.converter.inputs" -> "example.csv",
"geomesa.sft" -> "phrase:String,dtg:Date,geom:Point:srid=4326",
"geomesa.sft.name" -> "example")
val query = new Query("example")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)
I can choose GeoMesa's JSON Converter to create the SpatialRDD. However, it seems to be
necessary to assign all field names and types in geomesa.sft paramater and a converter config file. If I have many GeoJson files, I have to do this one by one manually, it is very
inconvenient obviously.
Is there any way that Geomesa Converter can infer the field names and types from the file?
Yes, GeoMesa can infer the type and converter. For scala/java, see this unit test. Alternatively, the GeoMesa CLI tools can be used ahead of time to persist the type and converter to reusable files, using the convert command (type inference is described in the linked ingest command).

How structured streaming dynamically parses kafka's json data

I am trying to read data from Kafka using structured streaming. The data received from kafka is in json format.
My code is as follows:
in the code I use the from_json function to convert the json to a dataframe for further processing.
val **schema**: StructType = new StructType()
.add("time", LongType)
.add(id", LongType)
.add("properties",new StructType()
.add("$app_version", StringType)
.
.
)
val df: DataFrame = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers","...")
.option("subscribe","...")
.load()
.selectExpr("CAST(value AS STRING) as value")
.select(from_json(col("value"), **schema**))
My problem is that if the field is increased,
I can't stop the spark program to manually add these fields,
then how can I parse these fields dynamically, I tried schema_of_json(),
it can only take the first line to infer the field type and it not suitable for multi-level nested structures json data.
My problem is that if the field is increased, I can't stop the spark program to manually add these fields, then how can I parse these fields dynamically
It is not possible in Spark Structured Streaming (or even Spark SQL) out of the box. There are a couple of solutions though.
Changing Schema in Code and Resuming Streaming Query
You simply have to stop your streaming query, change the code to match the current schema, and resume it. It is possible in Spark Structured Streaming with data sources that support resuming from checkpoint. Kafka data source does support it.
User-Defined Function (UDF)
You could write a user-defined function (UDF) that would do this dynamic JSON parsing for you. That's also among the easiest options.
New Data Source (MicroBatchReader)
Another option is to create an extension to the built-in Kafka data source that would do the dynamic JSON parsing (similarly to Kafka deserializers). That requires a bit more development, but is certainly doable.

Date fields transformation from AWS Glue table to RedShift Spectrum external table

I am trying to transform the JSON dataset from S3 to Glue table schema into an Redshift spectrum for data analysis. While creating external tables, how to transform the DATE fields?
Need to highlight the source data is coming from MongoDB in ISODate format. Here, is the Glue table format.
struct $date:string
Tried the following formats within the External table
startDate:struct<$date:varchar(40)>
startDate:struct<date:varchar(40)>
startDate:struct<date:timestamp>
Is there a work around within the Redshift Spectrum or Glue to handle ISODate formats? Or the recommendation is to go back to the source to convert the ISOdate format?
Assuming you are using Python in glue, and assuming python understands your field as a date, you could do something like:
from pyspark.sql.functions import date_format
from awsglue.dynamicframe import DynamicFrame
from awsglue.context import GlueContext
def out_date_format(to_format):
"""formats the passed date into MM/dd/yyyy format"""
return date_format(to_format,"MM/dd/yyyy")
#if you have a dynamic frame you will need to convert it to a dataframe first:
#dataframe = dynamic_frame.toDF()
dataframe.withColumn("new_column_name", out_date_format("your_old_date_column_name"))
#assuming you are outputting via glue, you will need to convert the dataframe back into a dynamic frame:
#glue_context = GlueContext(spark_context)
#final = DynamicFrame.fromDF(dataframe, glue_context,"final")
Depending on how you are getting the data, there may be other options to use mapping or formatting.
If python doesn't understand your field as a date object, you will need to parse it first, something like:
import dateutil.parser
#and the convert would change to:
def out_date_format(to_format):
"""formats the passed date into MM/dd/yyyy format"""
yourdate = dateutil.parser.parse(to_format)
return date_format(yourdate,"MM/dd/yyyy")
Note that if the dateutil isn't built into glue, you will need to add it to your job parameters with syntax like:
"--additional-python-modules" = "python-dateutil==2.8.1"

Scala - Extracting a value from a file containing data in JSON format

I have a 'sample.json' file containing the following JSON data:
{"query": "SELECT count(*) FROM TABLE_NAME"}
This file is being generated by another application and is placed in that application directory.
What I want to do is to read this file and extract the value (i.e., SELECT count(*) FROM TABLE_NAME) of the key 'query' into a val query. This val query will be used to query a database.
Being new to Scala, I am quite lost in the other answers I have found.
What is the simplest way to extract the value from the file into val queryin Scala?
I would prefer not using any external libraries unless very necessary.
There are countless ways of parsing JSON in scala. For your particular task, here is one suggestion, assuming your JSON string has been read into the variable jsonStr:
import scala.util.parsing.json.JSON
val resultOption = JSON.parseFull(jsonStr) match {
case Some(map: Map[String, String]) => map.get("query")
case _ => None
}
Now, if the parsing succeeded, resultOption will contain the SQL query string, wrapped in an option (otherwise, it will be None). Next, you should probably check for errors, but for the sake of simplicity, if we blindly assume that everything worked out nicely, we can now access the final value as follows:
val query = resultOption.get