Scala schema_of_json function fails in spark structured streaming - json

I have created a function to read JSON as a string with its schema. Then using that function in spark streaming. I am getting error while doing so. The same piece works when I create schema first, then use that schema to read, but doesn't work in single line. How can I fix it?
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
TOPICS.split(',').foreach(topic =>{
var TableName = topic.split('.').last.toUpperCase
var df = microBatchOutputDF
/*var schema = schema_of_json(df
.select($"value")
.filter($"topic".contains(topic))
.as[String]
)*/
var jsonDataDf = df.filter($"topic".contains(topic))
.withColumn("jsonData", from_json($"value", schema_of_json(lit($"value".as[String])), scala.collection.immutable.Map[String, String]().asJava))
var srcTable = jsonDataDf
.select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")
srcTable
.select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
.write
.mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
} )
}
Spark streaming code
import org.apache.spark.sql.streaming.Trigger
val StreamingQuery = InputDf
.select("*")
.writeStream.outputMode("update")
.option("queryName", "StreamingQuery")
.foreachBatch(processBatch _)
.start()
Error:
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema_of_json(value)

Error –org.apache.spark.sql.AnalysisException: Schema should be
specified in DDL format as a string literal or output of the
schema_of_json/schema_of_csv functions instead of
schema_of_json(value)
Above error suggests issue with from_json() function.
Syntax:- from_json(jsonStr, schema[, options]) - Returns a struct value with the given jsonStr and schema.
Refer below Examples:
> SELECT from_json('{"a":1, "b":0.8}', 'a INT, b DOUBLE');
{"a":1,"b":0.8}
> SELECT from_json('{"time":"26/08/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));
{"time":2015-08-26 00:00:00}
Refer - https://docs.databricks.com/sql/language-manual/functions/from_json.html

This is how I solved this.
I created a filtered dataframe from the kafka output dataframe, and applied all the logics in it, as it was before. The problem with generating schema while reading is, from_json doesn't know which exact row to use from all the rows of the dataframe.
def processBatch(microBatchOutputDF: DataFrame, batchId: Long) {
TOPICS.split(',').foreach(topic =>{
var TableName = topic.split('.').last.toUpperCase
var df = microBatchOutputDF.where(col("topic") === topic)
var schema = schema_of_json(df
.select($"value")
.filter($"topic".contains(topic))
.as[String]
)
var jsonDataDf = df.withColumn("jsonData", from_json($"value", schema, scala.collection.immutable.Map[String, String]().asJava))
var srcTable = jsonDataDf
.select(col(s"jsonData.payload.after.*"), $"offset", $"timestamp")
srcTable
.select(srcTable.columns.map(c => col(c).cast(StringType)) : _*)
.write
.mode("append").format("delta").save("/mnt/datalake/raw/kafka/" + TableName)
spark.sql(s"""CREATE TABLE IF NOT EXISTS kafka_raw.$TableName USING delta LOCATION '/mnt/datalake/raw/kafka/$TableName'""")
} )
}

Related

Jsony newHook has `SIGSEGV: Illegal storage access. (Attempt to read from nil?)` when deserializing into ref-objects

I am writing a web-application and am deserializing via jsony into norm-model-object types.
Norm-model-types are always ref objects. Somehow my code which is very similar to the default example in jsony's github documentation does not compile. Instead I receive the error SIGSEGV: Illegal storage access. (Attempt to read from nil?).
See here my code sample
import std/[typetraits, times]
import norm/[pragmas, model]
import jsony
const OUTPUT_TIME_FORMAT* = "yyyy-MM-dd'T'HH:mm:ss'.'ffffff'Z'"
type Character* {.tableName: "wikientries_character".} = ref object of Model
name*: string
creation_datetime*: DateTime
update_datetime*: DateTime
proc parseHook*(s: string, i: var int, v: var DateTime) =
##[ jsony-hook that is automatically called to convert a json-string to datetime
``s``: The full JSON string that needs to be serialized. Your type may only be a part of this
``i``: The index on the JSON string where the next section of it starts that needs to be serialized here
``v``: The variable to fill with a proper value]##
var str: string
s.parseHook(i, str)
v = parse(s, OUTPUT_TIME_FORMAT, utc())
proc newHook*(entry: var Character) =
let currentDateTime: DateTime = now()
entry.creation_datetime = currentDateTime # <-- This line is listed as the reason for the sigsev
entry.update_datetime = currentDateTime
entry.name = ""
var input = """ {"name":"Test"} """
let c = input.fromJson(Character)
I don't understand what the issue appears to be here, as the jsony-example on its github page looks pretty similar:
type
Foo5 = object
visible: string
id: string
proc newHook*(foo: var Foo5) =
# Populates the object before its fully deserialized.
foo.visible = "yes"
var s = """{"id":"123"}"""
var v = s.fromJson(Foo5)
doAssert v.id == "123"
doAssert v.visible == "yes"
How can I fix this?
The answer lies in the fact that norm-object-types are ref objects, not normal (value) objects (Thanks to ElegantBeef, Rika and Yardanico from the nim-discord to point this out)! If you do not explicitly 'create' a ref-type at one point, the memory for it is never allocated since the code doesn't do the memory allocation for you unlike with value types!
Therefore, you must initialize/create a ref-object first before you can use it, and Jsony does not take over initialization for you!
The correct way to write the above newHook thus looks like this:
proc newHook*(entry: var Character) =
entry = new(Character)
let currentDateTime: DateTime = now()
entry.creation_datetime = currentDateTime
entry.update_datetime = currentDateTime
entry.name = ""

Very simple Slick query, get a single value

I'm trying to introduce slick into my code to replace some existing jdbc code.
First of all I'd like to use a scala worksheet to run a really simple query, I want to pass in an integer id, and get back a string uuid. This is the simplest method in the whole codebase.
As I understand I need to make a connection to the database, setup an action, and then run the action. I have the following code:
val db = Database.forURL("jdbc:mysql://mysql-dev.${URL}/${DB}?autoReconnect=true&characterEncoding=UTF-8",
driver = "com.mysql.jdbc.Driver", user = "${user}",password= "${pass}")
val getUUID = sql"""SELECT ${UUIDFIELD} from users u WHERE u.${IDFIELD} = ${Id}""".as[String]
val uuid:String = db.run(getUUID)
println(uuid)
I'm pretty sure I don't have the driver setup correctly in the Database.forURL call, but also the worksheet is complaining that the result of db.run is not a string. How do I get to the string UUID value?
The db.run method returns a Future[_] type. You should use Await for getting result from it.
val db = Database.forURL("jdbc:mysql://mysql-dev.${URL}/${DB}?autoReconnect=true&characterEncoding=UTF-8",
driver = "com.mysql.jdbc.Driver", user = "${user}",password= "${pass}")
val getUUID = sql"""SELECT ${UUIDFIELD} from users u WHERE u.${IDFIELD} = ${Id}""".as[String]
val uuidFuture:Future[String] = db.run(getUUID)
import scala.concurrent._
import scala.concurrent.duration._
val uuid:String = Await.result(uuidFuture, Duration.Inf)
println(uuid)

Spark UDF returns a length of field instead of length of value

Consider the code below
object SparkUDFApp {
def main(args: Array[String]) {
val df = ctx.read.json(".../example.json")
df.registerTempTable("example")
val fn = (_: String).length // % 10
ctx.udf.register("len10", fn)
val res0 = ctx sql "SELECT len10('id') FROM example LIMIT 1" map {_ getInt 0} collect
println(res0.head)
}
}
JSON example
{"id":529799371026485248,"text":"Example"}
The code should return a length of the field value from JSON (e.g. 'id' has value 18). But instead of returning '18' it returns '2', which is the length of 'id' I suppose.
So my question is how to rewrite UDF to fix it?
The problem is that you are passing the string id as a literal to your UDF so it is interpreted as one instead of a column (notice that it has 2 letters this is why it returns such number). To solve this just change the way how you formulate the SQL query.
E.g.
val res0 = ctx sql "SELECT len10(id) FROM example LIMIT 1" map {_ getInt 0} collect
// Or alternatively
val len10 = udf(word => word.length)
df.select(len10(df("id")).as("length")).show()

Assign unique ID's in parallell [duplicate]

I am having a JDBC connection with Apache Spark and PostgreSQL and I want to insert some data into my database. When I use append mode I need to specify id for each DataFrame.Row. Is there any way for Spark to create primary keys?
Scala:
If all you need is unique numbers you can use zipWithUniqueId and recreate DataFrame. First some imports and dummy data:
import sqlContext.implicits._
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(
("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")
Extract schema for further usage:
val schema = df.schema
Add id field:
val rows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
Create DataFrame:
val dfWithPK = sqlContext.createDataFrame(
rows, StructType(StructField("id", LongType, false) +: schema.fields))
The same thing in Python:
from pyspark.sql import Row
from pyspark.sql.types import StructField, StructType, LongType
row = Row("foo", "bar")
row_with_index = Row(*["id"] + df.columns)
df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF()
def make_row(columns):
def _make_row(row, uid):
row_dict = row.asDict()
return row_with_index(*[uid] + [row_dict.get(c) for c in columns])
return _make_row
f = make_row(df.columns)
df_with_pk = (df.rdd
.zipWithUniqueId()
.map(lambda x: f(*x))
.toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))
If you prefer consecutive number your can replace zipWithUniqueId with zipWithIndex but it is a little bit more expensive.
Directly with DataFrame API:
(universal Scala, Python, Java, R with pretty much the same syntax)
Previously I've missed monotonicallyIncreasingId function which should work just fine as long as you don't require consecutive numbers:
import org.apache.spark.sql.functions.monotonicallyIncreasingId
df.withColumn("id", monotonicallyIncreasingId).show()
// +---+----+-----------+
// |foo| bar| id|
// +---+----+-----------+
// | a|-1.0|17179869184|
// | b|-2.0|42949672960|
// | c|-3.0|60129542144|
// +---+----+-----------+
While useful monotonicallyIncreasingId is non-deterministic. Not only ids may be different from execution to execution but without additional tricks cannot be used to identify rows when subsequent operations contain filters.
Note:
It is also possible to use rowNumber window function:
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df.withColumn("id", rowNumber().over(w)).show()
Unfortunately:
WARN Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
So unless you have a natural way to partition your data and ensure uniqueness is not particularly useful at this moment.
from pyspark.sql.functions import monotonically_increasing_id
df.withColumn("id", monotonically_increasing_id()).show()
Note that the 2nd argument of df.withColumn is monotonically_increasing_id() not monotonically_increasing_id .
I found the following solution to be relatively straightforward for the case where zipWithIndex() is the desired behavior, i.e. for those desirng consecutive integers.
In this case, we're using pyspark and relying on dictionary comprehension to map the original row object to a new dictionary which fits a new schema including the unique index.
# read the initial dataframe without index
dfNoIndex = sqlContext.read.parquet(dataframePath)
# Need to zip together with a unique integer
# First create a new schema with uuid field appended
newSchema = StructType([StructField("uuid", IntegerType(), False)]
+ dfNoIndex.schema.fields)
# zip with the index, map it to a dictionary which includes new field
df = dfNoIndex.rdd.zipWithIndex()\
.map(lambda (row, id): {k:v
for k, v
in row.asDict().items() + [("uuid", id)]})\
.toDF(newSchema)
For anyone else who doesn't require integer types, concatenating the values of several columns whose combinations are unique across the data can be a simple alternative. You have to handle nulls since concat/concat_ws won't do that for you. You can also hash the output if the concatenated values are long:
import pyspark.sql.functions as sf
unique_id_sub_cols = ["a", "b", "c"]
df = df.withColumn(
"UniqueId",
sf.md5(
sf.concat_ws(
"-",
*[
sf.when(sf.col(sub_col).isNull(), sf.lit("Missing")).otherwise(
sf.col(sub_col)
)
for sub_col in unique_id_sub_cols
]
)
),
)

solrj QueryResponse getTermsResponse returns null

I'm trying to get a TermsResponse object from a solrj QueryResponse object, but it doesn't seem to be working. I'm using scala, but I would be happy with a working java example too.
First I set up the term vector query, which looks to be working:
val solrurl = "http://localhost:8983/solr"
val server= new HttpSolrServer( solrurl )
val query = new SolrQuery
query.setRequestHandler("/tvrh")
query.set("fl", "text")
query.set("tv.all", true)
query.setQuery("uid:" + id)
val response = server.query(query)
The query returns a QueryResponse object whose toString looks to be a JSON object. This object includes the term vector information (terms, frequency, etc . . .) as part of the JSON object.
But when I do this I always get a null object:
val termsResponse = Option(response.getTermsResponse)
Is this function deprecated?
If so what is the best way to retrieve the structure from QueryResponse? Convert to JSON? Some other sources point to using response.get("termVector") but that seems to be deprecated.
Any ideas?
Thanks
I have been using simple java object for this with the following configuration.
//Adding terms for 2 word phrases
qterms.setTerms(true);
qterms.setRequestHandler("/terms");
qterms.setTermsLimit(20);
qterms.addTermsField("PhraseIndx2");
qterms.setTermsMinCount(20);
QueryResponse response = solr.query(query);
SolrDocumentList results = response.getResults();
//queryresponse get all terms from in 2 phrase field
System.out.println ("printing the terms from queryresponse: \n");
QueryResponse resTerms = solr.query(qterms);
TermsResponse termResp =resTerms.getTermsResponse();
List<Term> terms = termResp.getTerms("PhraseIndx2");
System.out.print(terms.size());