Decode chunked JSON with AKKA Stream - json

I have a Source[ByteString, _] from an input file with 3 rows like this (in reality the input is a TCP socket with a continuos stream):
{"a":[2
33]
}
Now the problem is that I want to parse this into a Source[ChangeMessage,_], however the only examples I have found deals with when there is a whole JSON message for every row not when each JSON message can be fragmented over multiple rows.
One example I found is this this library, however it expects } or , as last character, that is one JSON per row. The example below shows this setup.
"My decoder" should "decode chunked json" in {
implicit val sys = ActorSystem("test")
implicit val mat = ActorMaterializer()
val file = Paths.get("chunked_json_stream.json")
val data = FileIO.fromPath(file)
.via(CirceStreamSupport.decode[ChangeMessage])
.runWith(TestSink.probe[ChangeMessage])
.request(1)
.expectComplete()
}
Another alternative would be to use a fold and balance } and only emit when a whole JSON is completed. The problem with this is that the fold operator only emits when the stream completes and since this is a continuous stream I can not use it here.
My question is: What is the fastest way to parse chunked JSON streams
in AKKA Stream and are there any available software that already does
this? If possible I would like to use circe

As documentation of knutwalker/akka-stream-json says:
This flow even supports parsing multiple json documents in whatever fragmentation they may arrive, which is great for consuming stream/sse based APIs.
In your case all you need to do is to just delimit the incoming ByteStrings:
"My decoder" should "decode chunked json" in {
implicit val sys = ActorSystem("test")
implicit val mat = ActorMaterializer()
val file = Paths.get("chunked_json_stream.json")
val sourceUnderTest =
FileIO.fromPath(file)
.via(Framing.delimiter(ByteString("\n"), 8192, allowTruncation = true))
.via(CirceStreamSupport.decode[ChangeMessage])
sourceUnderTest
.runWith(TestSink.probe[ChangeMessage])
.request(1)
.expectNext(ChangeMessage(List(233)))
.expectComplete()
}
That's because when reading from file, ByteString elements contain multiple lines and therefore Circe is not able to parse malformed jsons. When you delimit by new line, each element in the stream is a separate line and therefore Circe is able to parse it using the aformentioned feature.

Unfortunately, I'm not aware of any Scala libraries which support stream-based parsing of JSON. It seems to me that some support for this is available in Google Gson, but I'm not entirely sure it can properly handle "broken" input.
What you can do, however, is to collect JSON documents in a streaming fashion, similarly to what Framing.delimiter does. This is very similar to the alternative you have mentioned, but it is not using fold(); if you do go this way, you would probably need to mimic what Framing.delimiter does but instead of looking for a single delimiter, you will need to balance curly braces (and optionally brackets, if top-level arrays are possible), buffering the intermediate data, until the entire document comes through, which you would emit as a single chunk suitable for parsing.
Just as a side note, an appropriate interface for a streaming JSON parser suitable to be used in Akka Streams could look like this:
trait Parser {
def update(data: Array[Byte]) // or String
def pull(): Option[Either[Error, JsonEvent]]
}
where pull() returns None if it can't read anymore but there are no actual syntactic errors in the incoming document, and JsonEvent is some standard structure for describing events of streaming parsers (i.e. a sealed trait with subclasses like BeginObject, BeginArray, EndObject, EndArray, String, etc.). If you find such a library or create one, you can use it to parse data coming from an Akka stream of ByteStrings.

Related

Fail when decoding into a circe when JSON has more attributes than expected

I would like to capture JSON typos and other schema violations. I originally wanted to use circe-json-schema but the scala version provided in our execution environment is 2.11 which is not supported by said library. I turned to semi-automatic decoding but it, unfortunately, ignores any "additional" attributes in the incoming JSON. Example:
val json = json"""{ "pepe": "corre", "tito": "tira" }"""
case class Pepe(pepe: String)
implicit val pepeDecoder: Decoder[Pepe] = deriveDecoder
/* ideally, because "tito" is extra but the assertion nevertheless fails */
assert(json.as[Pepe] == Left("decoding failure"))
My other option would be to manually traverse the whole JSON w/ a Map of expectations but I think it pedestrian. If there are any other options w/in the circe world, including coexisting w/ a higher scala major version (in order to leverage circe-json-schema) w/in Maven, please advise.

How to capture incorrect (corrupt) JSON records in (Py)Spark Structured Streaming?

I have a Azure Eventhub, which is streaming data (in JSON format).
I read it as a Spark dataframe, parse the incoming "body" with from_json(col("body"), schema) where schema is pre-defined. In code it, looks like:
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import *
schema = StructType().add(...) # define the incoming JSON schema
df_stream_input = (spark
.readStream
.format("eventhubs")
.options(**ehConfInput)
.load()
.select(from_json(col("body").cast("string"), schema)
)
And now = if there is some inconsistency between the incoming JSON's schema and the defined schema (e.g. the source eventhub starts sending data in new format without notice), the from_json() functions will not throw an error = instead, it will put NULL to the fields, which are present in my schema definition but not in the JSONs eventhub sends.
I want to capture this information and log it somewhere (Spark's log4j, Azure Monitor, warning email, ...).
My question is: what is the best way how to achieve this.
Some of my thoughts:
First thing I can think of is to have a UDF, which checks for the NULLs and if there is any problem, it raise an Exception. I believe there it is not possible to send logs to log4j via PySpark, as the "spark" context cannot be initiated within the UDF (on the workers) and one wants to use the default:
log4jLogger = sc._jvm.org.apache.log4j
logger = log4jLogger.LogManager.getLogger('PySpark Logger')
Second thing I can think of is to use "foreach/foreachBatch" function and put this check logic there.
But I feel both these approaches are like.. like too much custom - I was hoping that Spark has something built-in for these purposes.
tl;dr You have to do this check logic yourself using foreach or foreachBatch operators.
It turns out I was mistaken thinking that columnNameOfCorruptRecord option could be an answer. It will not work.
Firstly, it won't work due to this:
case _: BadRecordException => null
And secondly due to this that simply disables any other parsing modes (incl. PERMISSIVE that seems to be used alongside columnNameOfCorruptRecord option):
new JSONOptions(options + ("mode" -> FailFastMode.name), timeZoneId.get))
In other words, your only option is to use the 2nd item in your list, i.e. foreach or foreachBatch and handle corrupted records yourself.
A solution could use from_json while keeping the initial body column. Any record with an incorrect JSON would end up with the result column null and foreach* would catch it, e.g.
def handleCorruptRecords:
// if json == null the body was corrupt
// handle it
df_stream_input = (spark
.readStream
.format("eventhubs")
.options(**ehConfInput)
.load()
.select("body", from_json(col("body").cast("string"), schema).as("json"))
).foreach(handleCorruptRecords).start()

Prevent parsing a JSON node with common-lisp YASON library

I am using the Yason library in common-lisp, I want to parse a json string but would like the parser to keep one a its node unparsed.
Typically with an example like that:
{
"metadata1" : "mydata1",
"metadata2" : "mydata2",
"payload" : {...my long payload object},
"otherNodesToParse" : {...}
}
How can I set the yason parser to parse my json but skip the payload node and keep it as a string in the json format.
Use: let's say I just want the envelope data (everything that's not the payload), and to forward the payload as-is (as json string) to another system.
If I parse the whole json (so including payload) and then re-encode the payload to json, it is inefficient. The payload size could also be pretty big.
How do you know where the end of the payload object is in the stream? You do so by parsing the stream: if you don't parse the stream you simply can't know where the end of the object is: that's the nature of JSON's syntax (as it is the nature of CL's default syntax). For instance the only way you can know the difference between where to continue after
{x:1}
and after
{x:1.2}
is by parsing the two things.
So you must necessarily parse the whole thing.
So the answer to your question is: you can't do this.
You could (but not, I think, with YASON) decide that you did not want to build an object as a result of the parse. And perhaps, if the stream you are parsing corresponds to something with random access like a string or a file, you could note the start and end positions in the stream to later extract a string from it corresponding to the unparsed data (or you could perhaps build it up as you go).
It looks as if some or all of this might be possible with CL-JSON, but you'd have to work at it.
Unless the objects you are reading are vast the benefit of this seems questionable-to-none. If you really do want to do something like this efficiently you need a serialisation scheme which tells you how long things are.

Parsing large JSON file with Scala and JSON4S

I'm working with Scala in IntelliJ IDEA 15 and trying to parse a large twitter record json file and count the total number of hashtags. I am very new to Scala and the idea of functional programming. Each line in the json file is a json object (representing a tweet). Each line in the file starts like so:
{"in_reply_to_status_id":null,"text":"To my followers sorry..
{"in_reply_to_status_id":null,"text":"#victory","in_reply_to_screen_name"..
{"in_reply_to_status_id":null,"text":"I'm so full I can't move"..
I am most interested in a property called "entities" which contains a property called "hastags" with a list of hashtags. Here is an example:
"entities":{"hashtags":[{"text":"thewayiseeit","indices":[0,13]}],"user_mentions":[],"urls":[]},
I've browsed the various scala frameworks for parsing json and have decided to use json4s. I have the following code in my Scala script.
import org.json4s.native.JsonMethods._
var json: String = ""
for (line <- io.Source.fromFile("twitter38.json").getLines) json += line
val data = parse(json)
My logic here is that I am trying to read each line from twitter38.json into a string and then parse the entire string with parse(). The parse function is throwing an error claiming:
"Type mismatch, expected: Nothing, found:String."
I have seen examples that use parse() on strings that hold json objects such as
val jsontest =
"""{
|"name" : "bob",
|"age" : "50",
|"gender" : "male"
|}
""".stripMargin
val data = parse(jsontest)
but I have received the same error. I am coming from an object oriented programming background, is there something fundamentally wrong with the way I am approaching this problem?
You have most likely incorrectly imported dependencies to your Intellij project or modules into your file. Make sure you have the following lines imported:
import org.json4s.native.JsonMethods._
Even if you correctly import this module, parse(String: json) will not work for you, because you have incorrectly formed a json. Your json String will look like this:
"""{"in_reply_...":"someValue1"}{"in_reply_...":"someValues2"}"""
but should look as follows to be a valid json that can be parsed:
"""{{"in_reply_...":"someValue1"},{"in_reply_...":"someValues2"}}"""
i.e. you need starting and ending brackets for the json, and a comma between each line of tweets. Please read the json4s documenation for more information.
Although being almost 6 years old, I think this question deserves another try.
JSON format has a few misunderstandings in people's minds, especially how they are stored and how they are read back.
JSON documents, are stored as either a single object having all the other fields, or an array of multiple object possibly in same format. this second part is important because arrays in almost every programming language are defined by angle brackets and values separated by commas (note here I used a person object as my single value):
[
{"name":"John","surname":"Doe"},
{"name":"Jane","surname":"Doe"}
]
also note that everything except brackets, numbers and booleans are enclosed in quotes when written into file.
however, there is another use that is not official but preferred to transfer datasets easily where every object, or document as in nosql/mongo language, are stored in a new line like this:
{"name":"John","surname":"Doe"}
{"name":"Jane","surname":"Doe"}
so for the question, OP has a document written in this second form, but tries an algorithm written to read the first form. following code has few simple changes to achieve this, and the user must read the file knowing that:
var json: String = "["
for (line <- io.Source.fromFile("twitter38.json").getLines) json += line + ","
json=json.splitAt(json.length()-1)._1
json+= "]"
val data = parse(json)
PS: although #sbrannon, has the correct idea, the example he/she gave has mistakenly curly braces instead of angle brackets to surround the data.
EDIT: I have added json=json.splitAt(json.length()-1)._1 because the code above ends with a trailing comma which will cause parse error per the JSON format definition.

Is it valid to define functions in JSON results?

Part of a website's JSON response had this (... added for context):
{..., now:function(){return(new Date).getTime()}, ...}
Is adding anonymous functions to JSON valid? I would expect each time you access 'time' to return a different value.
No.
JSON is purely meant to be a data description language. As noted on http://www.json.org, it is a "lightweight data-interchange format." - not a programming language.
Per http://en.wikipedia.org/wiki/JSON, the "basic types" supported are:
Number (integer, real, or floating
point)
String (double-quoted Unicode
with backslash escaping)
Boolean
(true and false)
Array (an ordered
sequence of values, comma-separated
and enclosed in square brackets)
Object (collection of key:value
pairs, comma-separated and enclosed
in curly braces)
null
The problem is that JSON as a data definition language evolved out of JSON as a JavaScript Object Notation. Since Javascript supports eval on JSON, it is legitimate to put JSON code inside JSON (in that use-case). If you're using JSON to pass data remotely, then I would say it is bad practice to put methods in the JSON because you may not have modeled your client-server interaction well. And, further, when wishing to use JSON as a data description language I would say you could get yourself into trouble by embedding methods because some JSON parsers were written with only data description in mind and may not support method definitions in the structure.
Wikipedia JSON entry makes a good case for not including methods in JSON, citing security concerns:
Unless you absolutely trust the source of the text, and you have a need to parse and accept text that is not strictly JSON compliant, you should avoid eval() and use JSON.parse() or another JSON specific parser instead. A JSON parser will recognize only JSON text and will reject other text, which could contain malevolent JavaScript. In browsers that provide native JSON support, JSON parsers are also much faster than eval. It is expected that native JSON support will be included in the next ECMAScript standard.
Let's quote one of the spec's - https://www.rfc-editor.org/rfc/rfc7159#section-12
The The JavaScript Object Notation (JSON) Data Interchange Format Specification states:
JSON is a subset of JavaScript but excludes assignment and invocation.
Since JSON's syntax is borrowed from JavaScript, it is possible to
use that language's "eval()" function to parse JSON texts. This
generally constitutes an unacceptable security risk, since the text
could contain executable code along with data declarations. The same
consideration applies to the use of eval()-like functions in any
other programming language in which JSON texts conform to that
language's syntax.
So all answers which state, that functions are not part of the JSON standard are correct.
The official answer is: No, it is not valid to define functions in JSON results!
The answer could be yes, because "code is data" and "data is code".
Even if JSON is used as a language independent data serialization format, a tunneling of "code" through other types will work.
A JSON string might be used to pass a JS function to the client-side browser for execution.
[{"data":[["1","2"],["3","4"]],"aFunction":"function(){return \"foo bar\";}"}]
This leads to question's like: How to "https://stackoverflow.com/questions/939326/execute-javascript-code-stored-as-a-string".
Be prepared, to raise your "eval() is evil" flag and stick your "do not tunnel functions through JSON" flag next to it.
It is not standard as far as I know. A quick look at http://json.org/ confirms this.
Nope, definitely not.
If you use a decent JSON serializer, it won't let you serialize a function like that. It's a valid OBJECT, but not valid JSON. Whatever that website's intent, it's not sending valid JSON.
JSON explicitly excludes functions because it isn't meant to be a JavaScript-only data
structure (despite the JS in the name).
A short answer is NO...
JSON is a text format that is completely language independent but uses
conventions that are familiar to programmers of the C-family of
languages, including C, C++, C#, Java, JavaScript, Perl, Python, and
many others. These properties make JSON an ideal data-interchange
language.
Look at the reason why:
When exchanging data between a browser and a server, the data can only
be text.
JSON is text, and we can convert any JavaScript object into JSON, and
send JSON to the server.
We can also convert any JSON received from the server into JavaScript
objects.
This way we can work with the data as JavaScript objects, with no
complicated parsing and translations.
But wait...
There is still ways to store your function, it's widely not recommended to that, but still possible:
We said, you can save a string... how about converting your function to a string then?
const data = {func: '()=>"a FUNC"'};
Then you can stringify data using JSON.stringify(data) and then using JSON.parse to parse it (if this step needed)...
And eval to execute a string function (before doing that, just let you know using eval widely not recommended):
eval(data.func)(); //return "a FUNC"
Via using NodeJS (commonJS syntax) I was able to get this type of functionality working, I originally had just a JSON structure inside some external JS file, but I wanted that structure to be more of a Class, with methods that could be decided at run time.
The declaration of 'Executor' in myJSON is not required.
var myJSON = {
"Hello": "World",
"Executor": ""
}
module.exports = {
init: () => { return { ...myJSON, "Executor": (first, last) => { return first + last } } }
}
Function expressions in the JSON are completely possible, just do not forget to wrap it in double quotes. Here is an example taken from noSQL database design:
{
"_id": "_design/testdb",
"views": {
"byName": {
"map": "function(doc){if(doc.name){emit(doc.name,doc.code)}}"
}
}
}
although eval is not recommended, this works:
<!DOCTYPE html>
<html>
<body>
<h2>Convert a string written in JSON format, into a JavaScript function.</h2>
<p id="demo"></p>
<script>
function test(val){return val + " it's OK;}
var someVar = "yup";
var myObj = { "func": "test(someVar);" };
document.getElementById("demo").innerHTML = eval(myObj.func);
</script>
</body>
</html>
Leave the quotes off...
var a = {"b":function(){alert('hello world');} };
a.b();