Avro Json.ObjectWriter - "Not the Json schema" error - json

I'm writing a tool to convert data from a homegrown format to Avro, JSON and Parquet, using Avro 1.8.0. Conversion to Avro and Parquet is working okay, but JSON conversion throws the following error:
Exception in thread "main" java.lang.RuntimeException: Not the Json schema:
{"type":"record","name":"Torperf","namespace":"converTor.torperf",
"fields":[{"name":"descriptor_type","type":"string","
[... rest of the schema omitted for brevity]
Irritatingly this is the schema that I passed along and which indeed I want the converter to use. I have no idea what Avro is complaining about.
This is the relevant snippet of my code:
// parse the schema file
Schema.Parser parser = new Schema.Parser();
Schema mySchema;
// tried two ways to load the schema
// like this
File schemaFile = new File("myJsonSchema.avsc");
mySchema = parser.parse(schemaFile) ;
// and also like Json.class loads it's schema
mySchema = parser.parse(Json.class.getResourceAsStream("myJsonSchema.avsc"));
// initialize the writer
Json.ObjectWriter jsonDatumWriter = new Json.ObjectWriter();
jsonDatumWriter.setSchema(mySchema);
OutputStream out = new FileOutputStream(new File("output.avro"));
Encoder encoder = EncoderFactory.get().jsonEncoder(mySchema, out);
// append a record created by way of a specific mapping
jsonDatumWriter.write(specificRecord, encoder);
I replaced myJsonSchema.avsc with the one returned from the exception without success (and except whitespace and linefeeds they are the same). Initializing the jsonEncoder with org.apache.avro.data.Json.SCHEMA instead of mySchema didn't change anything either. Replacing the schema passed to Json.ObjectWriter with org.apache.avro.data.Json.SCHEMA leads to a NullPointerException at org.apache.avro.data.Json.write(Json.java:183) (which is a deprecated method).
From staring at org.apache.avro.data.Json.java it seems to me like Avro is checking my record schema against it's own schema of a Json record (line 58) for equality (line 73).
58 SCHEMA = Schema.parse(Json.class.getResourceAsStream("/org/apache/avro/data/Json.avsc"));
72 public void setSchema(Schema schema) {
73 if(!Json.SCHEMA.equals(schema))
74 throw new RuntimeException("Not the Json schema: " + schema);
75 }
The referenced Json.avsc defines the field types of a record:
{"type": "record", "name": "Json", "namespace":"org.apache.avro.data",
"fields": [
{"name": "value",
"type": [
"long",
"double",
"string",
"boolean",
"null",
{"type": "array", "items": "Json"},
{"type": "map", "values": "Json"}
]
}
]
}
equals is implemented in org.apache.avro.Schema, line 346:
public boolean equals(Object o) {
if(o == this) {
return true;
} else if(!(o instanceof Schema)) {
return false;
} else {
Schema that = (Schema)o;
return this.type != that.type?false:this.equalCachedHash(that) && this.props.equals(that.props);
}
}
I don't fully understand what's going on in the third check (especially equalCachedHash()) but I only recognize checks for equality in a trivial way which doesn't make sense to me.
Also I can't find any examples or notes about usage of Avro's Json.ObjectWriter on the InterWebs. I wonder if I should go with the deprecated Json.Writer instead because there are at least a few code snippets online to learn and glean from.
The full source is available at https://github.com/tomlurge/converTor
Thanks,
Thomas

A little more debugging proofed that passing org.apache.avro.data.Json.SCHEMA to Json.ObjectWriter is indeed the right thing to do. The object I get back written to System.out prints the JSON object that I expect. The null pointer exception though did not go away.
Probably I would not have had to setSchema() of Json.ObjectWriter at all since omitting the command alltogether leads to the same NullPointerException.
I finally filed a bug with Avro and it turned out that in my code I was handing an object of type "specific" to ObjectWriter which it couldn't handle. It did return silently though and an error was thrown only at a later stage. That was fixed in Avro 1.8.1 - see https://issues.apache.org/jira/browse/AVRO-1807 for details.

Related

Newtonsoft JSON Schema - $ref is resolved, ignoring required though

I externalized a portion of my json schema into a separate schema file.
For example: "$ref": "http://schema.company.com/boroughschema.json"
Within this schema I specify required properties, when validating a known bad json file, it doesn't complain that a required property is missing.
"required": [
"Name",
"Representative",
"District"
]
I purposely leave off "District" in the source json, and there are no complaints when validating.
Using Newtonsoft.Json.Schema 3.0.11.
The original schema validates just fine, if I move the schema portion to a definitions that works as well.
private bool ValidateViaExternalReferences(string jsonstring)
{
JSchemaPreloadedResolver resolver = new JSchemaPreloadedResolver();
// load schema
var schemaText = System.IO.File.ReadAllText(COMPLEXSCHEMAFILE);
// Rather than rely on 100% http access, use a resolver and
// preload schema with http://schema.company.com/boroughschema.json
var schemaTextBorough = System.IO.File.ReadAllText(BOROUGHSCHEMAFILE);
resolver.Add(new Uri("http://schema.company.com/boroughschema.json"), schemaTextBorough);
JSchema schema = JSchema.Parse(schemaText, resolver);
JToken json = JToken.Parse(jsonstring);
// validate json
IList<ValidationError> errors;
bool valid = json.IsValid(schema, out errors);
if (!valid)
{
foreach (var validationerr in errors)
{
Append2Log(validationerr.ToString());
}
}
return valid;
}
Missing "District" yields no errors, I expect the same correct behavior when using the original schema.
I had placed the:
"required": [
"Name",
"Representative",
"District"
] fragment in the wrong place.
I apologize, it behaves as expected.

ZonedDateTime Custom JSON Converter Grails 3.3.0

I am in the process of converting a really old Grails app to the latest version (3.3.0). Things have been a bit frustrating, but I'm pretty close to migrating everything except the JSON and XML marshallers which were previously registered in my BootStrap init.
The previous marshaller registering looked like this:
// register JSON marshallers at startup in all environments
MarshallerUtils.registerMarshallers()
This was defined like this:
class MarshallerUtils {
// Registers marshaller logic for various types that
// aren't supported out of the box or that we want to customize.
// These are used whenever the JSON or XML converters are called,
// e.g. return model as JSON
static registerMarshallers() {
final dateTimeFormatter = ISODateTimeFormat.dateTimeNoMillis()
final isoDateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ")
// register marshalling logic for both XML and JSON converters
[XML, JSON].each { converter ->
// This overrides the marshaller from the joda time plugin to
// force all DateTime instances to use the UTC time zone
// and the ISO standard "yyyy-mm-ddThh:mm:ssZ" format
converter.registerObjectMarshaller(DateTime, 10) { DateTime it ->
return it == null ? null : it.toString(dateTimeFormatter.withZone(org.joda.time.DateTimeZone.UTC))
}
converter.registerObjectMarshaller(Date, 10) { Date it ->
return it == null ? null : isoDateFormat.format(it)
}
converter.registerObjectMarshaller(TIMESTAMP, 10) { TIMESTAMP it ->
return it == null ? null : isoDateFormat.format(it.dateValue())
}
}
}
}
During the migration, I ended up converting all instances of org.joda.time.DateTime to java.time.ZonedDateTime:
class MarshallerUtils {
// Registers marshaller logic for various types that
// aren't supported out of the box or that we want to customize.
// These are used whenever the JSON or XML converters are called,
// e.g. return model as JSON
static registerMarshallers() {
final dateTimeFormatter = DateTimeFormatter.ISO_ZONED_DATE_TIME
final isoDateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZ")
// register marshalling logic for both XML and JSON converters
[XML, JSON].each { converter ->
// This overrides the marshaller from the java.time to
// force all DateTime instances to use the UTC time zone
// and the ISO standard "yyyy-mm-ddThh:mm:ssZ" format
converter.registerObjectMarshaller(ZonedDateTime, 10) { ZonedDateTime it ->
return it == null ? null : it.toString(dateTimeFormatter.withZone(ZoneId.of("UTC")))
}
converter.registerObjectMarshaller(Date, 10) { Date it ->
return it == null ? null : isoDateFormat.format(it)
}
converter.registerObjectMarshaller(TIMESTAMP, 10) { TIMESTAMP it ->
return it == null ? null : isoDateFormat.format(it.dateValue())
}
}
}
}
Unfortunately, after the upgrade to Grails 3.3.0, this marshaller registering doesn't seem to be used at all, no matter what I try to do.
I do know that there is a new "JSON Views" way of doing things, but this particular service has many endpoints, and I don't want to write custom converters and ".gson" templates for all of them, if everything is already in the format I need. I just need the responses to be in JSON and the dates to behave property (be formatted strings).
Instead, what I am finding (compared to the previous behavior, is that the properties which utilize ZonedDateTime are "exploded" in my JSON output. There is an insane amount of garbage date object information that is not needed, and it is not formatted as a simple string as I expect.
I have tried a few things (mostly per recommendations in the offical latest Grails documentation) ---
Custom Converters
Default Date Format
Adding configurations for grails views in application.yml:
views:
json:
generator:
dateFormat: "yyyy-MM-dd'T'HH:mm:ss.SSSZ"
locale: "en/US"
timeZone: "GMT"
Creating this path under "src":
src/main/resources/META-INF/services/grails.plugin.json.builder.JsonGenerator$Converter
And adding a Converter to my domain class which is named in the file above^:
class MultibeamFileConverter implements JsonGenerator.Converter {
final DateTimeFormatter isoDateFormat = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSSZ").withZone(ZoneId.of("UTC"));
#Override
boolean handles(Class<?> type) {
MultibeamFile.isAssignableFrom(type)
}
#Override
Object convert(Object value, String key) {
MultibeamFile multibeamFile = (MultibeamFile)value
multibeamFile.startTime.format(isoDateFormat)
multibeamFile.endTime.format(isoDateFormat)
return multibeamFile
}
}
In my controller, I have changed:
return multibeamCatalogService.findFiles(cmd, params)
To this (in order to get JSON output in the browser as before):
respond multibeamCatalogService.findFiles(cmd, params), formats: ['json', 'xml']
Unfortuantely, most permutations I can think to try of the above have resulted in errors such as "Could not resolve view". Otherwise, when I am getting a response, the major issue is that the date is not formatted as a string. This function was previously performed by the Marshaller.
I am getting pretty frustrated. Can someone please tell me how to format ZonedDateTime as a simple string (e.g. - "2009-06-21T00:00:00Z") in my JSON output instead of a giant object like this? Simply converting to java.util.Date causes the "Could not resolve view" error to show up again; consequently, that expects me to make a ".gson" view which never ends up showing the format I expect or is empty.
"startTime": {
"dayOfMonth": 26,
"dayOfWeek": {
"enumType": "java.time.DayOfWeek",
"name": "FRIDAY"
},
"dayOfYear": 207,
"hour": 0,
"minute": 0,
"month": {
"enumType": "java.time.Month",
"name": "JULY"
},
"monthValue": 7,
"nano": 0,
"offset": {
"id": "-06:00",
"rules": {
"fixedOffset": true,
"transitionRules": [],
"transitions": []
},
"totalSeconds": -21600
}, ... // AND SO ON FOR_EVAH
The simple answer is to format a ZonedDateTime object you call .format(DateTimeFormatter). It depends what format you want. You can specify your own or use some of the predefined ones in DateTimeFormatter.
I too though would love to know if there's an easy way to say "for every endpoint display it as json". The only way I've found so far is to have this in every controller class, which isn't too bad but seems silly. I'm using respond followed by a return in my controller methods.
static responseFormats = ['json'] // This is needed for grails to indicate what format to use for respond.
Though I still see the error logged, but rest api still appears to work, "Could not resolve view" for any endpoint I hit.

How to intercept map getProperty and list getAt?

I'm scraping external sources, mostly JSON. I'm using new JsonSlurper().parse(body) to parse them and I operate on them using constructs like def name = json.user[0].name. These being external, can change without notice, so I want to be able to detect this and log it somehow.
After reading a bit about the MOP I thought that I can change the appropriate methods of the maps and lists to log if the property is missing. I only want to do that the json object and on its properties recursively. The thing is, I don't know how to do that.
Or, is there a better way to accomplish all this?
[EDIT] For example if I get this JSON:
def json = '''{
"owners": [
{
"firstName": "John",
"lastName": "Smith"
},
{
"firstName": "Jane",
"lastName": "Smith"
}
]
}'''
def data = new groovy.json.JsonSlurper().parse(json.bytes)
assert data.owners[0].firstName == 'John'
However, if they change "owners" to "ownerInfo" the above access would throw NPE. What I want is intercept the access and do something (like log it in a special log, or whatever). I can also decide to throw a more specialized exception.
I don't want to catch NullPointerException, because it may be caused by some bug in my code instead of the changed data format. Besides, if they changed "firstName" to "givenName", but kept the "owners" name, I'd just get a null value, not NPE. Ideally I want to detect this case as well.
I also don't want to put a lot of if or evlis operators, if possible.
I actually managed to intercept that for maps:
data.getMetaClass().getProperty = {name -> println ">>> $name"; delegate.get(name)}
assert data.owners // this prints ">>> owners"
I still can't find out how to do that for the list:
def owners = data.owners
owners.getMetaClass().getAt(o -> println "]]] $o"; delegate.getAt(o)}
assert owners[0] // this doesn't print anything
Try this
owners.getMetaClass().getAt = { Integer o -> println "]]] $o"; delegate.get(o)}
I'm only guessing that it got lost because of multiple getAt() methods, so you have to define type. I also delegated to ArrayList's Java get() method since getAt() resulted in recursive calls.
If you want to more control over all methods calls, you could always do this
owners.getMetaClass().invokeMethod = { String methodName, Object[] args ->
if (methodName == "getAt") {
println "]]] $args"
}
return ArrayList.getMetaClass().getMetaMethod(methodName, args).invoke(delegate, args)
}
The short answer is that you can't do this with the given example. The reason is that the owners object is a java.util.ArrayList, and you are calling the get(int index) method on it. The metaClass object is specific to Groovy, and if you have a Java object making a method call to a Java method, it will have no knowledge of the metaClass. Here's a somewhat related question.
The good news is that there is and option, although I'm not sure if it works for your use case. You can create a Groovy wrapper object for this list, so that you can capture method calls.
For example, you could change your code from this
def owners = data.owners
to this
def owners = new GroovyList(data.owners)
and then create this new class
class GroovyList {
private List list
public GroovyList(List list) {
this.list = list
}
public Object getAt(int index) {
println "]]] $index"
list.getAt(index)
}
}
Now when you call
owners[0]
you'll get the output
]]] 0
[firstName:John, lastName:Smith]

jackson jsonparser restart parsing in broken JSON

I am using Jackson to process JSON that comes in chunks in Hadoop. That means, they are big files that are cut up in blocks (in my problem it's 128M but it doesn't really matter).
For efficiency reasons, I need it to be streaming (not possible to build the whole tree in memory).
I am using a mixture of JsonParser and ObjectMapper to read from my input.
At the moment, I am using a custom InputFormat that is not splittable, so I can read my whole JSON.
The structure of the (valid) JSON is something like:
[ { "Rep":
{
"date":"2013-07-26 00:00:00",
"TBook":
[
{
"TBookC":"ABCD",
"Records":
[
{"TSSName":"AAA",
...
},
{"TSSName":"AAB",
...
},
{"TSSName":"ZZZ",
...
}
] } ] } } ]
The records I want to read in my RecordReader are the elements inside the "Records" element. The "..." means that there is more info there, which conforms my record.
If I have an only split, there is no problem at all.
I use a JsonParser for fine grain (headers and move to "Records" token) and then I use ObjectMapper and JsonParser to read records as Objects. For details:
configure(JsonParser.Feature.AUTO_CLOSE_SOURCE, false);
MappingJsonFactory factory = new MappingJsonFactory();
mapper = new ObjectMapper(factory);
mapper.configure(Feature.FAIL_ON_UNKNOWN_PROPERTIES,false);
mapper.configure(SerializationConfig.Feature.FAIL_ON_EMPTY_BEANS,false);
parser = factory.createJsonParser(iStream);
mapper.readValue(parser, JsonNode.class);
Now, let's imagine I have a file with two inputsplits (i.e. there are a lot of elements in "Records").
The valid JSON starts on the first split, and I read and keep the headers (which I need for each record, in this case the "date" field).
The split would cut anywhere in the Records array. So let's assume I get a second split like this:
...
},
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
}
] } ] } } ]
I can check before I start parsing, to move the InputStream (FSDataInputStream) to the beginning ("{" ) of the record with the next "TSSNAME" in it (and this is done OK). It's fine to discard the trailing "garbage" at the beginning. So we got this:
{"TSSName":"ZZZ",
...
},
{"TSSName":"ZZZ2",
...
},
...
] } ] } } ]
Then I handle it to the JsonParser/ObjectMapper pair seen above.
The first object "ZZZ" is read OK.
But for the next "ZZZ2", it breaks: the JSONParser complaints about malformed JSON. It is encountering a "," not being in an array. So it fails. And then I cannot keep on reading my records.
How could this problem be solved, so I can still be reading my records from the second (and nth) splits? How could I make the parser ignore these errors on the commas, or either let the parser know in advance it's reading contents of an array?
It seems it's OK just catching the exception: the parser goes on and it's able to keep on reading objects via the ObjectMapper.
I don't really like it - I would like an option where the parser could not throw Exceptions on nonstandard or even bad JSON. So I don't know if this fully answers the question, but I hope it helps.

How to report parsing errors when using JSON.parseFull with Scala

When my app is fed syntactically incorrect JSON I want to be able to report the error to the user with some useful detail that will allow the problem area to be located.
So in this example j will be None because of the trailing comma after "File1". Is there a way to obtain details of last parse error?
val badSyntax = """
{
"menu1": {
"id": "file1",
"value": "File1",
},
"menu2": {
"id": "file2",
"value": "File2",
}
}"""
val j = JSON.parseFull(badSyntax)
When you get a parse error, use JSON.lastNoSuccess to get the last error. It is of type JSON.NoSuccess of which thare are two subclasses, JSON.Error and JSON.Failure, both containing a msg: String member detailing the error.
Note that JSON.lastNoSuccess is not thread safe (it is a mere global variable) and is now deprecated (bound to disappear in scala 2.11)
UPDATE: Apparently, I was wrong about it not being thread-safe: it was indeed not thread-safe before scala 2.10, but now lastNoSuccess is backed by a thread-local variable (and is thus safe to use in a multi-threaded context).
After seing this, you'd be forgiven to think that as long as you read right after a parsing failure in the same thread as the one that was used to do the parsing (the thread where you called parseFull), then everything will work as expected. Unfortunately, during this refactor they also changed how they use lastNoSuccess internally inside Parsers.phrase (which is called by JSON.parseFull.
See https://github.com/scala/scala/commit/1a4faa92faaf495f75a6dd2c58376f6bb3fbf44c
Since this refactor, lastNoSuccess is reset to None at the end of Parsers.phrase. This is no problem in parsers in general, as lastNoSuccess is used as a temporary value that is returned as the result of Parsers.phrase anyway.
The problem here is that we don't call Parsers.phrase, but JSON.parseFull, which drops any error info (see case None => None inside method JSON.parseRaw at https://github.com/scala/scala/blob/v2.10.0/src/library/scala/util/parsing/json/JSON.scala).
The fact that JSON.parseFull drops any error info could easily be circumvented prior to scala 2.10 by directly reading JSON.lastNoSuccess as I advised before, but now that this value is reset at the end of Parsers.phrase, there is not much you can do to get the error information out of JSON.
Any solution? Yes. What you can do is to create your very own version of JSON that will not drop the error information:
import util.parsing.json._
object MyJSON extends Parser {
def parseRaw(input : String) : Either[NoSuccess, JSONType] = {
phrase(root)(new lexical.Scanner(input)) match {
case Success(result, _) => Right(result)
case ns: NoSuccess => Left(ns)
}
}
def parseFull(input: String): Either[NoSuccess, Any] = {
parseRaw(input).right.map(resolveType)
}
def resolveType(input: Any): Any = input match {
case JSONObject(data) => data.transform {
case (k,v) => resolveType(v)
}
case JSONArray(data) => data.map(resolveType)
case x => x
}
}
I just changed Option to Either as the result type, so that I can return parsing errors as an Left. Some test in the REPL:
scala> MyJSON.parseFull("[1,2,3]")
res11: Either[MyJSON.NoSuccess,Any] = Right(List(1.0, 2.0, 3.0))
scala> MyJSON.parseFull("[1,2,3")
res12: Either[MyJSON.NoSuccess,Any] =
Left([1.7] failure: end of input
[1,2,3
^)