How to easily change a recurring property name in multiple schemas? - json

To be able to deserialize polymorphic types, I use a type discriminator across many of my JSON objects. E.g., { "$type": "SomeType", "otherProperties": "..." }
For the JSON schemas of concrete types, I specify a const value for type.
{
"type": "object",
"properties": {
"$type": { "const": "SomeType" },
"otherProperties": { "type": "string" }
}
}
This works, but distributes the chosen "$type" property name throughout many different JSON schemas. In fact, we are considering renaming it to "__type" to play more nicely with BSON.
Could I have prevented having to rename this property in all affected schemas?
I tried searching for a way to load the property name from elsewhere. As far as I can tell $ref only works for property values.

JSON Schema has no ability to dynamically load in key values from other location like you are asking. Specifically because the value will be different, and you want only the key to be loaded from elsewhere.
While you can't do this with JSON Schema, you could use a templating tool such as Jsonnet. I've seen this work well at scale.
This would require you have a pre-processing step, but it sounds like that's something you're planning for already, creating some sort of pipeline to generate your schemas.
A word of warning, watch out for existing schema generation tooling. It is often only good for scaffolding, and requires lots of modifications. It sounds like you're building your own, which is likely a better approach.

Related

What is the best practice in REST-Api, to pass structured data or key-value pair?

I have a data-structure similar to the given below, which I am supposed to process. I am designing an API which should accept a POST request similar to the one given below. (ignore the headers, etc)
{
"Name" : "Johny English",
"Id": "534dsf",
"Message":[
{
"Header":"Country of origin",
"Value":"England"
},
{
"Header":"Nature of work",
"Value":"Secret Agent/Spy"
}
]
}
Some how I do not feel, its a correct way to pass/accept data. Here I am talking about structured data vs. Key-Value pair. While I can extract the fields ("Name", "Id") directly to an object attributes, but for Key-Value pairs, I need to loop through the collection and compare with strings (eg. "Nature of Work") to extract values.
I searched few sites, looking for any best practices, could not reach into any conclusion. Is there any best practice, suggestions etc.
I don't think you are going to find any firm, evidence based arguments against including a list of key value pairs in your message schema. But that's the sort of thing to look for - people writing about message schema design, and how to design messages to support change, and so on.
As a practical matter, there's not a whole lot of difference
{
"Name" : "Johny English",
"Id": "534dsf",
"Message":[
{
"Header":"Country of origin",
"Value":"England"
},
{
"Header":"Nature of work",
"Value":"Secret Agent/Spy"
}
]
}
or
{
"Name" : "Johny English",
"Id": "534dsf",
"Message": {
"Country of origin": "England",
"Nature of work": "Secret Agent/Spy"
}
}
In the early days of the world wide web, "everything" is key value pairs, because it was easy to describe a collection of key value pairs in such a way that a general-purpose component, like a web browser, could work with it (ie, definitions of HTML forms). It got the job done.
It's usually good to structure your response data the same as what you'd expect the input of the corresponding POST, PUT, and PATCH endpoints to be. This allows record alteration to not require the consuming entity to transform the data first. So in that context, arrays of objects with "name"/"value" fields is much easier to write input validation for.

manipulating (nested) JSON keys and there values, using nifi

I am currently facing an issue where I have to read a JSON file that has mostly the same structure, has about 10k+ lines, and is nested.
I thought about creating my own custom processor which reads the JSON and replaces several matching key/values to the ones needed. As I am trying to use NiFi I assume that there should be a more comfortable way as the JSON-structure itself is mostly consistent.
I already tried using the ReplaceText processor as well as the JoltTransformJson processor, but I could not figure out. How can I transform both keys and values, if needed? For example: if there is something like this:
{
"id": "test"
},
{
"id": "14"
}
It might be necessary to turn the "id" into "Number" and map "test" to "3", as I am using different keys/values in my jsonfiles/database, so they need to fit those. Is there a way of doing so without having to create my own processor?
Regards,
Steve

Json schema when the properties required depend on the value of another property

I have a number of requests types I might be sent.
There's a request-type property that may have values "add", "update", "delete" (for example).
Depending on the request type, I will get different properties
If the request type is "add", then I will get additional propertie "add-red", "add-blue", "foo" for example
If the request type is "update,, then "update-xxx", "update-yyy", "update-xxx"
and if "delete" then "foo", "bar"...
Note that some additional properties could appear for more than one request type (see "foo" in the above example)
So I want to validate differently depending on the value of "request-type".
I tried to to
"oneOf": [
{
...
"properties": { "request-type" : { "enum": ["add"] }
"add-red": { ...}
}
},
{
...
"properties": { "request-type" : { "enum": ["update"] }
"update-xxx": { ...}
}
}
In the hope that the validator would match the value of the first when deciding which of the "oneOf" would be selected.
This appears itself to be "valid" (in that the VS Code validator thinks it's a valid schema) but doesn't do what I want - it seems when I write the corresponding JSON it always matches the first, and will only accept "add" as its value).
So how should I do this? I can define the JSON format here, so I can require the use of something I can validate somehow.
It's nearly a duplicate of this: JSON schema anyOf validation based on one of properties except I think the answer there requires distinct sets of additional properties for each request type.
EDIT: According to the answer to validation of json schema having oneOf keyword
it looks like my approach should work so maybe this is just a limitation of the intellisense in MS VS Code?
EDIT2: And this gives another approach: writing more complex json schemas that have dependencies upon other keys
I'll have to experiment some more and maybe end up deleting this!
Answering my own question - the approach of the question works fine. Using a validator like http://www.jsonschemavalidator.net/, I get the behaviour I expect.
It's just that Visual Studio Code's intellisense can't interpret it in a way that means it can provide useful guidance (and to be fair, it's a difficult problem as it means partially matching all the of the alternatives in the "oneOf" to see which ones might still be valid)

Schemaless Support for Elastic Search Queries

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

Is it possible to create Avro schema for an array of maps?

I want to serialize a JSON object that has a potentially variable number of keys, representing cell phone deviceid's (due to Android and iPhones differences). The JSON object might look like this for example (for Android):
"deviceids":{
"openudid":"",
"androidid":"dcbfXXXXXc2d5f",
"imei":"3533XXXXX941712"
}
whereas an iPhone looks like this:
"deviceids":
{
"openudid":"37368a5361XXXXXXXXXXdaedc186b4acf4cd4",
"ifv":"BD87ECBF-XXXXXXXXXX-DDF46E18129A",
"mac":"XXXXXXXXXX",
"odin":"2f5672cXXXXXXXXXX2022a5349939a2d7b952",
"ifa":"82F7B2AXXXXXXXXXX5-A2DADA99D05B"
}
In Avro, I was thinking a schema like this could account for the differences:
{
"name":"deviceids",
"type":"record",
"fields":[
{
"type":"array",
"items":{
"type":"map",
"values":"string"
}
}
]
}
Is this valid Avro schema?
Yes, a map is a valid type for an array. Your particular schema is not legal however, as it should be
{
"name":"deviceids",
"type":"record",
"fields":[
{ "name": "arrayOfMaps",
"type":{
"type": "array",
"items":{
"type":"map",
"values":"string"
}
}
}
]
}
That is, the fields of your record must be named, and the type definition for array and map both have to be full definition giving both the outer complex type (map/array) and contained type.
Since it can be hard sometimes to answer specific Avro questions based on the available documentation and repository of examples, the easiest way to answer this sort of question is probably to just try to compile it using the Avro tools jar, which can be found alongside the regular jars in the Avro releases.
java -jar avro-tools-1.7.5.jar compile schema /path/to/schema .
This will quickly resolve the concern over whether or not it is valid. If this still doesn't resolve the issue, the Avro mailing lists seem fairly active.