currently I have a Kafka topic on which I'm writing the following json messages:
{
"messageType": "NEW",
"timestamp": 1656582818024,
"fieldId": 266,
"number": 9835,
"contains": [
"56644630997",
"06014134231",
"06014134231"
]
}
Pleas note that I can write an arbitrary number of strings in "contains" list (one time I could have it empty or populated with nulls, other times I could have thousands of strings)
The message is then written in Parquet format on a S3 storage using Confluent's S3 connector.
This connector works fine reading the json schemas from the confluent schema registry.
My problem here is I don't understand how should I build the json schema for this particular message since I don't know how to manage arrays.
Here is my current tentative which it isn't working.
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "my_messages",
"type": "object",
"properties": {
"messageType": {
"type": ["string", "null"]
},
"timestamp": {
"type": ["integer", "null"]
},
"fieldId": {
"type": ["integer", "null"]
},
"number": {
"type": ["integer", "null"]
},
"contains": {
"type": ["array", "null"],
"items": [
{
"type": ["string", "null"]
}
]
}
},
"additionalProperties": true
}
I'm fairly sure my kafka connector is breaking because the "contains" array is not validated in my schema.json.
Following I post the error I'm getting:
caused by: org.apache.kafka.connect.errors.DataException: Array schema did not specify the items type
Lastly, if you have a complete link to schema json documentation it would be great.
Thank you for any help
If you want all elements of the array in "contains" to be validated, remove the extra nesting of that subschema in an array. That is, turn this:
"contains": {
"type": ["array", "null"],
"items": [
{
"type": ["string", "null"]
}
]
}
into this:
"contains": {
"type": ["array", "null"],
"items": {
"type": ["string", "null"]
}
}
Documentation is available here: https://json-schema.org/understanding-json-schema/reference/array.html
Related
I am using HTTP+Swagger to make API call(hosted in APIM gateway) in logic app.
In Logic app we can use the parse Json to parse the schema and it provide us access to the individual element of the JSON payload(which we can use in downstream logic). But we have to manually paste the schema in the schema field of parseJson activity.
I want to automate the process. I am looking a way where I can collect the Json schema in a variable and then parse my Json through this variable(schema). This will help me not to make manual interferences with logic app in case of change in schema.
Any help much appreciated.
Thanks,
You don't really have to manually write the JSON schema but you just need a sample of the JSON that you want to parse. Consider here is some sample
[
{
"document": "A",
"min": 7500001,
"policy": "X"
},
{
"document": "B",
"min": 7500001,
"policy": "Y"
},
{
"document": "C",
"min": 7500001,
"policy": "Z"
},
{
"document": "D",
"min": 7500002,
"policy": "X"
}
]
So you need to use Use sample payload to generate schema.
and then you will be automatically have generated schema
Below is the generated schema
{
"type": "array",
"items": {
"type": "object",
"properties": {
"document": {
"type": "string"
},
"min": {
"type": "integer"
},
"policy": {
"type": "string"
}
},
"required": [
"document",
"min",
"policy"
]
}
}
Updated Answer
So, This would work even when you have a JSON change just remove the required from the schema and this would work perfectly when you are trying to use the same parameters as before.
For example, I have changed my sample to
[
{
"document": "A",
"min": 7500001,
"sample":"test1"
},
{
"document": "B",
"min": 7500001,
"sample":"test2"
},
{
"document": "C",
"min": 7500001,
"policy": "Z"
},
{
"document": "D",
"min": 7500002,
"policy": "X"
}
]
and remain the schema to be the same as before, this would fail due to schema validation
but when you just remove the required from schema i.e.,
{
"type": "array",
"items": {
"type": "object",
"properties": {
"document": {
"type": "string"
},
"min": {
"type": "integer"
},
"policy": {
"type": "string"
}
}
}
}
I have a JSON object like:
{
"result": [
{
"name" : "abc",
"email": "abc.test#mail.com"
},
{
"name": "def",
"email": "def.test#mail.com"
},
{
"name": "xyz",
"email": "abc.test#mail.com"
}
]
}
and schema for this:
{
"definitions": {},
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "https://example.com/object1607582431.json",
"title": "Root",
"type": "object",
"required": [
"result"
],
"properties": {
"result": {
"$id": "#root/result",
"title": "Result",
"type": "array",
"default": [],
"uniqueItems": true,
"items": {
"$id": "#root/result/items",
"title": "Items",
"type": "object",
"required": [
"name",
"email"
],
"properties": {
"name": {
"$id": "#root/result/items/name",
"title": "Name",
"type": "string"
},
"email": {
"$id": "#root/result/items/email",
"title": "Email",
"type": "string"
}
}
}
}
}
}
I am looking for an option to check uniqueness for email irrespective of name. How I can validate that every email should be unique?
You can't. There are no keywords that let you compare one particular data value against another, other than uniqueItems, which compares an array element in toto against another.
The JsonSchema specification does not currently support this.
You can see the active GitHub issue here: https://github.com/json-schema-org/json-schema-vocabularies/issues/22
However, there are various extensions of JsonSchema that do validate unique fields within lists of objects.
If you happen to be using Python you can use the package (I created) JsonVL. It can be installed with pip install jsonvl and then run with jsonvl data.json schema.json.
Code examples in the GitHub repo: https://github.com/gregorybchris/jsonvl
I want to create a MongoDB collection using an JSON schema file.
Suppose the JSON file address.schema.json contain address information schema (this file is one of the Json-schema.org's examples):
{
"$id": "https://example.com/address.schema.json",
"$schema": "http://json-schema.org/draft-07/schema#",
"description": "An address similar to http://microformats.org/wiki/h-card",
"type": "object",
"properties": {
"post-office-box": {
"type": "string"
},
"extended-address": {
"type": "string"
},
"street-address": {
"type": "string"
},
"locality": {
"type": "string"
},
"region": {
"type": "string"
},
"postal-code": {
"type": "string"
},
"country-name": {
"type": "string"
}
},
"required": [ "locality", "region", "country-name" ],
"dependencies": {
"post-office-box": [ "street-address" ],
"extended-address": [ "street-address" ]
}
}
What is the MongoDB command, such as mongoimport or db.createCollection to create a MongoDB collection using the above schema?
It can be nice if I can use the file directly in MongoDB with the need of changing the file format manually.
I wonder if the JSON schema format is a standard one, why do I need to change it to adopt it for MongoDB. Why MongoDB does not have this functionality built-in?
You can create it via createCollection command or shell helper:
db.createCollection(
"mycollection",
{validator:{$jsonSchema:{
"description": "An address similar to http://microformats.org/wiki/h-card",
"type": "object",
"properties": {
"post-office-box": { "type": "string" },
"extended-address": { "type": "string" },
"street-address": { "type": "string" },
"locality": { "type": "string" },
"region": { "type": "string" },
"postal-code": { "type": "string" },
"country-name": { "type": "string" }
},
"required": [ "locality", "region", "country-name" ],
"dependencies": {
"post-office-box": [ "street-address" ],
"extended-address": [ "street-address" ]
}
}}})
You only need to specify bsonType instead of type if you want to use a type that exists in bson but not in generic json schema. You do have to remove the lines $id and $schema as those are not supported by MongoDB JSON schema support (documented here)
The only option which you can use is adding jsonSchema validator during collection creating: https://docs.mongodb.com/manual/reference/operator/query/jsonSchema/#document-validator. It will mean that any document which you will insert/update in your collection will have to match with the provided schema
I am working with JSON Schema Draft 4 and am experiencing an issue I can't quite get my head around. Within the schema below you'll see an array, metricsGroups where any item should equal exactly oneOf the defined sub-schemas. Within the sub-schemas you'll notice that they both share the property name timestamp, but metricsGroupOne has the properties temperature and humidity whilst metricsGroupTwo has properties PIR and CO2. All properties within both metricsGroups are required.
Please see the schema below. Below the schema is an example of some data that I'd expect to be validated, but instead is deemed invalid and an explanation of my issue.
{
"type": "object",
"properties": {
"uniqueId": {
"type": "string"
},
"metricsGroups": {
"type": "array",
"minItems": 1,
"items": {
"oneOf": [
{
"type": "object",
"properties": {
"metricsGroupOne": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"timestamp": {
"type": "string",
"format": "date-time"
},
"temperature": {
"type": "number"
},
"humidity": {
"type": "array",
"items": {
"type": "number"
}
}
},
"additionalProperties": false,
"required": [
"timestamp",
"temperature",
"humidity"
]
}
}
},
"required": [
"metricsGroupOne"
]
},
{
"type": "object",
"properties": {
"metricsGroupTwo": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"properties": {
"timestamp": {
"type": "string",
"format": "date-time"
},
"PIR": {
"type": "array",
"items": {
"type": "number"
}
},
"CO2": {
"type": "number"
}
},
"additionalProperties": false,
"required": [
"timestamp",
"PIR",
"CO2"
]
}
}
},
"required": [
"metricsGroupTwo"
]
}
]
}
}
},
"additionalProperties": false,
"required": [
"uniqueId",
"metricsGroups"
]
}
Here's some data that I believe should be valid:
{
"uniqueId": "d3-52-f8-a1-89-ee",
"metricsGroups": [
{
"metricsGroupOne": [
{"timestamp": "2020-03-04T12:34:00Z", "temperature": 32.5, "humidity": [45.0] }
],
"metricsGroupTwo": [
{"timestamp": "2020-03-04T12:34:00Z", "PIR": [16, 20, 7], "CO2": 653.76 }
]
}
]
}
The issue I am facing is that both of the metricsGroup arrays in my believed to be valid data validate against both of the sub-schemas - this then invalidates the data due to the use of the oneOf keyword. I don't understand how the entry for metricsGroupOne validates against the schema for metricsGroupTwo as the property names differ and vice versa.
I'm using an node library under the hood that throws this error, but I've also tested that the same error occurs on some online validation testing websites:
jsonschemavalidator
json-schema-validator
Any help is appreciated. Thanks,
Adam
JSON Schema uses a constraints based approach. If you don't define something is not allowed, it is allowed.
What's happening here is, you haven't specificed in oneOf[1] anything which would make the first item in your instance data array invalid.
Lete me illistrate this with a simple example.
My schema. I'm going to use draft-07, but there's no difference in this principal for draft-04
{
"oneOf": [
{
"properties": {
"a": true
}
},
{
"properties": {
"b": false
}
}
]
}
And my instance:
{
"a": 1
}
This fails validation because the instance is valid when both oneOf schemas are applied.
Demo: https://jsonschema.dev/s/EfUc4
If the instance was in stead...
{
"a": 1,
"b": 1
}
This would be valid, because the instance is fails validation for the subschema oneOf[1].
If the instance was...
{
"b": 1
}
It would be valid according to oneOf[0] but not according to oneOf[1], and therefore overall would be valid because it is only valid according to one subschema.
In your case, you probably need to use additionalProperties to make properties you haven't defined in properties dissallowed. I can't tell from your question if you want to allow both properties, because your schema is defined as oneOf which seems to conflict with the instance you expect to be valid.
Any relevant help will be appreciated.
I have several different JSON docs whic need to be inserted into BigQuery. Now to avoid generating schema manually, I am using the help of online Json Schema Generation tools available. But the schema generated by them are not being accepted by BigQuery Load Data wizard.
For eaxmple: for a Json data like this:
{"_id":100,"actor":"KK","message":"CCD is good in Pune",
"comment":[{"actor":"Subho","message":"CCD is not as good in Kolkata."},
{"actor":"bisu","message":"CCD is costly too in Kolkata"}]
}
the generated schema by online tool is:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Generated from c:jsonccd.json with shasum a003286a350a6889b152
b3e33afc5458f3771e9c",
"type": "object",
"required": [
"_id",
"actor",
"message",
"comment"
],
"properties": {
"_id": {
"type": "integer"
},
"actor": {
"type": "string"
},
"message": {
"type": "string"
},
"comment": {
"type": "array",
"minItems": 1,
"uniqueItems": true,
"items": {
"type": "object",
"required": [
"actor",
"message"
],
"properties": {
"actor": {
"type": "string"
},
"message": {
"type": "string"
}
}
}
}
}
}
But when I put it into BigQuery in the Load Data wizard, it fails with errors.
How can this be mitigated?
Thanks.
The schema generated by that tool is way more complex than what BigQuery requires.
Look at the sample in the docs:
"schema": {
"fields": [
{"name":"f1", "type":"STRING"},
{"name":"f2", "type":"INTEGER"}
]
},
https://developers.google.com/bigquery/loading-data-into-bigquery?hl=en#loaddatapostrequest
Meanwhile the tool mentioned in the question adds fields like $schema, description, type, required, properties that are not necessary and confusing to the BigQuery schema parser.