I have a newline-delimited JSON file. Is it possible to generate a schema using a tool like jq? I've had some success with jq in the past but haven't done something as complicated as this.
Here's the format of the schema I'm aiming for: https://cloud.google.com/bigquery/docs/nested-repeated#example_schema. Notice that nesting is handled with a fields key of the parent, and arrays are handled with "mode": "repeated". (Any help with some sort of schema is greatly appreciated and I then can massage into this format).
Copying from the link above, I'd like to generate from this:
{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
...to...
[
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "first_name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "last_name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "dob",
"type": "DATE",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "status",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "address",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "city",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "state",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "zip",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "numberOfYears",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
(ref BigQuery autodetect doesn't work with inconsistent json?, showing that I can't use the BigQuery autodetect because the items aren't the same. I'm fairly confident I can merge schemas together manually to create a superset)
Here's a simple recursive function that may help if you decide to roll your own:
def schema:
def isdate($v): $v | test("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]");
def array($k;$v): {"name":$k,"type":"RECORD",mode:"REPEATED","fields":($v[0] | schema)};
def date($k): {"name":$k,"type":"DATE", mode:"NULLABLE"};
def string($k): {"name":$k,"type":"STRING",mode:"NULLABLE"};
def item($k;$v):
$v | if type == "array" then array($k;$v)
elif type == "string" and isdate($v) then date($k)
elif type == "string" then string($k)
else empty end;
[ to_entries[] | item(.key;.value) ]
;
schema
Try it online!
Any help with some sort of schema is greatly appreciated and I then can massage into this format
There is a schema-inference module written in jq at http://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed but the inferred schemas are "structural" - they mirror the input JSON. For your sample, the inferred schema is as shown below. As you can see, it would be quite easy to transform this into the format you have in mind, except that extra work would be required to infer the mode values.
Please note that the above-mentioned module infers the "common schema" from an arbitrarily large "sample" of JSON documents. That is, it is a schema inference engine rather than simply a "schema generator".
The above link references a companion schema-checker named JESS, also written in jq. The "E" in "JESS" stands for "extended", signifying that the JESS schema language for specifying schemas allows complex constraints to be included.
{
"id": "string",
"first_name": "string",
"last_name": "string",
"dob": "string",
"addresses": [
{
"status": "string",
"address": "string",
"city": "string",
"state": "string",
"zip": "string",
"numberOfYears": "string"
}
]
}
Related
Lets say I have two schemas defined as follows -
ADDRESS_CLASS_SCHEMA_DEFINITION = {
"title": "Address",
"type": "object",
"properties": {
"country_code": {
"$ref": "#/definitions/CountryCode"
},
"city_code": {
"title": "City Code",
"type": "string"
},
"zipcode": {
"title": "Zipcode",
"type": "string"
},
"address_str": {
"title": "Address Str",
"type": "string"
}
},
"required": [
"country_code",
"city_code",
"zipcode"
],
"definitions": {
"CountryCode": {
"title": "CountryCode",
"description": "An enumeration.",
"enum": [
"CA",
"USA",
"UK"
],
"type": "string"
}
}
}
EMPLOYEE_CLASS_SCHEMA_DEFINITION = {
"title": "Employee",
"type": "object",
"properties": {
"id": {
"title": "Id",
"type": "integer"
},
"name": {
"title": "Name",
"type": "string"
},
"email": {
"title": "Email",
"type": "string"
},
"telephone": {
"title": "Telephone",
"type": "string"
},
"address": {
"$ref": "#/definitions/Address"
}
},
"required": [
"id",
"name",
"email"
],
"definitions": {
"Address": ADDRESS_CLASS_SCHEMA_DEFINITION
}
}
I'm trying to re-use sub-schema definitions by defining a constant and referencing them individually in definitions (for example address-schema is referenced through constant in employee-schema definition). This approach works for individual schemas, however there seems to be a json-pointer path issue for Employee schema - #/definitions/CountryCode wouldn't resolve in Employee schema. I was assuming that #/definitions/CountryCode would be a relative path on Address schema as its scope is defined on a sub-schema, but my understanding seems wrong. I can make it work by flattening out like below, however I donot want to take this route -
{
"title": "Employee",
"type": "object",
"properties": {
"id": {
"title": "Id",
"type": "integer"
},
"name": {
"title": "Name",
"type": "string"
},
"email": {
"title": "Email",
"type": "string"
},
"telephone": {
"title": "Telephone",
"type": "string"
},
"address": {
"$ref": "#/definitions/Address"
}
},
"required": [
"id",
"name",
"email"
],
"definitions": {
"CountryCode": {
"title": "CountryCode",
"description": "An enumeration.",
"enum": [
"CA",
"USA",
"UK"
],
"type": "string"
},
"Address": {
"title": "Address",
"type": "object",
"properties": {
"country_code": {
"$ref": "#/definitions/CountryCode"
},
"city_code": {
"title": "City Code",
"type": "string"
},
"zipcode": {
"title": "Zipcode",
"type": "string"
},
"address_str": {
"title": "Address Str",
"type": "string"
}
},
"required": [
"country_code",
"city_code",
"zipcode"
]
}
}
}
I'm wondering how to fix this, I've briefly looked into jsonschema-bundling and using $id but from best practices it seems like the general recommendation is to use $id when dealing with URI's alone. Would like to know about best practices and how to fix this problem, would also appreciate if someone can point me on how to use $id correctly (for example, constant based approach seems to work when I provide identifiers like $id: Address, $id: Employee). Thanks in advance.
JSON Schema implementations work in JSON land. When you combine your schemas in your example above, presumably in javascript/node.js, by the time it gets to the JSON Schema implementation for validation execution, any knowledge that there were separate schemas is lost. (It's generally not considered that this approach is the best approach.)
The EASY fix here SHOULD be just to define $id in each of the roots of your schemas. These should be a fully qualfied URI. It doesn't really matter what they are at this point. They could be https://example.com/a and https://example.com/b. Then, in the primary schema, you can do $ref: https://example.com/b.
Implementations should provide you with a way to load in your other/non-primary schemas so the $id values can be stored in an index. Using $id in your other schema with a fully qualified URI will signify a "resource boundary".
https://json-schema.hyperjump.io is the only web playground to support multiple files/schemas/"Schema Resources", so you can test this out there to confirm your expectations.
Not all implementations make it easy or even provide a means to import your other schemas, but they should.
If you have follow up questions, feel free to leave a comment, or join the JSON Schema slack server if it would be off-topic for StackOverflow.
I know that fields listed in a json schema object have no defined order, since they are not an array, but I am looking for a way to be able to display them in the proper order in my application UI.
Workarounds I have found so far include things like using a different serializer, or even hard-coding a number into the field name.
I would like to come up with something that works with my current setup.
Hibernate, Spring Boot, and a react-app front end.
given this GET request:
/profile/personEntities
with header: Accept: application/schema+json
I will receive this:
{
"title": "Person entity",
"properties": {
"birthday": {
"title": "Birthday",
"readOnly": false,
"type": "string",
"format": "date-time"
},
"lastName": {
"title": "Last name",
"readOnly": false,
"type": "string"
},
"address": {
"title": "Address",
"readOnly": false,
"type": "string",
"format": "uri"
},
"firstName": {
"title": "First name",
"readOnly": false,
"type": "string"
},
"email": {
"title": "Email",
"readOnly": false,
"type": "string"
},
"cellPhone": {
"title": "Cell phone",
"readOnly": false,
"type": "string"
}
},
"requiredProperties": [
"firstName",
"lastName"
],
"definitions": {},
"type": "object",
"$schema": "http://json-schema.org/draft-04/schema#"
}
I have tried adding #JsonProperty(index=2) to the field, but nothing changes.
Thank you much for any tips.
If you're using Jackson to handle your serialization/deserialization you can use #JsonPropertyOrder - from their docs:
// ensure that "id" and "name" are output before other properties
#JsonPropertyOrder({ "id", "name" })
// order any properties that don't have explicit setting using alphabetic order
#JsonPropertyOrder(alphabetic=true)
See: http://fasterxml.github.io/jackson-annotations/javadoc/2.3.0/com/fasterxml/jackson/annotation/JsonPropertyOrder.html
So I need to convert following response from Httprequest of JsonString to Object. Can someone help in looping it to an object. [New to Vb6]
Please see the below Json response.
{
"Participants": [
{
"Participant": {
"EntityHierarchy": {},
"ProviderPlatform": "string",
"ProviderPlatformDetail": [
{
"ProviderPlatform": "string",
"Primary": true
}
],
"FirstName": "string",
"LastName": "string",
"BusinessName": "string",
"City": "string",
"Region": "string",
"PostalCode": "string",
"Phone": "string",
"CountryCode": "string",
"Email": "string",
"AccountNumber": "string",
"Active": true,
"PSuiteAttribute": "string",
"ParticipantIdentifier": "string",
"SystemParticipantIdentifier": "string",
"ITAIdentifier": "string"
},
"Platform": "string",
"Program": "string",
"ProgramFriendlyName": "string",
"EnterpriseServicesIdentifier": "string",
"IdentityMapped": true,
"MappedToMasterPlatform": true,
"MasterPlatform": "string",
"SupportingPlatform": "string",
"MasterPlatformName": "string",
"SupportingPlatformName": "string",
"FaultedMessages": [
"string"
]
}
Bruce McPherson makes heavy use of a class called cJobject to do JSON handling in VBA/VB6. cJobject is too big to fit in an answer, but you can get its current source code off GitHub. See also his usage notes.
There are plenty of JSON classes floating around written in VB6, VBA, or VB6 that is portable to VBA.
Here are a couple more:
JsonBag, Another JSON Parser/Generator
JNode - JSON revisited
I have a roughly 10G JSON file. Each line contains exactly one JSON document. I was wondering what is the best way to convert this to Avro. Ideally I would like to keep several documents (like 10M) per file. I think Avro supports having multiple documents in the same file.
You should be able to use Avro tools' fromjson command (see here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)).
The easiest way to convert a large JSON file to Avro is using avro-tools from the Avro website.
After creating a simple schema the file can be directly converted.
java -jar avro-tools-1.7.7.jar fromjson --schema-file cpc.avsc --codec deflate test.1g.json > test.1g.deflate.avro
The example schema:
{
"type": "record",
"name": "cpc_schema",
"namespace": "com.streambright.avro",
"fields": [{
"name": "section",
"type": "string",
"doc": "Section of the CPC"
}, {
"name": "class",
"type": "string",
"doc": "Class of the CPC"
}, {
"name": "subclass",
"type": "string",
"doc": "Subclass of the CPC"
}, {
"name": "main_group",
"type": "string",
"doc": "Main-group of the CPC"
}, {
"name": "subgroup",
"type": "string",
"doc": "Subgroup of the CPC"
}, {
"name": "classification_value",
"type": "string",
"doc": "Classification value of the CPC"
}, {
"name": "doc_number",
"type": "string",
"doc": "Patent doc_number"
}, {
"name": "updated_at",
"type": "string",
"doc": "Document update time"
}],
"doc:": "A basic schema for CPC codes"
}
I am loading a json file to a table in a bigquery dataset . A sample json in that file is :
{"a": "string_a","b": "string_b","c": 4.42,"d_list":["x","y","z"]}
I define the schema field as:
a:string, b:string, c:float, d_list:string
This gives an import error Field:d_list, array specified for non-repeated field
I think d_list should be defined as:
{
"type": "STRING",
"name": "d_list",
"mode": "repeated"
}
Is it right? If yes how can I use WEbUI to define it in this way?
The Web UI also accepts JSON line as noted in the helper icon, so you can have a JSON array of fields defined as, and you can paste this into the web UI.
[
{
"type": "STRING",
"name": "a",
"mode": "nullable"
},
{
"type": "STRING",
"name": "b",
"mode": "nullable"
},
{
"type": "FLOAT",
"name": "c",
"mode": "nullable"
},
{
"type": "STRING",
"name": "d_list",
"mode": "repeated"
}
]