I am deploying a model onto AWS via Sagemaker:
I set up my JSON schema as follow:
import json
schema = {
"input": [
{
"name": "V1",
"type": "double"
},
{
"name": "V2",
"type": "double"
},
{
"name": "V3",
"type": "double"
},
{
"name": "V4",
"type": "double"
},
{
"name": "V5",
"type": "double"
},
{
"name": "V6",
"type": "double"
},
{
"name": "V7",
"type": "double"
},
{
"name": "V8",
"type": "double"
},
{
"name": "V9",
"type": "double"
},
{
"name": "V10",
"type": "double"
},
{
"name": "V11",
"type": "double"
},
{
"name": "V12",
"type": "double"
},
{
"name": "V13",
"type": "double"
},
{
"name": "V14",
"type": "double"
},
{
"name": "V15",
"type": "double"
},
{
"name": "V16",
"type": "double"
},
{
"name": "V17",
"type": "double"
},
{
"name": "V18",
"type": "double"
},
{
"name": "V19",
"type": "double"
},
{
"name": "V20",
"type": "double"
},
{
"name": "V21",
"type": "double"
},
{
"name": "V22",
"type": "double"
},
{
"name": "V23",
"type": "double"
},
{
"name": "V24",
"type": "double"
},
{
"name": "V25",
"type": "double"
},
{
"name": "V26",
"type": "double"
},
{
"name": "V27",
"type": "double"
},
{
"name": "V28",
"type": "double"
},
{
"name": "Amount",
"type": "double"
},
],
"output":
{
"name": "features",
"type": "double",
"struct": "vector"
}
}
schema_json = json.dumps(schema)
print(schema_json)
And deployed as:
from sagemaker.model import Model
from sagemaker.pipeline import PipelineModel
from sagemaker.sparkml.model import SparkMLModel
sparkml_data = 's3://{}/{}/{}'.format(s3_model_bucket, s3_model_key_prefix, 'model.tar.gz')
# passing the schema defined above by using an environment variable that sagemaker-sparkml-serving understands
sparkml_model = SparkMLModel(model_data=sparkml_data, env={'SAGEMAKER_SPARKML_SCHEMA' : schema_json})
xgb_model = Model(model_data=xgb_model.model_data, image=training_image)
model_name = 'inference-pipeline-' + timestamp_prefix
sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, xgb_model])
endpoint_name = 'inference-pipeline-ep-' + timestamp_prefix
sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
I got the error as below:
ClientError: An error occurred (ValidationException) when calling the CreateModel operation: 1 validation error detected: Value '{SAGEMAKER_SPARKML_SCHEMA={"input": [{"type": "double", "name": "V1"}, {"type": "double", "name": "V2"}, {"type": "double", "name": "V3"}, {"type": "double", "name": "V4"}, {"type": "double", "name": "V5"}, {"type": "double", "name": "V6"}, {"type": "double", "name": "V7"}, {"type": "double", "name": "V8"}, {"type": "double", "name": "V9"}, {"type": "double", "name": "V10"}, {"type": "double", "name": "V11"}, {"type": "double", "name": "V12"}, {"type": "double", "name": "V13"}, {"type": "double", "name": "V14"}, {"type": "double", "name": "V15"}, {"type": "double", "name": "V16"}, {"type": "double", "name": "V17"}, {"type": "double", "name": "V18"}, {"type": "double", "name": "V19"}, {"type": "double", "name": "V20"}, {"type": "double", "name": "V21"}, {"type": "double", "name": "V22"}, {"type": "double", "name": "V23"}, {"type": "double", "name": "V24"}, {"type": "double", "name": "V25"}, {"type": "double", "name": "V26"}, {"type": "double", "name": "V27"}, {"type": "double", "name": "V28"}, {"type": "double", "name": "Amount"}], "output": {"type": "double", "name": "features", "struct": "vector"}}}' at 'containers.1**.member.environment' failed to satisfy constraint: Map value must satisfy constraint: [Member must have length less than or equal to 1024,** Member must have length greater than or equal to 0, Member must satisfy regular expression pattern: [\S\s]*]
I try to reduce my features to 20 and it able to deploy. Just wondering how can I Pass the schema with 29 attributes?
I do not think the environment length of 1024 limit will be increased in a short time. To work around this, you could try to rebuild the spark ml container with the SAGEMAKER_SPARKML_SCHEMA env var:
https://github.com/aws/sagemaker-sparkml-serving-container/blob/master/README.md#running-the-image-locally
Related
I am triying to validate a schema with postman.
I created the following schema using the website: https://www.liquid-technologies.com/online-json-to-schema-converter
Here's the result:
const schema = {
"type": "object",
"properties": {
"system_id": {"type": "string"},
"start": {"type": "string"},
"end": {"type": "string"},
"period": {"type": "string" },
"unit": {"type": "string"},
"values": {"type": "array",
"items": [
{
"type": "object",
"properties": {
"timestamp": {"type": "string"},
"battery_charge": {"type": "float" },
"battery_discharge": {"type": "number"},
"grid_export": {"type": "number"},
"grid_import": {"type": "number"},
"home_consumption": {"type": "number"},
"solar": {"type": "number"},
"battery_charge_state": {"type": "number"}
},
"required": [
"timestamp",
"battery_charge",
"battery_discharge",
"grid_export",
"grid_import",
"home_consumption",
"solar",
"battery_charge_state"
]
}
]
},
"totals": {
"type": "object",
"properties": {
"battery_charge": {"type": "number"},
"battery_discharge": {"type": "number"},
"grid_export": {"type": "number"},
"grid_import": {"type": "number"},
"home_consumption": {"type": "number"},
"solar": {"type": "number"}
},
"required": [
"battery_charge",
"battery_discharge",
"grid_export",
"grid_import",
"home_consumption",
"solar"
]
}
},
"required": [
"system_id",
"start",
"end",
"period",
"unit",
"values",
"totals"
]
}
pm.test("Schema validation", () => {
pm.response.to.have.jsonSchema(schema);
});
But the problem is that appears some "type: number" and it should be "float". If I change this number to float, Postman displays the following error message:
Schema validation | Error: schema is invalid: data.properties['values'].items should be object,boolean, data.properties['values'].items[0].properties['battery_charge'].type should be equal to one of the allowed values, data.properties['values'].items[0].properties['battery_charge'].type should be array, data.properties['values'].items[0].properties['battery_charge'].type should match some schema in anyOf, data.properties['values'].items should match some schema in anyOf
Json Response:
{
"system_id": "C18208",
"start": "2022-02-06T00:00:00+00:00",
"end": "2022-02-06T23:59:00+00:00",
"period": "minute",
"unit": "kWh",
"values": [{
"timestamp": "2022-02-06T00:00:00+00:00",
"battery_charge": 0.0,
"battery_discharge": 0.0,
"grid_export": 0.0,
"grid_import": 0.0,
"home_consumption": 0.0,
"solar": 0.0,
"battery_charge_state": 100.0
}],
"totals": {
"battery_charge": 9.3,
"battery_discharge": 4.8,
"grid_export": 4.5,
"grid_import": 31.9,
"home_consumption": 32.1,
"solar": 33.3
}
}
Any help please?
With json, there are only 6 data types string,number,array,object,null,boolean, there is no such data type "float".
Fix:
"battery_charge": {"type": "float" } --> "battery_charge": {"type": "number" }
I have a schema which has nested fields.When I try to convert it with:
jtopy=json.dumps(schema_message['SchemaDefinition']) #json.dumps take a dictionary as input and returns a string as output.
print(jtopy)
dict_json=json.loads(jtopy) # json.loads take a string as input and returns a dictionary as output.
print(dict_json)
new_schema = StructType.fromJson(dict_json)
print(new_schema)
It returns error:
return StructType([StructField.fromJson(f) for f in json["fields"]])
TypeError: string indices must be integers
The schema is Definition as described below is what Im passing
{
"type": "record",
"name": "tags",
"namespace": "com.tigertext.data.events.tags",
"doc": "Schema for tags association to accounts (role,etc..)",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": "eventHeader",
"namespace": "com.tigertext.data.events",
"doc": "Metadata about the event record.",
"fields": [
{
"name": "topic",
"type": "string",
"doc": "The topic this record belongs to. e.g. messages"
},
{
"name": "server",
"type": "string",
"doc": "The server that generated this event. e.g. xmpp-07"
},
{
"name": "service",
"type": "string",
"doc": "The service that generated this event. e.g. erlang-producer"
},
{
"name": "environment",
"type": "string",
"doc": "The environment this record belongs to. e.g. dev, prod"
},
{
"name": "time",
"type": "long",
"doc": "The time in epoch this record was produced."
}
]
}
},
{
"name": "eventType",
"type": {
"type": "enum",
"name": "eventType",
"symbols": [
"CREATE",
"UPDATE",
"DELETE",
"INIT"
]
},
"doc": "event type"
},
{
"name": "tagId",
"type": "string",
"doc": "Tag ID for the tag"
},
{
"name": "orgToken",
"type": "string",
"doc": "org ID"
},
{
"name": "tagName",
"type": "string",
"doc": "name of the tag"
},
{
"name": "colorId",
"type": "string",
"doc": "color id"
},
{
"name": "colorName",
"type": "string",
"doc": "color name"
},
{
"name": "colorValue",
"type": "string",
"doc": "color value e.g. #C8C8C8"
},
{
"name": "entities",
"type": [
"null",
{
"type": "array",
"items": {
"type": "record",
"name": "entity",
"fields": [
{
"name": "entityToken",
"type": "string"
},
{
"name": "entityType",
"type": "string"
}
]
}
}
],
"default": null
}
]
}
Above is the schema of the kafka topic I want to parse into pyspark schema
I have a Json file and I need to write a Scheme for it in Oxygen.
"characters": [
{
"house":"Gryffindor",
"orderOfThePhoenix":false,
"name":"Cuthbert Binns",
"bloodStatus":"unknown",
"deathEater":false,
"dumbledoresArmy":false,
"school":"Hogwarts School of Witchcraft and Wizardry",
"role":"Professor, History of Magic",
"__v":0,
"ministryOfMagic":false,
"_id":"5a0fa67dae5bc100213c2333",
"species":"ghost"
}
],
"spells": [
{
"spell":"Aberto",
"effect":"opens objects",
"_id":"5b74ebd5fb6fc0739646754c",
"type":"Charm"
}
],
"houses": [
{
"values": [
"courage",
"bravery",
"nerve",
"chivalry"
],
"headOfHouse":"Minerva McGonagall",
"mascot":"lion",
"name":"Gryffindor",
"houseGhost":"Nearly Headless Nick",
"founder":"Goderic Gryffindor",
"colors": [
"scarlet",
"gold"
],
"school":"Hogwarts School of Witchcraft and Wizardry",
"__v":0,
"members": [
"5a0fa648ae5bc100213c2332",
"5a0fa67dae5bc100213c2333",
"5a0fa7dcae5bc100213c2338",
"5a123f130f5ae10021650dcc"
],
"_id":"5a05e2b252f721a3cf2ea33f"
},
For sure the current JSON file is much bigger. If someone could send related links it would help too, or some kind of tutorials.
Could you please help me with creating a schema for it?
If you want to create a JSON Schema, the best way to start is to check the "json-schema.org" tutorials. You can find them here:
https://json-schema.org/learn/getting-started-step-by-step.html
https://json-schema.org/understanding-json-schema/
In the next version of Oxygen there will be support to create a JSON Schema based on a JSON instance or on an XSD, but you will need to check the created schema and customize it for your needs.
For example, for the instance you provided the schema can look something like this:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"characters": {"$ref": "#/definitions/characters_type"},
"spells": {"$ref": "#/definitions/spells_type"},
"houses": {"$ref": "#/definitions/houses_type"}
},
"definitions": {
"characters_type": {
"type": "array",
"minItems": 0,
"items": {
"type": "object",
"properties": {
"house": {"type": "string"},
"orderOfThePhoenix": {"type": "boolean"},
"name": {"type": "string"},
"bloodStatus": {"type": "string"},
"deathEater": {"type": "boolean"},
"dumbledoresArmy": {"type": "boolean"},
"school": {"type": "string"},
"role": {"type": "string"},
"__v": {"type": "number"},
"ministryOfMagic": {"type": "boolean"},
"_id": {"type": "string"},
"species": {"type": "string"}
},
"required": [
"role",
"bloodStatus",
"school",
"species",
"deathEater",
"dumbledoresArmy",
"__v",
"name",
"ministryOfMagic",
"_id",
"orderOfThePhoenix",
"house"
]
}
},
"spells_type": {
"type": "array",
"minItems": 0,
"items": {
"type": "object",
"properties": {
"spell": {"type": "string"},
"effect": {"type": "string"},
"_id": {"type": "string"},
"type": {"type": "string"}
},
"required": [
"spell",
"effect",
"_id",
"type"
]
}
},
"values_type": {
"type": "array",
"minItems": 0,
"items": {"type": "string"}
},
"houses_type": {
"type": "array",
"minItems": 0,
"items": {
"type": "object",
"properties": {
"values": {"$ref": "#/definitions/values_type"},
"headOfHouse": {"type": "string"},
"mascot": {"type": "string"},
"name": {"type": "string"},
"houseGhost": {"type": "string"},
"founder": {"type": "string"},
"colors": {"$ref": "#/definitions/values_type"},
"school": {"type": "string"},
"__v": {"type": "number"},
"members": {"$ref": "#/definitions/values_type"},
"_id": {"type": "string"}
},
"required": [
"headOfHouse",
"houseGhost",
"mascot",
"school",
"founder",
"values",
"__v",
"members",
"name",
"_id",
"colors"
]
}
}
}
}
Best Regards,
Octavian
Using Draft-07
What I got was
valid JSON
What error I expected from audit object was
directory: String length must be greater than or equal to 2
Tried two different validators with same results
https://www.jsonschemavalidator.net/
GoLang https://github.com/xeipuuv/gojsonschema
This is my schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "ISAM-Wrapper",
"description": "Validate isam wrapper json",
"type": "object",
"properties": {
"directory": {
"description": "path to location of isam file",
"type": "string",
"minLength": 2
},
"isamFile": {
"description": "isam database file",
"type": "string",
"minLength": 4
},
"isamIndex": {
"description": "isam index file",
"type": "string",
"minLength": 4
},
"port": {
"description": "port number for REST listener",
"type": "integer",
"minimum": 60410,
"maximum": 69999
},
"actions": {
"description": "Which operations are supported",
"type": "object",
"items": {
"properties": {
"create": {
"type": "boolean"
},
"read": {
"type": "boolean"
},
"update": {
"type": "boolean"
},
"delete": {
"type": "boolean"
}
}
},
"required": [
"create",
"read",
"update",
"delete"
]
},
"fields": {
"description": "each object describes one field of the isam file",
"type": "array",
"minItems": 1,
"items": {
"title": "field",
"description": "field schema",
"type": "object",
"properties": {
"name": {
"type": "string",
"minLength": 1
},
"ordinal": {
"type": "integer",
"minimum": 0
},
"offset": {
"type": "integer",
"minimum": 0
},
"length": {
"type": "integer",
"minimum": 1
},
"dataType": {
"enum": [
"uchar",
"ulong",
"long",
"uint",
"int",
"ushort",
"short"
]
}
},
"required": [
"name",
"ordinal",
"offset",
"length",
"dataType"
]
}
},
"audit": {
"description": "input needed to enable and configure isam auditing",
"type": "object",
"items": {
"properties": {
"enable": {
"enum": [
true,
false
]
},
"directory": {
"type": "string",
"minLength": 2
},
"fileName": {
"type": "string",
"minLength": 4
},
"workDirectory": {
"type": "string",
"minLength": 2
},
"archiveDirectory": {
"type": "string",
"minLength": 2
},
"interval": {
"type": "integer",
"minimum": 1
},
"byteThreshold": {
"type": "integer",
"minimum": 1048576,
"maximum": 1073741824
}
}
},
"required": [
"enable"
],
"if": {
"not": {
"properties": {
"enable": {
"enum": [
false
]
}
}
}
},
"then": {
"required": [
"directory",
"fileName",
"workDirectory",
"archiveDirectory",
"interval",
"byteThreshold"
]
}
}
},
"required": [
"directory",
"isamFile",
"isamIndex",
"port",
"actions",
"fields",
"audit"
]
}
This is my JSON
{
"directory": "./",
"isamFile": "isam.dat",
"isamIndex": "isam.idx",
"port": 60410,
"actions": {
"create": true,
"read": true,
"update": true,
"delete": true
},
"fields": [
{
"name": "F1",
"ordinal": 0,
"offset": 0,
"length": 4,
"dataType": "ulong"
},
{
"name": "F2",
"ordinal": 1,
"offset": 4,
"length": 4,
"dataType": "ulong"
}
],
"audit": {
"enable": true,
"directory": "",
"fileName": "file",
"workDirectory": "./work",
"archiveDirectory": "./archive",
"interval": 5,
"byteThreshold": 1500000
}
}
This issue you have is that your schema is invalid. For both actions and audit you specify these as objects but you don't provide any properties. What you do do, however, is specify an items key (which does nothing here - that's a key on an array) which contains the properties.
Once you correct this error, the schema behaves as you intend, see https://repl.it/repls/BlankWellmadeFrontpage
In our data we have json fields that include repeated sections, as well as infinite nesting possibilities (the samples I have so far are quite simplistic). After seeing BQ repeated fields and records, I decided to try restructuring the data into repeated record fields, as our use case is related to analytics and then wanted to test out different use cases for the data to see which approach is more efficient (time/cost/difficulty) for the analysis we intend to do on it. I have created a sample json record that I want to upload to BQ, that uses all the features that I think we would need (I have validated is using http://jsonlint.com/):
{
"aid": "6dQcrgMVS0",
"hour": "2016042723",
"unixTimestamp": "1461814784",
"browserId": "BdHOHp2aL9REz9dXVeKDaxdvefE3Bgn6NHZcDQKeuC67vuQ7PBIXXJda3SOu",
"experienceId": "EXJYULQOXQ05",
"experienceVersion": "1.0",
"pageRule": "V1XJW61TPI99UWR",
"userSegmentRule": "67S3YVMB7EMQ6LP",
"branch": [{
"branchId": "1",
"branchType": "userSegments",
"itemId": "userSegment67S3YVMB7EMQ6LP",
"headerId": "null",
"itemMethod": "null"
}, {
"branchId": "1",
"branchType": "userSegments",
"itemId": "userSegment67S3YVMB7EMQ6LP",
"headerId": "null",
"itemMethod": "null"
}],
"event": [{
"eventId": "546",
"eventName": "testEvent",
"eventDetails": [{
"key": "a",
"value": "1"
}, {
"key": "b",
"value": "2"
}, {
"key": "c",
"value": "3"
}]
}, {
"eventId": "547",
"eventName": "testEvent2",
"eventDetails": [{
"key": "d",
"value": "4"
}, {
"key": "e",
"value": "5"
}, {
"key": "f",
"value": "6"
}]
}]
}
I am using BQ interface, to upload this json into a table with the following structure:
[
{
"name": "aid",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "hour",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "unixTimestamp",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "browserId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "experienceId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "experienceVersion",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "pageRule",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "userSegmentRule",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "branch",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "branchId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "branchType",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "itemId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "headerId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "itemMethod",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "event",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "evenId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "eventName",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "eventDetails",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "key",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
}
]
My jobs fail with a
JSON parsing error in row starting at position 0 in <file_id>. Expected key (error code: invalid)
It is possible I can't have multiple nesting in a table, but the error seems more as if there was an issue with parsing the JSON itself. I was able to generate and successfully import a json with a simple repeated record (see example below):
{
"eventId": "546",
"eventName": "testEvent",
"eventDetails": [{
"key": "a",
"value": "1"
}, {
"key": "b",
"value": "2"
}, {
"key": "c",
"value": "3"
}]
}
Any advice is appreciated.
There doesn't seem to be anything problematic with your schema, so BigQuery should be able to load your data with your schema.
First, make sure you are uploading newline-delimited JSON to BigQuery. Your example row has many newline characters in the middle of your JSON row, and the parser is trying to interpret each line as a separate JSON row.
Second, it looks like your schema has the key "evenId" in the "event" record, but your example row has the key "eventId".