Databricks reading json with json-schema 2020-12 - json

I have json and its schema defined from the source as below:
Schema.json
{
"$id": "https://example.com/person.schema.json",
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "Person","type": "object",
"properties": {
"firstName": {"type": "string","description": "The person's first name."},
"lastName": {"type": "string","description": "The person's last name."},
"age": {"description": "Age in years which must be equal to or greater than zero.",
"type": "integer","minimum": 0}}}
Data.json
{"firstName": "John","lastName": "Doe","age": 21}
Now I wanted to load the json from databricks with the given schema.
My Schema is with Json Schema 2020-12 format and PySpark expects schema with fields. Anyway to automatically infer the data from the schema so that I can explode it on my will. Any help to show the right path is much appreciated since I stuck for a day on this!
When I use StructType.fromJson(schema_plain_text) --> it returns
TypeError: string indices must be integers
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-353243474841> in <cell line: 1>()
----> 1 StructType.fromJson(schema_plain_text)
/databricks/spark/python/pyspark/sql/types.py in fromJson(cls, json)
810 #classmethod
811 def fromJson(cls, json: Dict[str, Any]) -> "StructType":
--> 812 return StructType([StructField.fromJson(f) for f in json["fields"]])
813
814 def fieldNames(self) -> List[str]:
TypeError: string indices must be integers
P.S My actual schema is much more complex with 100s of columns. So creating the schema manually with StructType definition is not a solution.

Related

JSON Schema object properties defined by enum

I'm attempting to reuse an enum in my JSON Schema to define the properties for an object.
I was wondering if the following is correct.
JSON Schema
{
"type": "object",
"propertyNames": {
"enum": ["Foo","Bar"]
},
"patternProperties": {
".*": {
"type": "number"
}
}
}
JSON Data
{
"Foo": 123,
"Bar": 456
}
The reason I ask is that I get inconsistent results from JSON Schema validation libraries. Some indicate the JSON validates, while others indicate the JSON is invalid.
p.s. if anyone is wondering "why" I'm trying to define the properties with an enum, it is because the enum is shared in various parts of my json schema. In some cases it is a constraint on a string, but I need the identical set of possible values both on those string properties and also on the object properties. As an enum I can maintain the set of possible values in one place.
Yes, that's a valid JSON Schema. You could also express it like this:
{
"type": "object",
"propertyNames": {
"enum": ["Foo","Bar"]
},
"additionalProperties": {
"type": "number"
}
}
It says "all property names must conform to this schema: (one of these values listed in the enum); also, all property values must conform to this schema: (must be numeric type)."
What errors do you get from the implementations that report this as invalid? Those implementations have a bug; would you consider reporting it to them?

Storing JSON blob in an Avro Field

I have inherited project where an avro file that is being consumed by Snowflake. The schema of the avro is as follows:
{
"name": "TableName",
"namespace": "sqlserver",
"type": "record",
"fields": [
{
"name": "hAccount",
"type": "string"
},
{
"name": "hTableName",
"type": "string"
},
{
"name": "hRawJSON",
"type": "string"
}
]
}
The hRawJSON is a blob of JSON itself. The previous dev put this as a type of string, and this is where I believe the problem lies.
The application takes a JSON object (the JSON is varible so I never know the contents or what it contains) and populates the hRawJSON field in the Avro record. But it contains the escape characters for the double quotes in the string:
hAccount:"H11122"
hTableName:"Departments"
hRawJSON:"{\"DepartmentID\":1,\"ModelID\":0,\"Description\":\"P Medicines\",\"Margin\":\"3.300000000000000e+001\",\"UCSVATRateID\":0,\"References\":719,\"HeadOfficeID\":1,\"DividendID\":0}"
As a result the JSON blob is staged into Snowflake as a VARIANT field but still retains the escape characters:
Snowflake image
This means when querying the data in the JSON I constantly have to use this:
PARSE_JSON(RAW_FILE:hRawJSON):DepartmentID
I can't help feeling that the field type of string in the Avro file is causing the issue and that a different type should be used. I've tried Record, but without fields it's unuable. Doc also not working.
The other alternative is that this behavior is correct and when moving the hRawJSON from staging into "proper" tables I should use something like:
INSERT INTO DATA.PUBLIC.DEPARTMENTS
SELECT
RAW_FILE:hAccount::VARCHAR(4) as Account,
PARSE_JSON(RAW_FILE:hRawJSON) as JsonRaw
FROM DATA.STAGING.AVRO_RAW WHERE RAW_FILE:hTableName::STRING = 'Department';
So if this should be the correct approach and I'm over thinking this I'd appreciate guidance.

Debezium Connector for PostgreSQL introduces \ in topic data of type JSON

I am new to kafka and would like to get some advice on this problem. I am trying to produce data to one of the kafka topics from a table in postgres, one of the column has type json, which is defined like this in the schema file (.avsc):
{
"name": "details",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.data.Json"
}
],
"default": null
}
As per https://debezium.io/documentation/reference/0.9/connectors/postgresql.html#data-types it is mapped to string data type of kafka connector.
The db column contains data like :
{"type":"User","id":"123","attributes":{"id":"123","state":"active"}}
But the topic produces it like this:
{\"type\":\"User\",\"id\":\"123\",\"attributes\":{\"id\":\"123\",\"state\":\"active\"}}
I want this to be produced as same string as passed, without the \. So, the expected output stream for the topic should have something like :
{"type":"User","id":"123","attributes":{"id":"123","state":"active"}}
What would be the best way to achieve it?

Unable to parse JSON list in Azure Data Factory ADF

In my datafactory pipeline I hava a web activity which is giving below JSON response. In the next stored procedure activity I am unable parse the output parameter. I tried few methods.
I have set Content-Type application/json in web activity
Sample JSON:
Output
{
"Response": "[{\"Message\":\"Number of barcode(s) found:1\",\"Status\":\"Success\",\"CCS Office\":[{\"Name\":\"Woodstock\",\"CCS Description\":null,\"BranchType\":\"Sub CFS Office\",\"Status\":\"Active\",\"Circle\":\"NJ\"}]}]"
}
For parameter in stored procedure activity:
#json(first(activity('Web1').output.Response))
output - System.Collections.Generic.List`1[System.Object]
#json(activity('Web1').output.Response[0])
output - cannot be evaluated because property '0' cannot be selected. Property selection is not supported on values of type 'String'
#json(activity('Web1').output.Response.Message)
output - cannot be evaluated because property 'Message' cannot be selected. Property selection is not supported on values of type 'String'
Here is what I did:
I created a new pipeline, and created a parameter of type 'object' using your 'output' in its entirety:
{ "Response": "[{\"Message\":\"Number of barcode(s) found:1\",\"Status\":\"Success\",\"CCS Office\":[{\"Name\":\"Woodstock\",\"CCS Description\":null,\"BranchType\":\"Sub CFS Office\",\"Status\":\"Active\",\"Circle\":\"NJ\"}]}]" }
I created a variable and setVariable activity. Variable is of type string. The dynamic expression I used is:
#{json(pipeline().parameters.output.response)[0]}
Let me break down and explain. The {curly braces} were necessary because variable is of type string. You may not want/need them.
json(....)
was necessary because data type for the value of 'response' was left as a string. Whether it being string is correct behavior or not is a different discussion. By converting from string to json, I can now do the final piece.
[0]
Now works because Data Factory sees the contents as an objects rather than string literal. This conversion seems to have been applied to the nested contents as well, because without the encapsulating {curly braces} to convert to string, I would get a type error from my setVariable activity, as the variable is of type string.
Entire pipeline code:
{
"name": "pipeline11",
"properties": {
"activities": [
{
"name": "Set Variable1",
"type": "SetVariable",
"dependsOn": [],
"userProperties": [],
"typeProperties": {
"variableName": "thing",
"value": {
"value": "#{json(pipeline().parameters.output.response)[0]}",
"type": "Expression"
}
}
}
],
"parameters": {
"output": {
"type": "object",
"defaultValue": {
"Response": "[{\"Message\":\"Number of barcode(s) found:1\",\"Status\":\"Success\",\"CCS Office\":[{\"Name\":\"Woodstock\",\"CCS Description\":null,\"BranchType\":\"Sub CFS Office\",\"Status\":\"Active\",\"Circle\":\"NJ\"}]}]"
}
}
},
"variables": {
"thing": {
"type": "String"
}
},
"annotations": []
}
}
I had the similar problem and this is how I resolved the issue.
I passed the value of Response as a string to lookup activity which calls a stored procedure in Azure SQL. The stored procedure parses the string using Json_value and return the individual key, value as a row. Now output of lookup activity can be accessed directly from preceding activities.

How to validate number of properties in JSON schema

I am trying to create a schema for a piece of JSON and have slimmed down an example of what I am trying to achieve.
I have the following JSON schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Set name",
"description": "The exmaple schema",
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"additionalProperties": false
}
The following JSON is classed as valid when compared to the schema:
{
"name": "W",
"name": "W"
}
I know that there should be a warning about the two fields having the same name, but is there a way to force the validation to fail if the above is submitted? I want it to only validate when there is only one occurrence of the field 'name'
This is outside of the responsibility of JSON Schema. JSON Schema is built on top of JSON. In JSON, the behavior of duplicate properties in an object is undefined. If you want to get warning about this you should run it through a separate validation step to ensure valid JSON before passing it to a JSON Schema validator.
There is a maxProperties constraint that can limit total number of properties in an object.
Though having data with duplicated properties is a tricky case as many json decoding implementions would ignore duplicate.
So your JSON schema validation lib would not even know duplicate existed.