Storing JSON blob in an Avro Field - json

I have inherited project where an avro file that is being consumed by Snowflake. The schema of the avro is as follows:
{
"name": "TableName",
"namespace": "sqlserver",
"type": "record",
"fields": [
{
"name": "hAccount",
"type": "string"
},
{
"name": "hTableName",
"type": "string"
},
{
"name": "hRawJSON",
"type": "string"
}
]
}
The hRawJSON is a blob of JSON itself. The previous dev put this as a type of string, and this is where I believe the problem lies.
The application takes a JSON object (the JSON is varible so I never know the contents or what it contains) and populates the hRawJSON field in the Avro record. But it contains the escape characters for the double quotes in the string:
hAccount:"H11122"
hTableName:"Departments"
hRawJSON:"{\"DepartmentID\":1,\"ModelID\":0,\"Description\":\"P Medicines\",\"Margin\":\"3.300000000000000e+001\",\"UCSVATRateID\":0,\"References\":719,\"HeadOfficeID\":1,\"DividendID\":0}"
As a result the JSON blob is staged into Snowflake as a VARIANT field but still retains the escape characters:
Snowflake image
This means when querying the data in the JSON I constantly have to use this:
PARSE_JSON(RAW_FILE:hRawJSON):DepartmentID
I can't help feeling that the field type of string in the Avro file is causing the issue and that a different type should be used. I've tried Record, but without fields it's unuable. Doc also not working.
The other alternative is that this behavior is correct and when moving the hRawJSON from staging into "proper" tables I should use something like:
INSERT INTO DATA.PUBLIC.DEPARTMENTS
SELECT
RAW_FILE:hAccount::VARCHAR(4) as Account,
PARSE_JSON(RAW_FILE:hRawJSON) as JsonRaw
FROM DATA.STAGING.AVRO_RAW WHERE RAW_FILE:hTableName::STRING = 'Department';
So if this should be the correct approach and I'm over thinking this I'd appreciate guidance.

Related

Is there a way to transform a JSON Schema definition into a Big Query Schema definition?

From https://cloud.google.com/bigquery/docs/schemas a Schema Definition File looks like
[
{
"name": string,
"type": string,
"mode": string,
"fields": [
{
object (TableFieldSchema)
}
],
"description": string,
"policyTags": {
"names": [
string
]
},
"maxLength": string,
"precision": string,
"scale": string,
"collation": string,
"defaultValueExpression": string
},
{
"name": string,
"type": string,
...
}
]
Is there any tool/product that can take a https://json-schema.org file, and convert it to the form that Big Query prefers?
You can detect a json schema from a file (which is stock into a bucket in the same GCP project for example) with using a external table link to your file. The data from your file will be print into Bigquery. (you can use command line too, i never use it but it exist too )
Example in csv (json is possible too) :
Create or replace external table projectGCP.DatasetsGCP.TableGCP OPTIONS ( format = 'CSV', uris = ['gs://nameofmybucket/*pattern_i_want_tobe_detect_inthe_namefile.csv'] )
After doing that, you can go to the table created just before and get Bigquery schema of the table.
Here more information how you can do it (You can provide the schema inline (on the command line) or you can provide a JSON file containing the schema definition, it's possible too) : https://cloud.google.com/bigquery/docs/external-table-definition

How to validate number of properties in JSON schema

I am trying to create a schema for a piece of JSON and have slimmed down an example of what I am trying to achieve.
I have the following JSON schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Set name",
"description": "The exmaple schema",
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"additionalProperties": false
}
The following JSON is classed as valid when compared to the schema:
{
"name": "W",
"name": "W"
}
I know that there should be a warning about the two fields having the same name, but is there a way to force the validation to fail if the above is submitted? I want it to only validate when there is only one occurrence of the field 'name'
This is outside of the responsibility of JSON Schema. JSON Schema is built on top of JSON. In JSON, the behavior of duplicate properties in an object is undefined. If you want to get warning about this you should run it through a separate validation step to ensure valid JSON before passing it to a JSON Schema validator.
There is a maxProperties constraint that can limit total number of properties in an object.
Though having data with duplicated properties is a tricky case as many json decoding implementions would ignore duplicate.
So your JSON schema validation lib would not even know duplicate existed.

Reading Inconsistent Nested JSON in Athena

In Athena, I am reading some nested JSON files into a table. The field that actually contains the nested JSON has an inconsistent number of fields within it across the different files in the raw data.
Sometimes the data looks something like this:
{
"id": "9f1e07b4",
"date": "05/20/2018 02:30:53.110 AM",
"data": {
"a": "asd",
"b": "adf",
"body": {
"sid": {
"uif": "yes",
"sidd": "no",
"state": "idle"
}
},
"category": "scene"
}
}
Other times the data looks something like this:
{
"id": "9f1e07b4",
"date": "05/20/2018 02:30:45.436 AM",
"data": {
"a": "event",
"b": "state",
"body": {
"persona": {
"one": {
"movement": "idle"
}
}
},
"category": "scene"
}
}
Other times the "body" field contains both the "sid" struct and the "persona" struct.
As you can see the fields given within "body" are not always consistent. I tried to add all of the possible fields and their structures within my CREATE EXTERNAL TABLE query. However, the "data" column that contains the "body" field still does not fill and remains blank when I "preview table" in Athena.
In the CREATE TABLE DDL, is there a way to indicate that I want to fill all of columns that aren't present in the nested JSON of each file with null values?
Furthermore, the 'names' given to the fields in the query do not have to correspond to the key values in the raw JSON. It seems Athena is simply reading the structure and nothing else. Is there a way to indicate which JSON key corresponds to which Athena field name directly? So that if some fields are missing from the "body" of one file, Athena can know which one is missing and fill it in as null?

Key values of 'key' and 'type' in json schema

I am given to understand that the words type and id are reserved words in json schema. Is there any way to set these as keys in json schema? Here is an example of what I am trying to do.
"id": {
"type": "string"
},
"featureType": {
"type": "string"
},
"type": {
"type": "string"
}
I have tried validating this using a number of tools (including here). Googling around yields no suggestions either. Any help much appreciated. Cheers!
The snippet you pasted above will most probably work. "type" and "id" are reserved keys, but they have special meaning only in case their corresponding value is a string. Since the values are objects in your case, there is no problem. I'm not 100% sure if the json schema spec explicitly states this, but this is how implementations work usually.

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .json file)?

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .JSON file)?
In my Avro schema, I have two fields:
{"name": "author", "type": ["null", "string"], "default": null},
{"name": "importance", "type": ["null", "string"], "default": null},
And in my JSON files those two fields can exist or not.
However, when they do not exist, I receive an error (e.g. when I test such a JSON file using avro-tools command line client):
Expected field name not found: author
I understand that as long as the field name exists in a JSON, it can be null, or a string value, but what I'm trying to express is something like "this JSON is valid if the those field names do not exist, OR if they exist and they are null or string".
Is this possible to express in an Avro schema? If so, how?
you can define the default attribute as undefined example.
so the field can be skipped.
{
"name": "first_name",
"type": "string",
"default": "undefined"
},
Also all field are manadatory in avro.
if you want it to be optional, then union its type with null.
example:
{
"name": "username",
"type": [
"null",
"string"
],
"default": null
},
According to avro specification this is possible, using the default attribute.
See https://avro.apache.org/docs/1.8.2/spec.html
default: A default value for this field, used when reading instances that lack this field (optional). Permitted values depend on the field's schema type, according to the table below. Default values for union fields correspond to the first schema in the union.
At the example you gave, you do add the default attribute with value "null", so this should work. However, supporting this depends also on the library you use for reading the avro message (there are libraries at c,c++,python,java,c#,ruby etc.). Maybe (probably) the library you use lack this feature.