Date Field Schema in Avro Input Kettle - json

I'm using Pentaho Data Integration (Kettle) for an ETL process, extracting from a MongoDB source.
My source has an ISODateField so the JSON returned from the extraction is like:
{ "_id" : { "$oid" : "533a0180e4b026f66594a13b"} , "fac_fecha" : { "$date" : "2014-04-01T00:00:00.760Z"} , "fac_fedlogin" : "KAYAK"}
So now, I have to unserialize this JSON with an AVRO Input. So I've defined the AVRO schema
like
{
"type": "record",
"name": "xml_feeds",
"fields": [
{"name": "fac_fedlogin", "type": "string"},
{"name": "fac_empcod", "type": "string"},
{"name": "fac_fecha", "type": "string"}
]
}
It would be ok that fac_fecha could be a date type but AVRO doesn't support this.
In execution time, AVRO Input rejects all rows as they have an error. This only ocurrs when I use the date field.
Any suggestions of how can I do this?
Kettle version: 4.4.0
Pentaho-big-data-plugin: 1.3.0

You can convert this date string to a long (milliseconds).
This can be done both in Java and Javascript.
And then you can convert back the long to Date if required.

The easiest solution I found for this problem was uprading The Pentaho Big Data Plugin to a newer version 1.3.3
With this new version expliciting the schema for the mongodb Input json is avoided. So the Final solution is shown as following:
global view:
And inside MongoDB Input:
The schema is decided automatically and it can me modified.

Related

Storing JSON blob in an Avro Field

I have inherited project where an avro file that is being consumed by Snowflake. The schema of the avro is as follows:
{
"name": "TableName",
"namespace": "sqlserver",
"type": "record",
"fields": [
{
"name": "hAccount",
"type": "string"
},
{
"name": "hTableName",
"type": "string"
},
{
"name": "hRawJSON",
"type": "string"
}
]
}
The hRawJSON is a blob of JSON itself. The previous dev put this as a type of string, and this is where I believe the problem lies.
The application takes a JSON object (the JSON is varible so I never know the contents or what it contains) and populates the hRawJSON field in the Avro record. But it contains the escape characters for the double quotes in the string:
hAccount:"H11122"
hTableName:"Departments"
hRawJSON:"{\"DepartmentID\":1,\"ModelID\":0,\"Description\":\"P Medicines\",\"Margin\":\"3.300000000000000e+001\",\"UCSVATRateID\":0,\"References\":719,\"HeadOfficeID\":1,\"DividendID\":0}"
As a result the JSON blob is staged into Snowflake as a VARIANT field but still retains the escape characters:
Snowflake image
This means when querying the data in the JSON I constantly have to use this:
PARSE_JSON(RAW_FILE:hRawJSON):DepartmentID
I can't help feeling that the field type of string in the Avro file is causing the issue and that a different type should be used. I've tried Record, but without fields it's unuable. Doc also not working.
The other alternative is that this behavior is correct and when moving the hRawJSON from staging into "proper" tables I should use something like:
INSERT INTO DATA.PUBLIC.DEPARTMENTS
SELECT
RAW_FILE:hAccount::VARCHAR(4) as Account,
PARSE_JSON(RAW_FILE:hRawJSON) as JsonRaw
FROM DATA.STAGING.AVRO_RAW WHERE RAW_FILE:hTableName::STRING = 'Department';
So if this should be the correct approach and I'm over thinking this I'd appreciate guidance.

Can we interchange JSON schema with YAML schema? or viceversa?

I have a device application which gets the data in JSON format. This JSON format is generated by another web based application using a YAML schema.
Now, as the web tool validates this JSON data file against the YAML schema, my device application also has to validate it against a schema. Since, the resource on my device is limited and we already have json schema validation in place, we are restricted to use schema in JSON format only.
So, my question is could we replace the YAML schema with JSON schema for the web tool? The web application has Swagger.
On another note, is there any existing script or open source tool to convert YAML schema to JSON schema?
Not sure about the OpenAPI definition. Its a simple schema file that will be used to validate JSON data. The JSON schema (draft v4) has below format. Our device application is in C++ language. Not sure about what is used in Web Tool, but it has some Swagger framework that generates the JSON data file for us.
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"definitions": {
...
"foobar_Result" : {
"type" : "object",
"properties" : {
"request" : {
"type" : "integer"
},
"success" : {
"type" : "boolean"
},
"payload" : {
"type" : "array", "items" : {"$ref" : "#/definitions/foobar_Parameter"}
}
},
"required" : ["request"],
"additionalProperties" : false
}
},
"$ref" : "#/definitions/foobar_Result"
}
If you are looking converting between API specification formats then this tool might help https://www.apimatic.io/transformer/

Can I validate json Schema draft-7 with tv4?

I have to upgrade a javascript application that validates json with json schema. The old version is using tv4 to validate json schema draft 4. I need to use draft-7 in the new software.
I just replace a draft-7 json file in the current code. It worked fine at the beginning, but later the app started to show some errors related to tv4.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"lastName": {
"type": "string"
},
...
}
My question is can I use tv4 with draft-7. Is there any draft-7 library to replace tv4?
I found that I use ajv library to replace tv4

Constructing JSON string from Oracle DB

I have a Web application which gets it data from a JSON string.
The JSON is in the following format
{
"contacts": [{
"type": "contact",
"name": "John Doe",
"contact": 1,
"links": ["Spouse", "Friends","Jane Doe","Harry Smith"]
}]
}
Now this is a sample data. My actual DB is in Oracle. My question would be how do I construct this JSON from Oracle.
This is the best method I've come across: http://ora-00001.blogspot.sk/2010/02/ref-cursor-to-json.html.
To summarise:
Use the DBMS_XMLGEN package to generate XML from a SYS_REFCURSOR.
Then transform it using this XSLT.
I like it because there's no manual generation and because you have the option of returning XML too by skipping the final transformation.

Can you put comments in Avro JSON schema files?

I'm writing my first Avro schema, which uses JSON as the schema language. I know you cannot put comments into plain JSON, but I'm wondering if the Avro tool allows comments. E.g. Perhaps it strips them (like a preprocessor) before parsing the JSON.
Edit: I'm using the C++ Avro toolchain
Yes, but it is limited. In the schema, Avro data types 'record', 'enum', and 'fixed' allow for a 'doc' field that contains an arbitrary documentation string. For example:
{"type": "record", "name": "test.Weather",
"doc": "A weather reading.",
"fields": [
{"name": "station", "type": "string", "order": "ignore"},
{"name": "time", "type": "long"},
{"name": "temp", "type": "int"}
]
}
From the official Avro spec:
doc: a JSON string providing documentation to the user of this schema (optional).
https://avro.apache.org/docs/current/spec.html#schema_record
An example:
https://github.com/apache/avro/blob/33d495840c896b693b7f37b5ec786ac1acacd3b4/share/test/schemas/weather.avsc#L2
Yes, you can use C comments in an Avro JSON schema : /* something */ or // something Avro tools ignores these expressions during the parsing.
EDIT: It only works with the Java API.
According to the current (1.9.2) Avro specification it's allowed to put in extra attributes, that are not defined, as metadata:
This allows you add comments like this:
{
"type": "record",
"name": "test",
"comment": "This is a comment",
"//": "This is also a comment",
"TODO": "As per this comment we should remember to fix this schema" ,
"fields" : [
{
"name": "a", "type": "long"
},
{
"name": "b", "type": "string"
}
]
}
No, it can't in the C++ nor the C# version (as of 1.7.5). If you look at the code they just shove the JSON into the JSON parser without any comment preprocessing - bizarre programming style. Documentation and language support appears to be pretty sloppy...