Store Json field as String in Elastic search? - json

I am trying to index a json field in elastic search, I have given it external mapping that this field should be treated as string and not json, also indexing is not required for it, so no need to analyze it. The mapping for this is following
"json_field": {
"type": "string",
"index": "no"
},
Still at the time of indexing, this field is getting analyzed and because of that I am getting MapperParsingException
in Short How can we store json as a string in elastic search without getting analyzed ?

Finally got it,
if you want to store JSON as a string, without analyzing it, the mapping should be like this
"json_field": {
"type": "object",
"enabled" : false
},
The enabled flag allows to disable parsing and indexing a named object completely. This is handy when a portion of the JSON document contains an arbitrary JSON which should not be indexed, nor added to the mapping.
Update - From ES version 7.12 "enabled" has been changed to "index".

As of today ElasticSearch 7.12 "enabled" is now "index".
So mapping should be like this:
"json_field": {
"type": "object",
"index" : false
},

Solution
Set "enabled": false for the field.
curl -X PUT "localhost:9200/{{INDEX-NAME}}/_mapping/doc" -H 'Content-Type: application/json' -d'
{
"properties" : {
"json_field" : {
"type" : "object",
"enabled": false
}
}
}
Note: This cannot be applied to the existing field. Either pass it in mapping during the creation of index or you can always create a new field.
Explanation
The enabled setting, which can be applied only to the top-level mapping definition and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way:
Ref: Elasticsearch Docs

Related

Storing JSON blob in an Avro Field

I have inherited project where an avro file that is being consumed by Snowflake. The schema of the avro is as follows:
{
"name": "TableName",
"namespace": "sqlserver",
"type": "record",
"fields": [
{
"name": "hAccount",
"type": "string"
},
{
"name": "hTableName",
"type": "string"
},
{
"name": "hRawJSON",
"type": "string"
}
]
}
The hRawJSON is a blob of JSON itself. The previous dev put this as a type of string, and this is where I believe the problem lies.
The application takes a JSON object (the JSON is varible so I never know the contents or what it contains) and populates the hRawJSON field in the Avro record. But it contains the escape characters for the double quotes in the string:
hAccount:"H11122"
hTableName:"Departments"
hRawJSON:"{\"DepartmentID\":1,\"ModelID\":0,\"Description\":\"P Medicines\",\"Margin\":\"3.300000000000000e+001\",\"UCSVATRateID\":0,\"References\":719,\"HeadOfficeID\":1,\"DividendID\":0}"
As a result the JSON blob is staged into Snowflake as a VARIANT field but still retains the escape characters:
Snowflake image
This means when querying the data in the JSON I constantly have to use this:
PARSE_JSON(RAW_FILE:hRawJSON):DepartmentID
I can't help feeling that the field type of string in the Avro file is causing the issue and that a different type should be used. I've tried Record, but without fields it's unuable. Doc also not working.
The other alternative is that this behavior is correct and when moving the hRawJSON from staging into "proper" tables I should use something like:
INSERT INTO DATA.PUBLIC.DEPARTMENTS
SELECT
RAW_FILE:hAccount::VARCHAR(4) as Account,
PARSE_JSON(RAW_FILE:hRawJSON) as JsonRaw
FROM DATA.STAGING.AVRO_RAW WHERE RAW_FILE:hTableName::STRING = 'Department';
So if this should be the correct approach and I'm over thinking this I'd appreciate guidance.

Debezium Connector for PostgreSQL introduces \ in topic data of type JSON

I am new to kafka and would like to get some advice on this problem. I am trying to produce data to one of the kafka topics from a table in postgres, one of the column has type json, which is defined like this in the schema file (.avsc):
{
"name": "details",
"type": [
"null",
{
"type": "string",
"connect.version": 1,
"connect.name": "io.debezium.data.Json"
}
],
"default": null
}
As per https://debezium.io/documentation/reference/0.9/connectors/postgresql.html#data-types it is mapped to string data type of kafka connector.
The db column contains data like :
{"type":"User","id":"123","attributes":{"id":"123","state":"active"}}
But the topic produces it like this:
{\"type\":\"User\",\"id\":\"123\",\"attributes\":{\"id\":\"123\",\"state\":\"active\"}}
I want this to be produced as same string as passed, without the \. So, the expected output stream for the topic should have something like :
{"type":"User","id":"123","attributes":{"id":"123","state":"active"}}
What would be the best way to achieve it?

SOLR post json file Default fieldtype

I have POSTAL_CODE field in my json file. If I try importing that data to SOLR using solr/post, the fieldtype is being set as 'plongs' which is not suitable for data like "108-0023". Beacause of that the data import is throwing out an error. Is there any work around for this kind of issues?
Edit:
Sample data which you might use to check it.
{
"id": "1",
"POSTAL_CODE": "1982"
},
{
"id": "2",
"POSTAL_CODE": "1947"
},
{
"id": "3",
"POSTAL_CODE": "19473"
},
{
"id": "4",
"POSTAL_CODE": "19471"
},
{
"id": "5",
"POSTAL_CODE": "1947-123"
}
In the above sample, I don't understand why 'id' is not being considered as 'plongs' or 'pints' but only 'POSTAL_CODE' has that issue. if the first element has POSTAL_CODE as, say "1947-145" then the field type is being taken as 'text_general'. Generally if the value has double quotes, (i.e., "Data": "123") shouldn't it be considered as a string value?
Remove the collection, create it as new and before you index anything, define a field POSTAL_CODE in your schema as type string. This will then index any incoming data on this field without guessing, but instead use the string type, which means it is indexed as-is.
Copied and adapted from https://lucene.apache.org/solr/guide/7_0/schema-api.html, but untested:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"POSTAL_CODE",
"type":"string",
"stored":true }
}' http://localhost:8983/solr/yourcollectionhere/schema
I tried to import the data by creating a raw json document with the field POSTAL_CODE. Below is my json & my solr version is 7.2.1
{"array": [1,2,3],"boolean": true,"color": "#82b92c","null": null,"number": 123,"POSTAL_CODE": "108-0023"}
It is indexed as Text Field in solr below is the attached screenshot. Command I triggered to index the data is as below:
bin/post -c gettingstarted test.json
Could you please provide the sample data and version of solr on which you are facing this issue.

How to validate number of properties in JSON schema

I am trying to create a schema for a piece of JSON and have slimmed down an example of what I am trying to achieve.
I have the following JSON schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "Set name",
"description": "The exmaple schema",
"type": "object",
"properties": {
"name": {
"type": "string"
}
},
"additionalProperties": false
}
The following JSON is classed as valid when compared to the schema:
{
"name": "W",
"name": "W"
}
I know that there should be a warning about the two fields having the same name, but is there a way to force the validation to fail if the above is submitted? I want it to only validate when there is only one occurrence of the field 'name'
This is outside of the responsibility of JSON Schema. JSON Schema is built on top of JSON. In JSON, the behavior of duplicate properties in an object is undefined. If you want to get warning about this you should run it through a separate validation step to ensure valid JSON before passing it to a JSON Schema validator.
There is a maxProperties constraint that can limit total number of properties in an object.
Though having data with duplicated properties is a tricky case as many json decoding implementions would ignore duplicate.
So your JSON schema validation lib would not even know duplicate existed.

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .json file)?

Is it possible to have an optional field in an Avro schema (i.e. the field does not appear at all in the .JSON file)?
In my Avro schema, I have two fields:
{"name": "author", "type": ["null", "string"], "default": null},
{"name": "importance", "type": ["null", "string"], "default": null},
And in my JSON files those two fields can exist or not.
However, when they do not exist, I receive an error (e.g. when I test such a JSON file using avro-tools command line client):
Expected field name not found: author
I understand that as long as the field name exists in a JSON, it can be null, or a string value, but what I'm trying to express is something like "this JSON is valid if the those field names do not exist, OR if they exist and they are null or string".
Is this possible to express in an Avro schema? If so, how?
you can define the default attribute as undefined example.
so the field can be skipped.
{
"name": "first_name",
"type": "string",
"default": "undefined"
},
Also all field are manadatory in avro.
if you want it to be optional, then union its type with null.
example:
{
"name": "username",
"type": [
"null",
"string"
],
"default": null
},
According to avro specification this is possible, using the default attribute.
See https://avro.apache.org/docs/1.8.2/spec.html
default: A default value for this field, used when reading instances that lack this field (optional). Permitted values depend on the field's schema type, according to the table below. Default values for union fields correspond to the first schema in the union.
At the example you gave, you do add the default attribute with value "null", so this should work. However, supporting this depends also on the library you use for reading the avro message (there are libraries at c,c++,python,java,c#,ruby etc.). Maybe (probably) the library you use lack this feature.