Custom analyzer appearing in type mapping but not working in Elasticsearch - json

I'm trying to add a custom analyzer to my index while also mapping that analyzer to a property on a type. Here is my JSON object for doing this:
{ "settings" : {
"analysis" : {
"analyzer" : {
"test_analyzer" : {
"type" : "custom",
"tokenizer": "standard",
"filter" : ["lowercase", "asciifolding"],
"char_filter": ["html_strip"]
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"checkanalyzer" : {
"type" : "string",
"analyzer" : "test_analyzer"
}
}
}
}
}
I know this analyzer works because I've tested it using /wp2/_analyze?analyzer=test_analyzer -d '<p>Testing analyzer.</p>' and also it shows up as the analyzer for the checkanalyzer property when I check /wp2/test/_mapping. However, if I add a document like {"checkanalyzer": "<p>The tags should not show up</p>"}, the HTML tags don't get stripped out when I retrieve the document using the _search endpoint. Am I misunderstanding how the mapping works or is there something wrong with my JSON object? I'm dynamically creating the wp2 index and also the test type when I make this call to Elasticsearch, not sure if that matters.

The html doesn't get removed from the source, it gets removed from the terms generated by that source. You can see this if you use a terms aggregation:
POST /test_index/_search
{
"aggs": {
"checkanalyzer_field_terms": {
"terms": {
"field": "checkanalyzer"
}
}
}
}
{
"took": 77,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "test",
"_id": "1",
"_score": 1,
"_source": {
"checkanalyzer": "<p>The tags should not show up</p>"
}
}
]
},
"aggregations": {
"checkanalyzer_field_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "not",
"doc_count": 1
},
{
"key": "should",
"doc_count": 1
},
{
"key": "show",
"doc_count": 1
},
{
"key": "tags",
"doc_count": 1
},
{
"key": "the",
"doc_count": 1
},
{
"key": "up",
"doc_count": 1
}
]
}
}
}
Here's some code I used to test it:
http://sense.qbox.io/gist/2971767aa0f5949510fa0669dad6729bbcdf8570

Now if you want to completely strip out the html prior to indexing and storing the content as is, you can use the mapper attachment plugin - in which when you define the mapping, you can categorize the content_type to be "html."
The mapper attachment is useful for many things especially if you are handling multiple document types, but most notably - I believe just using this for the purpose of stripping out the html tags is sufficient enough (which you cannot do with the html_strip char filter).
Just a forewarning though - NONE of the html tags will be stored. So if you do need those tags somehow, I would suggest defining another field to store the original content. Another note: You cannot specify multifields for mapper attachment documents, so you would need to store that outside of the mapper attachment document. See my working example below.
You'll need to result in this mapping:
{
"html5-es" : {
"aliases" : { },
"mappings" : {
"document" : {
"properties" : {
"delete" : {
"type" : "boolean"
},
"file" : {
"type" : "attachment",
"fields" : {
"content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"author" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets"
},
"title" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "autocomplete"
},
"name" : {
"type" : "string"
},
"date" : {
"type" : "date",
"format" : "strict_date_optional_time||epoch_millis"
},
"keywords" : {
"type" : "string"
},
"content_type" : {
"type" : "string"
},
"content_length" : {
"type" : "integer"
},
"language" : {
"type" : "string"
}
}
},
"hash_id" : {
"type" : "string"
},
"path" : {
"type" : "string"
},
"raw_content" : {
"type" : "string",
"store" : true,
"term_vector" : "with_positions_offsets",
"analyzer" : "raw"
},
"title" : {
"type" : "string"
}
}
}
},
"settings" : { //insert your own settings here },
"warmers" : { }
}
}
Such that in NEST, I will assemble the content as such:
Attachment attachment = new Attachment();
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document"));
attachment.ContentType = "html";
Document document = new Document();
document.File = attachment;
document.RawContent = InsertRawContentFromString(originalText);
I have tested this in Sense - results are as follows:
"file": {
"_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=",
"_content_length": 0,
"_content_type": "html",
"_date": "0001-01-01T00:00:00",
"_title": "Topic10"
},
"delete": false,
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>"
},
"highlight": {
"file.content": [
"\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n "
]
}

Related

Index a JSON file into elasticsearch command/mapping errors

I'm new to ELK and I want to import a JSON file into Elasticsearch. this is my file:
{
"news":{
"1":{
"_score":1.0,
"_index":"newsvit",
"_source":{
"content":" \u0641\u0647\u06cc\u0645\u0647 \u062d\u0633\u0646\u200c\u0645\u06cc\u0631\u06cc: \u0627\u06af\u0631\u0686\u0647 \u062f\u0631 \u0647\u06cc\u0627\u0647\u0648\u06cc \u0627\u0646\u062a\u062e\u0627\u0628\u0627\u062a \u0631\u06cc\u0627\u0633\u062a \u062c\u0645\u0647\u0648\u0631\u06cc\u060c \u0645\u0648\u0636\u0648\u0639\u06cc \u0645\u0627\u0646\u0646\u062f \u0645\u0639\u0631\u0641\u06cc \u06a9\u0627\u0646\u062f\u06cc\u062f\u0627\u0647\u0627\u06cc \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631 \u062f\u0631 \u062d\u0627\u0634\u06cc\u0647 \u0642\u0631\u0627\u0631 \u06af\u0631\u0641\u062a\u0647\u060c \u0627\u0645\u0627 \u0627\u0645\u0633\u0627\u0644 \u0628\u0647 \u0639\u0646\u0648\u0627\u0646 \u067e\u0646\u062c\u0645\u06cc\u0646 \u062f\u0648\u0631\u0647 \u0627\u0646\u062a\u062e\u0627\u0628 \u0627\u0639\u0636\u0627\u06cc \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631\u060c \u0627\u06cc\u0646 \u0631\u0648\u06cc\u062f\u0627\u062f \u0628\u0647 \u0646\u0633\u0628\u062a \u062f\u0648\u0631\u0647\u200c\u0647\u0627\u06cc \u0642\u0628\u0644\u060c \u0628\u06cc\u0634\u062a\u0631 \u0645\u0648\u0631\u062f \u062a\u0648\u062c\u0647 \u0648\u0627\u0642\u0639 \u0634\u062f\u0647. \u0627\u06cc\u0646 \u0627\u0642\u0628\u0627\u0644\u060c \u0686\u0647 \u0627\u0632 \u0633\u0648\u06cc \u0686\u0647\u0631\u0647\u200c\u0647\u0627\u06cc \u0645\u0637\u0631\u062d \u0628\u0631\u0627\u06cc \u062b\u0628\u062a \u0646\u0627\u0645 \u0648 \u0686\u0647 \u0627\u0632 \u0633\u0648\u06cc \u0645\u0631\u062f\u0645 \u0628\u0631\u0627\u06cc \u0645\u0634\u0627\u0631\u06a9\u062a \u062f\u0631 \u0627\u06cc\u0646 \u0631\u0648\u06cc\u062f\u0627\u062f\u060c \u0639\u0644\u062a\u200c\u0647\u0627\u06cc \u06af\u0648\u0646\u0627\u06af\u0648\u0646\u06cc \u0645\u06cc\u200c\u062a\u0648\u0627\u0646\u062f \u062f\u0627\u0634\u062a\u0647 \u0628\u0627\u0634\u062f \u06a9\u0647 \u062a\u0648\u062c\u0647 \u0628\u0647 \u0622\u0646\u060c \u0645\u06cc\u200c\u062a\u0648\u0627\u0646\u062f \u0631\u0627\u0647\u06af\u0634\u0627\u06cc \u0627\u0639\u0636\u0627\u06cc \u0631\u06",
"lead":"\u062c\u0627\u0645\u0639\u0647 > \u0634\u0647\u0631\u06cc - \u0645\u06cc\u0632\u06af\u0631\u062f\u06cc \u062f\u0631\u0628\u0627\u0631\u0647 \u0639\u0645\u0644\u06a9\u0631\u062f \u062f\u0648\u0631\u0647\u200c\u0647\u0627\u06cc \u06af\u0630\u0634\u062a\u0647 \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631\u060c \u0622\u0646\u0686\u0647 \u0627\u0639\u0636\u0627\u06cc \u062c\u062f\u06cc\u062f \u0628\u0627\u06cc\u062f \u0645\u062f \u0646\u0638\u0631 \u062f\u0627\u0634\u062a\u0647 \u0628\u0627\u0634\u0646\u062f \u0648 \u0647\u0645\u0686\u0646\u06cc\u0646 \u0645\u0627\u0647\u06cc\u062a \u0633\u06cc\u0627\u0633\u06cc \u0628\u0648\u062f\u0646 \u06cc\u0627 \u0646\u0628\u0648\u062f\u0646 \u0634\u0648\u0631\u0627\u06cc \u0634\u0647\u0631.",
"agency":"13",
"date_created":1494518193,
"url":"http://www.khabaronline.ir/(X(1)S(bud4wg3ebzbxv51mj45iwjtp))/detail/663749/society/urban",
"image":"uploads/2017/05/11/1589793661.jpg",
"category":"15"
},
"_type":"news",
"_id":"2981643"
},
"2": {
...
based on what I have learnt, at first, I tried to create a mapping system for it in DevTools of Kibana. I want to be able to perform queries and search on this file based on fields in _source, such as category, id and so on. this is my mapping:
PUT /main-news-test-data
{
"mappings": {
"properties": {
"_score": {"type":"integer"},
"_index": {"type":"keyword"},
"_type":{"type":"keyword"},
"_id":{"type":"keyword"}
},
"_source":{
"properties": {
"content":{"type":"text"},
"title":{"type":"text"},
"lead":{"type":"text"},
"agency":{"type":"keyword"},
"date_created":{"type":"date"},
"url":{"type":"keyword"},
"image":{"type":"keyword"},
"category":{"type":"keyword"}
}
}
}
}
HEAD main-news-test-data
GET /main-news-test-data/_search?q=*
but when I run this in Devtools I receive this error:
{
"error" : {
"root_cause" : [
{
"type" : "mapper_parsing_exception",
"reason" : "Mapping definition for [_source] has unsupported parameters: [properties : {image={type=keyword}, agency={type=keyword}, date_created={type=date}, title={type=text}, category={type=keyword}, content={type=text}, lead={type=text}, url={type=keyword}}]"
}
],
"type" : "mapper_parsing_exception",
"reason" : "Failed to parse mapping [_doc]: Mapping definition for [_source] has unsupported parameters: [properties : {image={type=keyword}, agency={type=keyword}, date_created={type=date}, title={type=text}, category={type=keyword}, content={type=text}, lead={type=text}, url={type=keyword}}]",
"caused_by" : {
"type" : "mapper_parsing_exception",
"reason" : "Mapping definition for [_source] has unsupported parameters: [properties : {image={type=keyword}, agency={type=keyword}, date_created={type=date}, title={type=text}, category={type=keyword}, content={type=text}, lead={type=text}, url={type=keyword}}]"
}
},
"status" : 400
}
I also tried to index my file into elasticsearch using this PowerShell command afterwards:
Invoke-RestMethod "http://localhost:9200/main-news-test-data/doc/_bulk?pretty" -Method Post -ContentType 'application/x-ndjson' -InFile "test.json"
but again I get this error from Powershell:
Invoke-RestMethod : {
"error" : {
"root_cause" : [
{
"type" : "json_e_o_f_exception",
"reason" : "Unexpected end-of-input: expected close marker for Object (start marker at [Source:
(org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 1, column: 1])\n at
[Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 2, column: 1]"
}
],
"type" : "json_e_o_f_exception",
"reason" : "Unexpected end-of-input: expected close marker for Object (start marker at [Source:
(org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 1, column: 1])\n at
[Source: (org.elasticsearch.common.bytes.AbstractBytesReference$MarkSupportingStreamInputWrapper); line: 2, column: 1]"
},
"status" : 400
}
So what should I do? how do I import a JSON file into elasticsearch that is queryable by fields?
with what I read I can say that :
your mapping is strange
Just put :
PUT /main-news-test-data
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"title": {
"type": "text"
},
"lead": {
"type": "text"
},
"agency": {
"type": "keyword"
},
"date_created": {
"type": "date"
},
"url": {
"type": "keyword"
},
"image": {
"type": "keyword"
},
"category": {
"type": "keyword"
}
}
}
}
Your Json is wrong. A bulk don't use a valid json.
A file for the _bulk api will look like this :
{ "index" : { "_index" : "main-news-test-data", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "main-news-test-data", "_id" : "2" } }
{ "field1" : "value2" }
Please also Note that "_score":1.0, has no reason to be in your request and that _type is deprecated (if you use a 7.0+, _type can only be _doc and should be ignored)

Value of property QueueConfigurations must be of type List

I am trying to write SQS triggers for my S3 bucket. I am running into an error saying "Value of property QueueConfigurations must be of type List." Is there something wrong with my indentation/formatting? Or is it a content error? I recently had to transcribe this from YAML to JSON, and I could really use a second pair of eyes on this issue. Keep in mind that the reason the codeblock below is so indented is because I have some sensitive info I shouldn't post. Thanks in advance!
"NotificationConfiguration" : {
"QueueConfigurations" : {
"Id" : "1",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "prod_hvr/cdc/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
},
"QueueConfigurations" : {
"Id" : "2",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "prod_hvr/latency/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}
}
It should be something like below. And as per this docs, "Id" is not a valid attribute.
{
"NotificationConfiguration": {
"QueueConfigurations": [
{
"Event": "s3:ObjectCreated:*",
"Filter": {
"S3Key": {
"Rules": {
"Name": "prefix",
"Value": "prod_hvr/cdc/"
}
}
},
"Queue": "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
},
{
"Event": "s3:ObjectCreated:*",
"Filter": {
"S3Key": {
"Rules": {
"Name": "prefix",
"Value": "prod_hvr/latency/"
}
}
},
"Queue": "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}
]
}
}

The `type` is not updating when creating or updating index

I'm new to Kibana and Elasticsearch. I have task to migrate data from our production site to staging. Currently, I have given a simple code on creating index.
I have successfully created index, but upon comparing to production site, the type declared as date became text on my new site. We have noticed that all types are converting to text and not sure is because we are using new version of kibana.
Here in production site...
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
},
This is how I implemented on staging site...
POST /orders/_doc/1
{
"order": {
"properties": {
"authorization": {
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
}
}
}
}
}
Upon checking...
GET orders?pretty
Output in orders mapping...
"mappings" : {
"properties" : {
"order" : {
"properties" : {
"properties" : {
"properties" : {
"authorization" : {
"properties" : {
"authorizationDate" : {
"properties" : {
"format" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"ignore_malformed" : {
"type" : "boolean"
},
"type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}
}
}
},
The type became text instead of date, and the date format is not recorded.
Thanks in advance.
POST /orders/_doc/1 -- This will create a new index named orders with a default inferred mapping.
When you are running above it is treating "order", "properties", "ignore_malformed" as fields not part of mapping. hence you can see multiple nested properties in output mapping below.
"properties" : {
"properties" : {
"properties" : {
To create a new mapping first you should run
PUT orders ---> create a new index named orders
{
"mappings": {
"properties": {
"authorization": {
"type": "object", --->should be object/nested was not present in your quest
"properties": {
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
}
}
}
}
}
}
Then adding a new doc using
POST /orders/_doc/1
{
"authorization":{
"authorizationDate":"2019-01-01"
}
}
will give below data
[
{
"_index" : "orders12",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"authorization" : {
"authorizationDate" : "2019-01-01"
}
}
}
]
Link for (Mapping)[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html]

Indexing CSV blobs does not work in Azure Search

I have a number of TSV files as Azure blobs that have following as the first four tab-separated columns:
metadata_path, document_url, access_date, content_type
I want to index them as described here: https://learn.microsoft.com/en-us/azure/search/search-howto-index-csv-blobs
My request for creating an indexer has the following body:
{
"name" : "webdata",
"dataSourceName" : "webdata",
"targetIndexName" : "webdata",
"schedule" : { "interval" : "PT1H", "startTime" : "2017-01-09T11:00:00Z" },
"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextHeaders" : "metadata_path,document_url,access_date,content_type" , "firstLineContainsHeaders" : true, "delimitedTextDelimiter" : "\t" } },
"fieldMappings" : [ { "sourceFieldName" : "document_url", "targetFieldName" : "id", "mappingFunction" : { "name" : "base64Encode", "parameters" : "useHttpServerUtilityUrlTokenEncode" : false } } }, { "sourceFieldName" : "document_url", "targetFieldName" : "url" }, { "sourceFieldName" : "content_type", "targetFieldName" : "content_type" } ]
}
I am receiving an error:
{
"error": {
"code": "",
"message": "Data source does not contain column 'document_url', which is required because it maps to the document key field 'id' in the index 'webdata'. Ensure that the 'document_url' column is present in the data source, or add a field mapping that maps one of the existing column names to 'id'."
}
}
What do I do wrong?
What do I do wrong?
In your case, you supply the json format is invalid. The following is the request for creating an indexer. Detail info we could refer to this document
{
"name" : "Required for POST, optional for PUT. The name of the indexer",
"description" : "Optional. Anything you want, or null",
"dataSourceName" : "Required. The name of an existing data source",
"targetIndexName" : "Required. The name of an existing index",
"schedule" : { Optional. See Indexing Schedule below. },
"parameters" : { Optional. See Indexing Parameters below. },
"fieldMappings" : { Optional. See Field Mappings below. },
"disabled" : Optional boolean value indicating whether the indexer is disabled. False by default.
}
If we want to create an indexer with Rest API. We need 3 steps to do that. I also do a demo for it.
If Azure search SDK is acceptable, you also could refer to another SO thread.
1.Create datasource.
POST https://[service name].search.windows.net/datasources?api-version=2015-02-28-Preview
Content-Type: application/json
api-key: [admin key]
{
"name" : "my-blob-datasource",
"type" : "azureblob",
"credentials" : { "connectionString" : "DefaultEndpointsProtocol=https;AccountName=<account name>;AccountKey=<account key>;" },
"container" : { "name" : "my-container", "query" : "<optional, my-folder>" }
}
2.Create an index
{
"name" : "my-target-index",
"fields": [
{ "name": "metadata_path","type": "Edm.String", "key": true, "searchable": true },
{ "name": "document_url", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "access_date", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false },
{ "name": "content_type", "type": "Edm.String", "searchable": true, "filterable": false, "sortable": false, "facetable": false }
]
}
3. Create an indexer.
Below is the request body that works:
{
"name" : "webdata",
"dataSourceName" : "webdata",
"targetIndexName" : "webdata",
"schedule" :
{
"interval" : "PT1H",
"startTime" : "2017-01-09T11:00:00Z"
},
"parameters" :
{
"configuration" :
{
"parsingMode" : "delimitedText",
"delimitedTextHeaders" : "document_url,content_type,link_text" ,
"firstLineContainsHeaders" : true,
"delimitedTextDelimiter" : "\t",
"indexedFileNameExtensions" : ".tsv"
}
},
"fieldMappings" :
[
{
"sourceFieldName" : "document_url",
"targetFieldName" : "id",
"mappingFunction" : {
"name" : "base64Encode",
"parameters" : {
"useHttpServerUtilityUrlTokenEncode" : false
}
}
},
{
"sourceFieldName" : "document_url",
"targetFieldName" : "document_url"
},
{
"sourceFieldName" : "content_type",
"targetFieldName" : "content_type"
},
{
"sourceFieldName" : "link_text",
"targetFieldName" : "link_text"
}
]
}

Sub-records in Avro with Morphlines

I'm trying to convert JSON into Avro using the kite-sdk morphline module. After playing around I'm able to convert the JSON into Avro using a simple schema (no complex data types).
Then I took it one step further and modified the Avro schema as displayed below (subrec.avsc). As you can see the schema consist of a subrecord.
As soon as I tried to convert the JSON to Avro using the morphlines.conf and the subrec.avsc it failed.
Somehow the JSON paths "/record_type[]/alert/action" are not translated by the toAvro function.
The morphlines.conf
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
# Read the JSON blob
{ readJson: {} }
{ logError { format : "record: {}", args : ["#{}"] } }
# Extract JSON
{ extractJsonPaths { flatten: false, paths: {
"/record_type[]/alert/action" : /alert/action,
"/record_type[]/alert/signature_id" : /alert/signature_id,
"/record_type[]/alert/signature" : /alert/signature,
"/record_type[]/alert/category" : /alert/category,
"/record_type[]/alert/severity" : /alert/severity
} } }
{ logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } }
{ extractJsonPaths { flatten: false, paths: {
timestamp : /timestamp,
event_type : /event_type,
source_ip : /src_ip,
source_port : /src_port,
destination_ip : /dest_ip,
destination_port : /dest_port,
protocol : /proto,
} } }
# Create Avro according to schema
{ logError { format : "WE GO TO AVRO"} }
{ toAvro { schemaFile : /etc/flume/conf/conf.empty/subrec.avsc } }
# Create Avro container
{ logError { format : "WE GO TO BINARY"} }
{ writeAvroToByteArray { format: containerlessBinary } }
{ logError { format : "DONE!!!"} }
]
}
]
And the subrec.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "event_type",
"type" : "string"
}, {
"name" : "source_ip",
"type" : "string"
}, {
"name" : "source_port",
"type" : "int"
}, {
"name" : "destination_ip",
"type" : "string"
}, {
"name" : "destination_port",
"type" : "int"
}, {
"name" : "protocol",
"type" : "string"
}, {
"name": "record_type",
"type" : ["null", {
"name" : "alert",
"type" : "record",
"fields" : [ {
"name" : "action",
"type" : "string"
}, {
"name" : "signature_id",
"type" : "int"
}, {
"name" : "signature",
"type" : "string"
}, {
"name" : "category",
"type" : "string"
}, {
"name" : "severity",
"type" : "int"
}
] } ]
} ]
}
The output on { logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } } I output the following:
[{
/record_type[]/alert / action = [allowed],
/record_type[]/alert / category = [],
/record_type[]/alert / severity = [3],
/record_type[]/alert / signature = [GeoIP from NL,
Netherlands],
/record_type[]/alert / signature_id = [88006],
_attachment_body = [{
"timestamp": "2015-03-23T07:42:01.303046",
"event_type": "alert",
"src_ip": "1.1.1.1",
"src_port": 18192,
"dest_ip": "46.231.41.166",
"dest_port": 62004,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": "1",
"signature_id": "88006",
"rev": "1",
"signature" : "GeoIP from NL, Netherlands ",
"category" : ""
"severity" : "3"
}
}],
_attachment_mimetype=[json/java + memory],
basename = [simple_eve.json]
}]
UPDATE 2017-06-22
you MUST populate the data in the structure in order for this to work, by using addValues or setValues
{
addValues {
micDefaultHeader : [
{
eventTimestampString : "2017-06-22 18:18:36"
}
]
}
}
after debugging the sources of morphline toAvro, it appears that the record is the first object to be evaluated, no matter what you put in your mappings structure.
the solution is quite simple, but unfortunately took a little extra time, eclipse, running the flume agent in debug mode, cloning the source code and lots of coffee.
here it goes.
my schema:
{
"type" : "record",
"name" : "co_lowbalance_event",
"namespace" : "co.tigo.billing.cboss.lowBalance",
"fields" : [ {
"name" : "dummyValue",
"type" : "string",
"default" : "dummy"
}, {
"name" : "micDefaultHeader",
"type" : {
"type" : "record",
"name" : "mic_default_header_v_1_0",
"namespace" : "com.millicom.schemas.root.struct",
"doc" : "standard millicom header definition",
"fields" : [ {
"name" : "eventTimestampString",
"type" : "string",
"default" : "12345678910"
} ]
}
} ]
}
morphlines file:
morphlines : [
{
id : convertJsonToAvro
importCommands : ["org.kitesdk.**"]
commands : [
{
readJson {
outputClass : java.util.Map
}
}
{
addValues {
micDefaultHeader : [{}]
}
}
{
logDebug { format : "my record: {}", args : ["#{}"] }
}
{
toAvro {
schemaFile : /home/asarubbi/Development/test/co_lowbalance_event.avsc
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
}
}
{
writeAvroToByteArray {
format : containerlessJSON
codec : null
}
}
]
}
]
the magic lies here:
{
addValues {
micDefaultHeader : [{}]
}
}
and in the mappings:
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
explanation:
inside the code the first field name that is evaluated is micDefaultHeader of type RECORD. as there's no way to specify a default value for a RECORD (logically correct), the toAvro code evaluates this, does not get any value configured in mappings and therefore it fails at it detects (wrongly) that the record is empty when it shouldn't.
however, taking a look at the code, you may see that it requires a Map object, containing no values to please the parser and continue to the next element.
so we add a map object using the addValues and fill it with an empty map [{}]. notice that this must match the name of the record that is causing you an empty value. in my case "micDefaultHeader"
feel free to comment if you have a better solution, as this looks like a "dirty fix"