Sub-records in Avro with Morphlines - json

I'm trying to convert JSON into Avro using the kite-sdk morphline module. After playing around I'm able to convert the JSON into Avro using a simple schema (no complex data types).
Then I took it one step further and modified the Avro schema as displayed below (subrec.avsc). As you can see the schema consist of a subrecord.
As soon as I tried to convert the JSON to Avro using the morphlines.conf and the subrec.avsc it failed.
Somehow the JSON paths "/record_type[]/alert/action" are not translated by the toAvro function.
The morphlines.conf
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
# Read the JSON blob
{ readJson: {} }
{ logError { format : "record: {}", args : ["#{}"] } }
# Extract JSON
{ extractJsonPaths { flatten: false, paths: {
"/record_type[]/alert/action" : /alert/action,
"/record_type[]/alert/signature_id" : /alert/signature_id,
"/record_type[]/alert/signature" : /alert/signature,
"/record_type[]/alert/category" : /alert/category,
"/record_type[]/alert/severity" : /alert/severity
} } }
{ logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } }
{ extractJsonPaths { flatten: false, paths: {
timestamp : /timestamp,
event_type : /event_type,
source_ip : /src_ip,
source_port : /src_port,
destination_ip : /dest_ip,
destination_port : /dest_port,
protocol : /proto,
} } }
# Create Avro according to schema
{ logError { format : "WE GO TO AVRO"} }
{ toAvro { schemaFile : /etc/flume/conf/conf.empty/subrec.avsc } }
# Create Avro container
{ logError { format : "WE GO TO BINARY"} }
{ writeAvroToByteArray { format: containerlessBinary } }
{ logError { format : "DONE!!!"} }
]
}
]
And the subrec.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "event_type",
"type" : "string"
}, {
"name" : "source_ip",
"type" : "string"
}, {
"name" : "source_port",
"type" : "int"
}, {
"name" : "destination_ip",
"type" : "string"
}, {
"name" : "destination_port",
"type" : "int"
}, {
"name" : "protocol",
"type" : "string"
}, {
"name": "record_type",
"type" : ["null", {
"name" : "alert",
"type" : "record",
"fields" : [ {
"name" : "action",
"type" : "string"
}, {
"name" : "signature_id",
"type" : "int"
}, {
"name" : "signature",
"type" : "string"
}, {
"name" : "category",
"type" : "string"
}, {
"name" : "severity",
"type" : "int"
}
] } ]
} ]
}
The output on { logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } } I output the following:
[{
/record_type[]/alert / action = [allowed],
/record_type[]/alert / category = [],
/record_type[]/alert / severity = [3],
/record_type[]/alert / signature = [GeoIP from NL,
Netherlands],
/record_type[]/alert / signature_id = [88006],
_attachment_body = [{
"timestamp": "2015-03-23T07:42:01.303046",
"event_type": "alert",
"src_ip": "1.1.1.1",
"src_port": 18192,
"dest_ip": "46.231.41.166",
"dest_port": 62004,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": "1",
"signature_id": "88006",
"rev": "1",
"signature" : "GeoIP from NL, Netherlands ",
"category" : ""
"severity" : "3"
}
}],
_attachment_mimetype=[json/java + memory],
basename = [simple_eve.json]
}]

UPDATE 2017-06-22
you MUST populate the data in the structure in order for this to work, by using addValues or setValues
{
addValues {
micDefaultHeader : [
{
eventTimestampString : "2017-06-22 18:18:36"
}
]
}
}
after debugging the sources of morphline toAvro, it appears that the record is the first object to be evaluated, no matter what you put in your mappings structure.
the solution is quite simple, but unfortunately took a little extra time, eclipse, running the flume agent in debug mode, cloning the source code and lots of coffee.
here it goes.
my schema:
{
"type" : "record",
"name" : "co_lowbalance_event",
"namespace" : "co.tigo.billing.cboss.lowBalance",
"fields" : [ {
"name" : "dummyValue",
"type" : "string",
"default" : "dummy"
}, {
"name" : "micDefaultHeader",
"type" : {
"type" : "record",
"name" : "mic_default_header_v_1_0",
"namespace" : "com.millicom.schemas.root.struct",
"doc" : "standard millicom header definition",
"fields" : [ {
"name" : "eventTimestampString",
"type" : "string",
"default" : "12345678910"
} ]
}
} ]
}
morphlines file:
morphlines : [
{
id : convertJsonToAvro
importCommands : ["org.kitesdk.**"]
commands : [
{
readJson {
outputClass : java.util.Map
}
}
{
addValues {
micDefaultHeader : [{}]
}
}
{
logDebug { format : "my record: {}", args : ["#{}"] }
}
{
toAvro {
schemaFile : /home/asarubbi/Development/test/co_lowbalance_event.avsc
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
}
}
{
writeAvroToByteArray {
format : containerlessJSON
codec : null
}
}
]
}
]
the magic lies here:
{
addValues {
micDefaultHeader : [{}]
}
}
and in the mappings:
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
explanation:
inside the code the first field name that is evaluated is micDefaultHeader of type RECORD. as there's no way to specify a default value for a RECORD (logically correct), the toAvro code evaluates this, does not get any value configured in mappings and therefore it fails at it detects (wrongly) that the record is empty when it shouldn't.
however, taking a look at the code, you may see that it requires a Map object, containing no values to please the parser and continue to the next element.
so we add a map object using the addValues and fill it with an empty map [{}]. notice that this must match the name of the record that is causing you an empty value. in my case "micDefaultHeader"
feel free to comment if you have a better solution, as this looks like a "dirty fix"

Related

Nifi: Get values form a Json by a Jsonpath, those were defined in other Json

I have two Json:
Json1:
{
"level_1": [
{
"level_2_1": [
{
"key_2_1": "value_1",
"key_2_2": "value_2",
},
{
"key_2_1": "value_1",
"key_2_3": "value_3",
},
{
"key_2_1": "value_1",
"key_2_4": "value_4",
}
],
"level_2_2": {
"key_2_2_1": "2022-08-30T06:57:31.331Z",
"key_2_2_2": "2022"
}
}
]
}
Json2
{
"level_1" : {
"level_2_1" : {
"value" : "default value",
"type" : "String"
},
"level_2_2" : {
"value" : "level_1[0].level_2_1[0].key_2_2", // this Jsonpath of Json 1
"type" : "String"
},
"level_2_3" : {
"value" : "level_1[0].level_2_2.key_2_2_1", // this Jsonpath of Json 1
"type" : "String"
}
}
I want to get the result like this (these Values come from json1)
{
"level_1" : {
"level_2_1" : {
"value" : "default value",
"type" : "String"
},
"level_2_2" : {
"value" : "value_1", // this Value of Json 1
"type" : "String"
},
"level_2_3" : {
"value" : "2022-08-30T06:57:31.331Z", // this Value of Json 1
"type" : "String"
}
}
please, can you give me some advice. Thank for your help!!

Creating Multiple QueueConfigurations in CloudFormation

I'm currently trying to write multiple QueueConfigurations into my CloudFormation template. Each is an SQS queue that is triggered when an object is created to a specified prefix. Here's what I have so far:
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Id" : "1",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
}],
"QueueConfigurations" : [{
"Id" : "2",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
I've encountered the error saying Encountered unsupported property Id. I thought that by defining the ID, I would be able to avoid the Duplicate object key error.
Does anyone know how to create multiple triggers in a single CloudFormation template? Thanks for the help in advance.
It should be structured like the below, There should only be one QueueConfigurations attribute
that contains all queue configurations within it. Also the Id parameter is not a valid property.
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
},
{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
There is more information about QueueConfiguration in the documentation.

The `type` is not updating when creating or updating index

I'm new to Kibana and Elasticsearch. I have task to migrate data from our production site to staging. Currently, I have given a simple code on creating index.
I have successfully created index, but upon comparing to production site, the type declared as date became text on my new site. We have noticed that all types are converting to text and not sure is because we are using new version of kibana.
Here in production site...
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
},
This is how I implemented on staging site...
POST /orders/_doc/1
{
"order": {
"properties": {
"authorization": {
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
}
}
}
}
}
Upon checking...
GET orders?pretty
Output in orders mapping...
"mappings" : {
"properties" : {
"order" : {
"properties" : {
"properties" : {
"properties" : {
"authorization" : {
"properties" : {
"authorizationDate" : {
"properties" : {
"format" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"ignore_malformed" : {
"type" : "boolean"
},
"type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}
}
}
},
The type became text instead of date, and the date format is not recorded.
Thanks in advance.
POST /orders/_doc/1 -- This will create a new index named orders with a default inferred mapping.
When you are running above it is treating "order", "properties", "ignore_malformed" as fields not part of mapping. hence you can see multiple nested properties in output mapping below.
"properties" : {
"properties" : {
"properties" : {
To create a new mapping first you should run
PUT orders ---> create a new index named orders
{
"mappings": {
"properties": {
"authorization": {
"type": "object", --->should be object/nested was not present in your quest
"properties": {
"authorizationDate": {
"type": "date",
"ignore_malformed": true,
"format": "yyyy/MM/dd||yyyy-MM-dd"
}
}
}
}
}
}
Then adding a new doc using
POST /orders/_doc/1
{
"authorization":{
"authorizationDate":"2019-01-01"
}
}
will give below data
[
{
"_index" : "orders12",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"authorization" : {
"authorizationDate" : "2019-01-01"
}
}
}
]
Link for (Mapping)[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html]

Manipulating JSON messages from Kafka topic using Logstash filter

I am using Logstash 2.4 to read JSON messages from a Kafka topic and send them to an Elasticsearch Index.
The JSON format is as below --
{
"schema":
{
"type": "struct",
"fields": [
{
"type":"string",
"optional":false,
"field":"reloadID"
},
{
"type":"string",
"optional":false,
"field":"externalAccountID"
},
{
"type":"int64",
"optional":false,
"name":"org.apache.kafka.connect.data.Timestamp",
"version":1,
"field":"reloadDate"
},
{
"type":"int32",
"optional":false,
"field":"reloadAmount"
},
{
"type":"string",
"optional":true,
"field":"reloadChannel"
}
],
"optional":false,
"name":"reload"
},
"payload":
{
"reloadID":"328424295",
"externalAccountID":"9831200013",
"reloadDate":1446242463000,
"reloadAmount":240,
"reloadChannel":"C1"
}
}
Without any filter in my config file, the target documents from the ES index look like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfcyTU4SyCFNFP2z5-l",
"_score" : 1.0,
"_source" : {
"schema" : {
"type" : "struct",
"fields" : [ {
"type" : "string",
"optional" : false,
"field" : "reloadID"
}, {
"type" : "string",
"optional" : false,
"field" : "externalAccountID"
}, {
"type" : "int64",
"optional" : false,
"name" : "org.apache.kafka.connect.data.Timestamp",
"version" : 1,
"field" : "reloadDate"
}, {
"type" : "int32",
"optional" : false,
"field" : "reloadAmount"
}, {
"type" : "string",
"optional" : true,
"field" : "reloadChannel"
} ],
"optional" : false,
"name" : "reload"
},
"payload" : {
"reloadID" : "155559213",
"externalAccountID" : "9831200014",
"reloadDate" : 1449529746000,
"reloadAmount" : 140,
"reloadChannel" : "C1"
},
"#version" : "1",
"#timestamp" : "2016-10-19T11:56:09.973Z",
}
}
But, I want only the value part of the "payload" field to move to my ES index as the target JSON body. So I tried to use the 'mutate' filter in the config file as below --
input {
kafka {
zk_connect => "zksrv-1:2181,zksrv-2:2181,zksrv-4:2181"
group_id => "logstash"
topic_id => "reload"
consumer_threads => 3
}
}
filter {
mutate {
remove_field => [ "schema","#version","#timestamp" ]
}
}
output {
elasticsearch {
hosts => ["datanode-6:9200","datanode-2:9200"]
index => "kafka_reloads"
}
}
With this filter, the ES documents now look like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfch0yhSyCFNFP2z59f",
"_score" : 1.0,
"_source" : {
"payload" : {
"reloadID" : "850846698",
"externalAccountID" : "9831200013",
"reloadDate" : 1449356706000,
"reloadAmount" : 30,
"reloadChannel" : "C1"
}
}
}
But actually It should be like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfch0yhSyCFNFP2z59f",
"_score" : 1.0,
"_source" : {
"reloadID" : "850846698",
"externalAccountID" : "9831200013",
"reloadDate" : 1449356706000,
"reloadAmount" : 30,
"reloadChannel" : "C1"
}
}
Is there a way to do this? Can anyone help me on this?
I also tried the below filter --
filter {
json {
source => "payload"
}
}
But that is giving me errors like --
Error parsing json {:source=>"payload", :raw=>{"reloadID"=>"572584696", "externalAccountID"=>"9831200011", "reloadDate"=>1449093851000, "reloadAmount"=>180, "reloadChannel"=>"C1"}, :exception=>java.lang.ClassCastException: org.jruby.RubyHash cannot be cast to org.jruby.RubyIO, :level=>:warn}
Any help will be much appreciated.
Thanks
Gautam Ghosh
You can achieve what you want using the following ruby filter:
ruby {
code => "
event.to_hash.delete_if {|k, v| k != 'payload'}
event.to_hash.update(event['payload'].to_hash)
event.to_hash.delete_if {|k, v| k == 'payload'}
"
}
What it does is:
remove all fields but the payload one
copy all payload inner fields at the root level
delete the payload field itself
You'll end up with what you need.
It's been a while but here there is a valid workaround, hope it would be useful.
json_encode {
source => "json"
target => "json_string"
}
json {
source => "json_string"
}

Elasticsearch: No filter registered for [geo_shape]]

I have a "counties" index, with county documents that resemble the following (extra polygon points ommitted for brevity):
{ "fips" : 1093,
"location" : {
"type" : "polygon",
"coordinates" : [ [ [ -88.194525, 34.157699 ], [ -88.192128, 34.175351 ], ..., [ -88.194525, 34.157699 ] ] ]
}
}
I have created a mapping for these counties:
"mappings" : {
"county" : {
"properties" : {
"location" : {
"type" : "geo_shape"
}
}
}
}
}
I then try to query these documents with a query like the following:
{
"query" : {
"filtered" : {
"query" : {
"match_all" : { }
},
"filter" : {
"geo_shape" : {
"location" : {
"shape" : {
"type" : "envelope",
"coordinates" : [[-87.17863946,41.57623478],[-87.17863846,41.57623578]]
}
}
}
}
}
}
}g
This returns the following error:
{[OQuTlmv0RdmC34MLnN8qHQ][counties][1]: SearchParseException[[counties][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{\"query\":{\"filtered\":{\"query\":{\"match_all\":{}},\"filter\":{\"geo_shape\":{\"location\":{\"shape\":{\"type\":\"envelope\",\"coordinates\":[[-87.17863946,41.57623478],[-87.17863846,41.57623578]]}}}}}}}]]]; nested: QueryParsingException[[counties] No filter registered for [geo_shape]]; }]",
"status" : 400
}
I can't find any reference to a similar error when searching. I have tried this in filter and query forms, with similar errors. Using Elasticsearch 1.1.0.