Manipulating JSON messages from Kafka topic using Logstash filter

Manipulating JSON messages from Kafka topic using Logstash filter - json

I am using Logstash 2.4 to read JSON messages from a Kafka topic and send them to an Elasticsearch Index.
The JSON format is as below --
{
"schema":
{
"type": "struct",
"fields": [
{
"type":"string",
"optional":false,
"field":"reloadID"
},
{
"type":"string",
"optional":false,
"field":"externalAccountID"
},
{
"type":"int64",
"optional":false,
"name":"org.apache.kafka.connect.data.Timestamp",
"version":1,
"field":"reloadDate"
},
{
"type":"int32",
"optional":false,
"field":"reloadAmount"
},
{
"type":"string",
"optional":true,
"field":"reloadChannel"
}
],
"optional":false,
"name":"reload"
},
"payload":
{
"reloadID":"328424295",
"externalAccountID":"9831200013",
"reloadDate":1446242463000,
"reloadAmount":240,
"reloadChannel":"C1"
}
}
Without any filter in my config file, the target documents from the ES index look like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfcyTU4SyCFNFP2z5-l",
"_score" : 1.0,
"_source" : {
"schema" : {
"type" : "struct",
"fields" : [ {
"type" : "string",
"optional" : false,
"field" : "reloadID"
}, {
"type" : "string",
"optional" : false,
"field" : "externalAccountID"
}, {
"type" : "int64",
"optional" : false,
"name" : "org.apache.kafka.connect.data.Timestamp",
"version" : 1,
"field" : "reloadDate"
}, {
"type" : "int32",
"optional" : false,
"field" : "reloadAmount"
}, {
"type" : "string",
"optional" : true,
"field" : "reloadChannel"
} ],
"optional" : false,
"name" : "reload"
},
"payload" : {
"reloadID" : "155559213",
"externalAccountID" : "9831200014",
"reloadDate" : 1449529746000,
"reloadAmount" : 140,
"reloadChannel" : "C1"
},
"#version" : "1",
"#timestamp" : "2016-10-19T11:56:09.973Z",
}
}
But, I want only the value part of the "payload" field to move to my ES index as the target JSON body. So I tried to use the 'mutate' filter in the config file as below --
input {
kafka {
zk_connect => "zksrv-1:2181,zksrv-2:2181,zksrv-4:2181"
group_id => "logstash"
topic_id => "reload"
consumer_threads => 3
}
}
filter {
mutate {
remove_field => [ "schema","#version","#timestamp" ]
}
}
output {
elasticsearch {
hosts => ["datanode-6:9200","datanode-2:9200"]
index => "kafka_reloads"
}
}
With this filter, the ES documents now look like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfch0yhSyCFNFP2z59f",
"_score" : 1.0,
"_source" : {
"payload" : {
"reloadID" : "850846698",
"externalAccountID" : "9831200013",
"reloadDate" : 1449356706000,
"reloadAmount" : 30,
"reloadChannel" : "C1"
}
}
}
But actually It should be like below --
{
"_index" : "kafka_reloads",
"_type" : "logs",
"_id" : "AVfch0yhSyCFNFP2z59f",
"_score" : 1.0,
"_source" : {
"reloadID" : "850846698",
"externalAccountID" : "9831200013",
"reloadDate" : 1449356706000,
"reloadAmount" : 30,
"reloadChannel" : "C1"
}
}
Is there a way to do this? Can anyone help me on this?
I also tried the below filter --
filter {
json {
source => "payload"
}
}
But that is giving me errors like --
Error parsing json {:source=>"payload", :raw=>{"reloadID"=>"572584696", "externalAccountID"=>"9831200011", "reloadDate"=>1449093851000, "reloadAmount"=>180, "reloadChannel"=>"C1"}, :exception=>java.lang.ClassCastException: org.jruby.RubyHash cannot be cast to org.jruby.RubyIO, :level=>:warn}
Any help will be much appreciated.
Thanks
Gautam Ghosh

You can achieve what you want using the following ruby filter:
ruby {
code => "
event.to_hash.delete_if {|k, v| k != 'payload'}
event.to_hash.update(event['payload'].to_hash)
event.to_hash.delete_if {|k, v| k == 'payload'}
"
}
What it does is:
remove all fields but the payload one
copy all payload inner fields at the root level
delete the payload field itself
You'll end up with what you need.

It's been a while but here there is a valid workaround, hope it would be useful.
json_encode {
source => "json"
target => "json_string"
}
json {
source => "json_string"
}

Related

Creating Multiple QueueConfigurations in CloudFormation

I'm currently trying to write multiple QueueConfigurations into my CloudFormation template. Each is an SQS queue that is triggered when an object is created to a specified prefix. Here's what I have so far:
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Id" : "1",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
}],
"QueueConfigurations" : [{
"Id" : "2",
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
I've encountered the error saying Encountered unsupported property Id. I thought that by defining the ID, I would be able to avoid the Duplicate object key error.
Does anyone know how to create multiple triggers in a single CloudFormation template? Thanks for the help in advance.

It should be structured like the below, There should only be one QueueConfigurations attribute
that contains all queue configurations within it. Also the Id parameter is not a valid property.
{
"Resources": {
"S3Bucket": {
"Type" : "AWS::S3::Bucket",
"Properties" :
"BucketName" : { "Ref" : "paramBucketName" },
"LoggingConfiguration" : {
"DestinationBucketName" : "test-bucket",
"LogFilePrefix" : { "Fn::Join": [ "", [ { "Ref": "paramBucketName" }, "/" ] ] }
},
"NotificationConfiguration" : {
"QueueConfigurations" : [{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder1/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-cdc_feeder_prod_hvr_dev"
},
{
"Event" : "s3:ObjectCreated:*",
"Filter" : {
"S3Key" : {
"Rules" : {
"Name" : "prefix",
"Value" : "folder2/"
}
}
},
"Queue" : "arn:aws:sqs:us-east-1:958262988361:interstate-latency_hvr_dev"
}]
}
}
}
}
}
}
There is more information about QueueConfiguration in the documentation.

Update deeply nested array in mongodb

I am trying to update field value in mongoose.
{
"_id" : ObjectId("5b62c772efedb6bd3f0c983a"),
"projectID" : ObjectId("0000000050e62416d0d75837"),
"__v" : 0,
"clientID" : ObjectId("00000000996b902b7c3f5efa"),
"inspection_data" : [
{
"pdf" : null,
"published" : "N",
"submissionTime" : ISODate("2018-08-02T08:57:08.532Z"),
"userID" : ObjectId("00000000cac68e3bc04643f7"),
"insSummary" : "inspected areas",
"insName" : "Infotech",
"_id" : ObjectId("5b62c772fa02622a18655e7b"),
"published_date" : ISODate("2018-08-02T08:57:22.041Z"),
"locationAspects" : [
{
"aspectname" : "Ground floor",
"_id" : ObjectId("5b62c772fa02622a18655e80"),
"comments" : [
{
"_id" : ObjectId("5b62c772fa02622a18655e81"),
"images" : [
{
"path" : "/uploads/inspection/00000000996b902b7c3f5efa/images/1533200242005-IpjLKH4XFWNEcHXa.png",
"img_name" : "1533200242005-IpjLKH4XFWNEcHXa.png",
"title" : "Fan",
"id" : "1"
},
{
"path" : "/uploads/inspection/00000000996b902b7c3f5efa/images/1533200242008-YN8IlA5yrMn3cBnn.png",
"img_name" : "1533200242008-YN8IlA5yrMn3cBnn.png",
"title" : "Box",
"id" : "2"
}
],
"comment" : [
"comment4"
],
"recommendation" : ""
}
]
}]
}
Here I want to update a title Fan in image array as table fan.
I tried $set but I don't know how to do for my db structure.
Kindly give some solution to this
**Updated:**
I tried this code:
mongo.inspection.update({"projectID" : mongoose.Types.ObjectId(req.body.project_id) },
{ "$set": {
"inspection_data.$[e1].locationAspects.$[e2].comments.$[e3].images.$[e4].title" : "TableFan"
}},
{ "arrayFilters": [
{ "e1._id": mongoose.Types.ObjectId(req.body.insId)},
{ "e2._id": mongoose.Types.ObjectId(req.body.aspectId)},
{ "e3._id": mongoose.Types.ObjectId(req.body.commentId)},
{ "e4.id": "1" }
]},function(err,response){
if(err){
console.log("error")
}
else{
console.log('Updated')
console.log(response)
}
})
db.adminCommand( { setFeatureCompatibilityVersion: "3.6" } )
Its showing updated but in my db there is no change. Is any mistake I did ?

You can try with arrayFilters in mongodb
var mongoose = require('mongoose')
Temp.update(
{ "_id" : mongoose.Types.ObjectId("5b62c772efedb6bd3f0c983a") },
{ "$set": {
"inspection_data.$[e1].locationAspects.$[e2].comments.$[e3].images.$[e4].title": "TableFan"
}},
{ "arrayFilters": [
{ "e1._id": mongoose.Types.ObjectId("5b62c772fa02622a18655e7b") },
{ "e2._id": mongoose.Types.ObjectId("5b62c772fa02622a18655e80") },
{ "e3._id": mongoose.Types.ObjectId("5b62c772fa02622a18655e81") },
{ "e4.id": "1" }
]}
)
Note: You have to cast _id to ObjectId

How to only get the data of the nested JSON object MongoDB with node.js

I have a JSON object in my MONGODB
{
"_id" : ObjectId("59d4b9848621854d8fb2b1e1"),
"Bot_name" : "Scheduling bot",
"Modules" : [
{
"ModuleID" : "1111",
"ModuleStatement" : "This is a Sceduling bot, Would you like to book a flight?",
"_id" : ObjectId("59d4b9968621854d8fb2b1e3"),
"ModuleResponse" : [
{
"Response" : "yes",
"TransBotID" : "1112"
},
{
"Response" : "no",
"TransBotID" : "1113"
}
]
},
{
"ModuleID" : "1112",
"ModuleStatement" : "Where would you like to go? New York ? LA?",
"_id" : ObjectId("59d4b9968621854d8fb2b1e3"),
"ModuleResponse" : [
{
"Response" : "New York",
"TransBotID" : "1121"
},
{
"Response" : "LA",
"TransBotID" : "1122"
}
]
},
{
"ModuleID" : "1121",
"ModuleStatement" : " New York..",
"_id" : ObjectId("59d4b9968621854d8fb2b1e3"),
"ModuleResponse" : []
},
{
"ModuleID" : "1121",
"ModuleStatement" : " New York..",
"_id" : ObjectId("59d4b9968621854d8fb2b1e3"),
"ModuleResponse" : []
}
}
Im making a query that will first check the Bot_name and then check the ModuleID which is in the nested array Modules containing JSON object which are 1111, 1112 , 1121 .. so on
how do i only get the json object of ModuleID:1111 of Bot_name:Scheduling bot
so far my query is
botSchema.findOne({ Bot_name: req.body.Name ,'Modules.ModuleID':req.body.MID}, function (err, data) {
console.log(data)
}
here the query returns all the json inside the Modules
how to only get one desired json object? like this
{
"ModuleID" : "1111",
"ModuleStatement" : "This is a Sceduling bot, Would you like to book a flight?",
"_id" : ObjectId("59d4b9968621854d8fb2b1e3"),
"ModuleResponse" : [
{
"Response" : "yes",
"TransBotID" : "1112"
},
{
"Response" : "no",
"TransBotID" : "1113"
}
]
}

You need to use $elemMatch for filter sub arrays.
db.botSchema.findOne(
{ Bot_name: "Scheduling bot"}
, { 'Modules': { $elemMatch:{'ModuleID':"1111"} } }
, function (err, data) { console.log(data) })
Result:
{
"_id" : ObjectId("59d4b9848621854d8fb2b1e1"),
"Modules" : [
{
"ModuleID" : "1111",
"ModuleStatement" : "This is a Sceduling bot, Would you like to book a flight?",
"_id" : ObjectId("59d4b9968621854d8fb2b1e3"),
"ModuleResponse" : [
{
"Response" : "yes",
"TransBotID" : "1112"
},
{
"Response" : "no",
"TransBotID" : "1113"
}
]
}
]
}

Creating a geopoint object from CSV (latitude and longitude columns) in Logstash

I have a CSV with the columns latitude and longitude and I'm trying to create a geopoint object in Logstash 2.3.3 so that I can visualize these values in Kibana 4.5.1.
However when visualizing the data in Kibana, I see location.lat and location.lon, both of type float and not a location of type geopoint.
I'm new to ELK in general and this is driving me crazy. Especially because most of the information that I'm finding is outdated.
The .conf file that I'm using looks like this:
input {
file {
path => "C:/file.csv"
start_position => "beginning"
}
}
filter {
csv {
separator => ","
columns => ["longitude","latitude"]
}
mutate { convert => {"latitude" => "float"} }
mutate { convert => {"longitude" => "float"} }
mutate { rename => {"latitude" => "[location][lat]"} }
mutate { rename => {"longitude" => "[location][lon]"} }
mutate { convert => { "[location]" => "float" } }
}
output {
elasticsearch {
template => "...\elasticsearch-template.json"
template_overwrite => true
action => "index"
hosts => "localhost"
index => "testindex1"
workers => 1
}
stdout {}
}
The template file I'm specifying (elasticsearch-template.json) is the following:
{
"template" : "logstash-*",
"settings" : {
"index.refresh_interval" : "5s"
},
"mappings" : {
"_default_" : {
"_all" : {"enabled" : true, "omit_norms" : true},
"dynamic_templates" : [ {
"message_field" : {
"match" : "message",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string", "index" : "analyzed", "omit_norms" : true,
"fielddata" : { "format" : "disabled" }
}
}
}, {
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string", "index" : "analyzed", "omit_norms" : true,
"fielddata" : { "format" : "disabled" },
"fields" : {
"raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
}
}
}
} ],
"properties" : {
"#timestamp": { "type": "date" },
"#version": { "type": "string", "index": "not_analyzed" },
"geoip" : {
"dynamic": true,
"properties" : {
"ip": { "type": "ip" },
"location" : { "type" : "geo_point" },
"latitude" : { "type" : "float" },
"longitude" : { "type" : "float" }
}
},
"location" : { "type": "geo_point" }
}
}
}
}
If anyone could help me or give me some insight as to what I'm doing wrong, I would be very grateful. Also I'm sure this will help everyone who is on the same boat as me.
I solved it and it is now working perfectly. The template was looking for an index of the type logstash-* and I was using testindex1. Changing my index to logstash-%{+dd.MM.YYYY} fixed it.

You need to remove the last mutate filter which defeats the purpose of what you're trying to achieve.
Also you need to make sure that the testindex1 mapping is faithfully containing the mapping you have in your elasticsearch-template.json file

Sub-records in Avro with Morphlines

I'm trying to convert JSON into Avro using the kite-sdk morphline module. After playing around I'm able to convert the JSON into Avro using a simple schema (no complex data types).
Then I took it one step further and modified the Avro schema as displayed below (subrec.avsc). As you can see the schema consist of a subrecord.
As soon as I tried to convert the JSON to Avro using the morphlines.conf and the subrec.avsc it failed.
Somehow the JSON paths "/record_type[]/alert/action" are not translated by the toAvro function.
The morphlines.conf
morphlines : [
{
id : morphline1
importCommands : ["org.kitesdk.**"]
commands : [
# Read the JSON blob
{ readJson: {} }
{ logError { format : "record: {}", args : ["#{}"] } }
# Extract JSON
{ extractJsonPaths { flatten: false, paths: {
"/record_type[]/alert/action" : /alert/action,
"/record_type[]/alert/signature_id" : /alert/signature_id,
"/record_type[]/alert/signature" : /alert/signature,
"/record_type[]/alert/category" : /alert/category,
"/record_type[]/alert/severity" : /alert/severity
} } }
{ logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } }
{ extractJsonPaths { flatten: false, paths: {
timestamp : /timestamp,
event_type : /event_type,
source_ip : /src_ip,
source_port : /src_port,
destination_ip : /dest_ip,
destination_port : /dest_port,
protocol : /proto,
} } }
# Create Avro according to schema
{ logError { format : "WE GO TO AVRO"} }
{ toAvro { schemaFile : /etc/flume/conf/conf.empty/subrec.avsc } }
# Create Avro container
{ logError { format : "WE GO TO BINARY"} }
{ writeAvroToByteArray { format: containerlessBinary } }
{ logError { format : "DONE!!!"} }
]
}
]
And the subrec.avsc
{
"type" : "record",
"name" : "Event",
"fields" : [ {
"name" : "timestamp",
"type" : "string"
}, {
"name" : "event_type",
"type" : "string"
}, {
"name" : "source_ip",
"type" : "string"
}, {
"name" : "source_port",
"type" : "int"
}, {
"name" : "destination_ip",
"type" : "string"
}, {
"name" : "destination_port",
"type" : "int"
}, {
"name" : "protocol",
"type" : "string"
}, {
"name": "record_type",
"type" : ["null", {
"name" : "alert",
"type" : "record",
"fields" : [ {
"name" : "action",
"type" : "string"
}, {
"name" : "signature_id",
"type" : "int"
}, {
"name" : "signature",
"type" : "string"
}, {
"name" : "category",
"type" : "string"
}, {
"name" : "severity",
"type" : "int"
}
] } ]
} ]
}
The output on { logError { format : "EXTRACTED THIS : {}", args : ["#{}"] } } I output the following:
[{
/record_type[]/alert / action = [allowed],
/record_type[]/alert / category = [],
/record_type[]/alert / severity = [3],
/record_type[]/alert / signature = [GeoIP from NL,
Netherlands],
/record_type[]/alert / signature_id = [88006],
_attachment_body = [{
"timestamp": "2015-03-23T07:42:01.303046",
"event_type": "alert",
"src_ip": "1.1.1.1",
"src_port": 18192,
"dest_ip": "46.231.41.166",
"dest_port": 62004,
"proto": "TCP",
"alert": {
"action": "allowed",
"gid": "1",
"signature_id": "88006",
"rev": "1",
"signature" : "GeoIP from NL, Netherlands ",
"category" : ""
"severity" : "3"
}
}],
_attachment_mimetype=[json/java + memory],
basename = [simple_eve.json]
}]

UPDATE 2017-06-22
you MUST populate the data in the structure in order for this to work, by using addValues or setValues
{
addValues {
micDefaultHeader : [
{
eventTimestampString : "2017-06-22 18:18:36"
}
]
}
}
after debugging the sources of morphline toAvro, it appears that the record is the first object to be evaluated, no matter what you put in your mappings structure.
the solution is quite simple, but unfortunately took a little extra time, eclipse, running the flume agent in debug mode, cloning the source code and lots of coffee.
here it goes.
my schema:
{
"type" : "record",
"name" : "co_lowbalance_event",
"namespace" : "co.tigo.billing.cboss.lowBalance",
"fields" : [ {
"name" : "dummyValue",
"type" : "string",
"default" : "dummy"
}, {
"name" : "micDefaultHeader",
"type" : {
"type" : "record",
"name" : "mic_default_header_v_1_0",
"namespace" : "com.millicom.schemas.root.struct",
"doc" : "standard millicom header definition",
"fields" : [ {
"name" : "eventTimestampString",
"type" : "string",
"default" : "12345678910"
} ]
}
} ]
}
morphlines file:
morphlines : [
{
id : convertJsonToAvro
importCommands : ["org.kitesdk.**"]
commands : [
{
readJson {
outputClass : java.util.Map
}
}
{
addValues {
micDefaultHeader : [{}]
}
}
{
logDebug { format : "my record: {}", args : ["#{}"] }
}
{
toAvro {
schemaFile : /home/asarubbi/Development/test/co_lowbalance_event.avsc
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
}
}
{
writeAvroToByteArray {
format : containerlessJSON
codec : null
}
}
]
}
]
the magic lies here:
{
addValues {
micDefaultHeader : [{}]
}
}
and in the mappings:
mappings : {
"micDefaultHeader" : micDefaultHeader
"micDefaultHeader/eventTimestampString" : eventTimestampString
}
explanation:
inside the code the first field name that is evaluated is micDefaultHeader of type RECORD. as there's no way to specify a default value for a RECORD (logically correct), the toAvro code evaluates this, does not get any value configured in mappings and therefore it fails at it detects (wrongly) that the record is empty when it shouldn't.
however, taking a look at the code, you may see that it requires a Map object, containing no values to please the parser and continue to the next element.
so we add a map object using the addValues and fill it with an empty map [{}]. notice that this must match the name of the record that is causing you an empty value. in my case "micDefaultHeader"
feel free to comment if you have a better solution, as this looks like a "dirty fix"

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Manipulating JSON messages from Kafka topic using Logstash filter - json

It's been a while but here there is a valid workaround, hope it would be useful. json_encode { source => "json" target => "json_string" } json { source => "json_string" }

Related

Creating Multiple QueueConfigurations in CloudFormation

Update deeply nested array in mongodb

How to only get the data of the nested JSON object MongoDB with node.js

Creating a geopoint object from CSV (latitude and longitude columns) in Logstash

Sub-records in Avro with Morphlines

Categories

Resources