Handling arbitrary JSON logs in ELK stack - json

I trying to set up a full ELK stack for managing logs from our Kubernetes clusters. Our applications are either logging plain text logs or JSON objects. I want to be able to handle searching in the text logs, and also be able to index and search the fields in the JSON.
I have filebeats running on each Kubernetes node, picking up the docker logs, enriching them with various kubernetes fields, and a few fields we use internally. The complete filebeat.yml is:
filebeat.inputs:
- type: container
paths:
- /var/log/containers/*.log
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/log/containers/"
fields:
kubernetes.cluster: <name of the cluster>
environment: <environment of the cluster>
datacenter: <datacenter the cluster is running in>
fields_under_root: true
output.logstash:
hosts: ["logstash-logstash-headless:5044"]
The filebeat is shipping the resulting logs to a central Logstash I have installed. In the logstash I attempt to parse the log message field into a new field called message_parsed. The complete pipeline looks like this:
input {
beats {
port => 5044
type => "beats"
tags => ["beats"]
}
}
filter {
json {
source => "message"
target => "message_parsed"
skip_on_invalid_json => true
}
}
output {
elasticsearch {
hosts => [
"elasticsearch-logging-ingest-headless:9200"
]
}
}
I then have an Elasticsearch cluster installed which received the logs. I have separate Data, Ingest and Master nodes. Apart from some CPU and memory configuration the cluster is completely default settings.
The trouble I'm having is that I do not control the contents of the JSON messages. They could have any field with any type, and we have many cases where the same field exists but the fields values are of differing types. One simple example is the field level, which is usually a string carrying the values "debug", "info", "warn" or "error", but we also run some software that outputs this level as a numeric value. Other cases include error fields sometimes being objects and other times being strings, and date fields sometimes being unix timestamps and sometimes being human readable dates.
This of course makes Elasticsearch complain with a mapper_parsing_exception. Here's an example of one such error:
[2021-04-07T15:57:31,200][WARN ][logstash.outputs.elasticsearch][main][19f6c57d0cbe928f269b66714ce77f539d021549b68dc20d8d3668bafe0acd21] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash", :routing=>nil, :_type=>"_doc"}, #<LogStash::Event:0x1211193c>], :response=>{"index"=>{"_index"=>"logstash-2021.04.06-000014", "_type"=>"_doc", "_id"=>"L80NrXgBRfSv8axlknaU", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [message_parsed.error] tried to parse field [error] as object, but found a concrete value"}}}}
Is there any way I can make Elasticsearch handle that case?

Related

Filtering with regex vs json

When filtering logs, Logstash may use grok to parse the received log file (let's say it is Nginx logs). Parsing with grok requires you to properly set the field type - e.g., %{HTTPDATE:timestamp}.
However, if Nginx starts logging in JSON format then Logstash does very little processing. It simply creates the index, and outputs to Elasticseach. This leads me to believe that only Elasticsearch benefits from the "way" it receives the index.
Is there any advantage for Elasticseatch in having index data that was processed with Regex vs. JSON? E.g., Does it impact query time?
For elasticsearch it doesn't matter how you are parsing the messages, it has no information about it, you only need to send a JSON document with the fields that you want to store and search on according to your index mapping.
However, how you are parsing the message matters for Logstash, since it will impact directly in the performance.
For example, consider the following message:
2020-04-17 08:10:50,123 [26] INFO ApplicationName - LogMessage From The Application
If you want to be able to search and apply filters on each part of this message, you will need to parse it into fields.
timestamp: 2020-04-17 08:10:50,123
thread: 26
loglevel: INFO
application: ApplicationName
logmessage: LogMessage From The Application
To parse this message you can use different filters, one of them is grok, which uses regex, but if your message has always the same format, you can use another filter, like dissect, in this case both will achieve the same thing, but while grok uses regex to match the fields, dissect is only positional, this make a huge difference in CPU use when you have a high number of events per seconds.
Consider now that you have the same message, but in a JSON format.
{ "timestamp":"2020-04-17 08:10:50,123", "thread":26, "loglevel":"INFO", "application":"ApplicationName","logmessage":"LogMessage From The Application" }
It is easier and fast for logstash to parse this message, you can do it in your input using the json codec or you can use the json filter in your filter block.
If you have control on how your log messages are created, choose something that will make you do not need to use grok.

How to parse JSON from Terraform null_resource into map using data external block

I am trying to parse JSON key/value pairs into a map I can use in Terraform during a lookup.
I've created a null_resource with a local-exec provisioner to run my aws cli command and then parsed with jq to clean it up. The JSON looks good, the correct key/value pairs are displayed when run from the CLI. I created an external data block to convert the JSON into an TF map, but I'm getting an Inccorect attribute error from TF.
resource "null_resource" "windows_vars" {
provisioner "local-exec" {
command = "aws ssm --region ${var.region} --profile ${var.profile} get-parameters-by-path --recursive --path ${var.path} --with-decryption | jq '.Parameters | map({'key': .Name, 'value': .Value}) | from_entries'"
}
}
data "external" "json" {
depends_on = [null_resource.windows_vars]
program = ["echo", "${null_resource.windows_vars}"]
}
output "map" {
value = ["${values(data.external.json.result)}"]
}
I expected the key/value pairs to be added to a TF map I could use elsewhere.
I got the following error:
Error: Incorrect attribute value type
on instances/variables.tf line 33, in data "external" "json":
33: program = ["echo", "${null_resource.windows_vars}"]
Inappropriate value for attribute "program": element 1: string required.
JSON output looks like this:
{
"/vars/windows/KEY_1": "VALUE_1",
"/vars/windows/KEY_2": "VALUE_2",
"/vars/windows/KEY_3": "VALUE_3",
"/vars/windows/KEY_4": "VALUE_4"
}
I actually answered my own question. I am using a data external block to run my aws cli command and referencing the block in my module.
data "external" "json" {
program = ["sh", "-c", "aws ssm --region ${var.region} --profile ${var.profile} get-parameters-by-path --recursive --path ${var.path} --with-decryption | jq '.Parameters | map({'key': .Name, 'value': .Value}) | from_entries'"]
}
The ${var.amis["win2k19_base"]} will do a lookup on a map of ami ids I use and I am using that as the key in the parameter store for the value I am looking for.
Inside my module I am using this:
instance_var = data.external.json.result["${var.path}${var.amis["win2k19_base"]}"]
Thank you for the great suggestions.
An alternative way to address this would be to write a data-only module which encapsulates the data fetching and has its own configured aws provider to fetch from the right account.
Although it's usually not recommended for a child module to have its own provider blocks, that is allowed and can be okay if the provider in question is only being used to fetch data sources, because Terraform will never need to "destroy" those. The recommendation against nested module provider blocks is that it will cause trouble if you remove a module while the resource objects declared inside it still exist, and then there's no provider configuration left to use to destroy them.
With that said, here's an example of the above idea, intended to be used as a child module which can be imported by any configuration that needs access to this data:
variable "region" {}
variable "profile" {}
variable "path" {}
provider "aws" {
region = var.region
profile = var.profile
}
data "aws_ssm_parameter" "foo" {
name = var.path
}
output "result" {
# For example we'll just return the entire thing, but in
# practice it might be better to pre-process the output
# into a well-defined shape for consumption by the calling
# modules, so that they can rely on a particular structure.
value = jsondecode(data.aws_ssm_parameter.foo)
}
I don't think the above is exactly equivalent to the original question since AFAIK aws_ssm_parameter does not do a recursive fetch at the time of writing, but I'm not totally sure. My main purpose here was to show the idea of using a nested module with its own provider configuration as an alternative way to fetch data from a specific account/region.
A more direct response to the original question is that provisioners are designed as one-shot actions and so it's not possible to access any data they might return. The external data source is one way to run an external program to gather some data, if a suitable data source is not already available.

Kafka connect string to json in Postgresql

I have a topic with string containing json. For example a message could be:
'{"id":"foo", "datetime":1}'
In this topic everything is considered as string.
I would like to send messages in postgresql table with kafka-connect. My goal is to let postgresql to understand that messages are json. Indeed, postgresql handles pretty well json.
How to tell kafka-connect or postgresql that messages are in fact json ?
Thanks
EDIT:
For now, I use ./bin/connect-standalone config/connect-standalone.properties config/sink-sql-rules.properties.
With:
connect-standalone.properties
bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=10000
rest.port=8084
plugin.path=share/java
sink-sql-rules.properties
name=mysink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
# The topics to consume from - required for sink connectors like this one
topics=mytopic
# Configuration specific to the JDBC sink connector.
connection.url=***
connection.user=***
connection.password=***
mode=timestamp+incremeting
auto.create=true
auto.evolve=true
table.name.format=mytable
batch.size=500
EDIT2:
With those conf I get this error:
org.apache.kafka.connect.errors.ConnectException: No fields found using key and value schemas for table

Logstash: Handling of large messages

I'm trying to parse a large message with Logstash using a file input, a json filter, and an elasticsearch output. 99% of the time this works fine, but when one of my log messages is too large, I get JSON parse errors, as the initial message is broken up into two partial invalid JSON streams. The size of such messages is about 40,000+ characters long. I've looked to see if there is any information on the size of the buffer, or some max length that I should try to stay under, but haven't had any luck. The only answers I found related to the udp input, and being able to change the buffer size.
Does Logstash has a limit size for each event-message?
https://github.com/elastic/logstash/issues/1505
This could also be similar to this question, but there were never any replies or suggestions: Logstash Json filter behaving unexpectedly for large nested JSONs
As a workaround, I wanted to split my message up into multiple messages, but I'm unable to do this, as I need all the information to be in the same record in Elasticsearch. I don't believe there is a way to call the Update API from logstash. Additionally, most of the data is in an array, so while I can update an Elasticsearch record's array using a script (Elasticsearch upserting and appending to array), I can't do that from Logstash.
The data records look something like this:
{ "variable1":"value1",
......,
"variable30": "value30",
"attachements": [ {5500 charcters of JSON},
{5500 charcters of JSON},
{5500 charcters of JSON}..
...
{8th dictionary of JSON}]
}
Does anyone know of a way to have Logstash process these large JSON messages, or a way that I can split them up and have them end up in the same Elasticsearch record (using Logstash)?
Any help is appreciated, and I'm happy to add any information needed!
If your elasticsearch output has a document_id set, it will update the document (the default action in logstash is to index the data -- which will update the document if it already exists)
In your case, you'd need to include some unique field as part of your json messages and then rely on that to do the merge in elasticsearch. For example:
{"key":"123455","attachment1":"something big"}
{"key":"123455","attachment2":"something big"}
{"key":"123455","attachment3":"something big"}
And then have an elasticsearch output like:
elasticsearch {
host => localhost
document_id => "%{key}"
}

JSON - MongoDB versioning

I am trying to use JSON for application Configuration. I need some of the objects in the JSON to be dynamically crated(ex: Lookup from SQL database). It also needs to store the version history of the JSON file. Since I want to go back and forth to switch from old configuration to a new configuration version.
Initial thoughts were to put JSON on MongoDB and use placeholders for the dynamic part of the JSON object. Can someone give a guidance whether my thinking here is correct?(I am thinking of using JSON.NET for serialize/desiralize JSON object). Thanks in advance.
Edit:
Ex: lets assume we have 2 Environments. env1(v1.0.0.0) env2(v1.0.0.1)
**Env1**
{
id: env1 specific process id
processname: process_name_specific_to_env1
host: env1_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.0 release)
}
**Env2**
{
id: env2 specific process id
processname: process_name_specific_to_env2
host: env2_specific_host_ip
...
threads: 10(this is common across environments for V1.0.0.1 release)
queue_size:15 (this is common across environments for V1.0.0.1 release)
}
what I want store is a common JSON file PER version. The idea is if I want to upgrade the version lets say env1 to 1.0.0.1(from 1.0.0.0), should be able to take the v1.0.0.1 of JSON config and fill the env specific data from SQL and generate a new JSON). This way when moving environments from one release to another do not have to re do configuration.
ex: 1.0.0.0 JSON file
{
id: will be dynamically filled in from SQL
processname: will be dynamically filled in from SQL
host: will be dynamically filled in from SQL
...
threads: 10(this is common across environments for V1.0.0.0 release))
}
=> generate a new file for any environment when requested.
Hope I am being clear on what I am trying to achieve
As you said, you need some way to include the SQL part dynamically, that means manual joins in your application. Simple Ids referring to the other table should be enough, you don't need to invent a placeholder mechanism.
Choose which one is better for you:
MongoDB to SQL reference
MongoDB
{
"configParamA": "123", // ID of SQL row
"configParamB": "456", // another SQL ID
"configVersion": "2014-11-09"
}
SQL to MongoDB reference
MongoDB
{
"configVersion": "2014-11-09"
}
SQL
Just add a column with the configuration id, which is used in MongoDB, to every associated configuration row.