Logstash - won't parse JSON - json

I want to parse data to Elasticsearch using Logstash. So far this worked great but when I try to parse JSON files, Logstash just won't do ...anything. I can start Logstash without any exception but it won't parse anything.
Is there something wrong with my config? The path to the JSON file is correct.
my JSON:
{
"stats": [
{
"liveStatistic": {
"#scope": "21",
"#scopeType": "foo",
"#name": "minTime",
"#interval": "60",
"lastChange": "2011-01-11T15:19:53.259+02:00",
"start": "2011-01-18T14:19:48.333+02:00",
"unit": "s",
"value": 10
}
},
{
"liveStatistic": {
"#scope": "26",
"#scopeType": "bar",
"#name": "newCount",
"#interval": "60",
"lastChange": "2014-01-11T15:19:59.894+02:00",
"start": "2014-01-12T14:19:48.333+02:00",
"unit": 1,
"value": 5
}
},
...
]
}
my Logstash agent config:
input {
file {
path => "/home/me/logstash-1.4.2/values/stats.json"
codec => "json"
start_position => "beginning"
}
}
output {
elasticsearch {
host => localhost
protocol =>"http"
}
stdout {
codec => rubydebug
}
}

You should add the following line to your input:
start_position => "beginning"
Also put the complete document on one line and maybe add {} around your document to make it a valid json document.

Okay, two things:
First, the file input is by default set to start reading at the end of the file. If you want the file to start reading at the beginning, you will need to set start_position. Example:
file {
path => "/mypath/myfile"
codec => "json"
start_position => "beginning"
}
Second, keep in mind that logstash keeps a sincedb file which records how many lines of a file you have already read (so as to not parse information repeatedly!). This is usually a desirable feature, but for testing over a static file (which is what it looks like you're trying to do) then you want to work around this. There are two ways I know of.
One way is you can just make a new copy of the file every time you want to run logstash, and remember to tell logstash to read from that file.
The other way is you can go and delete the sincedb file, wherever it is located. You can tell logstash where to write the sincedeb file with the sincedb_path feature.
I hope this all helped!

Related

How to filter only JSON from TEXT and JSON mixed format in logstash

We have input coming from one of the applications in TEXT + JSON format like the below:
<12>1 2022-10-18T10:48:40.163Z 7VLX5D8 ERAServer 14016 - - {"event_type":"FilteredWebsites_Event","ipv4":"192.168.0.1","hostname":"9krkvs1","source_uuid":"11160173-r3bc-46cd-9f4e-99f66fc0a4eb","occured":"18-Oct-2022 10:48:37","severity":"Warning","event":"An attempt to connect to URL","target_address":"172.66.43.217","target_address_type":"IPv4","scanner_id":"HTTP filter","action_taken":"Blocked","handled":true,"object_uri":"https://free4pc.org","hash":"0E9ACB02118FBF52B28C3570D47D82AFB82EB58C","username":"CKFCVS1\\some.name","processname":"C:\\Users\\some.name\\AppData\\Local\\Programs\\Opera\\opera.exe","rule_id":"Blocked by internal blacklist"}
that is <12>1 2022-10-18T10:48:40.163Z 7VLX5D8 ERAServer 14016 - - in TEXT and other in JSON.
The TEXT part is similar, only the date and time is different, so even if we delete all TEXT part it is okay.
The JSON part is random, but it contains useful information.
Currently, on Kibana, the logs are appearing in the message field, but the separate fields are not appearing because of improper JSON.
So actually we tried to push ONLY the required JSON part by putting manually in the file gives us the required output in Kibana.
So our question is how to achieve this through logstash filters/grok.
Update:
#Val - We already have below configuration
input {
syslog {
port => 5044
codec => json
}
}
But the output on the Kibana is appearing as
And we want it like:
Even though syslog seems like an appealing way of shipping data, it is a big mess in terms of standardization and anyone has a different way of shipping data. The Logstash syslog input only supports RFC3164 and your log format doesn't match that standard.
You can still bypass the normal RFC3164 parsing by providing your own grok pattern, as shown below:
input {
syslog {
port => 5044
grok_pattern => "<%{POSINT:priority_key}>%{POSINT:version} %{TIMESTAMP_ISO8601:timestamp} %{HOSTNAME:[observer][hostname]} %{WORD:[observer][name]} %{WORD:[process][id]} - - %{GREEDYDATA:[event][original]}"
}
}
filter {
json {
source => "[event][original]"
}
}
output {
stdout { codec => json }
}
Running Logstash with the above config, your sample log line gets parsed as this:
{
"#timestamp": "2022-10-18T10:48:40.163Z",
"#version": "1",
"action_taken": "Blocked",
"event": "An attempt to connect to URL",
"event_type": "FilteredWebsites_Event",
"facility": 0,
"facility_label": "kernel",
"handled": true,
"hash": "0E9ACB02118FBF52B28C3570D47D82AFB82EB58C",
"host": "0:0:0:0:0:0:0:1",
"hostname": "9krkvs1",
"ipv4": "192.168.0.1",
"message": "<12>1 2022-10-18T10:48:40.163Z 7VLX5D8 ERAServer 14016 - - {\"event_type\":\"FilteredWebsites_Event\",\"ipv4\":\"192.168.0.1\",\"hostname\":\"9krkvs1\",\"source_uuid\":\"11160173-r3bc-46cd-9f4e-99f66fc0a4eb\",\"occured\":\"18-Oct-2022 10:48:37\",\"severity\":\"Warning\",\"event\":\"An attempt to connect to URL\",\"target_address\":\"172.66.43.217\",\"target_address_type\":\"IPv4\",\"scanner_id\":\"HTTP filter\",\"action_taken\":\"Blocked\",\"handled\":true,\"object_uri\":\"https://free4pc.org\",\"hash\":\"0E9ACB02118FBF52B28C3570D47D82AFB82EB58C\",\"username\":\"CKFCVS1\\\\some.name\",\"processname\":\"C:\\\\Users\\\\some.name\\\\AppData\\\\Local\\\\Programs\\\\Opera\\\\opera.exe\",\"rule_id\":\"Blocked by internal blacklist\"}\n",
"object_uri": "https://free4pc.org",
"observer": {
"hostname": "7VLX5D8",
"name": "ERAServer"
},
"occured": "18-Oct-2022 10:48:37",
"priority": 0,
"priority_key": "12",
"process": {
"id": "14016"
},
"processname": "C:\\Users\\some.name\\AppData\\Local\\Programs\\Opera\\opera.exe",
"rule_id": "Blocked by internal blacklist",
"scanner_id": "HTTP filter",
"severity": "Warning",
"severity_label": "Emergency",
"source_uuid": "11160173-r3bc-46cd-9f4e-99f66fc0a4eb",
"target_address": "172.66.43.217",
"target_address_type": "IPv4",
"timestamp": "2022-10-18T10:48:40.163Z",
"username": "CKFCVS1\\some.name",
"version": "1"
}

What logstash filter plugin to use for Elasticsearch?

I'm having trouble using logstash to bring in the following raw data to elasticsearch. Abstracted the raw data below, was hoping the JSON plugin worked but it currently does not. I've viewed other posts regarding json to no avail.
{
"offset": "stuff",
"results": [
{
"key": "value",
"key1": null,
"key2": null,
"key3": "true",
"key4": "value4",
"key4": [],
"key5": value5,
"key6": "value6",
"key7": "value7",
"key8": value8,
"key9": "value9",
"key10": null,
"key11": null,
"key12": "value12",
"key13": "value13",
"key14": [],
"key15": "key15",
"key16": "value16",
"key17": "value17",
"key18": "value18",
"key19": "value19"
},
{
"key20": "value20",
"key21": null,
"key22": null,
"key23": "value23",
"key24": "value24",
<etc.>
My current conf file:
input {
file {
codec => multiline
{
pattern => '^\{'
negate => true
what => previous
}
#type => "json"
path => <my path>
sincedb_path => "/dev/null"
start_position => "beginning"
}
}
#filter
#{
# json {
# source => message
# remove_field => message
# }
#}
filter
{
mutate
{
replace => [ "message", "%{message}}" ]
gsub => [ 'message','\n','']
}
if [message] =~ /^{.*}$/
{
json { source => message }
}
}
output {
#stdout { codec => rubydebug }
stdout { codec => json }
}
I get a long error that I can't read since it's full of
" \"key10\": null,\r \"key11\": \"value11\",\r
etc.
Does anyone know what I'm doing wrong or how to better see my error? This is valid json but maybe I'm using my regex for multiline codec wrong.
Can you use a different input plugin than file? Parsing a JSON file as a multiline may be problematic. If possible use a plugin with a JSON codec.
In the file input, you can set a real sincedb_path where logstash can write
In the line where you replace message you have one curly bracket } too many
replace => [ "message", "%{message}}" ]
I would write the output to elasticsearch instead of stdout, but ofcourse for testing you don't have to, but when you write the output to elasticsearch you can see the index being created and use kibana to discover if they the content is to your liking.
output {
elasticsearch {
hosts => "localhost"
index => "stuff-%{+xxxx.ww}"
}
}
I use these curl commands to read from the elasticsearch,
curl -s -XGET 'http://localhost:9200/_cat/indices?v&pretty'
and
curl -s -XGET 'http://localhost:9200/stuff*/_search?pretty=true'

Logstash is not converting correctly in JSON

Following is my json log file
[
{
"error_message": " Failed to get line from input file (end of file?).",
"type": "ERROR",
"line_no": "2625",
"file": "GTFplainText.c",
"time": "17:40:02",
"date": "01/07/16",
"error_code": "GTF-00014"
},
{
"error_message": " Bad GTF plain text file header or footer line. ",
"type": "ERROR",
"line_no": "2669",
"file": "GTFplainText.c",
"time": "17:40:02",
"date": "01/07/16",
"error_code": "GTF-00004"
},
{
"error_message": " '???' ",
"type": "ERROR",
"line_no": "2670",
"file": "GTFplainText.c",
"time": "17:40:02",
"date": "01/07/16",
"error_code": "GTF-00005"
},
{
"error_message": " Failed to find 'event source'/'product detail' records for event source '3025188506' host event type 1 valid",
"type": "ERROR",
"line_no": "0671",
"file": "RGUIDE.cc",
"time": "15:43:48",
"date": "06/07/16",
"error_code": "RGUIDE-00033"
}
]
According to my understanding As the log is already in json, We do not need filter section in logstash configuration. Following is my logstash config
input {
file{
path => "/home/ishan/sf_shared/log_json.json"
start_position => "beginning"
codec => "json"
}
}
and the output configuration is
output {
elasticsearch {
hosts => ["localhost:9200"]
sniffing => true
manage_template => false
index => "%{[#metadata][beat]}-%{+YYYY.MM.dd}"
document_type => "%{[#metadata][type]}"
}
stdout { codec => rubydebug }
}
But it seems like the data is not going into ES, as I am not able to see the data when I query the index. What am I missing?
I think the problem is that the json codec expects a full json message on one line and won't work with a message on multiple lines.
A possible work around would be to use the multiline codex and use the json filter.
The configuration for the multiline codec would be:
multiline {
pattern => "]"
negate => "true"
what => "next"
}
All the lines that do not begin with ] will be regrouped with the next line, so you'll have one full json document to give to the json filter.

How should I filter multiple fields with the same name in logstash?

I'm putting tsung logs into ElasticSearch (ES) so that I can filter, visualize and compare results using Kibana.
I'm using logstash and its JSON parsing filter to push tsung logs in JSON format to ES.
Tsung logs are a bit complicated (IMO) with array objects into array objects, multiple-lines event, and several fields having the same name such as "value" in my example hereafter.
I would like to transform this event:
{
"stats":[
{"timestamp": 1317413861, "samples": [
{"name": "users", "value": 0, "max": 1},
{"name": "users_count", "value": 1, "total": 1},
{"name": "finish_users_count", "value": 1, "total": 1}]}]}
into this:
{"timestamp": 1317413861},{"users_value":0},{"users_max":1},{"users_count_value":1},{"users_count_total":1},{"finish_users_count_value":1},{"finish_users_count_total":1}
Since the entire tsung log file is forwarded to logstash at the end of a performance test campaign, I'm thinking about using regex to remove CR and unusefull stats and samples arrays before sending the event to logstash in order to simplify a little bit.
And then, I would use those kind of JSON filter options:
add_field => {"%{name}_value" => "%{value}"}
add_field => {"%{name}_max" => "%{max}"}
add_field => {"%{name}_total" => "%{total}"}
But how should I handle the fact that there are many value fields in one event for instance? What is the best thing to do?
Thanks for your help.
Feels like the ruby{} filter would be needed here. Loop across the entries in the 'samples' field, and construct your own fields based on the name/value/total/max.
There are examples of this type of behavior elsewhere on SO.

Importing and updating data in Elasticsearch

We have an existing search function that involves data across multiple tables in SQL Server. This causes a heavy load on our DB, so I'm trying to find a better way to search through this data (it doesn't change very often). I have been working with Logstash and Elasticsearch for about a week using an import containing 1.2 million records. My question is essentially, "how do I update existing documents using my 'primary key'"?
CSV data file (pipe delimited) looks like this:
369|90045|123 ABC ST|LOS ANGELES|CA
368|90045|PVKA0010|LA|CA
367|90012|20000 Venice Boulvd|Los Angeles|CA
365|90045|ABC ST 123|LOS ANGELES|CA
363|90045|ADHOCTESTPROPERTY|DALES|CA
My logstash config looks like this:
input {
stdin {
type => "stdin-type"
}
file {
path => ["C:/Data/sample/*"]
start_position => "beginning"
}
}
filter {
csv {
columns => ["property_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
output {
elasticsearch {
embedded => true
index => "samples4"
index_type => "sample"
}
}
A document in elasticsearch, then looks like this:
{
"_index": "samples4",
"_type": "sample",
"_id": "64Dc0_1eQ3uSln_k-4X26A",
"_score": 1.4054651,
"_source": {
"message": [
"369|90045|123 ABC ST|LOS ANGELES|CA\r"
],
"#version": "1",
"#timestamp": "2014-02-11T22:58:38.365Z",
"host": "[host]",
"path": "C:/Data/sample/sample.csv",
"property_id": "369",
"postal_code": "90045",
"address_1": "123 ABC ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
I think would like the unique ID in the _id field, to be replaced with the value of property_id. The idea is that subsequent data files would contain updates. I don't need to keep previous versions and there wouldn't be a case where we added or removed keys from a document.
The document_id setting for elasticsearch output doesn't put that field's value into _id (it just put in "property_id" and only stored/updated one document). I know I'm missing something here. Am I just taking the wrong approach?
EDIT: WORKING!
Using #rutter's suggestion, I've updated the output config to this:
output {
elasticsearch {
embedded => true
index => "samples6"
index_type => "sample"
document_id => "%{property_id}"
}
}
Now documents are updating by dropping new files into the data folder as expected. _id and property_id are the same value.
{
"_index": "samples6",
"_type": "sample",
"_id": "351",
"_score": 1,
"_source": {
"message": [
"351|90045|Easy as 123 ST|LOS ANGELES|CA\r"
],
"#version": "1",
"#timestamp": "2014-02-12T16:12:52.102Z",
"host": "TXDFWL3474",
"path": "C:/Data/sample/sample_update_3.csv",
"property_id": "351",
"postal_code": "90045",
"address_1": "Easy as 123 ST",
"city": "LOS ANGELES",
"state_code": "CA"
}
Converting from comment:
You can overwrite a document by sending another document with the same ID... but that might be tricky with your previous data, since you'll get randomized IDs by default.
You can set an ID using the output plugin's document_id field, but it takes a literal string, not a field name. To use a field's contents, you could use an sprintf format string, such as %{property_id}.
Something like this, for example:
output {
elasticsearch {
... other settings...
document_id => "%{property_id}"
}
}
declaimer - I'm the author of ESL
You can use elasticsearch_loader to load psv files into elasticsearch.
In order to set the _id field you can use --id-field=property_id.
for instance:
elasticsearch_loader --index=myindex --type=mytype --id-field=property_id csv --delimiter='|' filename.csv
Have you tried changing the config to this:
filter {
csv {
columns => ["_id","postal_code","address_1","city","state_code"]
separator => "|"
}
}
By naming property_id as _id it should get used during indexing.