Filtering with regex vs json - json

When filtering logs, Logstash may use grok to parse the received log file (let's say it is Nginx logs). Parsing with grok requires you to properly set the field type - e.g., %{HTTPDATE:timestamp}.
However, if Nginx starts logging in JSON format then Logstash does very little processing. It simply creates the index, and outputs to Elasticseach. This leads me to believe that only Elasticsearch benefits from the "way" it receives the index.
Is there any advantage for Elasticseatch in having index data that was processed with Regex vs. JSON? E.g., Does it impact query time?

For elasticsearch it doesn't matter how you are parsing the messages, it has no information about it, you only need to send a JSON document with the fields that you want to store and search on according to your index mapping.
However, how you are parsing the message matters for Logstash, since it will impact directly in the performance.
For example, consider the following message:
2020-04-17 08:10:50,123 [26] INFO ApplicationName - LogMessage From The Application
If you want to be able to search and apply filters on each part of this message, you will need to parse it into fields.
timestamp: 2020-04-17 08:10:50,123
thread: 26
loglevel: INFO
application: ApplicationName
logmessage: LogMessage From The Application
To parse this message you can use different filters, one of them is grok, which uses regex, but if your message has always the same format, you can use another filter, like dissect, in this case both will achieve the same thing, but while grok uses regex to match the fields, dissect is only positional, this make a huge difference in CPU use when you have a high number of events per seconds.
Consider now that you have the same message, but in a JSON format.
{ "timestamp":"2020-04-17 08:10:50,123", "thread":26, "loglevel":"INFO", "application":"ApplicationName","logmessage":"LogMessage From The Application" }
It is easier and fast for logstash to parse this message, you can do it in your input using the json codec or you can use the json filter in your filter block.
If you have control on how your log messages are created, choose something that will make you do not need to use grok.

Related

How do I best construct complex NiFi routing

I'm a total noob when it comes to NiFi - so please feel free to highlight any stupidity/ignorance.
I'm reading messages from a Kafka topic using NiFi.
Each message contains JSON that contains a field called Function and then a whole bunch of different fields, based on the Function. For example, if Function ="Login", you can expect a username and password field, but if Function = "Pay", you can expect "From", "To" and "Amount" fields.
I need to process each type of Function differently. So, basically, I want to read the message from Kafka, determine the function and then route the message, based on the function to the appropriate set of rules.
It sounds like this should be simple - but for one small complication. I have about 500 different types of Functions. So, I don't want to add a RouteOnAttribute node for each function.
Is there a better way to do this? If this was "real code", I suppose that I'm looking for the difference between an "if" statements and some sort of "switch/case" statement....
You would first use EvaluateJsonPath to extract the function into a flow file attribute, then RouteOnAttribute which would need 500 conditions added to it, and then connect each of those 500 conditions to whatever follow on processing is required. The only other thing you could do is implement a custom processor that handles the 500 conditions internally.

How to process values in CSV format in streaming queries over Kafka source?

I'm new to Structured Streaming, and I'd like to know is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs. The format in Kafka value is 50+ fields syslog-like csv, and manually splitting is painfully slow.
Here's the brief part of my code (see full gist here)
spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "myserver:9092")
.option("subscribe", "mytopic")
.load()
.select(split('value, """\^""") as "raw")
.select(ColumnExplode('raw, schema.size): _*) // flatten WrappedArray
.toDF(schema.fieldNames: _*) // apply column names
.select(fieldsWithTypeFix: _*) // cast column types from string
.select(schema.fieldNames.map(col): _*) // re-order columns, as defined in schema
.writeStream.format("console").start()
with no further operations, I can only achieve roughly 10MB/s throughput on a 24-core 128GB mem server. Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
is there a way to specify Kafka value's schema like what we do in normal structured streaming jobs.
No. The so-called output schema for kafka external data source is fixed and cannot be changed ever. See this line.
Would it help if I convert the syslog to JSON in prior? In that case I can use from_json with schema, and maybe it will be faster.
I don't think so. I'd even say that CSV is a simpler text format than JSON (as there's simply a single separator usually).
Using split standard function is the way to go and think you can hardly get better performance since it's to split a row and take every element to build the final output.

Jmeter - What is the best extractor to use on a json message?

Currently testing system where the output is in the form of formatted json.
As part of my tests I need to extract and validate two values from the json record.
The values both have individual identifiers on them but don't appear in the same part of the record, so I can't just grab a single long string.
Loose format of the information in both cases:
"identifier1": [{"identifier2":"idname","values":["bit_I_want!]}]
In the case of the bit I want, this can either be a single quoted value (e.g. "12345") or multiple quoted values (e.g. "12345","23456","98765").
In both cases I'm only interested in validating the whole string of values, not individual values from the set.
Can anyone recommend which of the various extractors in Jmeter would be best to achieve this?
Many Thanks!
The most obvious choicse seems to be JSON Path Assertion (available via JMeter Plugins), it allows not only executing arbitrary JSON queries but conditionally failing the sampler basing on actual and expected result match.
The recommended way of installing JMeter Plugins and keeping them up-to-date is using JMeter Plugins Manager
JMeter 3.1 comes with JSON Extractor to parse JSON response. you could use this expression $.identifier1[0].values
as the JSON Path to extract the values.
If your JSON response is going to simple always as shown in your question, you could use Regular Expression Extractor as well. Advantage is it is faster than JSON extractor. The regular expression would be "values":\[(.*?)\]
Reference: http://www.testautomationguru.com/jmeter-response-data-extractors-comparison/

Logstash: Handling of large messages

I'm trying to parse a large message with Logstash using a file input, a json filter, and an elasticsearch output. 99% of the time this works fine, but when one of my log messages is too large, I get JSON parse errors, as the initial message is broken up into two partial invalid JSON streams. The size of such messages is about 40,000+ characters long. I've looked to see if there is any information on the size of the buffer, or some max length that I should try to stay under, but haven't had any luck. The only answers I found related to the udp input, and being able to change the buffer size.
Does Logstash has a limit size for each event-message?
https://github.com/elastic/logstash/issues/1505
This could also be similar to this question, but there were never any replies or suggestions: Logstash Json filter behaving unexpectedly for large nested JSONs
As a workaround, I wanted to split my message up into multiple messages, but I'm unable to do this, as I need all the information to be in the same record in Elasticsearch. I don't believe there is a way to call the Update API from logstash. Additionally, most of the data is in an array, so while I can update an Elasticsearch record's array using a script (Elasticsearch upserting and appending to array), I can't do that from Logstash.
The data records look something like this:
{ "variable1":"value1",
......,
"variable30": "value30",
"attachements": [ {5500 charcters of JSON},
{5500 charcters of JSON},
{5500 charcters of JSON}..
...
{8th dictionary of JSON}]
}
Does anyone know of a way to have Logstash process these large JSON messages, or a way that I can split them up and have them end up in the same Elasticsearch record (using Logstash)?
Any help is appreciated, and I'm happy to add any information needed!
If your elasticsearch output has a document_id set, it will update the document (the default action in logstash is to index the data -- which will update the document if it already exists)
In your case, you'd need to include some unique field as part of your json messages and then rely on that to do the merge in elasticsearch. For example:
{"key":"123455","attachment1":"something big"}
{"key":"123455","attachment2":"something big"}
{"key":"123455","attachment3":"something big"}
And then have an elasticsearch output like:
elasticsearch {
host => localhost
document_id => "%{key}"
}

JMeter / AMQ - Substitute substring when reading strings from JSON file

I've been bashing against a brick wall on this ever since Monday, when the customer told me that we needed to simulate up to 50,000 pseudo-concurrent entities for the purposes of performance testing. This is the setup. I have text files full of JSON objects containing JSON data that looks a bit like this:
{"customerId"=>"900", "assetId"=>"NN_18_144", "employee"=>"", "visible"=>false,
"GenerationDate"=>"2012-09-21T09:41:39Z", "index"=>52, "Category"=>2...}
It's one object to a line. I'm using JMeter's JMS publisher to read the lines sequentially:
${_StringFromFile(${PATH_TO_DATA_FILES}scenario_9.json)}
from the each of which contain a different scenario.
What I need to do is read the files in and substitute assetId's value with a randomly selected value from a list of 50,000 non-sequential, pre-generated strings (I can't possibly have a separate file for each assetId, as that would involve littering the load injector with 50,000 files and configuring a thread group within JMeter for each). Programatically, it's a trivial matter to perform the substitution but it's not so simple to do it in JMeter on the fly.
Normally, I'd treat this as the interesting technical challenge that it is and spend a few days working it out, but I only have the weekend, which I suspect I'll spend sleeping overnight in the office anyway.
Can anyone help me with this, please?
Thanks.
For reading your assets, use a CSV Data SetConfig , I suppose assetId will be the variable name.
Modify your expression:
${_StringFromFile(${PATH_TO_DATA_FILES}scenario_9.json, lineToSubstitute)}
To do the substitution, add a Beanshell sampler or JSR223_SamplerJ (using groovy) and code the substitution:
String assetId = vars.get("assetId");
String lineToSubstitute = vars.get("lineToSubstitute");
String lineSubstituted = ....;
vars.put("lineSubstituted", lineSubstituted);
If your JSON body is always the same or you have little changes in it, you should:
Use an HTTP Sampler with RAW POST Body
Put the JSON body in it with variables for asset ids
Put asset ids in CSV Data Set config
Avoid using ${_StringFromFile} as it has a cost.
If you need scripting , use JSR223 Post Processor with Script in external file + Caching (available since 2.8) so that script is compiled.