Logstash: Handling of large messages - json

I'm trying to parse a large message with Logstash using a file input, a json filter, and an elasticsearch output. 99% of the time this works fine, but when one of my log messages is too large, I get JSON parse errors, as the initial message is broken up into two partial invalid JSON streams. The size of such messages is about 40,000+ characters long. I've looked to see if there is any information on the size of the buffer, or some max length that I should try to stay under, but haven't had any luck. The only answers I found related to the udp input, and being able to change the buffer size.
Does Logstash has a limit size for each event-message?
https://github.com/elastic/logstash/issues/1505
This could also be similar to this question, but there were never any replies or suggestions: Logstash Json filter behaving unexpectedly for large nested JSONs
As a workaround, I wanted to split my message up into multiple messages, but I'm unable to do this, as I need all the information to be in the same record in Elasticsearch. I don't believe there is a way to call the Update API from logstash. Additionally, most of the data is in an array, so while I can update an Elasticsearch record's array using a script (Elasticsearch upserting and appending to array), I can't do that from Logstash.
The data records look something like this:
{ "variable1":"value1",
......,
"variable30": "value30",
"attachements": [ {5500 charcters of JSON},
{5500 charcters of JSON},
{5500 charcters of JSON}..
...
{8th dictionary of JSON}]
}
Does anyone know of a way to have Logstash process these large JSON messages, or a way that I can split them up and have them end up in the same Elasticsearch record (using Logstash)?
Any help is appreciated, and I'm happy to add any information needed!

If your elasticsearch output has a document_id set, it will update the document (the default action in logstash is to index the data -- which will update the document if it already exists)
In your case, you'd need to include some unique field as part of your json messages and then rely on that to do the merge in elasticsearch. For example:
{"key":"123455","attachment1":"something big"}
{"key":"123455","attachment2":"something big"}
{"key":"123455","attachment3":"something big"}
And then have an elasticsearch output like:
elasticsearch {
host => localhost
document_id => "%{key}"
}

Related

Processing a Kafka message using KSQL that has a field that can be either an ARRAY or a STRUCT

I'm consuming a Kafka topic published by another team (so I have very limited influence over the message format). The message has a field that holds an ARRAY of STRUCTS (an array of objects), but if the array has only one value then it just holds that STRUCT (no array, just an object). I'm trying to transform the message using Confluent KSQL. Unfortunately, I cannot figure out how to do this.
For example:
{ "field": {...} } <-- STRUCT (single element)
{ "field": [ {...}, {...} ] } <-- ARRAY (multiple elements)
{ "field": [ {...}, {...}, {...} ] <-- ARRAY (multiple elements)
If I configure the field in my message schema as a STRUCT then all messages with multiple values error. If I configure the field in my message schema as an ARRAY then all messages with a single value error. I could create two streams and merge them, but then my error log will be polluted with irrelevant errors.
I've tried capturing this field as a STRING/VARCHAR which is fine and I can split the messages into two streams. If I do this, then I can parse the single value messages and extract the data I need, but I cannot figure out how to parse the multivalue messages. None of the KSQL JSON functions seem to allow parsing of JSON Arrays out of JSON Strings. I can use EXTRACTJSONFIELD() to extract a particular element of the array, but not all of the elements.
Am I missing something? Is there any way to handle this reasonably?
In my experience, this is one use-case where KSQL just doesn't work. You would need to use Kafka Streams or a plain consumer to deserialize the event as a generic JSON type, then check object.get("field").isArray() or isObject(), and handle accordingly.
Even if you used a UDF in KSQL, the STREAM definition would be required to know ahead of time if you have field ARRAY<?> or field STRUCT<...>
I finally solved this in a roundabout way...
First, I created an initial stream reading the transaction as a stream of bytes using KAFKA format instead of JSON format. This allows me to put a filter conditional filter on the data so I can fork the stream into a version for the single (STRUCT) variation and a version for the multiple (ARRAY) variation.
The initial stream looks like:
CREATE OR REPLACE STREAM `my-topic-stream` (
id STRING KEY,
data BYTES
)
WITH (
KAFKA_TOPIC='my-topic',
VALUE_FORMAT='KAFKA'
);
Forking that stream looks like this with a second for a multiple version filtering for IS NOT NULL:
CREATE OR REPLACE STREAM `my-single-stream`
WITH (
kafka_topic='my-single-topic'
) AS
SELECT *
FROM `my-topic-stream`
WHERE JSON_ARRAY_LENGTH(EXTRACTJSONFIELD(FROM_BYTES(data, 'utf8'), '$.field')) IS NULL;
At this point I can create a schema for both variations, explode field, and merge the two streams back together. I don't know if this can be refined to be more efficient, but this successfully processes the transactions as I wanted.

Are there Size limits for flutter json.decode?

I have a dictionary file with 200,000 items in it.
I have a Dictionary Model which matches the SQLite db and the proper methods.
If I try to parse the whole file, it seems to hang. If I do 8000 items, it seems to do it quite quickly. Is there a size limit, or is just because there might be some corrupted data somewhere? This json was exported from the sqlite db as json pretty, so I would imagine it was done correctly. It also works fine with the first 8000 items.
String peuJson = await getPeuJson();
List<Dictionary> dicts = (json.decode(peuJson) as List)
.map((i) => Dictionary.fromJson(i))
.toList();
JSON is similar to other data formats like XML - if you need to transmit more data, you just send more data. There's no inherent size limitation to the JSON request. Any limitation would be set by the server parsing the request.

Foundry Data Connection Rest responses as rows

I'm trying to implement a very simple single get call, and the response returns some text with a bunch of ids separated by newline (like a single column csv). I want to save each one as a row in a dataset.
I understand that in general the Rest connector saves each response as a new row in an avro file, which works well for json responses which can then be parsed in code.
However in my case I need it to just save the response in a txt or csv file, which I can then apply a schema to, getting each id in its own row. How can I achieve this?
By default, the Data Connection Rest connector will place each response from the API as a row in the output dataset. If you know the format type of your response, and it's something that would usually be parsed to be one row per newline (csv for example), you can try setting the outputFileType to the correct format (undefined by default).
For example (for more details see the REST API Plugin documentation):
type: rest-source-adapter2
outputFileType: csv
restCalls:
- type: magritte-rest-call
method: GET
path: '/my/endpoint/file.csv'
If you don't know the format, or the above doesn't work regardless, you'll need to parse the response in transforms to split it into separate rows, this can be done as if the response was a string column, in this case exploding after splitting on newline (\n) might be useful: F.explode(F.split(F.col("response"), r'\n'))

Filtering with regex vs json

When filtering logs, Logstash may use grok to parse the received log file (let's say it is Nginx logs). Parsing with grok requires you to properly set the field type - e.g., %{HTTPDATE:timestamp}.
However, if Nginx starts logging in JSON format then Logstash does very little processing. It simply creates the index, and outputs to Elasticseach. This leads me to believe that only Elasticsearch benefits from the "way" it receives the index.
Is there any advantage for Elasticseatch in having index data that was processed with Regex vs. JSON? E.g., Does it impact query time?
For elasticsearch it doesn't matter how you are parsing the messages, it has no information about it, you only need to send a JSON document with the fields that you want to store and search on according to your index mapping.
However, how you are parsing the message matters for Logstash, since it will impact directly in the performance.
For example, consider the following message:
2020-04-17 08:10:50,123 [26] INFO ApplicationName - LogMessage From The Application
If you want to be able to search and apply filters on each part of this message, you will need to parse it into fields.
timestamp: 2020-04-17 08:10:50,123
thread: 26
loglevel: INFO
application: ApplicationName
logmessage: LogMessage From The Application
To parse this message you can use different filters, one of them is grok, which uses regex, but if your message has always the same format, you can use another filter, like dissect, in this case both will achieve the same thing, but while grok uses regex to match the fields, dissect is only positional, this make a huge difference in CPU use when you have a high number of events per seconds.
Consider now that you have the same message, but in a JSON format.
{ "timestamp":"2020-04-17 08:10:50,123", "thread":26, "loglevel":"INFO", "application":"ApplicationName","logmessage":"LogMessage From The Application" }
It is easier and fast for logstash to parse this message, you can do it in your input using the json codec or you can use the json filter in your filter block.
If you have control on how your log messages are created, choose something that will make you do not need to use grok.

Handling big JSONs in Azure Data Factory

I'm trying to use ADF for the following scenario:
a JSON is uploaded to a Azure Storage Blob, containing an array of similar objects
this JSON is read by ADF with a Lookup Activity and uploaded via a Web Activity to an external sink
I cannot use the Copy Activity, because I need to create a JSON payload for the Web Activity, so I have to lookup the array and paste it like this (payload of the Web Activity):
{
"some field": "value",
"some more fields": "value",
...
"items": #{activity('GetJsonLookupActivity').output.value}
}
The Lookup activity has a known limitation of an upper limit of 5000 rows at a time. If the JSON is larger, only 5000 top rows will be read and all else will be ignored.
I know this, so I have a system that chops payloads into chunks of 5000 rows before uploading to storage. But I'm not the only user, so there's a valid concern that someone else will try uploading bigger files and the pipeline will silently pass with a partial upload, while the user would obviously expect all rows to be uploaded.
I've come up with two concepts for a workaround, but I don't see how to implement either:
Is there any way for me to check if the JSON file is too large and fail the pipeline if so? The Lookup Activity doesn't seem to allow row counting, and the Get Metadata Activity only returns the size in bytes.
Alternatively, the MSDN docs propose a workaround of copying data in a foreach loop. But I cannot figure out how I'd use Lookup to first get rows 1-5000 and then 5001-10000 etc. from a JSON. It's easy enough with SQL using OFFSET N FETCH NEXT 5000 ROWS ONLY, but how to do it with a JSON?
You can't set any index range(1-5,000,5,000-10,000) when you use LookUp Activity.The workaround mentioned in the doc doesn't means you could use LookUp Activity with pagination,in my opinion.
My workaround is writing an azure function to get the total length of json array before data transfer.Inside azure function,divide the data into different sub temporary files with pagination like sub1.json,sub2.json....Then output an array contains file names.
Grab the array with ForEach Activity, execute lookup activity in the loop. The file path could be set as dynamic value.Then do next Web Activity.
Surely,my idea could be improved.For example,you get the total length of json array and it is under 5000 limitation,you could just return {"NeedIterate":false}.Evaluate that response by IfCondition Activity to decide which way should be next.It the value is false,execute the LookUp activity directly.All can be divided in the branches.