Processing large JSON attributes gives errors in Nifi - json

I have a Nifi flow which processes data from the webhose API, webhose returns a whole webpage of text in its result as a attribute in Json. When I try to extract this using EvaluateJsonPath processor and write it to a new attribute it gives me the "nifi processor exception repository failed to update" error, the content is encoded in utf8 and I know that there is a limitation of 65535 bytes for an attribute in Json. Is there a workaround for this.

I believe this limitation should be resolved in Apache NiFi 1.2.0 from this JIRA:
https://issues.apache.org/jira/browse/NIFI-3389
Also, keep in mind that having a lot of large attributes is not ideal for performance.

Related

Nifi RestLookupService in LookupRecord gives JsonParseException

I have a basic NIFI flow with one GenerateFlowFile processor and one LookupRecord processor with RestLookupService.
I need to do a rest lookup and enrich the incoming flow file with rest response.
I am getting errors that the lookup service is unable to lookup coordinates with the value I am extracting from the incoming flow file.
GenerateFlowFile is configured with simple JSON
LookupRecord is configured to extract the key from the JSON and populate it to the RestLookupService. Also, JsonReader and JsonSetWriter is configured to read the incoming flow file and to write the response back to the flow file
The RestLookupService itself exits with JsonParseException about unexpected character '<'
RestLookupService is configured with my API running in the cloud in which I am trying to use the extracted KEY from the incoming flow file.
The most interesting bug is that when I configure the URL to point for example to mocky.io everything works correctly so it means that the issue itself is tight with the API URL I am using (http://iotosk.ddns.net:3006/devices/${key}/getParsingData). I have tried also removing the $key, using the exact URL, using different URLs..
Of course the API is working OK over postman/curl anything else. I have checked the logs on the container that the API is running on and there is no requests in the logs what means that nifi is failing even before reaching the API. At least on application level...
I am absolutely out of options without any clue how to solve this. And with nifi also google is not helpful.
Does anybody see any issue in the configuration or can point me in some direction what can cause this issue?
After some additional digging. The issue was connected with authentication logic even
before the API receives it and that criples the request and and returned XML as Bryan Bende suggested in the comment.
But definitely better logging in nifi will help to solve this way faster...

Processing json data from kafka using structured streaming

I want to convert incoming JSON data from Kafka into a dataframe.
I am using structured streaming with Scala 2.12
Most people add a hard coded schema, but if the json can have additional fields, it requires changing the code base every-time, which is tedious.
One approach is to write it into a file and infer it with but I rather avoid doing that.
Is there any other way to approach this problem?
Edit: Found a way to turn a json string into a dataframe but cant extract it from the stream source, it is possible to extract it?
One way is to store the schema itself in the message headers (not in the key or value).
Though, this increases message size, it will be easy to parse the JSON value without the need for any external resource like a file or a schema registry.
New messages can have new schemas while at the same time old messages can still be processed using their old schema itself, because the schema is within the message itself.
Alternatively, you can version the schemas and include an id for every schema in the message headers (or) a magic byte in the key or value and infer the schema from there.
This approach is followed by Confluent Schema registry. It allows you to basically go through different versions of same schema and see how your schema has evolved over time.
Read the data as string and then convert it to map[string,String], this way you can process the any json without even knowing its schema
based on JavaTechnical answer , the best approach would be to use a schema registry and
avro data instead of json, there is no going around hardcoding a schema (for now).
include your schema name and id as a header and use them to read the schema from the schema registry.
use the from_avro fucntion to turn that data into a df!

Json file that would be used in Elasticsearch

I want to know, if the Json files that would be used in Elasticsearch should have a predefined structure. Or can any Json document can be uploaded?
I've seen some Json documents that before each record there's such:
{"index":{"_index":"plos","_type":"article","_id":0}}
{"id":"10.1371/journal.pone.0007737","title":"Phospholipase C-β4 Is Essential for the Progression of the Normal Sleep Sequence and Ultradian Body Temperature Rhythms in Mice"}
Theoretically you can upload any JSON document. However, be mindful that Elasticsearch can create/change the index mapping based on your create/update actions. So if you send a JSON that includes a previously unknown field? Congratulations, your index mapping now contains a new field! In this same way the data type of a field might also be affected by introducing a document with data of a different type. So, my advice is be very careful in constructing your requests to avoid surprises.
Fyi, the syntax you posted looks like a bulk request (https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html). Those do have some demands on the syntax to clarify what you want to do to which documents. "Index" call sending one document is very unrestricted though.

JSONs with (supposedly) same format treated differently by BigQuery - one accepted, one rejected

I am trying to upload JSON files to BigQuery. The JSON files are outputs from the Lighthouse auditing tool. I have made some changes them in Python to make field names acceptable for BigQuery and converted the format into newline JSON.
I am now testing this process and I have found that while for many web pages the upload runs without issue, BigQuery is rejecting some of the JSON files. The rejected JSONs always seem to be from the same website, for example, many of the audit JSONs from Topshop have failed on upload (the manipulations in Python run without issue). What I am confused by is that I can see no difference in the formatting/structure of the JSONs which succeed and fail.
I have included some examples here of the JSON files: https://drive.google.com/open?id=1x66PoDeQGfOCTEj4l3VqMIjjhdrjqs9w
The error I get from BigQuery when a JSON fails to load is this:
Error while reading table: build_test_2f38f439_7e9a_4206_ada6_ac393e55b8ec4_source, error message: Failed to parse JSON: No active field found.; ParsedString returned false; Could not parse value; Could not parse value; Could not parse value; Could not parse value; Could not parse value; Could not parse value; Parser terminated before end of string
I have also attempted to upload the failed JSONs to a new table through the interface using the autodetect feature (in an attempt to discover whether the Schema was at fault) and these uploads fail too, with the same error.
This makes me think the JSON files must be wrong, but I have copied them into several different JSON validators which all accept them as one row of valid JSON.
Any help understanding this issue would be much appreciated, thank you!
When you load JSON files to BigQuery, it's good to remember that there are some limitations associated to this format. You can find them here. Even though your files might be valid JSON files, some of them may not comply with BigQuery limitations, so I would recommend you to double check if they are actually correct for BigQuery.
I hope that helps.
I eventually found the error here through a long trial and error process where I uploaded first the first-half and then the second-half of the JSON file to BigQuery. The second-half failed so I split that in half again to see which half the error occurred in. This continued until I found the line.
At a deep level of nesting there was a situation where one field was always a list of strings, but when there were no values associated with the field it appeared as an empty string (rather than an empty list). This inconsistency was causing the error. The trial and error process was long but given the vague error message and that the JSON was thousands of lines long, this seemed like the most efficient way to get there.

Grails JSON max length

I am aware that JSON strings often have a max length defined in either Apache on PHP, however where is the max length for JSON strings defined in Grails using TomCat?
The JSON string I am sending is 13,636 characters in length, however I can shorten it a lot (although I don't want to while we're testing) - also, we may be sending images via JSON in the future which I've read requires base64 encoding and thus a considerable overhead. If we were to do such a thing then I am worried that if this limit is causing problems, it's something we should overcome now.
If there is no limit, then perhaps I am doing something wrong. I have a finite amount of domain objects that I am encoding as JSON using domainInstance as grails.converters.deep.JSON - this is done using a for loop and each time the JSON string is appended to a StringBuilder
I then render the StringBuilder in a view using render(stringBuilder.toString()) and the first JSON string is fine, however the second is truncated near to the end. If I were to guestimate I'd say I am getting around 80% of the total length of the StringBuilder.
EDIT/SOLUTION: Apologies guys & girls, I've noticed that when I view source on the page I get the complete JSON string, however when I just view the page it's truncated. It's an odd error, I'll accept answers on why it's truncated, though. Thanks.
There is a maximum size in Tomcat for POST requests, which you may be concerned with later if you start sending huge JSON / Base64 image requests (like you mentioned).
The default value in tomcat is 2,097,152 bytes (2 MB); it can be changed by setting the maxPostSize attribute of the <Connector> in server.xml.
From the docs (for maxPostSize in Tomcat 7):
The maximum size in bytes of the POST which will be handled by the
container FORM URL parameter parsing. The limit can be disabled by
setting this attribute to a value less than or equal to 0. If not
specified, this attribute is set to 2097152 (2 megabytes).
This is pretty straightforward to configure if you're deploying a Grails war to a standalone Tomcat instance. However, I can't find much about actually configuring server.xml if you're using the tomcat plugin. If it were me, I'd probably just run large-file tests using the war instead of with grails run-app.