Nifi - edit json flow file content - json

I am using Nifi 1.6.0. I am trying to copy an S3 file into redshift. The json file on S3 looks like this:
[
{
"a": 1,
"b": 2
},
{
"a": 3,
"b": 4
}
]
However this gives an error (Invalid JSONPath format: Member is not an object.) because of '[' and ']' in the file (https://stackoverflow.com/a/45348425).
I need to convert the json from above format to a format which looks like this:
{
"a": 1,
"b": 2
}
{
"a": 3,
"b": 4
}
So basically what I am trying to do is to remove '[' and ']' and replace '},\n' with '}\n'.
The file has over 14 million rows (113 MB in size)
How can I achieve this using Nifi?

You can use ReplaceText. Check this. It is very similar to your problem. First replace brackets with empty string then replace commas by using LiteralReplace strategy

Related

How would you collect the first few entries of a list from a large json file using jq?

I am trying to process a large json file for testing purposes that has a few thousand entries. The json contains a long list of data to is too large for me to process in one go. Using a jq, is there an easy way to get a valid snippet of the json that only contains the first few entries from the data list? For example is there a query that would look at the whole json file and return to me a valid json that only contains the first 4 entries from data? Thank you!
{
"info":{
"name":"some-name"
},
"data":[
{...},
{...},
{...},
{...}
}
Based on your snippet, the relevant jq would be:
.data |= .[:4]
Here's an example using the --stream option:
$ cat input.json
{
"info": {"name": "some-name"},
"data": [
{"a":1},
{"b":2},
{"c":3},
{"d":4},
{"e":5},
{"f":6},
{"g":7}
]
}
jq --stream -n '
reduce (
inputs | select(has(1) and (.[0] | .[0] == "data" and .[1] < 4))
) as $in (
{}; .[$in[0][-1]] = $in[1]
)
' input.json
{
"a": 1,
"b": 2,
"c": 3,
"d": 4
}
Note: Using limit would have been more efficient in this case, but I tried to be more generic for the purpose of scalability.

How to edit specific textline of large json/textfile (~25gb)?

I have a json file with ElasticSearch events which can't be parsed by jq (funnily the json comes from jq) due to a missing comma. Here is an extract from the problematic place in the json file:
"end",
"protocol"
],
"dataset": "test",
"outcome": "success"
},
"#timestamp": "2020-08-23T04:47:10.000+02:00"
}
{
"agent": {
"hostname": "fb",
"type": "filebeat"
},
"destination": {
My jq command crashes at the closing brace (above "agent") as there is missing a comma after that brace (since a new event starts there). Now I know exactly the line and would like to add a comma there but couldn't find any options on how to do that efficiently. Since the file is around 25gb it is unsuitable to open it by nano or other tools. The error is parse error: Expected separator between values at line 192388762
Does anyone know if there is an efficient way to add a comma there so it looks like this?
"#timestamp": "2020-08-23T04:47:10.000+02:00"
},
{
"agent": {
Is there a command which I can tell to go to line X, column 1 and add a comma there (after column1)?
Are there brackets [] surrounding all these objects? If so, it is an array and there's missing commas indeed. But jq wouldn't have missed to produce them unless the previous filter was designed on purpose to behave that way. If there aren't surrounding brackets (which I presume according to the indentation of the sample), then it is a stream of objects that do not need a comma in between. In fact, putting a comma in between without the surrounding brackets would render the file inprocessible as it wouldn't be valid JSON anymore.
If it is a faulty array (the former case) maybe you're better off not using jq but rather a text stream editor such as sed or awk as you seem to know exactly where the commas are missing ("Is there a command which I can tell to go to line X, column 1 and add a comma there?")
If it is infact a stream of objects (the latter case), then you could use jq --slurp '…' or jq -n '[inputs] | …' to make it an array (surrounded by brackets and with commas in between) but the file (25 GB) has to fit entirely into your memory. If it doesn't, you need to use jq --stream '…' and handle the document (which has a different format then) according to the documentation for processing streams.
Illustrations:
This is an array of objects:
[
{"a": 1},
{"b": 2},
{"c": 3}
]
This is a stream of objects:
{"a": 1}
{"b": 2}
{"c": 3}
This is not valid JSON:
{"a": 1},
{"b": 2},
{"c": 3}

How to remove a block of. code from a json using jq?

I have a json file temp.json like this -
{
"data": {
"stuff": [
.....
]
},
"time": {
"metrics": 83
}
}
I want to remove this particular block of code from the above json file -
,
"time": {
"metrics": 83
}
After removal I want to rewrite new json in the same file so that new content in the same file will be -
{
"data": {
"stuff": [
.....
]
}
}
Is this possible to do by any chance?
Note: number 83 can be any number in general.
Here's an excellent tutorial: Baeldung: Guide to Linux jq Command for JSON Processing.
Maybe you can try something like this: jq 'del(.time)' temp.json > temp2.json.
Note that jq works at the semantic level; it's not just "text substitution". So things like the "comma" separators between objects will be deleted from the JSON text when you use jq to delete the object.
Experiment, and see what works best for your particular scenario.

Upload CSV-data to elasticsearch without logstash

I uploaded CSV-data into elasticsearch using the machine-learning approach described here.
This created an index and a pipeline with a csv - preprocessor. The import was successful.
What is the corresponding curl command line to upload CSV data into elasticsearch, assuming the index is called iislog and the pipeline iislog-pipeline?
The csv ingest processor will only work on a JSON document that contains a field with CSV data. You cannot throw raw CSV data at it using curl.
The CSV to JSON transformation happens in Kibana (when you drop the raw CSV file in the browser window) and only then Kibana will send JSON-ified CSV.
If your CSV looks like this:
column1,column2,column3
1,2,3
4,5,6
7,8,9
Kibana will transform each line into
{"message": "1,2,3"}
{"message": "4,5,6"}
{"message": "7,8,9"}
And then Kibana will send each of those raw CSV/JSON documents to your iislog index through the iislog-pipeline ingest pipeline. The pipeline looks like this:
{
"description" : "Ingest pipeline created by file structure finder",
"processors" : [
{
"csv" : {
"field" : "message",
"target_fields" : [
"column1",
"column2",
"column3"
],
"ignore_missing" : false
}
},
{
"remove" : {
"field" : "message"
}
}
]
}
In the end, the documents will look like this in your index:
{"column1": 1, "column2": 2, "column3": 3}
{"column1": 4, "column2": 5, "column3": 6}
{"column1": 7, "column2": 8, "column3": 9}
That's the way it works. So if you want to use curl, you need to do Kibana's pre-parsing job and send the latter documents.
curl -H 'Content-type: application/json' -XPOST iislog/_doc?pipeline=iislog-pipeline -d '{"column1": 1, "column2": 2, "column3": 3}'
There is another approach to insert CSV into elastic using an ingest pipeline described here: https://www.elastic.co/de/blog/indexing-csv-elasticsearch-ingest-node
In the end, it wraps each line into an json document and grok-parses each line in order to have the csv rows mapped to specific document fields.

Converting json file to a single line

I have a huge .json file like the one below. I want to convert that JSON to a dataframe on Spark.
{
"movie": {
"id": 1,
"name": "test"
}
}
When I execute the following code, I get a _corrupt_record error:
val df = sqlContext.read.json("example.json")
df.first()
Lately I learned that Spark only supports one-line JSON files, like:
{ "movie": { "id": 1, "name": "test test" } }
How can I convert a JSON text from multiple lines to a single line.