Extract top-level key and contents from large JSON using stream - json

One procedure in a system is to 'extract' one key and its (object) value to a dedicated file to subsequently process it in some way in a (irrelevant) script.
A representative subset of the original JSON file looks like:
{
"version" : null,
"produced" : "2021-01-01T00:00:00+0000",
"other": "content here",
"items" : [
{
"code" : "AA",
"name" : "Example 1",
"prices" : [ "other", "content", "here" ]
},
{
"code" : "BB",
"name" : "Example 2",
"prices" : [ "other", "content", "here" ]
}
]
}
And the current output, given that subset as input, simply equals:
[
{
"code" : "AA",
"name" : "Example 1",
"prices" : [ "other", "content", "here" ],
},
{
"code" : "BB",
"name" : "Example 2",
"prices" : [ "other", "content", "here" ],
},
...
]
Previously, we would extract the whole portion of "items" using jq with a very straightforward command (which worked fine):
cat file.json | jq '.items' > file.items.json
However, recently the size of the original json file has increased drastically in size, causing the script to fail due to a Out of memory error. One obvious solution is to use jq's 'stream' option. However, I am kind of stuck on how to convert above command to a valid filter in jq's stream syntax.
cat file.json | jq --stream '...' > file.items.json
Any advice on what to use as a filter for this command would be greatly appreciated. Thanks in advance!

You should use the --stream flag in combination with the fromstream builtin
jq --stream --null-input '
fromstream(inputs | select(.[0][0] == "items"))[]
' file.json
[
{
"code": "AA",
"name": "Example 1",
"prices": [
"other",
"content",
"here"
]
},
{
"code": "BB",
"name": "Example 2",
"prices": [
"other",
"content",
"here"
]
}
]
Demo not for the efficiency or memory consumption but rather for the syntax (as I had to stream your original input using tostream for the lack of the --stream option on jqplay.org)
Note: Although it works for the sample data, do not try to shortcut using
jq --stream --null-input 'fromstream(inputs).items' file.json
directly on your large JSON file, as it only
reconstructs the entire input JSON entity, thus defeating the purpose of using --stream
(clarified by #peak)

If a stream of the {code, name, prices} objects is acceptable, then you could go with:
< input.json jq --stream -n '
fromstream( 2 | truncate_stream(inputs | select(.[0][0] == "items")) )'
This would have minimal memory requirements, which may or may not be significant depending on the value of .items|length

Related

Expand large array and select elements in JQ

This may just not be possible due to how conceptually streaming/filtering JSON works, but let's suppose I have something like the following JSON:
[
{
"name": "account_1",
"type": "account"
},
{
"name": "account_2",
"type": "account"
},
{
"name": "user_1",
"type": "user"
},
{
"name": "user_2",
"type": "user"
}
]
And now I want to print out only the user objects.
I know I can filter to just the streaming type entities with something like this:
cat file.json | jq --stream 'select(.[0][1] == "type" and .[1] == "user" | .)'
Which would produce:
[
[
2,
"type"
],
"user"
]
[
[
3,
"type"
],
"user"
]
Is there any way I can print out the parent objects of those types instead of the type entities? E.g. I'd like to get out:
[
{
"name": "user_1",
"type": "user"
},
{
"name": "user_2",
"type": "user"
}
]
Without streaming, this is a pretty straightforward exercise. E.g.:
cat file.json | jq '.[] | select(.type=="user")'
In reality the actual input file is around 5GB, so I need to use streaming input, but I can't seem to get the jq syntax right with --stream enabled. E.g.
cat file.json | jq --stream '.[] | select(.type=="user")'
Produces:
jq: error (at <stdin>:3): Cannot index array with string "type"
jq: error (at <stdin>:5): Cannot index array with string "type"
...
(edited to include desired output)
Just truncate the top-level array.
jq -n --stream 'fromstream(1 | truncate_stream(inputs)) | select(.type == "user")'
Online demo
jqplay does not support the --stream option, so the above demo has the output of --stream as the JSON input.

jq process json where an element can be an array or object

The output from the tool I am using is creating an element in the json that is an object when there is only 1 item but an array when there is more than 1.
How do I parse this with jq to return the full list of names only from within content?
{
"data": [
{
"name": "data block1",
"content": {
"name": "1 bit of data"
}
},
{
"name": "data block2",
"content": [
{
"name": "first bit"
},
{
"name": "another bit"
},
{
"name": "last bit"
}
]
}
]
}
What I can't work out is how to switch depending on the type of content.
# jq '.data[].content.name' test.json
"1 bit of data"
jq: error (at test.json:22): Cannot index array with string "name"
# jq '.data[].content[].name' test.json
jq: error (at test.json:22): Cannot index string with string "name"
I am sure I should be able to use type but my jq-fu is not strong enough!
# jq '.data[].content | type=="array"' test.json
false
true
jq version 1.5
jq '.data[].content | if type == "array" then .[] else . end | .name?'
(The trailing ? is there just in case.)
More succinctly:
jq '.data[].content | .name? // .[].name?'

jq, replace null values on any level, not touching non-null or not existing

please assist to a newbie in jq. :)
I have to update a field with specific name that might occur on any level of JSON structure - and might not. Like with all *.description fields in JSON below:
{
"a": {
"b": [{
"name": "b0",
"description": "b0 has description"
},
{
"name": "b1",
"description": null
},
{
"name": "b2"
}
],
"description": null
},
"s": "Some string value"
}
I need to update "description" value with some dummy value if only it has null value, but do not touch existing values and do not create new fields where they do not exist. So desired result in this case is:
{
"a": {
"b": [{
"name": "b0",
"description": "b0 has description"
},
{
"name": "b1",
"description": "DUMMY DESCRIPTION"
},
{
"name": "b2"
}
],
"description": "DUMMY DESCRIPTION"
},
"s": "Some string value"
}
Here, .a.b[0].description left untouched because it existed and was not null; .a.b[1].description and .a.description are forced to "DUMMY DESCRIPTION" because these field existed and were null; and .a.b[2] as well as root level left untouched because there was no description field at all.
If for example I try to use command on known paths like below
jq '.known.level.description //= "DUMMY DESCRIPTION"' ........
it fails to skip non-existing fields like .a.b[2].description; and, sure, it works on known positions in JSON only. And if I try to do recursive search like:
jq '.. | .description? //= "DUMMY DESCRIPTION"' ........
it does not seem to work correctly on arrays.
What's the correct approach to walk through entire JSON in this case? Thanks!
What's the correct approach to walk through entire JSON in this case?
The answer is walk!
If your jq does not already have walk/1, you can google for it easily enough (jq "def walk"), and then include its def before using it, e.g. as follows:
walk(if type == "object" and has("description") and .description == null
then .description = "DUMMY DESCRIPTION"
else . end)
One option you could consider is using streams. You'll get paths and values to every item in the input. With that you could look for name/value pairs with the name "description" and update the value.
$ jq --arg replacement "DUMMY DESCRIPTION" '
fromstream(tostream | if length == 2 and .[0][-1] == "description"
then .[1] |= (. // $replacement)
else .
end)
' input.json

Grep for a name-value pair inside a JSON object

Have a shell script running on Unix that is going through a list of JSON objects like the following, collecting values like <init>() # JSONInputData.java:82. There are also other objects with other values that I need to retrieve.
Is there a better option than grepping for "STACKTRACE_LINE",\n\s*.* and then splitting up that result?
inb4: "add X package to the OS". Need to run generically.
. . .
"probableStartLocationView" : {
"lines" : [ {
"fragments" : [ {
"type" : "STACKTRACE_LINE",
"value" : "<init>() # JSONInputData.java:82"
} ],
"text" : "<init>() # JSONInputData.java:82"
} ],
"nested" : false
},
. . . .
What if I was looking for "description" : "Dangerous Data Received" in a series of objects like the following knowing that I need to know that it is associated with event 12345 and not with another event listed in the same file?
. . .
"events" : [ {
"id" : "12345",
"important" : true,
"type" : "Creation",
"description" : "Dangerous Data Received",
. . .
Is there a better option than grepping for "STACKTRACE_LINE",\n\s*.* and then splitting up that result?
Yes. Use jq to filter and extract the interesting parts.
Example 1, given this JSON:
{
"probableStartLocationView": {
"lines": [
{
"fragments": [
{
"type": "STACKTRACE_LINE",
"value": "<init>() # JSONInputData.java:82"
}
],
"text": "<init>() # JSONInputData.java:82"
}
],
"nested": false
}
}
Extract value where type is "STACKTRACE_LINE":
jq -r '.probableStartLocationView.lines[] | .fragments[] | select(.type == "STACKTRACE_LINE") | .value' file.json
This is going to produce one line per value.
Example 2, given this JSON:
{
"events": [
{
"id": "12345",
"important": true,
"type": "Creation",
"description": "Dangerous Data Received"
}
]
}
Extract the id where description starts with "Dangerous":
jq -r '.events[] | select(.description | startswith("Dangerous")) | .id'
And so on.
See the jq manual for more examples and capabilities.
Also there are many questions on Stack Overflow using jq,
that should help you find the right combination of filtering and extracting the relevant parts.

create an object from an existing json file using 'jq'

I have a messages.json file
[
{
"id": "title",
"description": "This is the Title",
"defaultMessage": "title",
"filepath": "src/title.js"
},
{
"id": "title1",
"description": "This is the Title1",
"defaultMessage": "title1",
"filepath": "src/title1.js"
},
{
"id": "title2",
"description": "This is the Title2",
"defaultMessage": "title2",
"filepath": "src/title2.js"
},
{
"id": "title2",
"description": "This is the Title2",
"defaultMessage": "title2",
"filepath": "src/title2.js"
},
]
I want to create an object
{
"title": "Dummy1",
"title1": "Dummy2",
"title2": "Dummy3",
"title3": "Dummy4"
}
from the top one.
So far I have
jq '.[] | .id' src/messages.json;
And it does give me the IDs
How do I add some random text and make the new object as above?
Can we also create a new JSON file and write the newly created object onto it using jq?
Your output included "title3" so I'll assume that you intended that the second occurrence of "title2" in the input was supposed to refer to "title3".
With this assumption, the following jq program seems to do what you want:
map( .id )
| . as $in
| reduce range(0;length) as $i ({};
. + {($in[$i]): "dummy\(1+$i)"})
In words, extract the values of .id, and then turn each into an object of the form: {(.id) : "dummy\(1+$i)"}
This uses string interpolation, and produces:
{
"title": "dummy1",
"title1": "dummy2",
"title2": "dummy3",
"title3": "dummy4"
}
reduce-free solution
map(.id )
| [., [range(0;length)]]
| transpose
| map( {(.[0]): "dummy\(.[1]+1)"})
| add
Output
Can we also create a new json file and write the newly created object onto it using jq?
Yes, just use output redirection:
jq -f program.jq messages.json > output.json
Addendum
I want a parent object "de" to the already created json file objects
You could just pipe either of the above solutions to: {de: .}