jq - parsing& replacement based on key-value pairs within json - json

I have a json file in the form of a key-value map. For example:
{
"users":[
{
"key1":"user1",
"key2":"user2"
}
]
}
I have another json file. The values in the second file has to be replaced based on the keys in first file.
For example 2nd file is:
{
"info":
{
"users":["key1","key2","key3","key4"]
}
}
This second file should be replaced with
{
"info":
{
"users":["user1","user2","key3","key4"]
}
}
Because the value of key1 in first file is user1. this could be done with any python program, but I am learning jq and would like to try if it is possible with jq itself. I tried different combinations with reading file using slurpfile, then select & walk etc. But couldn't arrive at the required solution.
Any suggestions for the same will be appreciated.

Since .users[0] is a JSON dictionary, it would make sense to use it as such (e.g. for efficiency):
Invocation:
jq -c --slurpfile users users.json -f program.jq input.json
program.jq:
$users[0].users[0] as $dict
| .info.users |= map($dict[.] // .)
Output:
{"info":{"users":["user1","user2","key3","key4"]}}
Note: the above assumes that the dictionary contains no null or false values, or rather that any such values in the dictionary should be ignored. This avoids the double lookup that would otherwise be required. If this assumption is invalid, then a solution using has or in (e.g. as provided by RomanPerekhrest) would be appropriate.
Solution to supplemental problem
(See "comments".)
$users[0].users[0] as $dict
| second
| .info.users |= (map($dict[.] | select(. != null)))
sponge
It is highly inadvisable to use redirection to overwrite an input file.
If you have or can install sponge, then it would be far better to use it. For further details, see e.g. "What is jq's equivalent of sed -i?" in the jq FAQ.

jq solution:
jq --slurpfile users 1st.json '$users[0].users[0] as $users
| .info.users |= map(if in($users) then $users[.] else . end)' 2nd.json
The output:
{
"info": {
"users": [
"user1",
"user2",
"key3",
"key4"
]
}
}

Related

JQ write each object to subdirectory file

I'm new to jq (around 24 hours). I'm getting the filtering/selection already, but I'm wondering about advanced I/O features. Let's say I have an existing jq query that works fine, producing a stream (not a list) of objects. That is, if I pipe them to a file, it produces:
{
"id": "foo"
"value": "123"
}
{
"id": "bar"
"value": "456"
}
Is there some fancy expression I can add to my jq query to output each object individually in a subdirectory, keyed by the id, in the form id/id.json? For example current-directory/foo/foo.json and current-directory/bar/bar.json?
As #pmf has pointed out, an "only-jq" solution is not possible. A solution using jq and awk is as follows, though it is far from robust:
<input.json jq -rc '.id, .' | awk '
id=="" {id=$0; next;}
{ path=id; gsub(/[/]/, "_", path);
system("mkdir -p " path);
print >> path "/" id ".json";
id="";
}
'
As you will need help from outside jq anyway (see #peak's answer using awk), you also might want to consider using another JSON processor instead which offers more I/O features. One that comes to my mind is mikefarah/yq, a jq-inspired processor for YAML, JSON, and other formats. It can split documents into multiple files, and since its v4.27.2 release it also supports reading multiple JSON documents from a single input source.
$ yq -p=json -o=json input.json -s '.id'
$ cat foo.json
{
"id": "foo",
"value": "123"
}
$ cat bar.json
{
"id": "bar",
"value": "456"
}
The argument following -s defines the evaluation filter for each output file's name, .id in this case (the .json suffix is added automatically), and can be manipulated to further needs, e.g. -s '"file_with_id_" + .id'. However, adding slashes will not result in subdirectories being created, so this (from here on comparatively easy) part will be left over for post-processing in the shell.

Need help deleting elements with special character # from json object with jtc or jq

I'm trying to identify object elements which a key starting with #t. My goal is to delete them from the object all together.
Example Input
{
"process_state": {
"#user_id": "john smith",
"#t39ee396f50": 1,
"#t375b0311e8": 1,
"#t12dd92bf45": 1
}
}
Expected Output
{
"process_state": {
"#user_id": "john smith",
}
}
I've tried using jq and jtc to accomplish this and both seem to struggle with the leading # symbol. I'm assuming it's a format issue with my code. Can I use wildcards? I've tried a couple methods with no luck.
JQ
jq '. |= map(select(. | contains("#t") | not))'
Error: and string ("#t") cannot have their containment checked
JTC
<file jtc -w'<process_state.#t*>l:'
No error but #t* fields still exist in json object.
Any help is much appreciated.
We're going to use test("^#t") instead of contains("#t") since we want to check for a leading #t. But that's not why you are getting that error. You have the wrong object on the left-hand side of |=, and map doesn't do what you think for objects.
Two solutions:
Using filtering (like you were attempting):
.process_state |= with_entries(
select( .key | test("^#t") | not )
)
Demo on jqplay
Using deletion:
.process_state |= del(
.[
keys_unsorted[] |
select( test("^#t") )
]
)
Demo on jqplay
This is correct jtc usage:
bash $ <file jtc -w'<^#t>L:' -p
{
"process_state": {
"#user_id": "john smith"
}
}
bash $

Using JQ to merge two JSON snippets from one file

I've got output from a script that outputs two structurally identical JSON snippets into one file:
{
"Objects": [
{
"Key": "somevalue",
"VersionId": "someversion"
}
],
"Quiet": false
}
{
"Objects": [
{
"Key": "someothervalue",
"VersionId": "someotherversion"
}
],
"Quiet": false
}
I would like to pass this output through JQ to have one Objects[] list, concatenating all of the objects within the two lists, and outputting the same overall structure. I can accomplish it with piping between two separate JQ commands:
jq '.Objects[]' inputfile | jq -s '{"Objects":., "Quiet":false}' -
But I'm wondering if there is a more elegant way to do so using only one invocation of JQ.
I'm currently using JQ version 1.5 but can update if needed.
You don't need to invoke JQ twice there. The second object can be fetched using the input keyword.
.Objects += input.Objects
Online demo
You can use reduce:
jq -s 'reduce .[] as $item ({ Quiet: false }; .Objects += $item.Objects)'
See it in action.
As #oguz-ismail suggested in a comment, the -s (slurp) flag can be removed by using inputs to get the rest of the entries after the first one:
jq 'reduce inputs as $item (.; .Objects += $item.Objects)'
See it in action.
Both versions work with any number of entries in the input (the second version requires at least one).

jq streaming - filter nested list and retain global structure

In a large json file, I want to remove some elements from a nested list, but keep the overall structure of the document.
My example input it this (but the real one is large enough to demand streaming).
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
{
"keep": "false",
"extra": "non-keeper"
}
]
}
The required output just has one element of the 'filter_this' block removed:
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
]
}
The standard way to handle such cases appears to be using 'truncate_stream' to reconstitute streamed objects, before filtering those in the usual jq way. Specifically, the command:
jq -nc --stream 'fromstream(1|truncate_stream(inputs))'
gives access to a stream of objects:
{"keep_this":["this","list"]}
[{"keep":"true"},{"keep":"true","extra":"keeper"},
{"keep":"false","extra":"non-keeper"}]
at which point it is easy to filter for the required objects. However, this strips the results from the context of their parent object, which is not what I want.
Looking at the streaming structure:
[["keep_untouched","keep_this",0],"this"]
[["keep_untouched","keep_this",1],"list"]
[["keep_untouched","keep_this",1]]
[["keep_untouched","keep_this"]]
[["filter_this",0,"keep"],"true"]
[["filter_this",0,"keep"]]
[["filter_this",1,"keep"],"true"]
[["filter_this",1,"extra"],"keeper"]
[["filter_this",1,"extra"]]
[["filter_this",2,"keep"],"false"]
[["filter_this",2,"extra"],"non-keeper"]
[["filter_this",2,"extra"]]
[["filter_this",2]]
[["filter_this"]]
it seems I need to select all the 'filter_this' rows, truncate those rows only (using 'truncate_stream'), rebuild these rows as objects (using 'from_stream'), filter them, and turn the objects back into the stream data format (using 'tostream') to join the stream of 'keep untouched' rows, which are still in the streaming format. At that point it would be possible to re-build the whole json. If that is the right approach - which seems overly converluted to me - how do I do that? Or is there a better way?
If your input file consists of a single very large JSON entity that is too big for the regular jq parser to handle in your environment, then there is the distinct possibility that you won't have enough memory to reconstitute the JSON document.
With that caveat, the following may be worth a try. The key insight is that reconstruction can be accomplished using reduce.
The following uses a bunch of temporary files for the sake of clarity:
TMP=/tmp/$$
jq -c --stream 'select(length==2)' input.json > $TMP.streamed
jq -c 'select(.[0][0] != "filter_this")' $TMP.streamed > $TMP.1
jq -c 'select(.[0][0] == "filter_this")' $TMP.streamed |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
| .filter_this |= map(select(.keep=="true"))
| tostream
| select(length==2)' > $TMP.2
# Reconstruction
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' $TMP.1 $TMP.2
Output
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this": [
{
"keep": "true"
},
{
"keep": "true",
"extra": "keeper"
}
]
}
Many thanks to #peak. I found his approach really useful, but unrealistic in terms of performance. Stealing some of #peak's ideas, though, I came up with the following:
Extract the 'parent' object:
jq -c --stream 'select(length==2)' input.json |
jq -c 'select(.[0][0] != "filter_this")' |
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' > $TMP.parent
Extract the 'keepers' - though this means reading the file twice (:-<):
jq -nc --stream '[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' input.json > $TMP.keepers
Insert the filtered list into the parent object.
jq -nc -s 'inputs as $items
| $items[0] as $parent
| $parent
| .filter_this |= $items[1]
' $TMP.parent $TMP.keepers > result.json
Here is a simplified version of #PeteC's script. It requires one fewer invocations of jq.
In both cases, please note that the invocation of jq that uses "2|truncate_stream(_)" requires a more recent version of jq than 1.5.
TMP=/tmp/$$
INPUT=input.json
# Extract all but .filter_this
< $INPUT jq -c --stream 'select(length==2 and .[0][0] != "filter_this")' |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
' > $TMP.parent
# Need jq > 1.5
# Extract the 'keepers'
< $INPUT jq -n -c --stream '
[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' $INPUT > $TMP.keepers
# Insert the filtered list into the parent object:
jq -s '. as $in | .[0] | (.filter_this |= $in[1])
' $TMP.parent $TMP.keepers > result.json

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })