jq add list to object list until condition - json

Background
I have an object with each value being a nested list of only strings. For each string value within the nested list, look up the string value within the object and add all of its values into the current value.
Here's what I have so far:
#!/bin/bash
in=$(jq -n '{
"bar": [["re", "de"]],
"do": [["bar","baz"]],
"baz": [["re"]],
"re": [["zoo"]]
}')
echo "expected:"
jq -n '{
"bar": [["re", "de"], ["zoo"]],
"do": [["bar","baz"], ["re", "de"], ["re"], ["zoo"]],
"baz": [["re"], ["zoo"]],
"re": [["zoo"]]
}'
echo "actual:"
echo ${in} | jq '. as $origin
| map_values( . +
until(
length == 0;
(. | flatten | map($origin[.]) | map(select( . != [[]] )) | add )
)
)'
Problem:
The output is the exact same as the input $in. If the until() function is removed from the statement, then the output correctly outputs one iteration. Although I want to recursively lookup the output strings within the object and add the lookup value until the lookup value is empty or non-existing.
For example, the key do has a value of [["bar","baz"]]. If we iterate through the values of do we come across baz. The value of baz within the object is [["re"]]. Add baz's value ["re"] to do so that do equals: [["bar","baz"], ["re"]]. Since re IS a key within the object, add the value of ["re"] which is ["zoo"]. Since ["zoo"] is NOT a key within the object finish baz and continue to the next key within the object.

The following solves the problem as originally stated, but the "expected" output as shown does not quite match the stated problem.
echo ${in} | jq -c '
. as $dict
| map_values(reduce (..|strings) as $v (.;
. + $dict[$v] ))
'
produces (after some manual reformatting for clarity):
{"bar":[["re","de"],["zoo"]],
"do":[["bar","baz"],["re","de"],["re"]],
"baz":[["re"],["zoo"]],"re":[["zoo"]]}
If some kind of recursive lookup is needed, then please reformulate the problem statement, being sure to avoid infinite loops.

Related

Get value if object or string if string in jq array

I have a JSON object that looks like this:
[{"name":"NAME_1"},"NAME_2"]
I would like an output of
["NAME_1", "NAME_2"]
Some of the entries in the array are an object with a key "name" and some are just a string of the name. I am trying to extract an array of the names. Using
jq -cr '.[].name // []'
throws an error as it is trying to index .name of the string object. Is there a way to check if it is a string, and if so just use its value instead of .name?
echo '[{"name":"NAME_1"},"NAME_2"]' \
| jq '[ .[] | if (.|type) == "object" then .name else . end ]'
[
"NAME_1"
"NAME_2"
]
Ref:
https://stedolan.github.io/jq/manual/#ConditionalsandComparisons
https://stedolan.github.io/jq/manual/#type
As #LĂ©aGris comments, a simpler version
jq '[ .[] | .name? // . ]' file
https://stedolan.github.io/jq/manual/#ErrorSuppression/OptionalOperator:%3f
https://stedolan.github.io/jq/manual/#Alternativeoperator://
You can use the type function which returns "object" for objects.
jq '.[] | if type == "object" then .name else . end' file.json
To get the output as array, just wrap the whole expression into [ ... ].
Just use the error suppression operator with ?, map and scalars
jq 'map( .name?, scalars )'
Note that by using scalars, it is assumed that other than objects with name, all others are names of form NAME_*. If there are other strings as well, and you need to exclude some of them you might need to add some additional logic to do that. e.g. using startswith(..) with a string of your choice.
map( .name?, select( scalars | startswith("NAME") ) )
Demo
With your shown samples only, please try following jq code. Using tostream function here to get the required values from requirement.
jq -c '[.[] | tostream | if .[1] != null then .[1] else empty end]' Input_file

Update values in json (array elements) using jq in loop

I have a input json file with an array. I need to updated two values (ver & date) in each array element. I could come up with below script but need help. I have hardcoded the ver & date to simplify the script.
input.json
[
{
"svcname": "svc1",
"repo": "https://repo.mycom.org/repocontext/svc1-list",
"ver": "0.1",
"date": "2019-11-05"
},
{
"svcname": "svc1",
"repo": "https://repo.mycom.org/repocontext/svc1-list",
"ver": "0.1",
"date": "2019-12-21"
}
]
Script:
#!/bin/bash
set +x
injson=input.json
updatedjson=$(jq .[] ${injson})
services=$(cat ${injson} | jq '.[] | .svcname' | tr -d \")
i=1
for svc in $services; do
echo "==>$svc"
echo "======> input json=${updatedjson}"
echo "======> update ver=${i}"
updatedjson=$(echo ${updatedjson} | jq ". | select( .name ==\"$svc\").ver=\"$i\"" | jq . )
svcdate="2020-01-$i"
echo "======> update date=$svcdate"
updatedjson=$(echo ${updatedjson} | jq ". | select( .name ==\"$svc\").date=\"$svcdate\"" | jq . )
echo "============================================"
echo
i=`expr $i + 1`
done
echo "======= write to file ====="
echo ${updatedjson}
echo ${updatedjson} | jq . > outjson.json
You are not using the true features of jq. What you shown in a loop, iterating over all the JSON objects can be simply reduced to one reduce() construct that is sort of a for loop in jq given an initial value and runs the filter incrementally
jq 'reduce range(0, length) as $d (.; (.[$d].ver = ($d+1|tostring)) | (.[$d].date = "2020-01-\($d+1)")) '
A brief explanation of how it works
The range expression returns a list with numbers generated from 0 to upto the length of the objects in the array. For your given input it produces 0,1 which is assigned to d
The reduce expression given the input value . the whole JSON, runs by setting the values in each object indexed by $d. So .[$d].ver refers to the ver field in the zeroth index. This is done incrementally till all the objects are processed.
The same way the date field is modified using [$d].date with the value string prefixed (YYYY-MM-) and date is set accordingly.

JSON will not convert with jq in Unix

Having difficulties converting this JSON. It is multi-line similar to what is below. The example data at the bottom is what is reads as-is once unzipped.
An example of what has been tried:
jq -r '(([["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]]) + [(.table.All[] | [.user_id,.server_received_time,.app,.device_carrier,.$schema,.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.$insert_id,.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time])])[]|#csv' test.json > test.csv
As well as some other jq options. I need every column regardless of the value, and the values as-is. Does anyone have thoughts on why we are running into issues? One error we get is:
jq: error: try .["field"] instead of .field for unusually named fields at <top-level>, line 1:
Other jq lines have given the following error:
string (...) cannot be csv-formatted, only array
This is an excerpt from one of the JSON files:
{"groups":{},"country":"United States","device_id":"3d-88c-45-b6-ed81277eR","is_attribution_event":false,"server_received_time":"2019-12-17 17:29:11.113000","language":"English","event_time":"2019-12-17 17:27:49.047000","user_creation_time":"2019-11-08 13:15:32.919000","city":"Sure","uuid":"someID","device_model":"Windows","amplitude_event_type":null,"client_upload_time":"2019-12-17 17:29:21.958000","data":{},"library":"amplitude-js\/5.2.2","device_manufacturer":null,"dma":"Washington, DC (Townville, USA)","version_name":null,"region":"Virginia","group_properties":{},"location_lng":null,"device_family":"Windows","paying":null,"client_event_time":"2019-12-17 17:27:59.892000","$schema":12,"device_brand":null,"user_id":"email#gmail.com","event_properties":{"title":"Name","id":"1-253251","applicationName":"SomeName"},"os_version":"18","device_carrier":null,"server_upload_time":"2019-12-17 17:29:11.135000","session_id":1576603675620,"app":231165,"amplitude_attribution_ids":null,"event_type":"CHANGE_PERSPECTIVE","user_properties":{},"adid":null,"device_type":"Windows","$insert_id":"e308c923-d8eb-48c6-8ea5-600","event_id":24,"amplitude_id":515,"processed_time":"2019-12-17 17:29:12.760372","platform":"Web","idfa":null,"os_name":"Edge","location_lat":null,"ip_address":"123.456.78.90","sample_rate":null,"start_version":null}
Thank you!
There are several problems with your attempt.
First, the keys with "$" in their names cannot be specified using the abbreviated .foo syntax; you could use .["$foo"] instead.
Second, #csv expects an array of atomic values. Thus the keys with JSON objects as values must be handled specially.
Third, the "+" is incorrect. The relevant connector here is ",".
With your sample JSON, the following will work:
(["user_id","server_received_time","app","device_carrier","$schema","city","uuid","event_time","platform","os_version","amplitude_id","processed_time","user_creation_time","version_name","ip_address","paying","dma","group_properties","user_properties","client_upload_time","$insert_id","event_type","library","amplitude_attribution_ids","device_type","device_manufacturer","start_version","location_lng","server_upload_time","event_id","location_lat","os_name","amplitude_event_type","device_brand","groups","event_properties","data","device_id","language","device_model","country","region","is_attribution_event","adid","session_id","device_family","sample_rate","idfa","client_event_time"]),
([.user_id,.server_received_time,.app,.device_carrier,.["$schema"],.city,.uuid,.event_time,.platform,.os_version,.amplitude_id,.processed_time,.user_creation_time,.version_name,.ip_address,.paying,.dma,.group_properties,.user_properties,.client_upload_time,.["$insert_id"],.event_type,.library,.amplitude_attribution_ids,.device_type,.device_manufacturer,.start_version,.location_lng,.server_upload_time,.event_id,.location_lat,.os_name,.amplitude_event_type,.device_brand,.groups,.event_properties,.data,.device_id,.language,.device_model,.country,.region,.is_attribution_event,.adid,.session_id,.device_family,.sample_rate,.idfa,.client_event_time]
| map(if type=="object"
then to_entries
| map( "\(.key):\(.value)" )
| join(";")
else . end))
| #csv
A less error-prone solution
Specifying the long list of keys twice makes the above solution error-prone. It would be better to specify the keys just once, and then programatically generate the rows.
Here's a utility function that can be used to this end:
def toa($headers):
. as $in | $headers | map($in[.]);
Or you could handle the object-valued keys inside toa:
def toa($headers):
def flat:
if type == "object" or type == "array"
then to_entries | map( "\(.key):\(.value)" ) | join(";")
else .
end;
. as $in | $headers | map($in[.] | flat);
JSONL
If the input is a stream of JSON objects of the type illustrated in the question, an efficient solution would use inputs with the -n command line option. This could be along the lines of:
print_header,
(inputs | print_row)

Using jq, Flatten Arbitrary JSON to Delimiter-Separated Flat Dictionary

I'm looking to transform JSON using jq to a delimiter-separated and flattened structure.
There have been attempts at this. For example, Flatten nested JSON using jq.
However the solutions on that page fail if the JSON contains arrays. For example, if the JSON is:
{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}
The solution above will fail to transform the above to:
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
In addition, I'm looking for a solution that will also allow for an arbitrary delimiter. For example, suppose the space character is the delimiter. In this case, the result would be:
{"a b 0":1,"x 0 y":2,"x 1 z":3}
I'm looking to have this functionality accessed via a Bash (4.2+) function as is found in CentOS 7, something like this:
flatten_json()
{
local JSONData="$1"
# jq command to flatten $JSONData, putting the result to stdout
jq ... <<<"$JSONData"
}
The solution should work with all JSON data types, including null and boolean. For example, consider the following input:
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
It should produce:
{"a b 0":"p q r","w 0 x":null,"w 1 y":false,"w 2 z":3}
If you stream the data in, you'll get pairings of paths and values of all leaf values. If not a pair, then a path marking the end of a definition of an object/array at that path. Using leaf_paths as you found would only give you paths to truthy leaf values so you'll miss out on null or even false values. As a stream, you won't get this problem.
There are many ways this could be combined to an object, I'm partial to using reduce and assignment in these situations.
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
.[[$i[0][]|tostring]|join($delim)] = $i[1]
)' input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Here's the same solution broken up a bit to allow room for explanation of what's going on.
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
[$i[0][]|tostring] as $path_as_strings
| ($path_as_strings|join($delim)) as $key
| $i[1] as $value
| .[$key] = $value
)' input.json
Converting the input to a stream with tostream, we'll receive multiple values of pairs/paths as input to our filter. With this, we can pass those multiple values into reduce which is designed to accept multiple values and do something with them. But before we do, we want to filter those pairs/paths by only the pairs (select(length==2)).
Then in the reduce call, we're starting with a clean object and assigning new values using a key derived from the path and the corresponding value. Remember that every value produced in the reduce call is used for the next value in the iteration. Binding values to variables doesn't change the current context and assignments effectively "modify" the current value (the initial object) and passes it along.
$path_as_strings is just the path which is an array of strings and numbers to just strings. [$i[0][]|tostring] is a shorthand I use as an alternative to using map when the array I want to map is not the current array. This is more compact since the mapping is done as a single expression. That instead of having to do this to get the same result: ($i[0]|map(tostring)). The outer parentheses might not be necessary in general but, it's still two separate filter expressions vs one (and more text).
Then from there we convert that array of strings to the desired key using the provided delimiter. Then assign the appropriate values to the current object.
The following has been tested with jq 1.4, jq 1.5 and the current "master" version. The requirement about including paths to null and false is the reason for "allpaths" and "all_leaf_paths".
# all paths, including paths to null
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
def all_leaf_paths:
def isscalar: type | (. != "object" and . != "array");
allpaths as $p
| select(getpath($p)|isscalar)
| $p ;
. as $in
| reduce all_leaf_paths as $path ({};
. + { ($path | map(tostring) | join($delim)): $in | getpath($path) })
With this jq program in flatten.jq:
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim . -f flatten.jq input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Collisions
Here is a helper function that illustrates an alternative path-flattening algorithm. It converts keys that contain the delimiter to quoted strings, and array elements are presented in square brackets (see the example below):
def flattenPath(delim):
reduce .[] as $s ("";
if $s|type == "number"
then ((if . == "" then "." else . end) + "[\($s)]")
else . + ($s | tostring | if index(delim) then "\"\(.)\"" else . end)
end );
Example: Using flattenPath instead of map(tostring) | join($delim), the object:
{"a.b": [1]}
would become:
{
"\"a.b\"[0]": 1
}
To add a new option to the solutions already given, jqg is a script I wrote to flatten any JSON file and then search it using a regex. For your purposes your regex would simply be '.' which would match everything.
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg .
{
"a.b.0": 1,
"x.0.y": 2,
"x.1.z": 3
}
and can produce compact output:
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg -q -c .
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
It also handles the more complicated example that #peak used:
$ echo '{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}' | jqg .
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
as well as empty arrays and objects (and a few other edge-case values):
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.two-a.non-integer-number": 101.75,
"two.two-a.number-zero": 0,
"two.true-boolean": true,
"two.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
(reporting empty arrays & objects can be turned off with the -E option).
jqg was tested with jq 1.6
Note : I am the author of the jqg script.

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })