Using jq, Flatten Arbitrary JSON to Delimiter-Separated Flat Dictionary - json

I'm looking to transform JSON using jq to a delimiter-separated and flattened structure.
There have been attempts at this. For example, Flatten nested JSON using jq.
However the solutions on that page fail if the JSON contains arrays. For example, if the JSON is:
{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}
The solution above will fail to transform the above to:
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
In addition, I'm looking for a solution that will also allow for an arbitrary delimiter. For example, suppose the space character is the delimiter. In this case, the result would be:
{"a b 0":1,"x 0 y":2,"x 1 z":3}
I'm looking to have this functionality accessed via a Bash (4.2+) function as is found in CentOS 7, something like this:
flatten_json()
{
local JSONData="$1"
# jq command to flatten $JSONData, putting the result to stdout
jq ... <<<"$JSONData"
}
The solution should work with all JSON data types, including null and boolean. For example, consider the following input:
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
It should produce:
{"a b 0":"p q r","w 0 x":null,"w 1 y":false,"w 2 z":3}

If you stream the data in, you'll get pairings of paths and values of all leaf values. If not a pair, then a path marking the end of a definition of an object/array at that path. Using leaf_paths as you found would only give you paths to truthy leaf values so you'll miss out on null or even false values. As a stream, you won't get this problem.
There are many ways this could be combined to an object, I'm partial to using reduce and assignment in these situations.
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
.[[$i[0][]|tostring]|join($delim)] = $i[1]
)' input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Here's the same solution broken up a bit to allow room for explanation of what's going on.
$ jq --arg delim '.' 'reduce (tostream|select(length==2)) as $i ({};
[$i[0][]|tostring] as $path_as_strings
| ($path_as_strings|join($delim)) as $key
| $i[1] as $value
| .[$key] = $value
)' input.json
Converting the input to a stream with tostream, we'll receive multiple values of pairs/paths as input to our filter. With this, we can pass those multiple values into reduce which is designed to accept multiple values and do something with them. But before we do, we want to filter those pairs/paths by only the pairs (select(length==2)).
Then in the reduce call, we're starting with a clean object and assigning new values using a key derived from the path and the corresponding value. Remember that every value produced in the reduce call is used for the next value in the iteration. Binding values to variables doesn't change the current context and assignments effectively "modify" the current value (the initial object) and passes it along.
$path_as_strings is just the path which is an array of strings and numbers to just strings. [$i[0][]|tostring] is a shorthand I use as an alternative to using map when the array I want to map is not the current array. This is more compact since the mapping is done as a single expression. That instead of having to do this to get the same result: ($i[0]|map(tostring)). The outer parentheses might not be necessary in general but, it's still two separate filter expressions vs one (and more text).
Then from there we convert that array of strings to the desired key using the provided delimiter. Then assign the appropriate values to the current object.

The following has been tested with jq 1.4, jq 1.5 and the current "master" version. The requirement about including paths to null and false is the reason for "allpaths" and "all_leaf_paths".
# all paths, including paths to null
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
def all_leaf_paths:
def isscalar: type | (. != "object" and . != "array");
allpaths as $p
| select(getpath($p)|isscalar)
| $p ;
. as $in
| reduce all_leaf_paths as $path ({};
. + { ($path | map(tostring) | join($delim)): $in | getpath($path) })
With this jq program in flatten.jq:
$ cat input.json
{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}
$ jq --arg delim . -f flatten.jq input.json
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
Collisions
Here is a helper function that illustrates an alternative path-flattening algorithm. It converts keys that contain the delimiter to quoted strings, and array elements are presented in square brackets (see the example below):
def flattenPath(delim):
reduce .[] as $s ("";
if $s|type == "number"
then ((if . == "" then "." else . end) + "[\($s)]")
else . + ($s | tostring | if index(delim) then "\"\(.)\"" else . end)
end );
Example: Using flattenPath instead of map(tostring) | join($delim), the object:
{"a.b": [1]}
would become:
{
"\"a.b\"[0]": 1
}

To add a new option to the solutions already given, jqg is a script I wrote to flatten any JSON file and then search it using a regex. For your purposes your regex would simply be '.' which would match everything.
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg .
{
"a.b.0": 1,
"x.0.y": 2,
"x.1.z": 3
}
and can produce compact output:
$ echo '{"a":{"b":[1]},"x":[{"y":2},{"z":3}]}' | jqg -q -c .
{"a.b.0":1,"x.0.y":2,"x.1.z":3}
It also handles the more complicated example that #peak used:
$ echo '{"a":{"b":["p q r"]},"w":[{"x":null},{"y":false},{"z":3}]}' | jqg .
{
"a.b.0": "p q r",
"w.0.x": null,
"w.1.y": false,
"w.2.z": 3
}
as well as empty arrays and objects (and a few other edge-case values):
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.two-a.non-integer-number": 101.75,
"two.two-a.number-zero": 0,
"two.true-boolean": true,
"two.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
(reporting empty arrays & objects can be turned off with the -E option).
jqg was tested with jq 1.6
Note : I am the author of the jqg script.

Related

How to convert arbitrary nested JSON to CSV with jq – so you can convert it back?

How do I use jq to convert an arbitrary JSON array of objects to CSV, while objects in this array are nested?
StackOverflow has a sea of questions/answers where specific input or output fields are referenced, but I'd like to have a generic solution that
includes a header row,
works for any JSON input including nested arrays + objects,
allows records that have missing values for keys that are present in other records
does not hard-code any field names,
allows converting the CSV back into the nested JSON structure if needed, and
uses key paths as header names (see the following description).
Dot notation
Many JSON-using products (like CouchDB, MongoDB, …) and libraries (like Lodash, …) use variations of syntax that allows access to nested property values / subfields by joining key fragments with a character, often a dot (‘dot notation’).
An example of a key path like this would be "a.b.0.c" to refer to the deeply nested property in this JSON snippet:
{
"a": {
"b": [
{
"c": 123,
}
]
}
}
Caveat: Using this method is a pragmatic solution for most cases, but means that either dot characters have to be banned in property names, or a more complex (and definitely never used property name) has to be invented for escaping dots in property names / accessing nested fields. MongoDB simply banned usage of "." in documents until v5.0, some libraries have workarounds for field access (Lodash example).
Despite this, for simplicity, a solution should use the described dot syntax in the CSV output’s header for nested properties. Bonus if there is a solution variant that solves this problem, e.g. with JSONPath.
Example JSON array as input
[
{
"a": {
"b": [
{
"c": 123
}
]
}
},
{
"a": {
"b": [
{
"c": "foo \" bar",
"d": "qux"
}
]
}
},
{
"a": {
"b": [
{
"d": 456
}
]
}
}
]
Example CSV output
The output should have a header that includes all fields (even if the object at the first array does not have defined values for all existing key paths).
To make the output intuitively editable by humans, each row should represent one object in the input array.
The expected output should look like this:
"a.b.0.c","a.b.0.d"
123,
"foo "" bar","qux"
,456
Command line
This is what I need:
cat example.json | jq <MISSING CODE HERE>
Solution 1, using dot notation
Here is the jq call to convert your array of nested JSON objects to CSV:
jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(#csv) | .[]
The fastest way to try this solution out is to use JQPlay.
The CSV output will have a header row. It will contain all properties that exist anywhere in the input objects, including nested ones, in dot notation. Each input array element will be represented as a single row, properties that are missing will be represented as empty CSV fields.
Using solution 1 in bash or a similar shell
Create the JSON input file…
echo '[{"a": {"b": [{"c": 123}]}},{"a": {"b": [{"c": "foo \" bar","d": "qux"}]}},{"a": {"b": [{"d": 456}]}}]' > example.json
Then use this jq command to output the CSV on the standard output:
cat example.json | jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(#csv) | .[]'
…or write the output to example.csv:
cat example.json | jq -r '(. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | map(tostring) | join(".")))] + $rows) | map(#csv) | .[]' > example.csv
Converting the data from solution 1 back to JSON
Here is a Node.js example that you can try on RunKit. It converts a CSV generated with the method in solution 1 back to an array of nested JSON objects.
Explanation for solution 1
Here is a longer, commented version of the jq filter.
# 1) Find all unique leaf property names of all objects in the input array. Each nested property name is an array with the components of its key path, for example ["a", 0, "b"].
(. | map(leaf_paths) | unique) as $cols |
# 2) Use the found key paths to determine all (nested) property values in the given input records.
map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows |
# 3) Create the raw output array of rows. Each row is represented as an array of values, one element per existing column.
(
# 3.1) This represents the header row. Key paths are generated here.
[($cols | map(. | map(tostring) | join(".")))]
+ # 3.2) concatenate the header row with all other rows
$rows
)
# 4) Convert each row to a escaped CSV string.
| map(#csv)
# 5) output each array element directly. Without this, the result would be a JSON array of CSV strings.
| .[]
Solution 2: for input that does have dots in property names
If you do need to support dot characters in property names, you can either use a different separator string for the key path syntax (replace the dot in "." with something else), or replace the map(tostring) | join(".") part with tostring - this yields a JSON array of strings that you can use as key paths - no dot notation needed. Here is a JQPlay with this solution variant.
Full jq command:
jq -r (. | map(leaf_paths) | unique) as $cols | map (. as $row | ($cols | map(. as $col | $row | getpath($col)))) as $rows | ([($cols | map(. | tostring))] + $rows) | map(#csv) | .[]
The output CSV for the variant would look like this then – it’s less readable and not useful for cases where you want humans to intuitively understand the CSV’s header:
"[""a"",""b"",0,""c""]","[""a"",""b"",0,""d""]"
123,
"foo "" bar","qux"
,456
See below for an idea how to convert this format back to a representation in your programming language.
Bonus: Converting the generated CSV back to JSON
If the input's nested properties contain no ".", it’s simple to convert the CSV back to JSON, for example with a library that supports dot notation, or with JSONPath.
JavaScript: Use Lodash's _.set()
Other languages: Find a package/library that implements JSONPath and use selectors like $.a.b.0.c or $['a']['b'][0]['c'] to set each nested property of each record.
Solution 2 (with JSON arrays as headers) allows you to interpret the headers as JSON array strings. Then you can generate a JSON Path from each header, and re-create all records/objects:
"[""a"",""b"",0,""c""]" (CSV)
→ ["a","b",0,"c"] (array of key-path components after unescaping and parsing as JSON)
→ $.["a"]["b"][0]["c"] (JSONPath)
→ { a: { b: [{c: … }] } } (Nested regenerated object)
I've written an example Node.js script to convert a CSV like this back to JSON. You can try solution 2 in RunKit.
The following tocsv and fromcsv functions provide a solution to the stated problem except for one complication regarding requirement (6) concerning the headers. Essentially, this requirement can be met using the functions given here by adding a matrix transposition step.
Whether or not a transposition step is added, the advantage of the approach taken here is that there are no restrictions on the JSON keys or values. In particular, they may
contain periods (dots), newlines and/or NUL characters.
In the example, an array of objects is given, but in fact any stream of valid JSON documents could be used as input to tocsv; thanks to the magic of jq, the original stream will be recreated by fromcsv (in the sense of entity-by-entity equality).
Of course, since there is no CSV standard, the CSV produced by the
tocsv function might not be understood by all CSV processors. In
particular, please note that the tocsv function defined here maps
embedded newlines in JSON strings or key names to the two-character
string "\n" (i.e., a literal backslash followed by the letter "n");
the inverse operation performs the inverse translation to meet the
"round-trip" requirement.
(The use of tail is just to simplify the presentation; it would be
trivial to modify the solution to make it an only-jq one.)
The CSV is generated on the assumption that any value can be
included in a field so long as (a) the field is quoted, and (b)
double-quotes within the field are doubled.
Any generic solution that supports "round-trips" is bound to be
somewhat complicated. The main reason why the solution presented here is
more complex than one might expect is because a third column is
added, partly to make it easy to distinguish between integers and
integer-valued strings, but mainly because it makes it easy to
distinguish between the size-1 and size-2 arrays produced by jq's
--stream option. Needless to say, there are other ways
these issues could be addressed; the number of calls to jq could
also be reduced.
The solution is presented as a test script that checks the round-trip requirement on a telling test case:
#!/bin/bash
function json {
cat<<EOF
[
{
"a": 1,
"b": [
1,
2,
"1"
],
"c": "d\",ef",
"embed\"ed": "quote",
"null": null,
"string": "null",
"control characters": "a\u0000c",
"newline": "a\nb"
},
{
"x": 1
}
]
EOF
}
function tocsv {
jq -ncr --stream '
(["path", "value", "stringp"],
(inputs | . + [.[1]|type=="string"]))
| map( tostring|gsub("\"";"\"\"") | gsub("\n"; "\\n"))
| "\"\(.[0])\",\"\(.[1])\",\(.[2])"
'
}
function fromcsv {
tail -n +2 | # first duplicate backslashes and deduplicate double-quotes
jq -rR '"[\(gsub("\\\\";"\\\\") | gsub("\"\"";"\\\"") ) ]"' |
jq -c '.[2] as $s
| .[0] |= fromjson
| .[1] |= if $s then . else fromjson end
| if $s == null then [.[0]] else .[:-1] end
# handle newlines
| map(if type == "string" then gsub("\\\\n";"\n") else . end)' |
jq -n 'fromstream(inputs)'
}
# Check the roundtrip:
json | tocsv | fromcsv | jq -s '.[0] == .[1]' - <(json)
Here is the CSV that would be produced by json | tocsv, except that SO seems to disallow literal NULs, so I have replaced that by \0:
"path","value",stringp
"[0,""a""]","1",false
"[0,""b"",0]","1",false
"[0,""b"",1]","2",false
"[0,""b"",2]","1",true
"[0,""b"",2]","false",null
"[0,""c""]","d"",ef",true
"[0,""embed\""ed""]","quote",true
"[0,""null""]","null",false
"[0,""string""]","null",true
"[0,""control characters""]","a\0c",true
"[0,""newline""]","a\nb",true
"[0,""newline""]","false",null
"[1,""x""]","1",false
"[1,""x""]","false",null
"[1]","false",null

jq add list to object list until condition

Background
I have an object with each value being a nested list of only strings. For each string value within the nested list, look up the string value within the object and add all of its values into the current value.
Here's what I have so far:
#!/bin/bash
in=$(jq -n '{
"bar": [["re", "de"]],
"do": [["bar","baz"]],
"baz": [["re"]],
"re": [["zoo"]]
}')
echo "expected:"
jq -n '{
"bar": [["re", "de"], ["zoo"]],
"do": [["bar","baz"], ["re", "de"], ["re"], ["zoo"]],
"baz": [["re"], ["zoo"]],
"re": [["zoo"]]
}'
echo "actual:"
echo ${in} | jq '. as $origin
| map_values( . +
until(
length == 0;
(. | flatten | map($origin[.]) | map(select( . != [[]] )) | add )
)
)'
Problem:
The output is the exact same as the input $in. If the until() function is removed from the statement, then the output correctly outputs one iteration. Although I want to recursively lookup the output strings within the object and add the lookup value until the lookup value is empty or non-existing.
For example, the key do has a value of [["bar","baz"]]. If we iterate through the values of do we come across baz. The value of baz within the object is [["re"]]. Add baz's value ["re"] to do so that do equals: [["bar","baz"], ["re"]]. Since re IS a key within the object, add the value of ["re"] which is ["zoo"]. Since ["zoo"] is NOT a key within the object finish baz and continue to the next key within the object.
The following solves the problem as originally stated, but the "expected" output as shown does not quite match the stated problem.
echo ${in} | jq -c '
. as $dict
| map_values(reduce (..|strings) as $v (.;
. + $dict[$v] ))
'
produces (after some manual reformatting for clarity):
{"bar":[["re","de"],["zoo"]],
"do":[["bar","baz"],["re","de"],["re"]],
"baz":[["re"],["zoo"]],"re":[["zoo"]]}
If some kind of recursive lookup is needed, then please reformulate the problem statement, being sure to avoid infinite loops.

create json from bash variable and associative array [duplicate]

This question already has answers here:
Constructing a JSON object from a bash associative array
(5 answers)
Closed 5 months ago.
Lets say I have the following declared in bash:
mcD="had_a_farm"
eei="eeieeio"
declare -A animals=( ["duck"]="quack_quack" ["cow"]="moo_moo" ["pig"]="oink_oink" )
and I want the following json:
{
"oldMcD": "had a farm",
"eei": "eeieeio",
"onThisFarm":[
{
"duck": "quack_quack",
"cow": "moo_moo",
"pig": "oink_oink"
}
]
}
Now I know I could do this with an echo, printf, or assign text to a variable, but lets assume animals is actually very large and it would be onerous to do so. I could also loop through my variables and associative array and create a variable as I'm doing so. I could write either of these solutions, but both seem like the "wrong way". Not to mention its obnoxious to deal with the last item in animals, after which I do not want a ",".
I'm thinking the right solution uses jq, but I'm having a hard time finding much documentation and examples on how to use this tool to write jsons (especially those that are nested) rather than parse them.
Here is what I came up with:
jq -n --arg mcD "$mcD" --arg eei "$eei" --arg duck "${animals['duck']}" --arg cow "${animals['cow']}" --arg pig "${animals['pig']}" '{onThisFarm:[ { pig: $pig, cow: $cow, duck: $duck } ], eei: $eei, oldMcD: $mcD }'
Produces the desired result. In reality, I don't really care about the order of the keys in the json, but it's still annoying that the input for jq has to go backwards to get it in the desired order. Regardless, this solution is clunky and was not any easier to write than simply declaring a string variable that looks like a json (and would be impossible with larger associative arrays). How can I build a json like this in an efficient, logical manner?
Thanks!
Assuming that none of the keys or values in the "animals" array contains newline characters:
for i in "${!animals[#]}"
do
printf "%s\n%s\n" "${i}" "${animals[$i]}"
done | jq -nR --arg oldMcD "$mcD" --arg eei "$eei" '
def to_o:
. as $in
| reduce range(0;length;2) as $i ({};
.[$in[$i]]= $in[$i+1]);
{$oldMcD,
$eei,
onthisfarm: [inputs] | to_o}
'
Notice the trick whereby {$x} in effect expands to {(x): $x}
Using "\u0000" as the separator
If any of the keys or values contains a newline character, you could tweak the above so that "\u0000" is used as the separator:
for i in "${!animals[#]}"
do
printf "%s\0%s\0" "${i}" "${animals[$i]}"
done | jq -sR --arg oldMcD "$mcD" --arg eei "$eei" '
def to_o:
. as $in
| reduce range(0;length;2) as $i ({};
.[$in[$i]]= $in[$i+1]);
{$oldMcD,
$eei,
onthisfarm: split("\u0000") | to_o }
'
Note: The above assumes jq version 1.5 or later.
You can reduce associative array with for loop and pipe it to jq:
for i in "${!animals[#]}"; do
echo "$i"
echo "${animals[$i]}"
done |
jq -n -R --arg mcD "$mcD" --arg eei "$eei" 'reduce inputs as $i ({onThisFarm: [], mcD: $mcD, eei: $eei}; .onThisFarm[0] += {($i): (input | tonumber ? // .)})'

Constructing a JSON object from a bash associative array

I would like to convert an associative array in bash to a JSON hash/dict. I would prefer to use JQ to do this as it is already a dependency and I can rely on it to produce well formed json. Could someone demonstrate how to achieve this?
#!/bin/bash
declare -A dict=()
dict["foo"]=1
dict["bar"]=2
dict["baz"]=3
for i in "${!dict[#]}"
do
echo "key : $i"
echo "value: ${dict[$i]}"
done
echo 'desired output using jq: { "foo": 1, "bar": 2, "baz": 3 }'
There are many possibilities, but given that you already have written a bash for loop, you might like to begin with this variation of your script:
#!/bin/bash
# Requires bash with associative arrays
declare -A dict
dict["foo"]=1
dict["bar"]=2
dict["baz"]=3
for i in "${!dict[#]}"
do
echo "$i"
echo "${dict[$i]}"
done |
jq -n -R 'reduce inputs as $i ({}; . + { ($i): (input|(tonumber? // .)) })'
The result reflects the ordering of keys produced by the bash for loop:
{
"bar": 2,
"baz": 3,
"foo": 1
}
In general, the approach based on feeding jq the key-value pairs, with one key on a line followed by the corresponding value on the next line, has much to recommend it. A generic solution following this general scheme, but using NUL as the "line-end" character, is given below.
Keys and Values as JSON Entities
To make the above more generic, it would be better to present the keys and values as JSON entities. In the present case, we could write:
for i in "${!dict[#]}"
do
echo "\"$i\""
echo "${dict[$i]}"
done |
jq -n 'reduce inputs as $i ({}; . + { ($i): input })'
Other Variations
JSON keys must be JSON strings, so it may take some work to ensure that the desired mapping from bash keys to JSON keys is implemented. Similar remarks apply to the mapping from bash array values to JSON values. One way to handle arbitrary bash keys would be to let jq do the conversion:
printf "%s" "$i" | jq -Rs .
You could of course do the same thing with the bash array values, and let jq check whether the value can be converted to a number or to some other JSON type as desired (e.g. using fromjson? // .).
A Generic Solution
Here is a generic solution along the lines mentioned in the jq FAQ and advocated by #CharlesDuffy. It uses NUL as the delimiter when passing the bash keys and values to jq, and has the advantage of only requiring one call to jq. If desired, the filter fromjson? // . can be omitted or replaced by another one.
declare -A dict=( [$'foo\naha']=$'a\nb' [bar]=2 [baz]=$'{"x":0}' )
for key in "${!dict[#]}"; do
printf '%s\0%s\0' "$key" "${dict[$key]}"
done |
jq -Rs '
split("\u0000")
| . as $a
| reduce range(0; length/2) as $i
({}; . + {($a[2*$i]): ($a[2*$i + 1]|fromjson? // .)})'
Output:
{
"foo\naha": "a\nb",
"bar": 2,
"baz": {
"x": 0
}
}
This answer is from nico103 on freenode #jq:
#!/bin/bash
declare -A dict=()
dict["foo"]=1
dict["bar"]=2
dict["baz"]=3
assoc2json() {
declare -n v=$1
printf '%s\0' "${!v[#]}" "${v[#]}" |
jq -Rs 'split("\u0000") | . as $v | (length / 2) as $n | reduce range($n) as $idx ({}; .[$v[$idx]]=$v[$idx+$n])'
}
assoc2json dict
You can initialize a variable to an empty object {} and add the key/values {($key):$value} for each iteration, re-injecting the result in the same variable :
#!/bin/bash
declare -A dict=()
dict["foo"]=1
dict["bar"]=2
dict["baz"]=3
data='{}'
for i in "${!dict[#]}"
do
data=$(jq -n --arg data "$data" \
--arg key "$i" \
--arg value "${dict[$i]}" \
'$data | fromjson + { ($key) : ($value | tonumber) }')
done
echo "$data"
This has been posted, and credited to nico103 on IRC, which is to say, me.
The thing that scares me, naturally, is that these associative array keys and values need quoting. Here's a start that requires some additional work to dequote keys and values:
function assoc2json {
typeset -n v=$1
printf '%q\n' "${!v[#]}" "${v[#]}" |
jq -Rcn '[inputs] |
. as $v |
(length / 2) as $n |
reduce range($n) as $idx ({}; .[$v[$idx]]=$v[$idx+$n])'
}
$ assoc2json a
{"foo\\ bar":"1","b":"bar\\ baz\\\"\\{\\}\\[\\]","c":"$'a\\nb'","d":"1"}
$
So now all that's needed is a jq function that removes the quotes, which come in several flavors:
if the string starts with a single-quote (ksh) then it ends with a single quote and those need to be removed
if the string starts with a dollar sign and a single-quote and ends in a double-quote, then those need to be removed and internal backslash escapes need to be unescaped
else leave as-is
I leave this last iterm as an exercise for the reader.
I should note that I'm using printf here as the iterator!
bash 5.2 introduces the #k parameter transformation which, makes this much easier. Like:
$ declare -A dict=([foo]=1 [bar]=2 [baz]=3)
$ jq -n '[$ARGS.positional | _nwise(2) | {(.[0]): .[1]}] | add' --args "${dict[#]#k}"
{
"foo": "1",
"bar": "2",
"baz": "3"
}

Flatten nested JSON using jq

I'd like to flatten a nested json object, e.g. {"a":{"b":1}} to {"a.b":1} in order to digest it in solr.
I have 11 TB of json files which are both nested and contains dots in field names, meaning not elasticsearch (dots) nor solr (nested without the _childDocument_ notation) can digest it as is.
The other solutions would be to replace dots in the field names with underscores and push it to elasticsearch, but I have far better experience with solr therefore I prefer the flatten solution (unless solr can digest those nested jsons as is??).
I will prefer elasticsearch only if the digestion process will take far less time than solr, because my priority is digesting as fast as I can (thus I chose jq instead of scripting it in python).
Kindly help.
EDIT:
I think the pair of examples 3&4 solves this for me:
https://lucidworks.com/blog/2014/08/12/indexing-custom-json-data/
I'll try soon.
You can also use the following jq command to flatten nested JSON objects in this manner:
[leaf_paths as $path | {"key": $path | join("."), "value": getpath($path)}] | from_entries
The way it works is: leaf_paths returns a stream of arrays which represent the paths on the given JSON document at which "leaf elements" appear, that is, elements which do not have child elements, such as numbers, strings and booleans. We pipe that stream into objects with key and value properties, where key contains the elements of the path array as a string joined by dots and value contains the element at that path. Finally, we put the entire thing in an array and run from_entries on it, which transforms an array of {key, value} objects into an object containing those key-value pairs.
This is just a variant of Santiago's jq:
. as $in
| reduce leaf_paths as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })
It avoids the overhead of the key/value construction and destruction.
(If you have access to a version of jq later than jq 1.5, you can omit the "map(tostring)".)
Two important points about both these jq solutions:
Arrays are also flattened.
E.g. given {"a": {"b": [0,1,2]}} as input, the output would be:
{
"a.b.0": 0,
"a.b.1": 1,
"a.b.2": 2
}
If any of the keys in the original JSON contain periods, then key collisions are possible; such collisions will generally result in the loss of a value. This would happen, for example, with the following input:
{"a.b":0, "a": {"b": 1}}
Here is a solution that uses tostream, select, join, reduce and setpath
reduce ( tostream | select(length==2) | .[0] |= [join(".")] ) as [$p,$v] (
{}
; setpath($p; $v)
)
I've recently written a script called jqg that flattens arbitrarily complex JSON and searches the results using a regex; to simply flatten the JSON, your regex would be '.', which matches everything. Unlike the answers above, the script will handle embedded arrays, false and null values, and can optionally treat empty arrays and objects ([] & {}) as leaf nodes.
$ jq . test/odd-values.json
{
"one": {
"start-string": "foo",
"null-value": null,
"integer-number": 101
},
"two": [
{
"two-a": {
"non-integer-number": 101.75,
"number-zero": 0
},
"true-boolean": true,
"two-b": {
"false-boolean": false
}
}
],
"three": {
"empty-string": "",
"empty-object": {},
"empty-array": []
},
"end-string": "bar"
}
$ jqg . test/odd-values.json
{
"one.start-string": "foo",
"one.null-value": null,
"one.integer-number": 101,
"two.0.two-a.non-integer-number": 101.75,
"two.0.two-a.number-zero": 0,
"two.0.true-boolean": true,
"two.0.two-b.false-boolean": false,
"three.empty-string": "",
"three.empty-object": {},
"three.empty-array": [],
"end-string": "bar"
}
jqg was tested using jq 1.6
Note: I am the author of the jqg script.
As it turns out, curl -XPOST 'http://localhost:8983/solr/flat/update/json/docs' -d #json_file does just this:
{
"a.b":[1],
"id":"24e3e780-3a9e-4fa7-9159-fc5294e803cd",
"_version_":1535841499921514496
}
EDIT 1: solr 6.0.1 with bin/solr -e cloud. collection name is flat, all the rest are default (with data-driven-schema which is also default).
EDIT 2: The final script I used: find . -name '*.json' -exec curl -XPOST 'http://localhost:8983/solr/collection1/update/json/docs' -d #{} \;.
EDIT 3: Is is also possible to parallel with xargs and to add the id field with jq: find . -name '*.json' -print0 | xargs -0 -n 1 -P 8 -I {} sh -c "cat {} | jq '. + {id: .a.b}' | curl -XPOST 'http://localhost:8983/solr/collection/update/json/docs' -d #-" where -P is the parallelism factor. I used jq to set an id so multiple uploads of the same document won't create duplicates in the collection (when I searched for the optimal value of -P it created duplicates in the collection)
As #hraban mentioned, leaf_paths does not work as expected (furthermore, it is deprecated). leaf_paths is equivalent to paths(scalars), it returns the paths of any values for which scalars returns a truthy value. scalars returns its input value if it is a scalar, or null otherwise. The problem with that is that null and false are not truthy values, so they will be removed from the output. The following code does work, by checking the type of the values directly:
. as $in
| reduce paths(type != "object" and type != "array") as $path ({};
. + { ($path | map(tostring) | join(".")): $in | getpath($path) })