I would like to "transpose" (not sure that's the right word) JSON elements.
For example, I have a JSON file like this:
{
"name": {
"0": "fred",
"1": "barney"
},
"loudness": {
"0": "extreme",
"1": "not so loud"
}
}
... and I would like to generate a JSON array like this:
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
My original JSON has many more first level elements than just "name" and "loudness", and many more names, features, etc.
For this simple example I could fully specify the transformation like this:
$ echo '{"name":{"0":"fred","1":"barney"},"loudness":{"0":"extreme","1":"not so loud"}}'| \
> jq '[{"name":.name."0", "loudness":.loudness."0"},{"name":.name."1", "loudness":.loudness."1"}]'
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
... but this isn't feasible for the original JSON.
How can jq create the desired output while being key-agnostic for my much larger JSON file?
Yes, transpose is an appropriate word, as the following makes explicit.
The following generic helper function makes for a simple solution that is completely agnostic about the key names, both of the enclosing object and the inner objects:
# Input: an array of values
def objectify($keys):
. as $in | reduce range(0;length) as $i ({}; .[$keys[$i]] = $in[$i]);
Assuming consistency of the ordering of the inner keys
Assuming the key names in the inner objects are given in a consistent order, a solution can now obtained as follows:
keys_unsorted as $keys
| [.[] | [.[]]] | transpose
| map(objectify($keys))
Without assuming consistency of the ordering of the inner keys
If the ordering of the inner keys cannot be assumed to be consistent, then one approach would be to order them, e.g. using this generic helper function:
def reorder($keys):
. as $in | reduce $keys[] as $k ({}; .[$k] = $in[$k]);
or if you prefer a reduce-free def:
def reorder($keys): [$keys[] as $k | {($k): .[$k]}] | add;
The "main" program above can then be modified as follows:
keys_unsorted as $keys
| (.[$keys[0]]|keys_unsorted) as $inner
| map_values(reorder($inner))
| [.[] | [.[]]] | transpose
| map(objectify($keys))
Caveat
The preceding solution only considers the key names in the first inner object.
Building upon Peak's solution, here is an alternative based on group_by to deal with arbitrary orders of inner keys.
keys_unsorted as $keys
| map(to_entries[])
| group_by(.key)
| map(with_entries(.key = $keys[.key] | .value |= .value))
Using paths is a good idea as pointed out by Hobbs. You could also do something like this :
[ path(.[][]) as $p | { key: $p[0], value: getpath($p), id: $p[1] } ]
| group_by(.id)
| map(from_entries)
This is a bit hairy, but it works:
. as $data |
reduce paths(scalars) as $p (
[];
setpath(
[ $p[1] | tonumber, $p[0] ];
( $data | getpath($p) )
)
)
First, capture the top level as $data because . is about to get a new value in the reduce block.
Then, call paths(scalars) which gives a key path to all of the leaf nodes in the input. e.g. for your sample it would give ["name", "0"] then ["name", "1"], then ["loudness", "0"], then ["loudness", "1"].
Run a reduce on each of those paths, starting the reduction with an empty array.
For each path, construct a new path, in the opposite order, with numbers-in-strings turned into real numbers that can be used as array indices, e.g. ["name", "0"] becomes [0, "name"].
Then use getpath to get the value at the old path in $data and setpath to set a value at the new path in . and return it as the next . for the reduce.
At the end, the result will be
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
If your real data structure might be two levels deep then you would need to replace [ $p[1] | tonumber, $p[0] ] with a more appropriate expression to transform the path. Or maybe some of your "values" are objects/arrays that you want to leave alone, in which case you probably need to replace paths(scalars) with something like paths | select(length == 2).
Related
I frequently need to create reusable function that performs transformations for a given field of input, for example:
def keep_field_only(field):
{field}
;
or
def count_by(field):
group_by(field) |
map(
{
field: .[0].field,
count: length
}
)
;
While group_by works fine with key passed as an argument, using it to construct object (eg. to keep only key in the object) doesn't work.
I believe it can be always worked around using path/1, but in my experience it significantly complicates code.
Other workaround I used is copying field +{new_field: field} at beginning of function, deleting it in the end, but it doesn't look very efficient or readable either.
Is there a shorter and more readable way?
Update:
Sample input:
[
{"type":1, "name": "foo"},
{"type":1, "name": "bar"},
{"type":2, "name": "joe"}
]
Preferred function invocation and expected results:
.[] | keep_field_only(.type):
{"type": 1}
{"type": 1}
{"type": 2}
count_by(.type):
[
{"type":1, "count": 2},
{"type":2, "count": 1}
]
You can define pick/1 as below,
def pick(paths):
. as $in
| reduce path(paths) as $path (null;
setpath($path; $in | getpath($path))
);
and use it like so:
.[] | pick(.type)
Online demo
def count_by(paths; filter):
group_by(paths | filter) | map(
(.[0] | pick(paths)) + {count: length}
);
def count_by(paths):
count_by(paths; .);
count_by(.type)
Online demo
I don't think there's a shorter and more readable way.
As you say, you can use path/1 to define your keep_field_only and count_by, but it can be done in a very simple way:
def keep_field_only(field):
(null | path(field)[0]) as $field
| {($field): field} ;
def count_by(field):
(null | path(field)[0]) as $field
| group_by(field)
| map(
{
($field): .[0][$field],
count: length
}
);
Of course this is only intended to work in examples like yours, e.g. with invocations like keep_field_only(.type) or count_by(.type).
However, thanks to setpath, the same technique can be used in more complex cases.
I have the following JSON. From there I'd like to count how many objects I have which type attribute is either "null" or has an array that contains the value "null". In the following example, the answer would be two. Note that the JSON could also be deeply nested.
{
"A": {
"type": "string"
},
"B": {
"type": "null"
},
"C": {
"type": [
"null",
"string"
]
}
}
I came up with the following, but obviously this doesn't work since it misses the arrays. Any hints how to solve this?
jq '[..|select(.type?=="null")] | length'
This answer focuses on efficiency, straightforwardness, and generality.
In brief, the following jq program produces 2 for the given example.
def count(s): reduce s as $x (0; .+1);
def hasValue($value):
has("type") and
(.type | . == $value or (type == "array" and any(. == $value)));
count(.. | objects | select(hasValue("null")))
Notice that using this approach, it would be easy to count the number of objects having null or "null":
count(.. | objects | select(hasValue("null") or hasValue(null)))
You were almost there. For arrays you could use IN. I also used objects, strings and arrays which are shortcuts to a select of the according types.
jq '[.. | objects.type | select(strings == "null", IN(arrays[]; "null"))] | length'
2
Demo
On larger structures you could also improve performance by not creating that array of which you would only calculate the length, but by instead just iterating over the matching items (e.g. using reduce) and counting on the go.
jq 'reduce (.. | objects.type | select(strings == "null", IN(arrays[]; "null"))) as $_ (0; .+1)'
2
Demo
My goal is it to flatten a filesystem like structure (nested directories) with history information for individual files into a csv file for further processing. Here is what i tried so far.
The simplified input looks like this:
{ "dirs": [
{
"name": "documents",
"files": [
{
"name": "foo.bar",
"history": [
{ "hash": "123", "timestamp": "..."},
{ "hash": "234", "timestamp": "..."}
]
}
],
"subDirs": [
{ "name": "tmp", "files": [...], "subDirs": [...]
}
]
}
]}
The tricky part is that the csv file should contain full directory paths, not only the directory name. The desired output looks like this:
"documents","foo.bar","123","..."
"documents","foo.bar","234","..."
"documents","bar.baz","345","..."
"documents","bar.baz","456","..."
"documents/tmp","deleteme","567","..."
"documents/tmp","deleteme","678","..."
flattening most of the data by using recurse works using this query:
.dirs[] | recurse(.subDirs[]?) | . as $d | $d.files[]? as $f | $f.history[]? as $h | [$d.name, $f.name, $h.hash, $h.timestamp] | #csv
...but i cannot wrap my head around how i can preserve build the directory path. Any suggestions would be much appreciated.
Here's an approach that neither uses recursion explicitly (*) nor relies on a recursive structure:
def names($path):
reduce getpath($path[0:range(0; $path|length)]) as $v ("";
if $v | type == "object" and has("name") then . + "/" + $v["name"] else . end) ;
paths as $p
| getpath($p) as $v
| select($v | objects | has("history"))
| [names($p), getpath($p + ["name"])]
+ ($v["history"][] | [.hash, .timestamp] )
| #csv
This produces "absolute" paths (e.g. "/documents"); omitting the leading "/" can be accomplished easily enough.
(*) paths is defined recursively but in a way that takes advantage of jq's tail-call optimization (TCO), which is only applied to arity-0 recursive functions.
I think you need to define a custom recursive function for this, like below; which assumes that all files have a non-empty history.
def f(pfix):
( [ pfix, .name ] | join("/") ) as $path |
( .files[] | .history[] as $hist | [ $path, .name, $hist[] ] ),
( .subDirs[] | f($path) );
.dirs[] | f(empty) | #csv
I have a nested JSON object where each level has the same property key and what distinguishes each level is a property called name. If I want to traverse down to a level which has a particular "path" of name properties, how would I formulate the jq filter?
Here is some sample JSON data that represents a file system's directory structure:
{
"subs": [
{
"name": "aaa",
"subs": [
{
"name": "bbb",
"subs": [
{
"name": "ccc",
"subs": [
{
"name": "ddd",
"payload": "xyz"
}
]
}
]
}
]
}
]
}
What's a jq filter for obtaining the value of the payload in the "path" aaa/bbb/ccc/ddd?
Prior research:
jq - select objects with given key name - helpful but looks for any element in the JSON which contains the specified name whereas I'm looking for an element that's nested under a set of objects who also have specific names.
http://arjanvandergaag.nl/blog/wrestling-json-with-jq.html - helpful in section 4 where it shows how to extract an object having a property name having a particular value. However, the recursion performed is based a specific known set of property names ("values[].links.clone[]"). In my case, my equivalent is just "subs[].subs[].subs[]".
Here is the basis for a generic solution:
def descend(name): .subs[] | select(.name == name);
So your particular query could be formulated as follows:
descend( "aaa") | descend( "bbb") | descend( "ccc") | descend( "ddd") | .payload
Or slightly better, still using the above definition of descend:
def path(array):
if (array|length)==0 then .
else descend(array[0]) | path(array[1:])
end;
path( ["aaa", "bbb", "ccc", "ddd"] ) | .payload
TCO
The above recursive definition of path/1 is simple enough but would be unsuitable for very deeply nested data structures, e.g. if the depth is greater than 1000. Here is an alternative definition that takes advantage of jq's tail-call optimization, and that therefore runs very quickly:
def atpath(array):
[array, .]
| until( .[0] == []; .[0] as $a | .[1] | descend($a[0]) | [$a[1:], . ] )
| .[1];
.aaa.bbb.ccc.ddd
If you want to be able to use the .aaa.bbb.ccc.ddd notation, one approach would be to begin by "flattening" the data:
def flat:
{ (.name): (if .subs then (.subs[] | flat) else .payload end) };
Since the top-level element does not have a "name" tag, the query would then be:
.subs[] | flat | .aaa.bbb.ccc.ddd
Here is a more efficient approach, once again using descend defined above:
def payload(p):
def get($array):
if $array == []
then .payload
else descend($array[0]) | get($array[1:]) end;
get( null | path(p) );
payload( .aaa.bbb.ccc.ddd )
The filter in the following jq command recurses down a "path" of objects that have name properties which correspond to the "path" aaa/bbb/ccc/ddd:
jq '.subs[] | select(.name = "aaa") | .subs[] | select(.name = "bbb") | .subs[] | select(.name = "ccc") | .subs[] | .payload'
Here it is live on qplay.org:
https://jqplay.org/s/tblW7UX0Si
I have a JSON data set with around 8.7 million key value pairs extracted from a Redis store, where each key is guaranteed to be an 8 digit number, and the key is an 8 alphanumeric character value i.e.
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
...
}]
To reduce Redis memory usage, I want to transform this into a hash of hashes, where the hash prefix key is the first 6 characters of the key (see this link) and then store this back into Redis.
Specifically, I want my resulting JSON data structure (that I'll then write some code to parse this JSON structure and create a Redis command file consisting of HSET, etc) to look more like
[{
"000000": { "00000023": "INCD1234",
"00000027": "INCF1423",
....
},
....
"904293": { "90429300": "THXX0020",
"90429302": "THXX0024",
"90429305": "THXY0013"}
}]
Since I've been impressed by jq and I'm trying to be more proficient at functional style programming, I wanted to use jq for this task. So far I've come up with the following:
% jq '.[0] | to_entries | map({key: .key, pfx: .key[0:6], value: .value}) | group_by(.pfx)'
This gives me something like
[
[
{
"key": "00000130",
"pfx": "000001",
"value": "CAXX3231"
},
{
"key": "00000162",
"pfx": "000001",
"value": "CAXX4606"
}
],
[
{
"key": "00000238",
"pfx": "000002",
"value": "CAXX1967"
},
{
"key": "00000256",
"pfx": "000002",
"value": "CAXX0727"
}
],
....
]
I've tried the following:
% jq 'map(map({key: .pfx, value: {key, value}}))
| map(reduce .[] as $item ({}; {key: $item.key, value: [.value[], $item.value]} ))
| map( {key, value: .value | from_entries} )
| from_entries'
which does give me the correct result, but also prints out an error for every reduce (I believe) of
jq: error: Cannot iterate over null
The end result is
{
"000001": {
"00000130": "CAXX3231",
"00000162": "CAXX4606"
},
"000002": {
"00000238": "CAXX1967",
"00000256": "CAXX0727"
},
...
}
which is correct, but how can I avoid getting this stderr warning thrown as well?
I'm not sure there's enough data here to assess what the source of the problem is. I find it hard to believe that what you tried results in that. I'm getting errors with that all the way.
Try this filter instead:
.[0]
| to_entries
| group_by(.key[0:6])
| map({
key: .[0].key[0:6],
value: map(.key=.key[6:8]) | from_entries
})
| from_entries
Given data that looks like this:
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
"00000023":"INCD1234",
"00000027":"INCF1423",
"90429300":"THXX0020",
"90429302":"THXX0024",
"90429305":"THXY0013"
}]
Results in this:
{
"000000": {
"23": "INCD1234",
"27": "INCF1423"
},
"201403": {
"67": "ITXX0043"
},
"904293": {
"00": "THXX0020",
"02": "THXX0024",
"05": "THXY0013",
"96": "THXX0020"
},
"912015": {
"44": "INXX0019"
}
}
I understand that this is not what you are asking for but, just for the reference, I think it will be MUCH more faster to do this with Redis's built-in Lua scripting.
And it turns out that it is a bit more straightforward:
for _,key in pairs(redis.call('keys', '*')) do
local val = redis.call('get', key)
local short_key = string.sub(key, 0, -2)
redis.call('hset', short_key, key, val)
redis.call('del', key)
end
This will be done in place without transferring from/to Redis and converting to/from JSON.
Run it from console as:
$ redis-cli eval "$(cat script.lua)" 0
For the record, jq's group_by relies on sorting, which of course will slow things down noticeably when the input is sufficiently large. The following is about 40% faster even when the input array has just 100,000 items:
def compress:
. as $in
| reduce keys[] as $key ({};
$key[0:6] as $k6
| $key[6:] as $k2
| .[$k6] += {($k2): $in[$key]} );
.[0] | compress
Given Jeff's input, the output is identical.