I have a JSON data set with around 8.7 million key value pairs extracted from a Redis store, where each key is guaranteed to be an 8 digit number, and the key is an 8 alphanumeric character value i.e.
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
...
}]
To reduce Redis memory usage, I want to transform this into a hash of hashes, where the hash prefix key is the first 6 characters of the key (see this link) and then store this back into Redis.
Specifically, I want my resulting JSON data structure (that I'll then write some code to parse this JSON structure and create a Redis command file consisting of HSET, etc) to look more like
[{
"000000": { "00000023": "INCD1234",
"00000027": "INCF1423",
....
},
....
"904293": { "90429300": "THXX0020",
"90429302": "THXX0024",
"90429305": "THXY0013"}
}]
Since I've been impressed by jq and I'm trying to be more proficient at functional style programming, I wanted to use jq for this task. So far I've come up with the following:
% jq '.[0] | to_entries | map({key: .key, pfx: .key[0:6], value: .value}) | group_by(.pfx)'
This gives me something like
[
[
{
"key": "00000130",
"pfx": "000001",
"value": "CAXX3231"
},
{
"key": "00000162",
"pfx": "000001",
"value": "CAXX4606"
}
],
[
{
"key": "00000238",
"pfx": "000002",
"value": "CAXX1967"
},
{
"key": "00000256",
"pfx": "000002",
"value": "CAXX0727"
}
],
....
]
I've tried the following:
% jq 'map(map({key: .pfx, value: {key, value}}))
| map(reduce .[] as $item ({}; {key: $item.key, value: [.value[], $item.value]} ))
| map( {key, value: .value | from_entries} )
| from_entries'
which does give me the correct result, but also prints out an error for every reduce (I believe) of
jq: error: Cannot iterate over null
The end result is
{
"000001": {
"00000130": "CAXX3231",
"00000162": "CAXX4606"
},
"000002": {
"00000238": "CAXX1967",
"00000256": "CAXX0727"
},
...
}
which is correct, but how can I avoid getting this stderr warning thrown as well?
I'm not sure there's enough data here to assess what the source of the problem is. I find it hard to believe that what you tried results in that. I'm getting errors with that all the way.
Try this filter instead:
.[0]
| to_entries
| group_by(.key[0:6])
| map({
key: .[0].key[0:6],
value: map(.key=.key[6:8]) | from_entries
})
| from_entries
Given data that looks like this:
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
"00000023":"INCD1234",
"00000027":"INCF1423",
"90429300":"THXX0020",
"90429302":"THXX0024",
"90429305":"THXY0013"
}]
Results in this:
{
"000000": {
"23": "INCD1234",
"27": "INCF1423"
},
"201403": {
"67": "ITXX0043"
},
"904293": {
"00": "THXX0020",
"02": "THXX0024",
"05": "THXY0013",
"96": "THXX0020"
},
"912015": {
"44": "INXX0019"
}
}
I understand that this is not what you are asking for but, just for the reference, I think it will be MUCH more faster to do this with Redis's built-in Lua scripting.
And it turns out that it is a bit more straightforward:
for _,key in pairs(redis.call('keys', '*')) do
local val = redis.call('get', key)
local short_key = string.sub(key, 0, -2)
redis.call('hset', short_key, key, val)
redis.call('del', key)
end
This will be done in place without transferring from/to Redis and converting to/from JSON.
Run it from console as:
$ redis-cli eval "$(cat script.lua)" 0
For the record, jq's group_by relies on sorting, which of course will slow things down noticeably when the input is sufficiently large. The following is about 40% faster even when the input array has just 100,000 items:
def compress:
. as $in
| reduce keys[] as $key ({};
$key[0:6] as $k6
| $key[6:] as $k2
| .[$k6] += {($k2): $in[$key]} );
.[0] | compress
Given Jeff's input, the output is identical.
Related
I frequently need to create reusable function that performs transformations for a given field of input, for example:
def keep_field_only(field):
{field}
;
or
def count_by(field):
group_by(field) |
map(
{
field: .[0].field,
count: length
}
)
;
While group_by works fine with key passed as an argument, using it to construct object (eg. to keep only key in the object) doesn't work.
I believe it can be always worked around using path/1, but in my experience it significantly complicates code.
Other workaround I used is copying field +{new_field: field} at beginning of function, deleting it in the end, but it doesn't look very efficient or readable either.
Is there a shorter and more readable way?
Update:
Sample input:
[
{"type":1, "name": "foo"},
{"type":1, "name": "bar"},
{"type":2, "name": "joe"}
]
Preferred function invocation and expected results:
.[] | keep_field_only(.type):
{"type": 1}
{"type": 1}
{"type": 2}
count_by(.type):
[
{"type":1, "count": 2},
{"type":2, "count": 1}
]
You can define pick/1 as below,
def pick(paths):
. as $in
| reduce path(paths) as $path (null;
setpath($path; $in | getpath($path))
);
and use it like so:
.[] | pick(.type)
Online demo
def count_by(paths; filter):
group_by(paths | filter) | map(
(.[0] | pick(paths)) + {count: length}
);
def count_by(paths):
count_by(paths; .);
count_by(.type)
Online demo
I don't think there's a shorter and more readable way.
As you say, you can use path/1 to define your keep_field_only and count_by, but it can be done in a very simple way:
def keep_field_only(field):
(null | path(field)[0]) as $field
| {($field): field} ;
def count_by(field):
(null | path(field)[0]) as $field
| group_by(field)
| map(
{
($field): .[0][$field],
count: length
}
);
Of course this is only intended to work in examples like yours, e.g. with invocations like keep_field_only(.type) or count_by(.type).
However, thanks to setpath, the same technique can be used in more complex cases.
I'm trying to use jq to parse the output of https://ssl-config.mozilla.org/guidelines/5.6.json, a pretty simple JSON structure.
How can I get the "openssl" values if "configurations" is "modern" or "intermediate"?
The basic JSON structure would be:
{
"configurations": {
"intermediate": {
"ciphers": {
"openssl": [
"ECDHE-ECDSA-AES128-GCM-SHA256",
"ECDHE-RSA-AES128-GCM-SHA256",
"ECDHE-ECDSA-AES256-GCM-SHA384",
"ECDHE-RSA-AES256-GCM-SHA384",
"ECDHE-ECDSA-CHACHA20-POLY1305",
"ECDHE-RSA-CHACHA20-POLY1305",
"DHE-RSA-AES128-GCM-SHA256",
"DHE-RSA-AES256-GCM-SHA384"
]
}
}
}
}
I had to shorten it in order to avoid the "It looks like your post is mostly code; please add some more detail" error message.
To get all both the modern and intermediate openssl arrays, we can use:
jq '.configurations | with_entries(select([.key] | inside([ "modern", "intermediate" ])))[] | .ciphers.openssl' input
This will show:
[]
[
"ECDHE-ECDSA-AES128-GCM-SHA256",
"ECDHE-RSA-AES128-GCM-SHA256",
"ECDHE-ECDSA-AES256-GCM-SHA384",
"ECDHE-RSA-AES256-GCM-SHA384",
"ECDHE-ECDSA-CHACHA20-POLY1305",
"ECDHE-RSA-CHACHA20-POLY1305",
"DHE-RSA-AES128-GCM-SHA256",
"DHE-RSA-AES256-GCM-SHA384"
]
To get a result with an object so we can see on what key those openssl certs are found, use something like:
jq '.configurations | to_entries | map(select([.key] | inside([ "modern", "intermediate" ])) | { "\(.key)": .value.ciphers.openssl }) | add' input
This will produce:
{
"modern": [],
"intermediate": [
"ECDHE-ECDSA-AES128-GCM-SHA256",
"ECDHE-RSA-AES128-GCM-SHA256",
"ECDHE-ECDSA-AES256-GCM-SHA384",
"ECDHE-RSA-AES256-GCM-SHA384",
"ECDHE-ECDSA-CHACHA20-POLY1305",
"ECDHE-RSA-CHACHA20-POLY1305",
"DHE-RSA-AES128-GCM-SHA256",
"DHE-RSA-AES256-GCM-SHA384"
]
}
I would like to "transpose" (not sure that's the right word) JSON elements.
For example, I have a JSON file like this:
{
"name": {
"0": "fred",
"1": "barney"
},
"loudness": {
"0": "extreme",
"1": "not so loud"
}
}
... and I would like to generate a JSON array like this:
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
My original JSON has many more first level elements than just "name" and "loudness", and many more names, features, etc.
For this simple example I could fully specify the transformation like this:
$ echo '{"name":{"0":"fred","1":"barney"},"loudness":{"0":"extreme","1":"not so loud"}}'| \
> jq '[{"name":.name."0", "loudness":.loudness."0"},{"name":.name."1", "loudness":.loudness."1"}]'
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
... but this isn't feasible for the original JSON.
How can jq create the desired output while being key-agnostic for my much larger JSON file?
Yes, transpose is an appropriate word, as the following makes explicit.
The following generic helper function makes for a simple solution that is completely agnostic about the key names, both of the enclosing object and the inner objects:
# Input: an array of values
def objectify($keys):
. as $in | reduce range(0;length) as $i ({}; .[$keys[$i]] = $in[$i]);
Assuming consistency of the ordering of the inner keys
Assuming the key names in the inner objects are given in a consistent order, a solution can now obtained as follows:
keys_unsorted as $keys
| [.[] | [.[]]] | transpose
| map(objectify($keys))
Without assuming consistency of the ordering of the inner keys
If the ordering of the inner keys cannot be assumed to be consistent, then one approach would be to order them, e.g. using this generic helper function:
def reorder($keys):
. as $in | reduce $keys[] as $k ({}; .[$k] = $in[$k]);
or if you prefer a reduce-free def:
def reorder($keys): [$keys[] as $k | {($k): .[$k]}] | add;
The "main" program above can then be modified as follows:
keys_unsorted as $keys
| (.[$keys[0]]|keys_unsorted) as $inner
| map_values(reorder($inner))
| [.[] | [.[]]] | transpose
| map(objectify($keys))
Caveat
The preceding solution only considers the key names in the first inner object.
Building upon Peak's solution, here is an alternative based on group_by to deal with arbitrary orders of inner keys.
keys_unsorted as $keys
| map(to_entries[])
| group_by(.key)
| map(with_entries(.key = $keys[.key] | .value |= .value))
Using paths is a good idea as pointed out by Hobbs. You could also do something like this :
[ path(.[][]) as $p | { key: $p[0], value: getpath($p), id: $p[1] } ]
| group_by(.id)
| map(from_entries)
This is a bit hairy, but it works:
. as $data |
reduce paths(scalars) as $p (
[];
setpath(
[ $p[1] | tonumber, $p[0] ];
( $data | getpath($p) )
)
)
First, capture the top level as $data because . is about to get a new value in the reduce block.
Then, call paths(scalars) which gives a key path to all of the leaf nodes in the input. e.g. for your sample it would give ["name", "0"] then ["name", "1"], then ["loudness", "0"], then ["loudness", "1"].
Run a reduce on each of those paths, starting the reduction with an empty array.
For each path, construct a new path, in the opposite order, with numbers-in-strings turned into real numbers that can be used as array indices, e.g. ["name", "0"] becomes [0, "name"].
Then use getpath to get the value at the old path in $data and setpath to set a value at the new path in . and return it as the next . for the reduce.
At the end, the result will be
[
{
"name": "fred",
"loudness": "extreme"
},
{
"name": "barney",
"loudness": "not so loud"
}
]
If your real data structure might be two levels deep then you would need to replace [ $p[1] | tonumber, $p[0] ] with a more appropriate expression to transform the path. Or maybe some of your "values" are objects/arrays that you want to leave alone, in which case you probably need to replace paths(scalars) with something like paths | select(length == 2).
command:
cat test.json | jq -r '.[] | select(.input[] | .["$link"] | contains("randomtext1")) | .id'
I was expecting to have both entries (a and b) to show up since they both contains randomtext1
Instead, I got the following output message:
a
jq: error (at <stdin>:22): Cannot index string with string "$link"
From some digging I understand that the issue is likely caused by the following object/value pair in the a entry:
"someotherobj": "123"
because it does not contain the object $link and the filter in the command expects to see $link in all objects under the input so it errors out before the command has a chance to search in the b entry.
What I really want is to be able to search for any entries that have at least one "$link": "randomtext1" pair under input. Is there a fuzzier search feature allowing me to achieve this?
I tried to use two contains hoping it will just pipe things through:
jq -r '.[] | select(.input[] | contains(["$link"]) | contains("randomtext1")) | .id'
but it did not like that at all..
the test.json file:
[
{
"input": {
"obj1": {
"$link": "randomtext1"
},
"obj2": {
"$link": "randomtext2"
},
"someotherobj": "123"
},
"id": "a"
},
{
"input": {
"obj3": {
"$link": "randomtext1"
},
"obj4": {
"$link": "randomtext2"
}
},
"id": "b"
}
]
What I really want is to be able to search for any entries that have at least one "$link": "randomtext1" pair under input.
The key word here, both in the question and the following answer, is any:
.[]
| select( any(.input[];
type=="object" and has("$link") and (.["$link"] | index("randomtext1"))))
| .id
Of course if you require the key's value to be "randomtext1", you'd write .["$link"] == "randomtext1".
Using jq I'd like to convert data of the format:
{
"key": "something-else",
"value": {
"value": "bloop",
"isEncrypted": false
}
}
{
"key": "something",
"value": {
"value": "blah",
"isEncrypted": false
}
}
To the format:
{
something: "blah",
something-else: "bloop"
}
Filtering out 'encrypted values' along the way. How can I achieve this? I've gotten as far as the following:
.parameters | to_entries[] | select (.value.isEncrypted == false) | .key + ": " + .value.value
Which produces:
"something-else: bloop"
"something: blah"
Close, but not there just yet. I suspect that there's some clever function for this.
Given the example input, here's a simple solution, assuming the stream of objects is available as an array. (This can be done using jq -s if the JSON objects are given as input to jq, or in your case, following your example, simply using .parameters | to_entries).
map( select(.value.isEncrypted == false) | {(.key): .value.value } )
| add
This produces the JSON object:
{
"something-else": "bloop",
"something": "blah"
}
The key ideas here are:
the syntax for object construction: {( KEYNAME ): VALUE}
add
One way to gain an understanding of how this works is to run the first part of the filter (map(...)) first.
Using keys_unsorted
If you want to avoid the overhead of to_entries, you might want to consider the following approach, which piggy-backs off your implicit description of .parameters:
.parameters
| [ keys_unsorted[] as $k
| if .[$k].isEncrypted == false
then { ($k) : .[$k].value } else empty end ]
| add