Extract schema of nested JSON object - json

Let's assume this is the source json file:
{
"name": "tom",
"age": 12,
"visits": {
"2017-01-25": 3,
"2016-07-26": 4,
"2016-01-24": 1
}
}
I want to get:
[
"age",
"name",
"visits.2017-01-25",
"visits.2016-07-26",
"visits.2016-01-24"
]
I am able to extract the keys using: jq '. | keys' file.json, but this skips nested fields. How to include those?

With your input, the invocation:
jq 'leaf_paths | join(".")'
produces:
"name"
"age"
"visits.2017-01-25"
"visits.2016-07-26"
"visits.2016-01-24"
If you want to include "visits", use paths. If you want the result as a JSON array, enclose the filter with square brackets: [ ... ]
If your input might include arrays, then unless you are using jq 1.6 or later, you will need to convert the integer indices to strings explicitly; also, since leaf_paths is now deprecated, you might want to use its def. The result:
jq 'paths(scalars) | map(tostring) | join(".")'
allpaths
To include paths to null, you could use allpaths defined as follows:
def allpaths:
def conditional_recurse(f): def r: ., (select(.!=null) | f | r); r;
path(conditional_recurse(.[]?)) | select(length > 0);
Example:
{"a": null, "b": false} | allpaths | join(".")
produces:
"a"
"b"
all_leaf_paths
Assuming jq version 1.5 or higher, we can get to all_leaf_paths by following the strategy used in builtins.jq, that is, by adding these definitions:
def allpaths(f):
. as $in | allpaths | select(. as $p|$in|getpath($p)|f);
def isscalar:
. == null or . == true or . == false or type == "number" or type == "string";
def all_leaf_paths: allpaths(isscalar);
Example:
{"a": null, "b": false, "object":{"x":0} } | all_leaf_paths | join(".")
produces:
"a"
"b"
"object.x"

Some time ago, I wrote a structural-schema inference engine that
produces simple structural schemas that mirror the JSON documents under consideration,
e.g. for the sample JSON given here, the inferred schema is:
{
"name": "string",
"age": "number",
"visits": {
"2017-01-25": "number",
"2016-07-26": "number",
"2016-01-24": "number"
}
}
This is not exactly the format requested in the original posting, but
for large collections of objects, it does provide a useful overview.
More importantly, there is now a complementary validator for
checking whether a collection of JSON documents matches a structural
schema. The validator checks against schemas written in
JESS (JSON Extended Structural Schemas), a superset of the simple
structural schemas (SSS) produced by the schema inference engine .
(The idea is that one can use the SSS as a starting point to add
more elaborate constraints, including recursive constraints,
within-document referential integrity constraints, etc.)
For reference, here is how one the SSS for your sample.json
would be produced using the "schema" module:
jq 'include "schema"; schema' source.json > source.schema.json
And to validate source.json against a SSS or ESS:
JESS --schema source.schema.json source.json

This does what you want but it doesn't return the data in an array, but it should be an easy modification:
https://github.com/ilyash/show-struct
you can also check out this page:
https://ilya-sher.org/2016/05/11/most-jq-you-will-ever-need/

Related

Count number of objects whose attribute are "null" or contain "null"

I have the following JSON. From there I'd like to count how many objects I have which type attribute is either "null" or has an array that contains the value "null". In the following example, the answer would be two. Note that the JSON could also be deeply nested.
{
"A": {
"type": "string"
},
"B": {
"type": "null"
},
"C": {
"type": [
"null",
"string"
]
}
}
I came up with the following, but obviously this doesn't work since it misses the arrays. Any hints how to solve this?
jq '[..|select(.type?=="null")] | length'
This answer focuses on efficiency, straightforwardness, and generality.
In brief, the following jq program produces 2 for the given example.
def count(s): reduce s as $x (0; .+1);
def hasValue($value):
has("type") and
(.type | . == $value or (type == "array" and any(. == $value)));
count(.. | objects | select(hasValue("null")))
Notice that using this approach, it would be easy to count the number of objects having null or "null":
count(.. | objects | select(hasValue("null") or hasValue(null)))
You were almost there. For arrays you could use IN. I also used objects, strings and arrays which are shortcuts to a select of the according types.
jq '[.. | objects.type | select(strings == "null", IN(arrays[]; "null"))] | length'
2
Demo
On larger structures you could also improve performance by not creating that array of which you would only calculate the length, but by instead just iterating over the matching items (e.g. using reduce) and counting on the go.
jq 'reduce (.. | objects.type | select(strings == "null", IN(arrays[]; "null"))) as $_ (0; .+1)'
2
Demo

How to extract data from JSON converted from ruby script?

How to convert a ruby file to json?
I use the above approach (rb2json0.rb is in the above link) to convert a ruby script to JSON. But the JSON is not well formatted as it only has arrays but not dictionaries, making it difficult to work with the JSON output.
I specifically want to extract fields in update_info, e.g.,
Name
Description
License
References
Author
and to extract the fields in register_options, e.g.,
LHOST
SOURCE
FILENAME
DOCAUTHOR
Note that the extraction should not assume the field names are fixed to these specific ones, as other field names can be used in other similar files.
The output should be a two-column TSV, with the field name as the first column and the field value as the second column. For example,
Name<TAB>Microsoft Word UNC Path Injector
...
Could anybody let me know the best jq way to achieve this? Thanks.
word_unc_injector.rb is at
https://github.com/rapid7/metasploit-framework/blob/master/modules/auxiliary/docx/word_unc_injector.rb
$ rb2json0.rb < word_unc_injector.rb | jq . # too long to include all output.
[
"program",
[
[
"command",
[
"#ident",
"require",
[
...
EDIT. The full solution of this problem may be complicated. But the first step might be to extract the part corresponding to update_info. Here is the relevant JSON fragment.
...
[
"method_add_arg",
[
"fcall",
[
"#ident",
"update_info",
[
24,
10
]
]
],
[
"arg_paren",
[
"args_add_block",
[
[
"var_ref",
[
"#ident",
"info",
[
24,
22
]
]
],
[
"bare_assoc_hash",
[
[
"assoc_new",
[
"string_literal",
[
"string_content",
[
"#tstring_content",
"Name",
...
The following illustrates how to convert the array-oriented JSON produced by rb2json0.rb to a more object-oriented JSON, in a way that you can query for update_info#ident in a straightforward way.
def objectify:
if type == "array"
then if length>=2 and (.[0]|type) == "string" and (.[1]|type) == "array"
then {(.[0]): ( .[1:] | map(objectify)) | objectify }
elif length>=3 and .[0][0:1] == "#" and (.[1]|type) == "string"
then { (.[1] + .[0]): ( .[2:] | map(objectify)) | objectify }
else map(objectify)
end
else .
end;
Illustrative query
Given the snippet shown, the following produces the output shown below:
objectify | .. | objects | .["update_info#ident"]? // empty
Output
[[24,10]]

ID lookup from an external file in JQ

I have a lookup file that maps IDs from one system onto another:
[
{
"idA": 2547,
"idB": "5d0bf91d191c6554d14572a6"
},
{
"idA": 2549,
"idB": "5b0473f93d4e53db19f8c249"
},
{
"idA": 2550,
"idB": "5d0bfabc8f20917b92ff07dc"
},
...
And I have a data file with values and an ID from one of these systems:
[
{
"idB": "5d0bf91d191c6554d14572a6",
"description": "Description for 5d0bf91d191c6554d14572a6"
},
{
"idB": "5d0bf49e9236c57281811cfc",
"description": "Description for 5d0bf49e9236c57281811cfc"
},
{
"idB": "5d0bfabc8f20917b92ff07dc",
"description": "Description for 5d0bfabc8f20917b92ff07dc"
},
...
I want to produce a new file of the descriptions with their IDs converted to the idA values in the lookup file. I tried this:
jq --slurpfile idmap ids.json 'map( {"description":.description, "id": (.idB as $b|$idmap[][]|select(.idB==$b)|.idA) } )' descriptions.json
But it produces only an empty array.
I have to double-dereference $idmap because slurping a file "binds an array of the parsed JSON values to the given global variable" -- so just doing $idmap[] throws an error, jq: error (at descriptions.json:70): Cannot index array with string "idB".
Can anyone explain what I'm doing wrong here?
Here's a concise and straightforward solution to the stated problem.
For simplicity, we'll begin by constructing a dictionary containing the relevant mapping using INDEX/2:
INDEX($idmap[]; .idB) | map_values(.idA)
Now the task is easy:
(INDEX($idmap[]; .idB) | map_values(.idA)) as $dict
| map( {description, "idA": $dict[.idB] } )
This assumes an invocation that uses --argfile idmap ids.json to avoid
the unwanted "slurping" caused by --slurpfile, but if the latter is used, then you would use $idmap[][] instead as noted in the original question.
Since the sample snippets do not include any matching "idB" values, there is little point in showing the output that would be obtained using these snippets.
Variation
If the objects in descriptions.js had other keys that should be retained, then the following variant would probably be a more useful guide:
(INDEX($idmap[]; .idB) | map_values(.idA)) as $dict # or $idmap[][] as above
| map( .idA = $dict[.idB] | del(.idB) )

jq: Conditionally update/replace/add json elements using an input file

I receive the following input file:
input.json:
[
{"ID":"aaa_12301248","time_CET":"00:00:00","VALUE":10,"FLAG":"0"},
{"ID":"aaa_12301248","time_CET":"00:15:00","VALUE":18,"FLAG":"0"},
{"ID":"aaa_12301248","time_CET":"00:30:00","VALUE":160,"FLAG":"0"},
{"ID":"bbb_0021122","time_CET":"00:00:00","VALUE":null,"FLAG":"?"},
{"ID":"bbb_0021122","time_CET":"00:15:00","VALUE":null,"FLAG":"?"},
{"ID":"bbb_0021122","time_CET":"00:30:00","VALUE":22,"FLAG":"0"},
{"ID":"ccc_0021122","time_CET":"00:00:00","VALUE":null,"FLAG":"?"},
{"ID":"ccc_0021122","time_CET":"00:15:00","VALUE":null,"FLAG":"?"},
{"ID":"ccc_0021122","time_CET":"00:30:00","VALUE":20,"FLAG":"0"},
{"ID":"ddd_122455","time_CET":"00:00:00","VALUE":null,"FLAG":"?"},
{"ID":"ddd_122455","time_CET":"00:15:00","VALUE":null,"FLAG":"?"},
{"ID":"ddd_122455","time_CET":"00:30:00","VALUE":null,"FLAG":"?"},
]
As you can see there are some valid values (FLAG: 0) and some invalid values (FLAG: "?").
Now I got a file looking like this (one for each ID):
aaa.json:
[
{"ID":"aaa_12301248","time_CET":"00:00:00","VALUE":10,"FLAG":"0"},
{"ID":"aaa_12301248","time_CET":"00:15:00","VALUE":null,"FLAG":"?"},
{"ID":"aaa_12301248","time_CET":"00:55:00","VALUE":45,"FLAG":"0"}
]
As you can see, object one is the same as in input.json but object two is invalid (FLAG: "?"). That's why object two has to be replaced by the correct object from input.json (with VALUE:18).
Objects can be identified by "time_CET" and "ID" element.
Additionally, there will be new objects in input.json, that have not been part of aaa.json etc. These objects should be added to the array, and valid objects from aaa.json should be kept.
In the end, aaa.json should look like this:
[
{"ID":"aaa_12301248","time_CET":"00:00:00","VALUE":10,"FLAG":"0"},
{"ID":"aaa_12301248","time_CET":"00:15:00","VALUE":18,"FLAG":"0"},
{"ID":"aaa_12301248","time_CET":"00:30:00","VALUE":160,"FLAG":"0"},
{"ID":"aaa_12301248","time_CET":"00:55:00","VALUE":45,"FLAG":"0"}
]
So, to summarize:
look for FLAG: "?" in aaa.json
replace this object with matching object from input.json using "ID"
and "time_CET" for mapping.
Keep exisiting valid objects and add objects from input.json that
did not exist in aaa.json before (this means only objects starting
with "aaa" in "ID" field)
repeat this for bbb.json, ccc.json and ddd.json
I am not sure if it's possible to get this done all at once with a command like this, because the output has to go to back to the correct id files (aaa, bbb ccc.json):
jq --argfile aaa aaa.json --argfile bbb bbb.json .... -f prog.jq input.json
The problem is, that the number after the identifier (aaa, bbb, ccc etc.) may change. So to make sure objects are added to the correct file/array, a statement like this would be required:
if (."ID"|contains("aaa")) then ....
Or is it better to run the program several times with different input parameters? I am not sure..
Thank you in advance!!
Here is one approach
#!/bin/bash
# usage: update.sh input.json aaa.json bbb.json....
# updates each of aaa.json bbb.json....
input_json="$1"
shift
for i in "$#"; do
jq -M --argfile input_json "$input_json" '
# functions to restrict input.json to keys of current xxx.json file
def prefix: input_filename | split(".")[0];
def selectprefix: select(.ID | startswith(prefix));
# functions to build and probe a lookup table
def pk: [.ID, .time_CET];
def lookup($t;$k): $t | getpath($k);
def lookup($t): lookup($t;pk);
def organize(s): reduce s as $r ({}; setpath($r|pk; $r));
# functions to identify objects in input.json missing from xxx.json
def pks: paths | select(length==2);
def missing($t1;$t2): [$t1|pks] - [$t2|pks] | .[];
def getmissing($t1;$t2): [ missing($t1;$t2) as $p | lookup($t1;$p)];
# main routine
organize(.[]) as $xxx
| organize($input_json[] | selectprefix) as $inp
| map(if .FLAG != "?" then . else . += lookup($inp) end)
| . + getmissing($inp;$xxx)
' "$i" | sponge "$i"
done
The script uses jq in a loop to read and update each aaa.json... file.
The filter creates temporary objects to facilitate looking up values by [ID,time_CET], updates any values in the aaa.json with a FLAG=="?" and finally adds any values from input.json that are missing in aaa.json.
The temporary lookup table for input.json uses input_filename so that only keys starting with a prefix matching the name of the currently processed file will be included.
Sample Run:
$ ./update.sh input.json aaa.json
aaa.json after run:
[
{
"ID": "aaa_12301248",
"time_CET": "00:00:00",
"VALUE": 10,
"FLAG": "0"
},
{
"ID": "aaa_12301248",
"time_CET": "00:15:00",
"VALUE": 18,
"FLAG": "0"
},
{
"ID": "aaa_12301248",
"time_CET": "00:55:00",
"VALUE": 45,
"FLAG": "0"
},
{
"ID": "aaa_12301248",
"time_CET": "00:30:00",
"VALUE": 160,
"FLAG": "0"
}
]

Using jq to extract common prefixes in a JSON data structure

I have a JSON data set with around 8.7 million key value pairs extracted from a Redis store, where each key is guaranteed to be an 8 digit number, and the key is an 8 alphanumeric character value i.e.
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
...
}]
To reduce Redis memory usage, I want to transform this into a hash of hashes, where the hash prefix key is the first 6 characters of the key (see this link) and then store this back into Redis.
Specifically, I want my resulting JSON data structure (that I'll then write some code to parse this JSON structure and create a Redis command file consisting of HSET, etc) to look more like
[{
"000000": { "00000023": "INCD1234",
"00000027": "INCF1423",
....
},
....
"904293": { "90429300": "THXX0020",
"90429302": "THXX0024",
"90429305": "THXY0013"}
}]
Since I've been impressed by jq and I'm trying to be more proficient at functional style programming, I wanted to use jq for this task. So far I've come up with the following:
% jq '.[0] | to_entries | map({key: .key, pfx: .key[0:6], value: .value}) | group_by(.pfx)'
This gives me something like
[
[
{
"key": "00000130",
"pfx": "000001",
"value": "CAXX3231"
},
{
"key": "00000162",
"pfx": "000001",
"value": "CAXX4606"
}
],
[
{
"key": "00000238",
"pfx": "000002",
"value": "CAXX1967"
},
{
"key": "00000256",
"pfx": "000002",
"value": "CAXX0727"
}
],
....
]
I've tried the following:
% jq 'map(map({key: .pfx, value: {key, value}}))
| map(reduce .[] as $item ({}; {key: $item.key, value: [.value[], $item.value]} ))
| map( {key, value: .value | from_entries} )
| from_entries'
which does give me the correct result, but also prints out an error for every reduce (I believe) of
jq: error: Cannot iterate over null
The end result is
{
"000001": {
"00000130": "CAXX3231",
"00000162": "CAXX4606"
},
"000002": {
"00000238": "CAXX1967",
"00000256": "CAXX0727"
},
...
}
which is correct, but how can I avoid getting this stderr warning thrown as well?
I'm not sure there's enough data here to assess what the source of the problem is. I find it hard to believe that what you tried results in that. I'm getting errors with that all the way.
Try this filter instead:
.[0]
| to_entries
| group_by(.key[0:6])
| map({
key: .[0].key[0:6],
value: map(.key=.key[6:8]) | from_entries
})
| from_entries
Given data that looks like this:
[{
"91201544":"INXX0019",
"90429396":"THXX0020",
"20140367":"ITXX0043",
"00000023":"INCD1234",
"00000027":"INCF1423",
"90429300":"THXX0020",
"90429302":"THXX0024",
"90429305":"THXY0013"
}]
Results in this:
{
"000000": {
"23": "INCD1234",
"27": "INCF1423"
},
"201403": {
"67": "ITXX0043"
},
"904293": {
"00": "THXX0020",
"02": "THXX0024",
"05": "THXY0013",
"96": "THXX0020"
},
"912015": {
"44": "INXX0019"
}
}
I understand that this is not what you are asking for but, just for the reference, I think it will be MUCH more faster to do this with Redis's built-in Lua scripting.
And it turns out that it is a bit more straightforward:
for _,key in pairs(redis.call('keys', '*')) do
local val = redis.call('get', key)
local short_key = string.sub(key, 0, -2)
redis.call('hset', short_key, key, val)
redis.call('del', key)
end
This will be done in place without transferring from/to Redis and converting to/from JSON.
Run it from console as:
$ redis-cli eval "$(cat script.lua)" 0
For the record, jq's group_by relies on sorting, which of course will slow things down noticeably when the input is sufficiently large. The following is about 40% faster even when the input array has just 100,000 items:
def compress:
. as $in
| reduce keys[] as $key ({};
$key[0:6] as $k6
| $key[6:] as $k2
| .[$k6] += {($k2): $in[$key]} );
.[0] | compress
Given Jeff's input, the output is identical.