Using jq to "normalize" JSON

Using jq to "normalize" JSON - json

I am trying to use jq to "normalize" JSON so that given the input:
[
[
{"M":6, "C":66, "R":0.1},
{"M":6, "C":81, "R":0.9}
],
[
{"M":11, "C":94, "R":0.8},
{"M":11, "C":55, "R":0.46}
]
,
...
]
the output should be:
[
{
"M" : 6,
"X" : [{"66" : 0.1},{"81": 0.9}]
},
{
"M" : 11,
"X" : [{"94" : 0.8},{"55": 0.46}]
},
...
]
I can extract M with map({M: .[0].M but not sure how to proceed

Set X to the result of mapping over the array and creating a one-element object for each entry, with C as the key and R as the value.
map({M: .[0].M, X: map({"\(.C)": .R})})

Since map(f) is defined as [.[]|f], here is an expanded form of Santiago's solution with some additional comments:
[ # compute result array from
.[] # scaning input array
| { M: .[0].M # compute M from the .M of the first element
, X: [ # compute X array from
.[] # scanning each element's objects
| {"\(.C)":.R} # and constructing new {C:R} objects
]
}
]

[foreach .[] as $list ({};
{
"M": $list[0].M,
"X": [foreach $list[] as $item ({};
{"\($item.C)": $item.R}
)]
}
)]

Related

jq: How to replace element in an array or add it if it doesn't exist

Given the following json structure:
{
"elements": [
{
"name": "disregard",
"value": "me"
},
{
"name": "foo",
"value": "bar"
},
{
"name": "dont-edit",
"value": "me"
}
]
}
What would be the appropriate jq query to replace the value of the name: foo element or create/add the element to the array, if it doesn't already exist?

Here is a safe if pedestrian solution:
.elements
|= (map(.name) | index("foo")) as $ix
| if $ix
then .[$ix]["value"] = "BAR"
else . + [{name: "foo", value: "BAR"}]
end
You might want to abstract away the "foo" and "BAR" bits:
upsert
# Input is assumed to be an array of {name:_, value:_} objects
def upsert($foo; $bar):
(map(.name) | index($foo)) as $ix
| if $ix then .[$ix]["value"] = $bar else . + [{name: $foo, value: $bar}] end;
Usage:
.elements |= upsert("foo"; "BAR")

I have a messy JSON that I am trying to clean up using jq

I have some messy JSON.
Some nodes are not consistent across rows. In some rows these nodes are arrays and in some these are objects or strings.
The example here is only two levels, but the actual data is nested many more levels.
Example:
[
{
"id": 1,
"person": {
"addresses": {
"address": {
"city": "FL"
}
},
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
}
},
{
"id": 2,
"person": {
"addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"phones": {
"type": "mobile",
"number": "555-555-5555"
},
"email": {
"type": "work",
"email": "jane.doe#gmail.com"
}
}
}
]
I would like to make the nodes consistent so that if any the node is an array in any of the nodes, then the remaining nodes should be converted into arrays.
Once the data is consistent, it would be easier to analyze and restructure the data.
Expected result:
[
{
"id": 1,
"person": {
"addresses": [
{
"address": {
"city": "FL"
}
}
],
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
}
},
{
"id": 2,
"person": {
"addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "jane.doe#gmail.com"
}
]
}
}
]
After making the arrays consistent I would like to flatten the data so that objects are flattened out but the arrays remain arrays. This
Expected result
[
{
"id": 1,
"person.addresses": [
{
"address": {
"city": "FL"
}
}
],
"person.phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"person.email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
},
{
"id": 2,
"person.addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"person.phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"person.email": [
{
"type": "work",
"email": "jane.doe#gmail.com"
}
]
}
]
I was able to do this partially using jq. It works when there are one or two paths to be fixed, but when there are more than two it seems to break.
The approach I took
Identify all possible paths
Group and count the datatypes for each path
Identify cases where there are mixed datatypes
Sort the paths by decreasing depth
Exclude paths that do not have mixed types
Exclude paths where one of the mixed types is not an array
For each path apply the fix on the original data
This generates a stream containing N copies one for each N transformation
Extract the last copy which should contain the cleaned result
My Experiment so far
def fix(data; path):
data |= map(. | getpath(path)?=([getpath(path)?]|flatten));
def hist:
length as $l
| group_by (.)
| map( .
| (.|length) as $c
| {(.[0]):{
"count": $c,
"diff": ($l - $c)
}} )
| (length>1) as $mixed
| {
"types": .[],
"count": $l,
"mixed":$mixed
};
def summary:
map( .
| path(..) as $p
| {
path:$p,
type: getpath($p)|type,
key:$p|join(".")
}
)
| flatten
| group_by(.key)
| map( .
| {
key: .[0].key,
path: .[0].path,
depth: (.[0].path|length),
type:([(.[] | .type)]|hist)
}
)
| sort_by(.depth)
| reverse;
. as $data
| .
| summary
| map( .
| select(.type.mixed)
| select(.type.types| keys| contains(["array"]))
| .path)
| map(. as $path | $data | fix($data;$path))
| length as $l
| .[$l-1]
Only the last conversion is present. I think the $data is not getting updated by my fix and this is probably the root cause, or maybe I am just doing this wrong.
Here is e where this doesn't work

The following response first solves the first task, namely:
make the nodes consistent so that if any ... node is an array in any of the nodes, then the remaining nodes should be converted into arrays.
in a generic way:
def paths_to_array:
[paths as $path
| select( any(.[]; (getpath($path[1:] )? | type) == "array"))
| $path] ;
# If a path to a value in .[] is an array,
# then ensure all corresponding values are also arrays
def make_uniform:
reduce (paths_to_array[][1:]) as $path (.;
map( (getpath($path)? // null) as $value
| if $value and ($value|type != "array")
then setpath($path; [$value])
else . end ) ) ;
make_uniform
For the second task, let's define a utility function:
# Input is assumed to be an object:
def flatten_top_level_keys:
[ to_entries[]
| if (.value|type) == "object"
then .key as $k
| (.value|to_entries)[] as $kv
| {key: ($k + "." + $kv.key), value: $kv.value}
else .
end ]
| from_entries;
This can be used in conjunction with walk/1 to achieve recursive
flattening.
In other words, the solution to the combined problem can be obtained
by:
make_uniform
| walk( if type == "object" then flatten_top_level_keys else . end )
Efficiency
The above def of make_uniform suffers from an obvious efficiency issue in the line:
reduce (paths_to_array[][1:]) as $path (.;
Using jq's unique would be one way to resolve it, but unique is implemented using a sort, which in this case introduces another inefficiency. So let's use this old chestnut:
# bag of words
def bow(stream):
reduce stream as $word ({}; .[$word|tostring] += 1);
Now we can define make_uniform more efficiently:
def make_uniform:
def uniques(s): bow(s) | keys_unsorted[] | fromjson;
reduce uniques(paths_to_array[][1:]) as $path (.;
map( (getpath($path)? // null) as $value
| if $value and ($value|type != "array")
then setpath($path; [$value])
else . end ) ) ;

Using a bit of python along with the JQ scripts that peak had given in the solution above, I was able to clean up my messy data.
I still think that the answer given by peak is the right answer given the question I had asked. Although the solution is very good and works well, it took a lot of time to complete. The time taken depended on the number of nodes, depth of the nodes and the number or arrays it found.
I had two different files that I needed to fix and both had around 5000 rows of data. On one of them, the jq script took about 6 hours to complete and I had to terminate the other one after 16 hours.
The solution below builds on the original solution by using python and jq together to process some of the steps in parallel. Finding paths to arrays is still the most time-consuming part.
setup
I split the scripts into the following
# paths_to_array.jq
def paths_to_array:
[paths as $path
| select( any(.[]; (getpath($path[1:] )? | type) == "array"))
| $path[1:]]
| unique
| map(. | select([.[]|type]|contains(["number"])|not));
paths_to_array
Minor adjustment to exclude any paths that had arrays in between. I just wanted all paths that end with arrays.
I also excluded the topmost array indices from the path to reduce the number of paths
# flatten.jq
def update_array($path):
(getpath($path)? // null) as $value
| (if $value and ($value|type != "array")
then . as $data | (try (setpath($path; [$value]))
catch $data)
else . end);
def make_uniform($paths):
map( .
| reduce($paths[]) as $path (
. ; update_array($path)
)
);
# Input is assumed to be an object:
def flatten_top_level_keys:
[ to_entries[]
| if (.value|type) == "object"
then .key as $k
| (.value|to_entries)[] as $kv
| {key: ($k + "." + $kv.key), value: $kv.value}
else .
end ]
| from_entries;
I had to add the walk function from jq builtins because the jq library for pythonn didn't include it.
I split the make_uniform function so that I could understand the script better and I added a try catch because of an issue I encountered when the path included array indices in between. Otherwise this is pretty much same as the code from the original solution
# apply.jq
make_uniform({path})
| map( .
| walk( if type == "object" then
flatten_top_level_keys
else . end ))
I had to split this because I was injecting the data for the path using the {path} and when this was in the full script I got an error when using .format() within python.
import math
import os
import JSON
from jq import jq
import multiprocessing as mp
def get_script(filename):
"""Utility function to read the jq script"""
with open(filename, "r") as f:
script = f.read()
return script
def get_data(filename):
"""Utility function to read json data from file"""
with open(filename, 'r') as f:
data = json.load(f)
return data
def transform(script, data):
"""Wrapper to be used by the parallel processor"""
return jq(script).transform(data)
def parallel_jq(script, data, rows=100, processes=8):
"""Executes the JQ script on data in parallel chuncks specified by rows"""
pool = mp.Pool(processes=processes)
size = math.ceil(len(data) / rows)
segments = [pool.apply_async(transform,
args=(script,
data[index*rows:(index+1)*rows]))
for index in range(size) ]
result = []
for seg in segments:
result.extend(seg.get())
return result
def get_paths_to_arrays(data, dest="data"):
"""Obtain the paths to arrays"""
filename = os.path.join(dest, "paths_to_arrays.json")
if os.path.isfile(filename):
paths = get_data(filename)
else:
script = get_script('jq/paths_to_array.jq')
paths = parallel_jq(script, data)
paths = jq("unique|sort_by(length)|reverse").transform(paths)
with open(filename, 'w') as f:
json.dump(paths, f, indent=2)
return paths
def flatten(data, paths, dest="data"):
"""Make the arrays uniform and flatten the result"""
filename = os.path.join(dest, "uniform_flat.json")
script = get_script('jq/flatten.jq')
script += "\n" + get_script('jq/apply.jq').format(path=json.dumps(paths))
data = parallel_jq(script, data)
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
if __name__ == '__main__':
entity = 'messy_data'
sourcefile = os.path.join('data', entity+'.json')
dest = os.path.join('data', entity)
data = get_data(sourcefile)
# Finding paths with arrays
paths = get_paths_to_arrays(data, dest)
# Fixing array paths and flattening
flatten(data, paths, dest)
As I mentioned before the get_paths_to_arrays does take quite long even with parallel processing.
get_paths_to_arrays took 3811.834 seconds => Just over an hour.
flatten took 38 seconds

How to substitute strings in JSON with jq based on the input

Given the input in this form
[
{
"DIR" : "/foo/bar/a/b/c",
"OUT" : "/foo/bar/x/y/z",
"ARG" : [ "aaa", "bbb", "/foo/bar/a", "BASE=/foo/bar" ]
},
{
"DIR" : "/foo/baz/d/e/f",
"OUT" : "/foo/baz/x/y/z",
"ARG" : [ "ccc", "ddd", "/foo/baz/b", "BASE=/foo/baz" ]
},
{
"foo" : "bar"
}
]
I'm trying to find out how to make jq transform that into this:
[
{
"DIR" : "BASE/a/b/c",
"OUT" : "BASE/x/y/z",
"ARG" : [ "aaa", "bbb", "BASE/a", "BASE=/foo/bar" ]
},
{
"DIR" : "BASE/d/e/f",
"OUT" : "BASE/x/y/z",
"ARG" : [ "ccc", "ddd", "BASE/b", "BASE=/foo/baz" ]
},
{
"foo" : "bar"
}
]
In other words, objects having an "ARG" array, containing a string that starts with "BASE=" should use the string after "BASE=", e.g. "/foo" to substitute other string values that start with "/foo" (except the "BASE=/foo" which should remain unchanged")
I'm not even close to finding a solution myself, and at this point I'm unsure that jq alone will do the job.

With jq:
#!/usr/bin/jq -f
# fix-base.jq
def fix_base:
(.ARG[] | select(startswith("BASE=")) | split("=")[1]) as $base
| .DIR?|="BASE"+ltrimstr($base)
| .OUT?|="BASE"+ltrimstr($base)
| .ARG|=map(if startswith($base) then "BASE"+ltrimstr($base) else . end)
;
map(if .ARG? then fix_base else . end)
You can run it like this:
jq -f fix-base.jq input.json
or make it an executable like this:
chmod +x fix-base.jq
./fix-base.jq input.json

Don't worry, jq alone will do the job:
jq 'def sub_base($base): if (startswith("BASE") | not) then sub($base; "BASE") else . end;
map(if .["ARG"] then ((.ARG[] | select(startswith("BASE=")) | split("=")[1]) as $base
| to_entries
| map(if (.value | type == "string") then .value |= sub_base($base)
else .value |= map(sub_base($base)) end)
| from_entries)
else . end)' input.json
The output:
[
{
"DIR": "BASE/a/b/c",
"OUT": "BASE/x/y/z",
"ARG": [
"aaa",
"bbb",
"BASE/a",
"BASE=/foo/bar"
]
},
{
"DIR": "BASE/d/e/f",
"OUT": "BASE/x/y/z",
"ARG": [
"ccc",
"ddd",
"BASE/b",
"BASE=/foo/baz"
]
},
{
"foo": "bar"
}
]

Some helper functions make the going much easier. The first is generic and worthy perhaps of your standard library:
# Returns the integer index, $i, corresponding to the first element
# at which f is truthy, else null
def indexof(f):
label $out
| foreach .[] as $x (null; .+1;
if ($x|f) then (.-1, break $out) else empty end) // null;
# Change the string $base to BASE using gsub
def munge($base):
if type == "string" and (test("^BASE=")|not) then gsub($base; "BASE")
elif type=="array" then map(munge($base))
elif type=="object" then map_values(munge($base))
else .
end;
And now the easy part:
map(if has("ARG")
then (.ARG|indexof(test("^BASE="))) as $ix
| if $ix
then (.ARG[$ix]|sub("^BASE=";"")) as $base | munge($base)
else . end
else . end )
Some points to note:
You may wish to use sub rather than gsub in munge;
The above solution assumes you want to make the change in all keys, not just "DIR", "OUT", and "ARG"
The above solution allows specifications of BASE that include one or more occurrences of "=".

jq: How can I combine data from duplicate keys

I have a fairly complex JSON data structure that I've managed to use jq to filter down to certain keys and their values. I need to combine the results though, so duplicate keys have only one array of values.
e.g.
{
"1.NBT.B": [
{
"id": 545
},
{
"id": 546
}
]
},
{
"1.NBT.B": [
{
"id": 1281
},
{
"id": 1077
}
]
}
would result in
{
"1.NBT.B": [
{
"id": 545
},
{
"id": 546
},
{
"id": 1281
},
{
"id": 1077
}
]
},
...
or even better:
[{"1.NBT.B": [545, 546, 1281, 1077]}, ...]
I need to do it without having to put in the key ("1.NBT.B") directly, since there are hundreds of these keys. I think what has me most stuck is that the objects here aren't named -- the keys are not the same between objects.
Something like this only gives me the 2nd set of ids, completing skipping the first:
reduce .[] as $item ({}; . + $item)

Part 1
The following jq function combines an array of objects in the manner envisioned by the first part of the question.
# Given an array of objects, produce a single object with an array at
# every key, the array at each key, k, being formed from all the values at k.
def merge:
reduce .[] as $o ({}; reduce ($o|keys)[] as $key (.; .[$key] += $o[$key] ));
With this definition together with the line:
merge
in a file, and with the example input modified to be a valid JSON array,
the result is:
{
"1.NBT.B": [
{
"id": 545
},
{
"id": 546
},
{
"id": 1281
},
{
"id": 1077
}
]
}
Part 2
With merge as defined above, the filter:
merge | with_entries( .value |= map(.id) )
produces:
{
"1.NBT.B": [
545,
546,
1281,
1077
]
}

How to simplify aws DynamoDB query JSON output from the command line?

I'm working with The AWS Command Line Interface for DynamoDB.
When we query an item, we get a very detailed JSON output. You get something like this (it has been built from the get-item in order to be almost exhaustive (the NULL type has been omitted) aws command line help:
{
"Count": 1,
"Items": [
{
"Id": {
"S": "app1"
},
"Parameters": {
"M": {
"nfs": {
"M": {
"IP" : {
"S" : "172.16.0.178"
},
"defaultPath": {
"S": "/mnt/ebs/"
},
"key": {
"B": "dGhpcyB0ZXh0IGlzIGJhc2U2NC1lbmNvZGVk"
},
"activated": {
"BOOL": true
}
}
},
"ws" : {
"M" : {
"number" : {
"N" : "5"
},
"values" : {
"L" : [
{ "S" : "12253456346346"},
{ "S" : "23452353463464"},
{ "S" : "23523453461232"},
{ "S" : "34645745675675"},
{ "S" : "46456745757575"}
]
}
}
}
}
},
"Oldtypes": {
"typeSS" : {"SS" : ["foo", "bar", "baz"]},
"typeNS" : {"NS" : ["0", "1", "2", "3", "4", "5"]},
"typeBS" : {"BS" : ["VGVybWluYXRvcgo=", "VGVybWluYXRvciAyOiBKdWRnbWVudCBEYXkK", "VGVybWluYXRvciAzOiBSaXNlIG9mIHRoZSBNYWNoaW5lcwo=", "VGVybWluYXRvciA0OiBTYWx2YXRpb24K","VGVybWluYXRvciA1OiBHZW5lc2lzCg=="]}
}
}
],
"ScannedCount": 1,
"ConsumedCapacity": null
}
Is there any way to get a simpler output for the Items part? Like this:
{
"ConsumedCapacity": null,
"Count": 1,
"Items": [
{
"Id": "app1",
"Parameters": {
"nfs": {
"IP": "172.16.0.178",
"activated": true,
"defaultPath": "/mnt/ebs/",
"key": "dGhpcyB0ZXh0IGlzIGJhc2U2NC1lbmNvZGVk"
},
"ws": {
"number": 5,
"values": ["12253456346346","23452353463464","23523453461232","34645745675675","46456745757575"]
}
},
"Oldtypes": {
"typeBS": ["VGVybWluYXRvcgo=", "VGVybWluYXRvciAyOiBKdWRnbWVudCBEYXkK", "VGVybWluYXRvciAzOiBSaXNlIG9mIHRoZSBNYWNoaW5lcwo=", "VGVybWluYXRvciA0OiBTYWx2YXRpb24K", "VGVybWluYXRvciA1OiBHZW5lc2lzCg=="],
"typeNS": [0, 1, 2, 3, 4, 5],
"typeSS": ["foo","bar","baz"]
}
}
],
"ScannedCount": 1
}
There is nothing helpful in the dynamodb - AWS CLI 1.7.10 documentation.
We must get the result from the command line. I'm willing to use other command line tools like jq if necessary, but such a jq mapping appears to complicated to me.
Update 1: jq based solution (with help from DanielH's answer)
With jq it is easy, but not quite pretty, you can do something like:
$> aws dynamodb query --table-name ConfigCatalog --key-conditions '{ "Id" : {"AttributeValueList": [{"S":"app1"}], "ComparisonOperator": "EQ"}}' | jq -r '.Items[0].Parameters.M."nfs#IP".S'
Result will be: 172.16.0.178
The jq -r option gives you a raw output.
Update 2: jq based solution (with help from #jeff-mercado)
Here is an updated and commented version of Jeff Mercado jq function to unmarshall DynamoDB output. It will give you the expected output:
$> cat unmarshal_dynamodb.jq
def unmarshal_dynamodb:
# DynamoDB string type
(objects | .S)
# DynamoDB blob type
// (objects | .B)
# DynamoDB number type
// (objects | .N | strings | tonumber)
# DynamoDB boolean type
// (objects | .BOOL)
# DynamoDB map type, recursion on each item
// (objects | .M | objects | with_entries(.value |= unmarshal_dynamodb))
# DynamoDB list type, recursion on each item
// (objects | .L | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type SS, string set
// (objects | .SS | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type NS, number set
// (objects | .NS | arrays | map(tonumber))
# DynamoDB typed list type BS, blob set
// (objects | .BS | arrays | map(unmarshal_dynamodb))
# managing others DynamoDB output entries: "Count", "Items", "ScannedCount" and "ConsumedCapcity"
// (objects | with_entries(.value |= unmarshal_dynamodb))
// (arrays | map(unmarshal_dynamodb))
# leaves values
// .
;
unmarshal_dynamodb
If you save the DynamoDB query output to a file, lets say ddb-query-result.json, you can execute to get desired result:
$> jq -f unmarshal_dynamodb.jq ddb-query-result.json

You can decode the values recursively with a well crafted function. It looks like the key names correspond to a type:
S -> string
N -> number
M -> map
Handle each of the cases you want to decode if possible, otherwise filter it out. You can make use of the various type filters and the alternative operator to do so.
$ cat input.json
{
"Count": 1,
"Items": [
{
"Id": { "S": "app1" },
"Parameters": {
"M": {
"nfs#IP": { "S": "192.17.0.13" },
"maxCount": { "N": "1" },
"nfs#defaultPath": { "S": "/mnt/ebs/" }
}
}
}
],
"ScannedCount": 1,
"ConsumedCapacity": null
}
$ cat ~/.jq
def decode_ddb:
def _sprop($key): select(keys == [$key])[$key]; # single property objects only
((objects | { value: _sprop("S") }) # string (from string)
// (objects | { value: _sprop("NULL") | null }) # null (from boolean)
// (objects | { value: _sprop("B") }) # blob (from string)
// (objects | { value: _sprop("N") | tonumber }) # number (from string)
// (objects | { value: _sprop("BOOL") }) # boolean (from boolean)
// (objects | { value: _sprop("M") | map_values(decode_ddb) }) # map (from object)
// (objects | { value: _sprop("L") | map(decode_ddb) }) # list (from encoded array)
// (objects | { value: _sprop("SS") }) # string set (from string array)
// (objects | { value: _sprop("NS") | map(tonumber) }) # number set (from string array)
// (objects | { value: _sprop("BS") }) # blob set (from string array)
// (objects | { value: map_values(decode_ddb) }) # all other non-conforming objects
// (arrays | { value: map(decode_ddb) }) # all other non-conforming arrays
// { value: . }).value # everything else
;
$ jq 'decode_ddb' input.json
{
"Count": 1,
"Items": [
{
"Id": "app1",
"Parameters": {
"nfs#IP": "192.17.0.13",
"maxCount": 1,
"nfs#defaultPath": "/mnt/ebs/"
}
}
],
"ScannedCount": 1,
"ConsumedCapacity": null
}

Another way to achieve the post's goal would be to use a node.js extension like node-dynamodb or dynamodb-marshaler and build a node command line tool.
Interesting tutorial to build a node.js command line application with commander package: Creating Your First Node.js Command-line Application
Here's a quick and dirty oneliner that reads one record from stdin and prints it in simplified form:
node -e 'console.log(JSON.stringify(require("aws-sdk").DynamoDB.Converter.unmarshall(JSON.parse(require("fs").readFileSync(0, "utf-8")))))'

Here's an updated version of the jq solution that can handle null values.
$> cat unmarshal_dynamodb.jq
def unmarshal_dynamodb:
# null
walk( if type == "object" and .NULL then . |= null else . end ) |
# DynamoDB string type
(objects | .S)
# DynamoDB blob type
// (objects | .B)
# DynamoDB number type
// (objects | .N | strings | tonumber)
# DynamoDB boolean type
// (objects | .BOOL)
# DynamoDB map type, recursion on each item
// (objects | .M | objects | with_entries(.value |= unmarshal_dynamodb))
# DynamoDB list type, recursion on each item
// (objects | .L | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type SS, string set
// (objects | .SS | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type NS, number set
// (objects | .NS | arrays | map(tonumber))
# DynamoDB typed list type BS, blob set
// (objects | .BS | arrays | map(unmarshal_dynamodb))
# managing others DynamoDB output entries: "Count", "Items", "ScannedCount" and "ConsumedCapcity"
// (objects | with_entries(.value |= unmarshal_dynamodb))
// (arrays | map(unmarshal_dynamodb))
# leaves values
// .
;
unmarshal_dynamodb
$> jq -f unmarshal_dynamodb.jq ddb-query-result.json
Credit to #jeff-mercado and #herve for the original version.

As far as I know, there is no other output like the "verbose" one you've posted. Therefore I think, you can't avoid intermediate tools like jq oder sed
There are several proposals in this post for converting the raw dynamo data:
Export data from DynamoDB
Maybe you can adapt one of these scripts in conjunction with jq or sed

Here is another approach. This may be a little brutal but it shows the basic idea.
def unwanted: ["B","BOOL","M","S","L","BS","SS"];
def fixpath(p): [ p[] | select( unwanted[[.]]==[] ) ];
def fixnum(p;v):
if p[-2]=="NS" then [p[:-2]+p[-1:],(v|tonumber)]
elif p[-1]=="N" then [p[:-1], (v|tonumber)]
else [p,v] end;
reduce (tostream|select(length==2)) as [$p,$v] (
{}
; fixnum(fixpath($p);$v) as [$fp,$fv]
| setpath($fp;$fv)
)
Try it online!
Sample Run (assuming filter in filter.jq and data in data.json)
$ jq -M -f filter.jq data.json
{
"ConsumedCapacity": null,
"Count": 1,
"Items": [
{
"Id": "app1",
"Oldtypes": {
"typeBS": [
"VGVybWluYXRvcgo=",
"VGVybWluYXRvciAyOiBKdWRnbWVudCBEYXkK",
"VGVybWluYXRvciAzOiBSaXNlIG9mIHRoZSBNYWNoaW5lcwo=",
"VGVybWluYXRvciA0OiBTYWx2YXRpb24K",
"VGVybWluYXRvciA1OiBHZW5lc2lzCg=="
],
"typeNS": [
0,
1,
2,
3,
4,
5
],
"typeSS": [
"foo",
"bar",
"baz"
]
},
"Parameters": {
"nfs": {
"IP": "172.16.0.178",
"activated": true,
"defaultPath": "/mnt/ebs/",
"key": "dGhpcyB0ZXh0IGlzIGJhc2U2NC1lbmNvZGVk"
},
"ws": {
"number": 5,
"values": [
"12253456346346",
"23452353463464",
"23523453461232",
"34645745675675",
"46456745757575"
]
}
}
}
],
"ScannedCount": 1
}

Here is a script in node to do this.
I named the file reformat.js but you can call it whatever you want
'use strict';
/**
* This script will parse the AWS dynamo CLI JSON response into JS.
* This parses out the type keys in the objects.
*/
const fs = require('fs');
const rawData = fs.readFileSync('response.json'); // Import the raw response from the dynamoDB CLI query
const response = JSON.parse(rawData); // Parse to JS to make it easier to work with.
function shallowFormatData(data){
// Loop through the object and replace the Type key with the value.
for(const key in data){
const innerRawObject = data[key]
const innerKeys = Object.keys(innerRawObject)
innerKeys.forEach(innerKey => {
const innerFormattedObject = innerRawObject[innerKey]
if(typeof innerFormattedObject == 'object'){
data[key] = shallowFormatData(innerFormattedObject) // Recursively call formatData if there are nested objects
}else{
// Null items come back with a type of "NULL" and value of true. we want to set the value to null if the type is "NULL"
data[key] = innerKey == 'NULL' ? null : innerFormattedObject
}
})
}
return data
}
// this only gets the Items and not the meta data.
const result = response.Items.map(item => {
return shallowFormatData(item)
})
console.dir(result, {'maxArrayLength': null}); // There is a default limit on how big a console.log can be, this removes that limit.
Step 1) run your dynamoDB query via the CLI and save it to a JSON file. To save the response from the CLI just add > somefile.json. For convenience, I saved this in the same directory as my reformat file
// Example: Run in CLI
$ aws dynamodb query --table-name stage_requests-service_FoxEvents \
--key-condition-expression "PK = :v1" \
--expression-attribute-values file://expression-attributes.json > response.json
expression-attributes.json
{
":v1": {"S": "SOMEVAL"}
}
If you need more information on how I queried DynamoDB look at these examples in the documentation https://docs.aws.amazon.com/cli/latest/reference/dynamodb/query.html#examples
Now that you have a JSON file of the data you need to reformat run the format.js script from your terminal
Step 2)
// Run this in your terminal
$ node reformat.js > formatted.js
You should have a clean JS Object output if you want a JSON object output just put a JSON.stringify(result) in the console.dir at the end of the script

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Using jq to "normalize" JSON - json

Set X to the result of mapping over the array and creating a one-element object for each entry, with C as the key and R as the value. map({M: .[0].M, X: map({"\(.C)": .R})})

[foreach .[] as $list ({}; { "M": $list[0].M, "X": [foreach $list[] as $item ({}; {"\($item.C)": $item.R} )] } )]

Related

jq: How to replace element in an array or add it if it doesn't exist

I have a messy JSON that I am trying to clean up using jq

How to substitute strings in JSON with jq based on the input

jq: How can I combine data from duplicate keys

How to simplify aws DynamoDB query JSON output from the command line?

Categories

Resources