I'm working with The AWS Command Line Interface for DynamoDB.
When we query an item, we get a very detailed JSON output. You get something like this (it has been built from the get-item in order to be almost exhaustive (the NULL type has been omitted) aws command line help:
{
"Count": 1,
"Items": [
{
"Id": {
"S": "app1"
},
"Parameters": {
"M": {
"nfs": {
"M": {
"IP" : {
"S" : "172.16.0.178"
},
"defaultPath": {
"S": "/mnt/ebs/"
},
"key": {
"B": "dGhpcyB0ZXh0IGlzIGJhc2U2NC1lbmNvZGVk"
},
"activated": {
"BOOL": true
}
}
},
"ws" : {
"M" : {
"number" : {
"N" : "5"
},
"values" : {
"L" : [
{ "S" : "12253456346346"},
{ "S" : "23452353463464"},
{ "S" : "23523453461232"},
{ "S" : "34645745675675"},
{ "S" : "46456745757575"}
]
}
}
}
}
},
"Oldtypes": {
"typeSS" : {"SS" : ["foo", "bar", "baz"]},
"typeNS" : {"NS" : ["0", "1", "2", "3", "4", "5"]},
"typeBS" : {"BS" : ["VGVybWluYXRvcgo=", "VGVybWluYXRvciAyOiBKdWRnbWVudCBEYXkK", "VGVybWluYXRvciAzOiBSaXNlIG9mIHRoZSBNYWNoaW5lcwo=", "VGVybWluYXRvciA0OiBTYWx2YXRpb24K","VGVybWluYXRvciA1OiBHZW5lc2lzCg=="]}
}
}
],
"ScannedCount": 1,
"ConsumedCapacity": null
}
Is there any way to get a simpler output for the Items part? Like this:
{
"ConsumedCapacity": null,
"Count": 1,
"Items": [
{
"Id": "app1",
"Parameters": {
"nfs": {
"IP": "172.16.0.178",
"activated": true,
"defaultPath": "/mnt/ebs/",
"key": "dGhpcyB0ZXh0IGlzIGJhc2U2NC1lbmNvZGVk"
},
"ws": {
"number": 5,
"values": ["12253456346346","23452353463464","23523453461232","34645745675675","46456745757575"]
}
},
"Oldtypes": {
"typeBS": ["VGVybWluYXRvcgo=", "VGVybWluYXRvciAyOiBKdWRnbWVudCBEYXkK", "VGVybWluYXRvciAzOiBSaXNlIG9mIHRoZSBNYWNoaW5lcwo=", "VGVybWluYXRvciA0OiBTYWx2YXRpb24K", "VGVybWluYXRvciA1OiBHZW5lc2lzCg=="],
"typeNS": [0, 1, 2, 3, 4, 5],
"typeSS": ["foo","bar","baz"]
}
}
],
"ScannedCount": 1
}
There is nothing helpful in the dynamodb - AWS CLI 1.7.10 documentation.
We must get the result from the command line. I'm willing to use other command line tools like jq if necessary, but such a jq mapping appears to complicated to me.
Update 1: jq based solution (with help from DanielH's answer)
With jq it is easy, but not quite pretty, you can do something like:
$> aws dynamodb query --table-name ConfigCatalog --key-conditions '{ "Id" : {"AttributeValueList": [{"S":"app1"}], "ComparisonOperator": "EQ"}}' | jq -r '.Items[0].Parameters.M."nfs#IP".S'
Result will be: 172.16.0.178
The jq -r option gives you a raw output.
Update 2: jq based solution (with help from #jeff-mercado)
Here is an updated and commented version of Jeff Mercado jq function to unmarshall DynamoDB output. It will give you the expected output:
$> cat unmarshal_dynamodb.jq
def unmarshal_dynamodb:
# DynamoDB string type
(objects | .S)
# DynamoDB blob type
// (objects | .B)
# DynamoDB number type
// (objects | .N | strings | tonumber)
# DynamoDB boolean type
// (objects | .BOOL)
# DynamoDB map type, recursion on each item
// (objects | .M | objects | with_entries(.value |= unmarshal_dynamodb))
# DynamoDB list type, recursion on each item
// (objects | .L | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type SS, string set
// (objects | .SS | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type NS, number set
// (objects | .NS | arrays | map(tonumber))
# DynamoDB typed list type BS, blob set
// (objects | .BS | arrays | map(unmarshal_dynamodb))
# managing others DynamoDB output entries: "Count", "Items", "ScannedCount" and "ConsumedCapcity"
// (objects | with_entries(.value |= unmarshal_dynamodb))
// (arrays | map(unmarshal_dynamodb))
# leaves values
// .
;
unmarshal_dynamodb
If you save the DynamoDB query output to a file, lets say ddb-query-result.json, you can execute to get desired result:
$> jq -f unmarshal_dynamodb.jq ddb-query-result.json
You can decode the values recursively with a well crafted function. It looks like the key names correspond to a type:
S -> string
N -> number
M -> map
Handle each of the cases you want to decode if possible, otherwise filter it out. You can make use of the various type filters and the alternative operator to do so.
$ cat input.json
{
"Count": 1,
"Items": [
{
"Id": { "S": "app1" },
"Parameters": {
"M": {
"nfs#IP": { "S": "192.17.0.13" },
"maxCount": { "N": "1" },
"nfs#defaultPath": { "S": "/mnt/ebs/" }
}
}
}
],
"ScannedCount": 1,
"ConsumedCapacity": null
}
$ cat ~/.jq
def decode_ddb:
def _sprop($key): select(keys == [$key])[$key]; # single property objects only
((objects | { value: _sprop("S") }) # string (from string)
// (objects | { value: _sprop("NULL") | null }) # null (from boolean)
// (objects | { value: _sprop("B") }) # blob (from string)
// (objects | { value: _sprop("N") | tonumber }) # number (from string)
// (objects | { value: _sprop("BOOL") }) # boolean (from boolean)
// (objects | { value: _sprop("M") | map_values(decode_ddb) }) # map (from object)
// (objects | { value: _sprop("L") | map(decode_ddb) }) # list (from encoded array)
// (objects | { value: _sprop("SS") }) # string set (from string array)
// (objects | { value: _sprop("NS") | map(tonumber) }) # number set (from string array)
// (objects | { value: _sprop("BS") }) # blob set (from string array)
// (objects | { value: map_values(decode_ddb) }) # all other non-conforming objects
// (arrays | { value: map(decode_ddb) }) # all other non-conforming arrays
// { value: . }).value # everything else
;
$ jq 'decode_ddb' input.json
{
"Count": 1,
"Items": [
{
"Id": "app1",
"Parameters": {
"nfs#IP": "192.17.0.13",
"maxCount": 1,
"nfs#defaultPath": "/mnt/ebs/"
}
}
],
"ScannedCount": 1,
"ConsumedCapacity": null
}
Another way to achieve the post's goal would be to use a node.js extension like node-dynamodb or dynamodb-marshaler and build a node command line tool.
Interesting tutorial to build a node.js command line application with commander package: Creating Your First Node.js Command-line Application
Here's a quick and dirty oneliner that reads one record from stdin and prints it in simplified form:
node -e 'console.log(JSON.stringify(require("aws-sdk").DynamoDB.Converter.unmarshall(JSON.parse(require("fs").readFileSync(0, "utf-8")))))'
Here's an updated version of the jq solution that can handle null values.
$> cat unmarshal_dynamodb.jq
def unmarshal_dynamodb:
# null
walk( if type == "object" and .NULL then . |= null else . end ) |
# DynamoDB string type
(objects | .S)
# DynamoDB blob type
// (objects | .B)
# DynamoDB number type
// (objects | .N | strings | tonumber)
# DynamoDB boolean type
// (objects | .BOOL)
# DynamoDB map type, recursion on each item
// (objects | .M | objects | with_entries(.value |= unmarshal_dynamodb))
# DynamoDB list type, recursion on each item
// (objects | .L | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type SS, string set
// (objects | .SS | arrays | map(unmarshal_dynamodb))
# DynamoDB typed list type NS, number set
// (objects | .NS | arrays | map(tonumber))
# DynamoDB typed list type BS, blob set
// (objects | .BS | arrays | map(unmarshal_dynamodb))
# managing others DynamoDB output entries: "Count", "Items", "ScannedCount" and "ConsumedCapcity"
// (objects | with_entries(.value |= unmarshal_dynamodb))
// (arrays | map(unmarshal_dynamodb))
# leaves values
// .
;
unmarshal_dynamodb
$> jq -f unmarshal_dynamodb.jq ddb-query-result.json
Credit to #jeff-mercado and #herve for the original version.
As far as I know, there is no other output like the "verbose" one you've posted. Therefore I think, you can't avoid intermediate tools like jq oder sed
There are several proposals in this post for converting the raw dynamo data:
Export data from DynamoDB
Maybe you can adapt one of these scripts in conjunction with jq or sed
Here is another approach. This may be a little brutal but it shows the basic idea.
def unwanted: ["B","BOOL","M","S","L","BS","SS"];
def fixpath(p): [ p[] | select( unwanted[[.]]==[] ) ];
def fixnum(p;v):
if p[-2]=="NS" then [p[:-2]+p[-1:],(v|tonumber)]
elif p[-1]=="N" then [p[:-1], (v|tonumber)]
else [p,v] end;
reduce (tostream|select(length==2)) as [$p,$v] (
{}
; fixnum(fixpath($p);$v) as [$fp,$fv]
| setpath($fp;$fv)
)
Try it online!
Sample Run (assuming filter in filter.jq and data in data.json)
$ jq -M -f filter.jq data.json
{
"ConsumedCapacity": null,
"Count": 1,
"Items": [
{
"Id": "app1",
"Oldtypes": {
"typeBS": [
"VGVybWluYXRvcgo=",
"VGVybWluYXRvciAyOiBKdWRnbWVudCBEYXkK",
"VGVybWluYXRvciAzOiBSaXNlIG9mIHRoZSBNYWNoaW5lcwo=",
"VGVybWluYXRvciA0OiBTYWx2YXRpb24K",
"VGVybWluYXRvciA1OiBHZW5lc2lzCg=="
],
"typeNS": [
0,
1,
2,
3,
4,
5
],
"typeSS": [
"foo",
"bar",
"baz"
]
},
"Parameters": {
"nfs": {
"IP": "172.16.0.178",
"activated": true,
"defaultPath": "/mnt/ebs/",
"key": "dGhpcyB0ZXh0IGlzIGJhc2U2NC1lbmNvZGVk"
},
"ws": {
"number": 5,
"values": [
"12253456346346",
"23452353463464",
"23523453461232",
"34645745675675",
"46456745757575"
]
}
}
}
],
"ScannedCount": 1
}
Here is a script in node to do this.
I named the file reformat.js but you can call it whatever you want
'use strict';
/**
* This script will parse the AWS dynamo CLI JSON response into JS.
* This parses out the type keys in the objects.
*/
const fs = require('fs');
const rawData = fs.readFileSync('response.json'); // Import the raw response from the dynamoDB CLI query
const response = JSON.parse(rawData); // Parse to JS to make it easier to work with.
function shallowFormatData(data){
// Loop through the object and replace the Type key with the value.
for(const key in data){
const innerRawObject = data[key]
const innerKeys = Object.keys(innerRawObject)
innerKeys.forEach(innerKey => {
const innerFormattedObject = innerRawObject[innerKey]
if(typeof innerFormattedObject == 'object'){
data[key] = shallowFormatData(innerFormattedObject) // Recursively call formatData if there are nested objects
}else{
// Null items come back with a type of "NULL" and value of true. we want to set the value to null if the type is "NULL"
data[key] = innerKey == 'NULL' ? null : innerFormattedObject
}
})
}
return data
}
// this only gets the Items and not the meta data.
const result = response.Items.map(item => {
return shallowFormatData(item)
})
console.dir(result, {'maxArrayLength': null}); // There is a default limit on how big a console.log can be, this removes that limit.
Step 1) run your dynamoDB query via the CLI and save it to a JSON file. To save the response from the CLI just add > somefile.json. For convenience, I saved this in the same directory as my reformat file
// Example: Run in CLI
$ aws dynamodb query --table-name stage_requests-service_FoxEvents \
--key-condition-expression "PK = :v1" \
--expression-attribute-values file://expression-attributes.json > response.json
expression-attributes.json
{
":v1": {"S": "SOMEVAL"}
}
If you need more information on how I queried DynamoDB look at these examples in the documentation https://docs.aws.amazon.com/cli/latest/reference/dynamodb/query.html#examples
Now that you have a JSON file of the data you need to reformat run the format.js script from your terminal
Step 2)
// Run this in your terminal
$ node reformat.js > formatted.js
You should have a clean JS Object output if you want a JSON object output just put a JSON.stringify(result) in the console.dir at the end of the script
Related
I'm trying to filter for items from a list that contain the values of other properties in the same object.
Example of the data.json:
{ result:
[
{ name: 'foo', text: 'my name is foo' },
{ name: 'bar', text: 'my name is baz' },
]
}
What I've tried:
cat data.json | jq '.result[] | select(.text | ascii_downcase | contains(.name))'
However it throws the following error:
Cannot index string with string
Is there a way in jq to select based on a dynamic property rather than a string literal?
Assuming your JSON looks more like this (strings in double quotes, no comma after the last array item):
{ "result":
[
{ "name": "foo", "text": "my name is foo" },
{ "name": "bar", "text": "my name is baz" }
]
}
When going into .text, you have lost the context from which you can access .name. You could save it (or directly the desired value) in a variable and reference it when needed:
jq '.result[] | select(.name as $name | .text | ascii_downcase | contains($name))'
{
"name": "foo",
"text": "my name is foo"
}
Demo
I have 2 JSON files. I would like to use jq to take the value of "capital" from File 2 and merge it with File 1 for each element where the same "name"-value pair occurs. Otherwise, the element from File 2 should not occur in the output. If there is no "name"-value pair for an element in File 1, it should have empty text for "capital."
File 1:
{
"countries":[
{
"name":"china",
"continent":"asia"
},
{
"name":"france",
"continent":"europe"
}
]
}
File 2:
{
"countries":[
{
"name":"china",
"capital":"beijing"
},
{
"name":"argentina",
"capital":"buenos aires"
}
]
}
Desired result:
{
"countries":[
{
"name":"china",
"continent":"asia",
"capital":"beijing"
},
{
"name":"france",
"continent":"europe",
"capital":""
}
]
}
You could first construct a dictionary from File2, and then perform the update, e.g. like so:
jq --argfile dict File2.json '
($dict.countries | map( {(.name): .capital}) | add) as $capitals
| .countries |= map( .capital = ($capitals[.name] // ""))
' File2.json
From a JSON-esque perspective, it would probably be better to use null for missing values; in that case, you could simplify the above by omitting // "".
Using INDEX/2
If your jq has INDEX/2, then the $capitals dictionary could be constructed using the expression:
INDEX($dict.countries[]; .name) | map_values(.capital)
Using INDEX makes the intention clearer, but if efficiency were a major concern, you'd probably be better off using reduce explicitly:
reduce $dict.countries[] as $c ({}; . + ($c | {(.name): .capital}))
One way:
$ jq --slurpfile file2 file2.json '
{ countries:
[ .countries[] |
. as $curr |
$curr + { capital: (($file2[0].countries[] | select(.name == $curr.name) | .capital) // "") }
]
}' file1.json
{
"countries": [
{
"name": "china",
"continent": "asia",
"capital": "beijing"
},
{
"name": "france",
"continent": "europe",
"capital": ""
}
]
}
An alternative:
$ jq -n '{ countries: ([inputs] | map(.countries) | flatten | group_by(.name) |
map(select(.[] | has("continent")) | add | .capital //= ""))
}' file[12].json
I have some messy JSON.
Some nodes are not consistent across rows. In some rows these nodes are arrays and in some these are objects or strings.
The example here is only two levels, but the actual data is nested many more levels.
Example:
[
{
"id": 1,
"person": {
"addresses": {
"address": {
"city": "FL"
}
},
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
}
},
{
"id": 2,
"person": {
"addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"phones": {
"type": "mobile",
"number": "555-555-5555"
},
"email": {
"type": "work",
"email": "jane.doe#gmail.com"
}
}
}
]
I would like to make the nodes consistent so that if any the node is an array in any of the nodes, then the remaining nodes should be converted into arrays.
Once the data is consistent, it would be easier to analyze and restructure the data.
Expected result:
[
{
"id": 1,
"person": {
"addresses": [
{
"address": {
"city": "FL"
}
}
],
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
}
},
{
"id": 2,
"person": {
"addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "jane.doe#gmail.com"
}
]
}
}
]
After making the arrays consistent I would like to flatten the data so that objects are flattened out but the arrays remain arrays. This
Expected result
[
{
"id": 1,
"person.addresses": [
{
"address": {
"city": "FL"
}
}
],
"person.phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"person.email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
},
{
"id": 2,
"person.addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"person.phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"person.email": [
{
"type": "work",
"email": "jane.doe#gmail.com"
}
]
}
]
I was able to do this partially using jq. It works when there are one or two paths to be fixed, but when there are more than two it seems to break.
The approach I took
Identify all possible paths
Group and count the datatypes for each path
Identify cases where there are mixed datatypes
Sort the paths by decreasing depth
Exclude paths that do not have mixed types
Exclude paths where one of the mixed types is not an array
For each path apply the fix on the original data
This generates a stream containing N copies one for each N transformation
Extract the last copy which should contain the cleaned result
My Experiment so far
def fix(data; path):
data |= map(. | getpath(path)?=([getpath(path)?]|flatten));
def hist:
length as $l
| group_by (.)
| map( .
| (.|length) as $c
| {(.[0]):{
"count": $c,
"diff": ($l - $c)
}} )
| (length>1) as $mixed
| {
"types": .[],
"count": $l,
"mixed":$mixed
};
def summary:
map( .
| path(..) as $p
| {
path:$p,
type: getpath($p)|type,
key:$p|join(".")
}
)
| flatten
| group_by(.key)
| map( .
| {
key: .[0].key,
path: .[0].path,
depth: (.[0].path|length),
type:([(.[] | .type)]|hist)
}
)
| sort_by(.depth)
| reverse;
. as $data
| .
| summary
| map( .
| select(.type.mixed)
| select(.type.types| keys| contains(["array"]))
| .path)
| map(. as $path | $data | fix($data;$path))
| length as $l
| .[$l-1]
Only the last conversion is present. I think the $data is not getting updated by my fix and this is probably the root cause, or maybe I am just doing this wrong.
Here is e where this doesn't work
The following response first solves the first task, namely:
make the nodes consistent so that if any ... node is an array in any of the nodes, then the remaining nodes should be converted into arrays.
in a generic way:
def paths_to_array:
[paths as $path
| select( any(.[]; (getpath($path[1:] )? | type) == "array"))
| $path] ;
# If a path to a value in .[] is an array,
# then ensure all corresponding values are also arrays
def make_uniform:
reduce (paths_to_array[][1:]) as $path (.;
map( (getpath($path)? // null) as $value
| if $value and ($value|type != "array")
then setpath($path; [$value])
else . end ) ) ;
make_uniform
For the second task, let's define a utility function:
# Input is assumed to be an object:
def flatten_top_level_keys:
[ to_entries[]
| if (.value|type) == "object"
then .key as $k
| (.value|to_entries)[] as $kv
| {key: ($k + "." + $kv.key), value: $kv.value}
else .
end ]
| from_entries;
This can be used in conjunction with walk/1 to achieve recursive
flattening.
In other words, the solution to the combined problem can be obtained
by:
make_uniform
| walk( if type == "object" then flatten_top_level_keys else . end )
Efficiency
The above def of make_uniform suffers from an obvious efficiency issue in the line:
reduce (paths_to_array[][1:]) as $path (.;
Using jq's unique would be one way to resolve it, but unique is implemented using a sort, which in this case introduces another inefficiency. So let's use this old chestnut:
# bag of words
def bow(stream):
reduce stream as $word ({}; .[$word|tostring] += 1);
Now we can define make_uniform more efficiently:
def make_uniform:
def uniques(s): bow(s) | keys_unsorted[] | fromjson;
reduce uniques(paths_to_array[][1:]) as $path (.;
map( (getpath($path)? // null) as $value
| if $value and ($value|type != "array")
then setpath($path; [$value])
else . end ) ) ;
Using a bit of python along with the JQ scripts that peak had given in the solution above, I was able to clean up my messy data.
I still think that the answer given by peak is the right answer given the question I had asked. Although the solution is very good and works well, it took a lot of time to complete. The time taken depended on the number of nodes, depth of the nodes and the number or arrays it found.
I had two different files that I needed to fix and both had around 5000 rows of data. On one of them, the jq script took about 6 hours to complete and I had to terminate the other one after 16 hours.
The solution below builds on the original solution by using python and jq together to process some of the steps in parallel. Finding paths to arrays is still the most time-consuming part.
setup
I split the scripts into the following
# paths_to_array.jq
def paths_to_array:
[paths as $path
| select( any(.[]; (getpath($path[1:] )? | type) == "array"))
| $path[1:]]
| unique
| map(. | select([.[]|type]|contains(["number"])|not));
paths_to_array
Minor adjustment to exclude any paths that had arrays in between. I just wanted all paths that end with arrays.
I also excluded the topmost array indices from the path to reduce the number of paths
# flatten.jq
def update_array($path):
(getpath($path)? // null) as $value
| (if $value and ($value|type != "array")
then . as $data | (try (setpath($path; [$value]))
catch $data)
else . end);
def make_uniform($paths):
map( .
| reduce($paths[]) as $path (
. ; update_array($path)
)
);
# Input is assumed to be an object:
def flatten_top_level_keys:
[ to_entries[]
| if (.value|type) == "object"
then .key as $k
| (.value|to_entries)[] as $kv
| {key: ($k + "." + $kv.key), value: $kv.value}
else .
end ]
| from_entries;
I had to add the walk function from jq builtins because the jq library for pythonn didn't include it.
I split the make_uniform function so that I could understand the script better and I added a try catch because of an issue I encountered when the path included array indices in between. Otherwise this is pretty much same as the code from the original solution
# apply.jq
make_uniform({path})
| map( .
| walk( if type == "object" then
flatten_top_level_keys
else . end ))
I had to split this because I was injecting the data for the path using the {path} and when this was in the full script I got an error when using .format() within python.
import math
import os
import JSON
from jq import jq
import multiprocessing as mp
def get_script(filename):
"""Utility function to read the jq script"""
with open(filename, "r") as f:
script = f.read()
return script
def get_data(filename):
"""Utility function to read json data from file"""
with open(filename, 'r') as f:
data = json.load(f)
return data
def transform(script, data):
"""Wrapper to be used by the parallel processor"""
return jq(script).transform(data)
def parallel_jq(script, data, rows=100, processes=8):
"""Executes the JQ script on data in parallel chuncks specified by rows"""
pool = mp.Pool(processes=processes)
size = math.ceil(len(data) / rows)
segments = [pool.apply_async(transform,
args=(script,
data[index*rows:(index+1)*rows]))
for index in range(size) ]
result = []
for seg in segments:
result.extend(seg.get())
return result
def get_paths_to_arrays(data, dest="data"):
"""Obtain the paths to arrays"""
filename = os.path.join(dest, "paths_to_arrays.json")
if os.path.isfile(filename):
paths = get_data(filename)
else:
script = get_script('jq/paths_to_array.jq')
paths = parallel_jq(script, data)
paths = jq("unique|sort_by(length)|reverse").transform(paths)
with open(filename, 'w') as f:
json.dump(paths, f, indent=2)
return paths
def flatten(data, paths, dest="data"):
"""Make the arrays uniform and flatten the result"""
filename = os.path.join(dest, "uniform_flat.json")
script = get_script('jq/flatten.jq')
script += "\n" + get_script('jq/apply.jq').format(path=json.dumps(paths))
data = parallel_jq(script, data)
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
if __name__ == '__main__':
entity = 'messy_data'
sourcefile = os.path.join('data', entity+'.json')
dest = os.path.join('data', entity)
data = get_data(sourcefile)
# Finding paths with arrays
paths = get_paths_to_arrays(data, dest)
# Fixing array paths and flattening
flatten(data, paths, dest)
As I mentioned before the get_paths_to_arrays does take quite long even with parallel processing.
get_paths_to_arrays took 3811.834 seconds => Just over an hour.
flatten took 38 seconds
I have an API that returns JSON - big blocks of it. Some of the key value pairs have more blocks of JSON as the value associated with a key. jq does a great job of parsing the main JSON levels. But I can't find a way to get it to 'recurse' into the values associated with the keys and pretty print them as well.
Here is the start of one of the JSON returns. Note it is only a small percent of the full return:
{
"code": 200,
"status": "OK",
"data": {
"PlayFabId": "xxxxxxx",
"InfoResultPayload": {
"AccountInfo": {
"PlayFabId": "xxxxxxxx",
"Created": "2018-03-22T19:23:29.018Z",
"TitleInfo": {
"Origination": "IOS",
"Created": "2018-03-22T19:23:29.033Z",
"LastLogin": "2018-03-22T19:23:29.033Z",
"FirstLogin": "2018-03-22T19:23:29.033Z",
"isBanned": false
},
"PrivateInfo": {},
"IosDeviceInfo": {
"IosDeviceId": "xxxxxxxxx"
}
},
"UserVirtualCurrency": {
"GT": 10,
"MB": 70
},
"UserVirtualCurrencyRechargeTimes": {},
"UserData": {},
"UserDataVersion": 15,
"UserReadOnlyData": {
"DataVersion": {
"Value": "6",
"LastUpdated": "2018-03-22T19:48:59.543Z",
"Permission": "Public"
},
"achievements": {
"Value": "[{\"id\":0,\"gamePack\":\"GAME.PACK.0.KK\",\"marblesAmount\":50,\"achievements\":[{\"id\":2,\"name\":\"Correct Round 4\",\"description\":\"Round 4 answered correctly\",\"maxValue\":10,\"increment\":1,\"currentValue\":3,\"valueUnit\":\"unit\",\"awardOnIncrement\":true,\"marbles\":10,\"image\":\"https://www.jamandcandy.com/kissinkuzzins/achievements/icons/sphinx\",\"SuccessKey\":[\"0_3_4_0\",\"0_5_4_0\",\"0_6_4_0\",\"0_7_4_0\",\"0_8_4_0\",\"0_9_4_0\",\"0_10_4_0\"],\"event\":\"Player_answered_round\",\"achieved\":false},{\"id\":0,\"name\":\"Complete
This was parsed using jq but as you can see when you get to the
"achievements": { "Vales": "[{\"id\":0,\"gamePack\":\"GAME.PACK.0.KK\",\"marblesAmount\":50,\
lq does no further parse the value at is also JSON.
Is there a filter I am missing to get it to parse the values as well as the higher level structure?
Is there a filter I am missing ...?
The filter you'll need is fromjson, but it should only be applied to the stringified JSON; consider therefore using |= as illustrated using your fragment:
echo '{"achievements": { "Vales": "[{\"id\":0,\"gamePack\":\"GAME.PACK.0.KK\",\"marblesAmount\":50}]"}}' |
jq '.achievements.Vales |= fromjson'
{
"achievements": {
"Vales": [
{
"id": 0,
"gamePack": "GAME.PACK.0.KK",
"marblesAmount": 50
}
]
}
}
recursively/1
If you want to apply fromjson recursively wherever possible, then recursively is your friend:
def recursively(f):
. as $in
| if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | recursively(f) )} )
elif type == "array" then map( recursively(f) )
else try (f as $f | if $f == . then . else ($f | recursively(f)) end) catch $in
end;
This would be applied as follows:
recursively(fromjson)
Example
{a: ({b: "xyzzy"}) | tojson} | tojson
| recursively(fromjson)
yields:
{
"a": {
"b": "xyzzy"
}
}
I am trying to use jq to "normalize" JSON so that given the input:
[
[
{"M":6, "C":66, "R":0.1},
{"M":6, "C":81, "R":0.9}
],
[
{"M":11, "C":94, "R":0.8},
{"M":11, "C":55, "R":0.46}
]
,
...
]
the output should be:
[
{
"M" : 6,
"X" : [{"66" : 0.1},{"81": 0.9}]
},
{
"M" : 11,
"X" : [{"94" : 0.8},{"55": 0.46}]
},
...
]
I can extract M with map({M: .[0].M but not sure how to proceed
Set X to the result of mapping over the array and creating a one-element object for each entry, with C as the key and R as the value.
map({M: .[0].M, X: map({"\(.C)": .R})})
Since map(f) is defined as [.[]|f], here is an expanded form of Santiago's solution with some additional comments:
[ # compute result array from
.[] # scaning input array
| { M: .[0].M # compute M from the .M of the first element
, X: [ # compute X array from
.[] # scanning each element's objects
| {"\(.C)":.R} # and constructing new {C:R} objects
]
}
]
[foreach .[] as $list ({};
{
"M": $list[0].M,
"X": [foreach $list[] as $item ({};
{"\($item.C)": $item.R}
)]
}
)]