I have an API that returns JSON - big blocks of it. Some of the key value pairs have more blocks of JSON as the value associated with a key. jq does a great job of parsing the main JSON levels. But I can't find a way to get it to 'recurse' into the values associated with the keys and pretty print them as well.
Here is the start of one of the JSON returns. Note it is only a small percent of the full return:
{
"code": 200,
"status": "OK",
"data": {
"PlayFabId": "xxxxxxx",
"InfoResultPayload": {
"AccountInfo": {
"PlayFabId": "xxxxxxxx",
"Created": "2018-03-22T19:23:29.018Z",
"TitleInfo": {
"Origination": "IOS",
"Created": "2018-03-22T19:23:29.033Z",
"LastLogin": "2018-03-22T19:23:29.033Z",
"FirstLogin": "2018-03-22T19:23:29.033Z",
"isBanned": false
},
"PrivateInfo": {},
"IosDeviceInfo": {
"IosDeviceId": "xxxxxxxxx"
}
},
"UserVirtualCurrency": {
"GT": 10,
"MB": 70
},
"UserVirtualCurrencyRechargeTimes": {},
"UserData": {},
"UserDataVersion": 15,
"UserReadOnlyData": {
"DataVersion": {
"Value": "6",
"LastUpdated": "2018-03-22T19:48:59.543Z",
"Permission": "Public"
},
"achievements": {
"Value": "[{\"id\":0,\"gamePack\":\"GAME.PACK.0.KK\",\"marblesAmount\":50,\"achievements\":[{\"id\":2,\"name\":\"Correct Round 4\",\"description\":\"Round 4 answered correctly\",\"maxValue\":10,\"increment\":1,\"currentValue\":3,\"valueUnit\":\"unit\",\"awardOnIncrement\":true,\"marbles\":10,\"image\":\"https://www.jamandcandy.com/kissinkuzzins/achievements/icons/sphinx\",\"SuccessKey\":[\"0_3_4_0\",\"0_5_4_0\",\"0_6_4_0\",\"0_7_4_0\",\"0_8_4_0\",\"0_9_4_0\",\"0_10_4_0\"],\"event\":\"Player_answered_round\",\"achieved\":false},{\"id\":0,\"name\":\"Complete
This was parsed using jq but as you can see when you get to the
"achievements": { "Vales": "[{\"id\":0,\"gamePack\":\"GAME.PACK.0.KK\",\"marblesAmount\":50,\
lq does no further parse the value at is also JSON.
Is there a filter I am missing to get it to parse the values as well as the higher level structure?
Is there a filter I am missing ...?
The filter you'll need is fromjson, but it should only be applied to the stringified JSON; consider therefore using |= as illustrated using your fragment:
echo '{"achievements": { "Vales": "[{\"id\":0,\"gamePack\":\"GAME.PACK.0.KK\",\"marblesAmount\":50}]"}}' |
jq '.achievements.Vales |= fromjson'
{
"achievements": {
"Vales": [
{
"id": 0,
"gamePack": "GAME.PACK.0.KK",
"marblesAmount": 50
}
]
}
}
recursively/1
If you want to apply fromjson recursively wherever possible, then recursively is your friend:
def recursively(f):
. as $in
| if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | recursively(f) )} )
elif type == "array" then map( recursively(f) )
else try (f as $f | if $f == . then . else ($f | recursively(f)) end) catch $in
end;
This would be applied as follows:
recursively(fromjson)
Example
{a: ({b: "xyzzy"}) | tojson} | tojson
| recursively(fromjson)
yields:
{
"a": {
"b": "xyzzy"
}
}
Related
I'm using JQ CLI to merge JSON from document to another. The issue I am facing is that I have select by the value of a property, rather than by a numeric array index
The first file contains a chunk of JSON jqtest.json:
{
"event": [
{
"listen": "test",
"script": {
"exec": [],
"type": "text/javascript"
}
}
]
}
The second file is where I want to merge the JSON into under "accounts" collection.json:
{
"item": [
{
"name": "accounts",
"item": [
{
"name": "Retrieves the collection of Account resources."
}
]
},
{
"name": "accounts mapped",
"item": [
{
"name": "Retrieves the collection of AccountMapped resources."
}
]
}
]
}
What i am trying to do is merge it under "accounts" and under "name": "Retrieves the collection of Account resources." I use the command:
jq -s '
.[0].event += .[1].item |
map(select(.name=="accounts")) |
.[].item
' jqtest.json collection.json
But when executed nothing is outputted. What am doing wrong with JQ or is there another tool i can use to accomplish this?
{
"item": [
{
"name": "accounts",
"item": [
{
"name": "Retrieves the collection of Account resources.",
"event": [
{
"listen": "test",
"script": {
"exec": [],
"type": "text/javascript"
}
}
]
},
{
"name": "accounts mapped",
"item": [
{
"name": "Retrieves the collection of AccountMapped resources."
}
]
}
]
}
]
}
To merge two objects, one can use obj1 + obj2. From this, it follows that obj1 += obj2 can be used to merge an object (obj2) into another existing object (obj1).
Maybe that's what you trying to use. If so, you were missing parens around the expression producing the object to merge into (causing the code to be misparsed), you have the operands to += backwards, you don't actually produce the correct objects on each side of += (or even objects at all), and you didn't narrow down your output (accidentally including jqtest in the output).
Fixed:
jq -s '
( .[1].item[] | select( .name == "accounts" ) | .item[] ) += .[0] | .[1]
' jqtest.json collection.json
Demo on jqplay
I find the following clearer (less mental overhead):
jq -s '
.[0] as $to_insert |
.[1] | ( .item[] | select( .name == "accounts" ) | .item[] ) += $to_insert
' jqtest.json collection.json
Demo
That said, I would avoid slurping in favour of --argfile.
jq --argfile to_insert jqtest.json '
( .item[] | select( .name == "accounts" ) | .item[] ) += $to_insert
' collection.json
Demo on jqplay
I want to transform JSON data using jq filter
Json data:
{
"main": [
{
"firstKey": "ABCD",
"id": "12345",
"data": [
{
"name": "first_id",
"value": "first_id_value"
},
{
"name": "second_id",
"value": "second_id_value"
},
{
"name": "third_id",
"value": "third_id_value"
}
]
}
]
}
Expected OUTPUT:
{
"firstKey": "ABCD",
"id": "12345",
"data.name.first_id": "first_id_value",
"data.name.second_id": "second_id_value",
"data.name.third_id": "third_id_value"
}
After many trials and errors, I was near to expected output using following filter expression
[.main[]|{"firstKey", "id"},foreach .data[] as $item (0; "data.name.\($item.name)" as $a|$item.value as $b| {($a): $b})][]
Used foreach as objects under "data" are dynamic. the number of objects can differ.
The output for the above expression is:
{
"firstKey": "ABCD",
"id": "12345"
}
{
"data.name.first_id": "first_id_value"
}
{
"data.name.second_id": "second_id_value"
}
{
"data.name.third_id": "third_id_value"
}
But I want the objects of data to be under the same braces as 'firstKey' and 'id'.
LINK to JqPlay
Any suggestions will be helpful.
Since your structure is so rigid, you can cheat and use the built-in from_entries, which takes a list of {key, value} pairs and constructs an object:
.main[] |
{firstKey, id} +
(.data | map({key: "data.name.\(.name)", value}) |
from_entries)
I could not find how to count occurrence of "title" grouped by "member_id"...
The json file is:
[
{
"member_id": 123,
"loans":[
{
"date": "123",
"media": [
{ "title": "foo" },
{ "title": "bar" }
]
},
{
"date": "456",
"media": [
{ "title": "foo" }
]
}
]
},
{
"member_id": 456,
"loans":[
{
"date": "789",
"media": [
{ "title": "foo"}
]
}
]
}
]
With this query I get loan entries for users with "title==foo"
jq '.[] | (.member_id) as $m | .loans[].media[] | select(.title=="foo") | {id: $m, title: .title}' member.json
{
"id": 123,
"title": "foo"
}
{
"id": 123,
"title": "foo"
}
{
"id": 456,
"title": "foo"
}
But I could not find how to get count by user (group by) for a title, to get a result like:
{
"id": 123,
"title": "foo",
"count": 2
}
{
"id": 456,
"title": "foo",
"count": 1
}
I got errors like jq: error (at member.json:31): object ({"title":"f...) and array ([[123]]) cannot be sorted, as they are not both arrays or similar...
When the main goal is to count, it is usually more efficient to avoid constructing an array if determining its length is the only reason for doing so. In the present case you could, for example, write:
def count(s): reduce s as $x (null; .+1);
"foo" as $title | .[] | {
id: .member_id,
$title,
count: count(.loans[].media[] | select(.title == $title))
}
group_by has its uses, but it is well to be aware that it is inefficient even for grouping, because its implementation involves a sort, which is not strictly necessary if the goal is to "group by" some criterion. A completely generic sort-free "group by" function is a bit tricky to implement, but often a simple but non-generic version is sufficient, such as:
# sort-free variant of group_by/1
# f must always evaluate to an integer or always to a string, which
# could be achieved by using `tostring`.
# Output: an array in the former case, or an object in the latter case
def GROUP_BY(f): reduce .[] as $x (null; .[$x|f] += [$x] );
Using group_by :
jq 'map(
(.member_id) as $m
| .loans[].media[]
| select(.title=="foo")
| {id: $m, title: .title}
)
|group_by(.id)[]
|.[0] + { count: length }
' input-file
This is my sample input
Input
[
{
"label": "test1",
"value": 1,
"path": "data/testData/testDataLevel3/testDataLevel3_1/0/testDataLevel3_1_a2"
},
{
"label": "test2",
"value": 2,
"path": "data/testData/testDataLevel1/testDataLevel1_1"
}
]
This input needs to be converted like this using jq
Expected output:
{
"data": {
"testData": {
"testDataLevel1": { //object
"testDataLevel1_1": 2
},
"testDataLevel3": {
"testDataLevel3_1": [ //array
{
"testDataLevel3_1_a2": 1
}
]
}
}
}
}
The path will contain the array index as path, and sometimes the keys will be combined in the path as well
You need to convert each .path to a form setpath can understand. The rest is straightforward.
reduce .[] as {$path, $value} (null;
setpath($path / "/" | map(tonumber? // .); $value)
)
Online demo
I have some messy JSON.
Some nodes are not consistent across rows. In some rows these nodes are arrays and in some these are objects or strings.
The example here is only two levels, but the actual data is nested many more levels.
Example:
[
{
"id": 1,
"person": {
"addresses": {
"address": {
"city": "FL"
}
},
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
}
},
{
"id": 2,
"person": {
"addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"phones": {
"type": "mobile",
"number": "555-555-5555"
},
"email": {
"type": "work",
"email": "jane.doe#gmail.com"
}
}
}
]
I would like to make the nodes consistent so that if any the node is an array in any of the nodes, then the remaining nodes should be converted into arrays.
Once the data is consistent, it would be easier to analyze and restructure the data.
Expected result:
[
{
"id": 1,
"person": {
"addresses": [
{
"address": {
"city": "FL"
}
}
],
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
}
},
{
"id": 2,
"person": {
"addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"email": [
{
"type": "work",
"email": "jane.doe#gmail.com"
}
]
}
}
]
After making the arrays consistent I would like to flatten the data so that objects are flattened out but the arrays remain arrays. This
Expected result
[
{
"id": 1,
"person.addresses": [
{
"address": {
"city": "FL"
}
}
],
"person.phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"person.email": [
{
"type": "work",
"email": "john.doe#gmail.com"
},
{
"type": "work",
"email": "john.doe#work.com"
}
]
},
{
"id": 2,
"person.addresses": [
{
"type": "home",
"address": {
"city": "FL"
}
}
],
"person.phones": [
{
"type": "mobile",
"number": "555-555-5555"
}
],
"person.email": [
{
"type": "work",
"email": "jane.doe#gmail.com"
}
]
}
]
I was able to do this partially using jq. It works when there are one or two paths to be fixed, but when there are more than two it seems to break.
The approach I took
Identify all possible paths
Group and count the datatypes for each path
Identify cases where there are mixed datatypes
Sort the paths by decreasing depth
Exclude paths that do not have mixed types
Exclude paths where one of the mixed types is not an array
For each path apply the fix on the original data
This generates a stream containing N copies one for each N transformation
Extract the last copy which should contain the cleaned result
My Experiment so far
def fix(data; path):
data |= map(. | getpath(path)?=([getpath(path)?]|flatten));
def hist:
length as $l
| group_by (.)
| map( .
| (.|length) as $c
| {(.[0]):{
"count": $c,
"diff": ($l - $c)
}} )
| (length>1) as $mixed
| {
"types": .[],
"count": $l,
"mixed":$mixed
};
def summary:
map( .
| path(..) as $p
| {
path:$p,
type: getpath($p)|type,
key:$p|join(".")
}
)
| flatten
| group_by(.key)
| map( .
| {
key: .[0].key,
path: .[0].path,
depth: (.[0].path|length),
type:([(.[] | .type)]|hist)
}
)
| sort_by(.depth)
| reverse;
. as $data
| .
| summary
| map( .
| select(.type.mixed)
| select(.type.types| keys| contains(["array"]))
| .path)
| map(. as $path | $data | fix($data;$path))
| length as $l
| .[$l-1]
Only the last conversion is present. I think the $data is not getting updated by my fix and this is probably the root cause, or maybe I am just doing this wrong.
Here is e where this doesn't work
The following response first solves the first task, namely:
make the nodes consistent so that if any ... node is an array in any of the nodes, then the remaining nodes should be converted into arrays.
in a generic way:
def paths_to_array:
[paths as $path
| select( any(.[]; (getpath($path[1:] )? | type) == "array"))
| $path] ;
# If a path to a value in .[] is an array,
# then ensure all corresponding values are also arrays
def make_uniform:
reduce (paths_to_array[][1:]) as $path (.;
map( (getpath($path)? // null) as $value
| if $value and ($value|type != "array")
then setpath($path; [$value])
else . end ) ) ;
make_uniform
For the second task, let's define a utility function:
# Input is assumed to be an object:
def flatten_top_level_keys:
[ to_entries[]
| if (.value|type) == "object"
then .key as $k
| (.value|to_entries)[] as $kv
| {key: ($k + "." + $kv.key), value: $kv.value}
else .
end ]
| from_entries;
This can be used in conjunction with walk/1 to achieve recursive
flattening.
In other words, the solution to the combined problem can be obtained
by:
make_uniform
| walk( if type == "object" then flatten_top_level_keys else . end )
Efficiency
The above def of make_uniform suffers from an obvious efficiency issue in the line:
reduce (paths_to_array[][1:]) as $path (.;
Using jq's unique would be one way to resolve it, but unique is implemented using a sort, which in this case introduces another inefficiency. So let's use this old chestnut:
# bag of words
def bow(stream):
reduce stream as $word ({}; .[$word|tostring] += 1);
Now we can define make_uniform more efficiently:
def make_uniform:
def uniques(s): bow(s) | keys_unsorted[] | fromjson;
reduce uniques(paths_to_array[][1:]) as $path (.;
map( (getpath($path)? // null) as $value
| if $value and ($value|type != "array")
then setpath($path; [$value])
else . end ) ) ;
Using a bit of python along with the JQ scripts that peak had given in the solution above, I was able to clean up my messy data.
I still think that the answer given by peak is the right answer given the question I had asked. Although the solution is very good and works well, it took a lot of time to complete. The time taken depended on the number of nodes, depth of the nodes and the number or arrays it found.
I had two different files that I needed to fix and both had around 5000 rows of data. On one of them, the jq script took about 6 hours to complete and I had to terminate the other one after 16 hours.
The solution below builds on the original solution by using python and jq together to process some of the steps in parallel. Finding paths to arrays is still the most time-consuming part.
setup
I split the scripts into the following
# paths_to_array.jq
def paths_to_array:
[paths as $path
| select( any(.[]; (getpath($path[1:] )? | type) == "array"))
| $path[1:]]
| unique
| map(. | select([.[]|type]|contains(["number"])|not));
paths_to_array
Minor adjustment to exclude any paths that had arrays in between. I just wanted all paths that end with arrays.
I also excluded the topmost array indices from the path to reduce the number of paths
# flatten.jq
def update_array($path):
(getpath($path)? // null) as $value
| (if $value and ($value|type != "array")
then . as $data | (try (setpath($path; [$value]))
catch $data)
else . end);
def make_uniform($paths):
map( .
| reduce($paths[]) as $path (
. ; update_array($path)
)
);
# Input is assumed to be an object:
def flatten_top_level_keys:
[ to_entries[]
| if (.value|type) == "object"
then .key as $k
| (.value|to_entries)[] as $kv
| {key: ($k + "." + $kv.key), value: $kv.value}
else .
end ]
| from_entries;
I had to add the walk function from jq builtins because the jq library for pythonn didn't include it.
I split the make_uniform function so that I could understand the script better and I added a try catch because of an issue I encountered when the path included array indices in between. Otherwise this is pretty much same as the code from the original solution
# apply.jq
make_uniform({path})
| map( .
| walk( if type == "object" then
flatten_top_level_keys
else . end ))
I had to split this because I was injecting the data for the path using the {path} and when this was in the full script I got an error when using .format() within python.
import math
import os
import JSON
from jq import jq
import multiprocessing as mp
def get_script(filename):
"""Utility function to read the jq script"""
with open(filename, "r") as f:
script = f.read()
return script
def get_data(filename):
"""Utility function to read json data from file"""
with open(filename, 'r') as f:
data = json.load(f)
return data
def transform(script, data):
"""Wrapper to be used by the parallel processor"""
return jq(script).transform(data)
def parallel_jq(script, data, rows=100, processes=8):
"""Executes the JQ script on data in parallel chuncks specified by rows"""
pool = mp.Pool(processes=processes)
size = math.ceil(len(data) / rows)
segments = [pool.apply_async(transform,
args=(script,
data[index*rows:(index+1)*rows]))
for index in range(size) ]
result = []
for seg in segments:
result.extend(seg.get())
return result
def get_paths_to_arrays(data, dest="data"):
"""Obtain the paths to arrays"""
filename = os.path.join(dest, "paths_to_arrays.json")
if os.path.isfile(filename):
paths = get_data(filename)
else:
script = get_script('jq/paths_to_array.jq')
paths = parallel_jq(script, data)
paths = jq("unique|sort_by(length)|reverse").transform(paths)
with open(filename, 'w') as f:
json.dump(paths, f, indent=2)
return paths
def flatten(data, paths, dest="data"):
"""Make the arrays uniform and flatten the result"""
filename = os.path.join(dest, "uniform_flat.json")
script = get_script('jq/flatten.jq')
script += "\n" + get_script('jq/apply.jq').format(path=json.dumps(paths))
data = parallel_jq(script, data)
with open(filename, 'w') as f:
json.dump(data, f, indent=2)
if __name__ == '__main__':
entity = 'messy_data'
sourcefile = os.path.join('data', entity+'.json')
dest = os.path.join('data', entity)
data = get_data(sourcefile)
# Finding paths with arrays
paths = get_paths_to_arrays(data, dest)
# Fixing array paths and flattening
flatten(data, paths, dest)
As I mentioned before the get_paths_to_arrays does take quite long even with parallel processing.
get_paths_to_arrays took 3811.834 seconds => Just over an hour.
flatten took 38 seconds