Count unique values in objects within large JSON file with Python - json

I have some rather large JSON files. Each contains thousands of objects within one (1) array. The JSONs are structured in the following format:
{
"alert": [
{ "field1": "abc",
"field2": "def",
"field3": "xyz
},
{ "field1": null,
"field2": null,
"field3": "xyz",
},
...
...
]
What's the most efficient way to use Python and the json library to search through a JSON file, find the unique values in each object within the array, and count how many times they appear? E.g., search the array's "field3" objects for the value "xyz" and count how many times it appears. I tried a few variations based on existing solutions in StackOverflow, but they are not providing the results I'm looking for.

A quick search on PyPI turned up
ijson 2.3 - Iterative JSON parser with a standard Python iterator interface
https://pypi.python.org/pypi/ijson
Here's an example which should work for your data
import ijson
counts = {}
with file("data.json") as f:
objects = ijson.items(f, 'alert.item')
for o in objects:
for k, v in o.items():
field = counts.get(k,{})
total = field.get(v,0)
field[v] = total + 1
counts[k] = field
import json
print json.dumps(counts, indent=2)
running this with your sample data in data.json produces
{
"field2": {
"null": 1,
"def": 1
},
"field3": {
"xyz": 2
},
"field1": {
"null": 1,
"abc": 1
}
}
Note however that the null in your input was transformed into the string "null".
As a point of comparison, here is a jq command which produces an equivalent result using tostream
jq -M '
reduce (tostream|select(length==2)) as [$p,$v] (
{}
; ($p[2:]+[$v|tostring]) as $k
| setpath($k; getpath($k)+1)
)
' data.json

Related

Why is adding parentheses to a filter in 'jq' producing valid JSON and without parentheses, multiple outputs of objects?

With jq, I would like to set a property within JSON data and let jq output the original JSON with the updated value. I found, more or less due to trial and error, a solution, and want to understand why and how it works.
I have the following JSON data:
{
"notifications": [
{
"source": "observer01",
"channel": "error",
"time": "2021-01-01 01:01:01"
},
{
"source": "observer01",
"channel": "info",
"time": "2021-02-02 02:02:02"
}
]
}
My goal is to update the time property of an object with a specific source and channel (the original JSON is way longer with lots of objects in the notifications array of the same format).
(In the following example, I want to update the time property of observer01 with channel info, so the second object in the example data above.)
My first try, not producing the desired output, was the following jq command:
jq '.notifications[] | select(.source == "observer01" and .channel == "info").time = "NEWTIME"' data.json
That produces the following output:
{
"source": "observer01",
"channel": "error",
"time": "2021-01-01 01:01:01"
},
{
"source": "observer01",
"channel": "info",
"time": "NEWTIME"
}
Which is just a list of the JSON objects within the notifications array. I understand that this can be useful, for example piping the objects to other command line tools.
Now let's try the following jq command, which is the same as above plus one pair of parentheses:
jq '(.notifications[] | select(.source == "observer01" and .channel == "info").time) = "NEWTIME"' data.json
This produces the desired output, the original valid JSON with the updated time property:
{
"notifications": [
{
"source": "observer01",
"channel": "error",
"time": "2021-01-01 01:01:01"
},
{
"source": "observer01",
"channel": "info",
"time": "NEWTIME"
}
]
}
Why is adding the parentheses to the jq filter in the case above producing a different output?
The parentheses just change the precedence. It's documented in man jq:
Parenthesis work as a grouping operator just as in any typical programming language.
jq ´(. + 2) * 5´
1
=> 15
Let's have a simpler example:
echo '[{"a":1}, {"a":2}]' | jq '.[] | .a |= .+1'
It outputs
{
"a": 2
}
{
"a": 3
}
because it's interpreted as
↓ ↓
echo '[{"a":1}, {"a":2}]' | jq '.[] | (.a |= .+1)'
The first filter .[] outputs the elements as separated objects, they are then modified by the second filter.
Placing the parentheses after the first two elements changes the precedence:
↓ ↓
echo '[{"a":1}, {"a":2}]' | jq '(.[] | .a) |= .+1'
and produces a different otuput:
[
{
"a": 2
},
{
"a": 3
}
]
BTW, this is the same output as from
echo '[{"a":1}, {"a":2}]' | jq '.[].a |= .+1'
It changes the value associated with the "a" key in the array.
Let's compare the two.
.notifications[] | select(...).time = "NEWTIME"
(.notifications[] | select(...).time) = "NEWTIME"
In the first one, the top-level filter is defined by |. The input is an object, and the output is the result of applying select(...).time = "NEWTIME" to each value produced by .notifications[]. In essence, the original object is "lost".
In the second one, the top-level filter is defined by =. x = y returns its input as output, but with a side effect produced by
Determining what the path expression x refers to in the input,
Evaluating the filter y on the input, (Even an expression like "NEWTIME" is just a filter: one that ignores its input and returns the string "NEWTIME")
Assigning the result of y to the thing addressed by x.

Parsing JSON in SAS

Does anyone know how to convert the following JSON to table format in SAS? Appreciate in advance any help!
JSON
{
"totalCount": 2,
"facets": {},
"content": [
[
{
"name": "customer_ID",
"value": "1"
},
{
"name": "customer_name",
"value": "John"
}
],
[
{
"name": "customer_ID",
"value": "2"
},
{
"name": "customer_name",
"value": "Jennifer"
}
]
]
}
Desired Output
customer_ID
customer_name
1
John
2
Jennifer
Steps I've Taken
1- Call API
filename request "C:\path.request.txt";
filename response "C:\path.response.json";
filename status "C:\path.status.json";
proc http
url="http://httpbin.org/get"
method="POST"
in=request
out=response
headerout=status;
run;
2- I have the following JSON MAP file save:
{
"DATASETS": [
{
"DSNAME": "customers",
"TABLEPATH": "/root/content",
"VARIABLES": [
{
"NAME": "name",
"TYPE": "CHARACTER",
"PATH": "/root/content/name"
},
{
"NAME": "value",
"TYPE": "CHARACTER",
"PATH": "/root/content/value"
}
]
}
]
}
3- I use the above JSON Map file as follow:
filename jmap "C:\path.jmap.map";
libname cust json map=jmap access=readonly;
proc copy inlib=cust outlib=work;
run;
4- This generates a table like this, which is not what I need:
name
value
customer_id
1
customer_value
John
customer_id
2
customer_value
Jennifer
From where you are, you have a very trivial step to convert to what you want - PROC TRANSPOSE.
filename test "h:\temp\test.json";
libname test json;
data pre_trans;
set test.content;
if name='customer_ID' then row+1;
run;
proc transpose data=pre_trans out=want;
by row;
id name;
var value;
run;
You could also do this directly in the data step; there are advantages to going either way.
data want;
set test.content;
retain customer_ID customer_name;
if name='customer_ID' then customer_ID=input(value,best.);
else if name='customer_name' then do;
customer_name = value;
output;
end;
run;
This data step works okay for the example above - the proc transpose works better for more complex examples, as you only have to hardcode the one value.
I suspect you could do this more directly with a proper JSON map, but I don't usually do this sort of thing that way - it's easier for me to just get it into a dataset and then work with it from there.
In this case, SAS is getting tripped up by the double arrays with no content before the second array - if there was some (any) content there, it would parse more naturally. Since there's nothing for SAS to really judge what you want to do with that Content array, it just lets you do whatever you want with it - which is easy enough.

Merge and Sort JSON using JQ

I have a file containing the following structure and unknown number of results:
{
"results": [
[
{
"field": "AccountID",
"value": "5177497"
},
{
"field": "Requests",
"value": "50900"
}
],
[
{
"field": "AccountID",
"value": "pro"
},
{
"field": "Requests",
"value": "251"
}
]
],
"statistics": {
"Matched": 51498,
"Scanned": 8673577,
"ScannedByte": 2.72400814E10
},
"status": "HOLD"
}
{
"results": [
[
{
"field": "AccountID",
"value": "5577497"
},
{
"field": "Requests",
"value": "51900"
}
],
"statistics": {
"Matched": 51498,
"Scanned": 8673577,
"ScannedByte": 2.72400814E10
},
"status": "HOLD"
}
There are multiple such results which are indexed as an array with the results folder. They are not seperated by a comma.
I am trying to just print The "AccountID" sorted by "Requests" in ZSH using jq. I have tried flattening them and using:
jq -r '.results[][0] |.value ' filename
jq -r '.results[][1] |.value ' filename
To get the Account ID and Requests seperately and sorting them. I don't think bash has a dictionary that can be used. The problem lies in the file as the Field and value are not key value pair but are both pairs. Therefore extracting them using the above two lines into seperate arrays and sorting by the second array seems a bit too long. I was wondering if there is a way to combine both the operations.
The other way is to combine it all to a string and sort it in ascending order. Python would probably have the best solution but the code requires to be a zsh or bash script.
Solutions that use sed, jq or any other ZSH supported compilers are welcome. If there is a way to create a dictionary in bash, please do let me know.
The projectd output requirement is just the Account ID vs Request Number.
5577497 has 51900 requests
5177497 has 50900 requests
pro has 251 requests
If you don't mind learning a little jq, it will probably be best to write a small jq program to do what you want.
To get you started, consider the following jq program, which assumes your input is a stream of valid JSON objects with a "results" key similar to your sample:
[inputs | .results[] | map( { (.field) : .value} ) | add]
After making minor changes to your input so that it consists of valid JSON objects, an invocation of jq with the -n option produces an array of AccountID/Requests objects:
[
{
"AccountID": "5177497",
"Requests": "50900"
},
{
"AccountID": "pro",
"Requests": "251"
},
{
"AccountID": "5577497",
"Requests": "51900"
}
]
You could (for example) now use jq's group_by to group these objects by AccountID, and thereby produce the result you want.
jq -S '.results[] | map( { (.field) : .value} ) | add' query-results-aggregate \
| jq -s -c 'group_by(.number_of_requests) | .[]'
This does the trick. Thanks to peak for the guidance.

How to use jq to reconstruct complete contents of json file, operating only on part of interest?

All the examples I've seen so far "reduce" the output (filter out) some part. I understand how to operate on the part of the input I want to, but I haven't figured out how to output the rest of the content "untouched".
The particular example would be an input file with several high level entries "array1", "field1", "array2", "array3" say. Each array contents is different. The specific processing I want to do is to sort "array1" entries by a "name" field which is doable by:
jq '.array1 | sort_by(.name)' test.json
but I also want this output as "array1" as well as all the other data to be preserved.
Example input:
{
"field1": "value1",
"array1":
[
{ "name": "B", "otherdata": "Bstuff" },
{ "name": "A", "otherdata": "Astuff" }
],
"array2" :
[
array2 stuff
],
"array3" :
[
array3 stuff
]
}
Expected output:
{
"field1": "value1",
"array1":
[
{ "name": "A", "otherdata": "Astuff" },
{ "name": "B", "otherdata": "Bstuff" }
],
"array2" :
[
array2 stuff
],
"array3" :
[
array3 stuff
]
}
I've tried using map but I can't seem to get the syntax correct to be able to handle any type of input other than the array I want to be sorted by name.
Whenever you use the assignment operators (=, |=, +=, etc.), the context of the expression is kept unchanged. So as long as your top-level filter(s) are assignments, in the end, you'll get the rest of the data (with your changes applied).
In this case, you're just sorting the array1 array so you could just update the array.
.array1 |= sort_by(.name)

Using jq to list keys in a JSON object

I have a hierarchically deep JSON object created by a scientific instrument, so the file is somewhat large (1.3MB) and not readily readable by people. I would like to get a list of keys, up to a certain depth, for the JSON object. For example, given an input object like this
{
"acquisition_parameters": {
"laser": {
"wavelength": {
"value": 632,
"units": "nm"
}
},
"date": "02/03/2525",
"camera": {}
},
"software": {
"repo": "github.com/username/repo",
"commit": "a7642f",
"branch": "develop"
},
"data": [{},{},{}]
}
I would like an output like such.
{
"acquisition_parameters": [
"laser",
"date",
"camera"
],
"software": [
"repo",
"commit",
"branch"
]
}
This is mainly for the purpose of being able to enumerate what is in a JSON object. After processing the JSON objects from the instrument begin to diverge: for example, some may have a field like .frame.cross_section.stats.fwhm, while others may have .sample.species, so it would be convenient to be able to interrogate the JSON object on the command line.
The following should do exactly what you want
jq '[(keys - ["data"])[] as $key | { ($key): .[$key] | keys }] | add'
This will give the following output, using the input you described above:
{
"acquisition_parameters": [
"camera",
"date",
"laser"
],
"software": [
"branch",
"commit",
"repo"
]
}
Given your purpose you might have an easier time using the paths builtin to list all the paths in the input and then truncate at the desired depth:
$ echo '{"a":{"b":{"c":{"d":true}}}}' | jq -c '[paths|.[0:2]]|unique'
[["a"],["a","b"]]
Here is another variation uing reduce and setpath which assumes you have a specific set of top-level keys you want to examine:
. as $v
| reduce ("acquisition_parameters", "software") as $k (
{}; setpath([$k]; $v[$k] | keys)
)