Select lots of known IDs from a big JSON document efficiently - json

I am trying to get some value from json via jq in bash. With small value it work nice but with big json it work too slow, like 1 value for each 2-3 second. Example of my code:
json=$(curl -s -A "some useragent" "url" )
pid=$(cat idlist.json | jq '.page_ids[]')
for id in $pid
do
echo $pagejson|jq -r '.page[]|select(.id=='$id')|.url'>>path.url
done
The "pid" is list of id that I type before running script. It may contain 700-1000 id. Example object of json
{
"page":[
{
"url":"some url",
"id":some numbers
},
{
"url":"some url",
"id":some numbers
}
]
}
Is there any way to speed up it? In javascript it work faster than it. Example of javascript:
//First sort object with order
var url="";
var sortedjson= ids.map(id => obj.find(page => page.id === id));
//Then collect url
for ( x=0 ; x < sortedjson.length;x++) {
url+=sortedjson[x].url
};
Should I sort json like in javascript for better performance? I don't tried it because don't know how.
Edit:
Replaced "pid" variable with json to use less code and for id in $(echo $pid) with for id in $pid.
But it still slow down if id list more than about 50

Calling jq once per id is always going to be slow. Don't do that -- call jq just once, and have it match against the full set.
You can accomplish that by passing the entire comma-separated list of ids into your one copy of jq, and letting jq itself do the work of splitting that string into individual items (and then putting them in a dictionary for fast access)
For example:
pid="24885,73648,38758,8377,747"
jq --arg pidListStr "$pid" '
($pidListStr | [split(",")[] | {(.): true}] | add) as $pidDict |
.page[] | select($pidDict[.id | tostring]) | .url
' <<<"$pagejson"

The following solution uses the same approach as the one posted by Charles Duffy (*) but is only applicable:
if each of the specified id values in $pid appears at most once as an id in the JSON objects in the .page array; or
if the goal is to extract, for each id in $pid, at most one corresponding object from the .page array.
The idea is to remove an id from the dictionary once it is found, and to stop if and when all ids have been found.
jq --arg pidListStr "$pid" '
($pidListStr | [splits(" *, *") | {(.): true}] | add) as $pidDict
| label $finish
| foreach .page[] as $page ($pidDict + {emit:null};
if length == 1 then break $finish
else ($page.id | tostring) as $id
| if .[$id] then delpaths([[$id]]) | .emit = $page.url
else .emit = null
end
end;
.emit // empty )
'
(*) Caveat
Using $pidDict here assumes there are no "collisions"; this condition would hold if all the id values in the .page objects are numeric.

The following is a response to the original question, which posited:
pid="24885,73648,38758,8377,747"
echo $pagejson|jq -r '.page[]|select(.id=='$pid')|.url'
(Based on subsequent edits to the question, it would appear that the intent was to iterate over the id values separately, invoking jq once per value. That is a bad idea as well but can be dealt with in a separate response.)
Response to original question
There are several problems with the invocation of jq based on
interpolating $pid as was originally done.
The major problem is that your query, when expanded, includes this select statement:
select(.id==24885,73648,38758,8377,747)
whereas what you evidently intend is:
select(.id==(24885,73648,38758,8377,747))
It's not difficult to see that there's a huge difference, which affects both functionality and performance.
Since you don't give any hints about the expected input, it's not feasible to suggest how the query might be optimized. To illustrate, though, suppose it's known that the .id values in the input are distinct. Then once all the ids in the query have been found, execution can stop.
In general, passing shell variables in by string interpolation is not a great idea. Some alternatives to consider are using --arg or --argjson.

The following solution, which is based on the one posted by Charles
Duffy (*), can be used if each of the specified id values in $pid
appears at most once as an id in the JSON objects in the .page array.
The idea is to stop if and when all the $pid ids have been found.
This can be accomplished with the following helper function:
def first_n(stream; $n):
label $done
| foreach stream as $x (-1; .+1; if . >= $n then break $done else $x end);
The solution can then be written as follows:
($pidListStr | [splits(" *, *") | {(.): true}] | add) as $pidDict
| ($pidDict|length) as $n
| first_n(.page[] | select($pidDict[.id | tostring]) | .url; $n)
This solution is similar to the one using foreach posted elsewhere
on this page, but is simpler and probably slightly more efficient as
the dictionary, once constructed, is unaltered.
The solution using foreach, however, can also be used if the ids of the
objects in the .page array are not unique, and if the goal is to
extract, for each id in $pid, at most one corresponding object from
the .page array.
(*) Caveat
Using $pidDict here assumes there are no "collisions"; this condition would hold if all the id values in the .page objects are numeric.

Related

How to select an element in an array based on two conditions in JMESPath?

I'm trying to select the SerialNumber of a specific AWS MFADevice for different profiles.
This command returns the list of MFADevices for a certain profile:
aws iam list-mfa-devices --profile xxx
and this is a sample JSON output:
{
"MFADevices": [
{
"UserName": "foobar#example.com",
"SerialNumber": "arn:aws:iam::000000000000:mfa/foo",
"EnableDate": "2022-12-06T16:23:41+00:00"
},
{
"UserName": "barfoo#example.com",
"SerialNumber": "arn:aws:iam::111111111111:mfa/bar_cli",
"EnableDate": "2022-12-12T09:13:10+00:00"
}
]
}
I would like to select the SerialNumber of the device containing the string cli. But in case there is only one device in the list (regardless of the presence or absence of the string cli), I'd like to get its SerialNumber.
I have this expression which already filters for the first condition, namely the desired string:
aws iam list-mfa-devices --profile xxx --query 'MFADevices[].SerialNumber | [?contains(#,`cli`)] | [0]'
However I still haven't been able to figure out how to add the if number_of_devices == 1 then return the serial of that single device.
I can get the number of MFADevices with this command:
aws iam list-mfa-devices --profile yyy --query 'length(MFADevices)'
And as a first step towards my final solution I wanted to initially get the SerialNumber only in the case the list has exactly one element, so, I thought of something like this:
aws iam list-mfa-devices --profile yyy --query 'MFADevices[].SerialNumber | [?length(MFADevices) ==`1`]'
but actually already at this stage I get the error below (left alone the fact that I still need to combine it with the cli part):
In function length(), invalid type for value: None, expected one of: ['string', 'array', 'object'], received: "null"
Does anybody know how to achieve what I want?
I know that I could just pipe the raw output to jq and do the filtering there, but I was wondering if there is a way to do it directly in the command using some JMESPath expression.
In order to do those kind of condition in JMESPath you will have to rely on logical or (||) and logical and (&&), because the language does not have a conditional keyword, per se.
So, in pseudo-code, instead of doing:
if length(MFADevices) == 1
MFADevices[0]
else
MFADevices[?someFilter]
You have to do, like in bash:
length(MFADevices) == 1 and MFADevices[0] or MFADevices[?someFilter]
So, in JMESPath:
length(MFADevices) == `1`
&& MFADevices[0].SerialNumber
|| (MFADevices[?contains(SerialNumber, `cli`)] | [0]).SerialNumber
Note: this assumes that, if there are more than one element but none contains cli, we should get null.
If you want the first element, even when there are multiple devices and the SerialNumber does not contains cli, then you can simplify it further and simply do a logical or, when the contains filter return nothing (as a null result will evaluates to false):
(MFADevices[?contains(SerialNumber, `cli`)] | [0]).SerialNumber
|| MFADevices[0].SerialNumber
With stedolan/jq you can filter for the substring and unconditonally add the first, then take the first of them:
.MFADevices | map(.SerialNumber) | first((.[] | select(contains("cli"))), first)
Demo
or
[.MFADevices[].SerialNumber] | map(select(contains("cli"))) + .[:1] | first
Demo
Output:
arn:aws:iam::111111111111:mfa/bar_cli

JSON: using jq with variable keys

I have input JSON data in a bunch of files, with an IP address as one of the keys. I need to iterate over a the files, and I need to get "stuff" out of them. The IP address is different for each file, but I'd like to use a single jq command to get the data. I have tried a bunch of things, the closest I've come is this:
jq '.facts | keys | keys as $ip | .[0]' a_file
On my input in a_file of:
{
"facts": {
"1.1.1.1": {
"stuff":"value"
}
}
}
it returns the IP address, i.e. 1.1.1.1, but then how do I to go back do something like this (which is obviously wrong, but I hope you get the idea):
jq '.facts.$ip[0].stuff' a_file
In my mind I'm trying to populate a variable, and then use the value of that variable to rewind the input and scan it again.
=== EDIT ===
Note that my input was actually more like this:
{
"facts": {
"1.1.1.1": {
"stuff": {
"thing1":"value1"
}
},
"outer_thing": "outer_value"
}
}
So I got an error:
jq: error (at <stdin>:9): Cannot index string with string "stuff"
This fixed it- the question mark after .stuff:
.facts | keys_unsorted[] as $k | .[$k].stuff?
You almost got it right, but need the object value iterator construct, .[] to get the value corresponding to the key
.facts | keys_unsorted[] as $k | .[$k].stuff
This assumes that, inside facts you have one object containing the IP address as the key and you want to extract .stuff from it.
Optionally, to guard against objects that don't contain stuff inside, you could add ? as .[$k].stuff?. And also you could optionally validate keys against a valid IP regex condition and filter values only for those keys.

How do I filter JSON using jq based on if an attribute value is in an array?

I have some JSON that I need to filter based on whether certain attribute values are present in an array.
I have something that works but if feels like a kludge. Is there a neater way of doing this?
Input
{"potato":4}
Filter
select(.potato as $k | ([1,2,3,4] | any(. == $k)))
Output
{
"potato": 4
}
jqplay link
https://jqplay.org/s/Ts97jkk21K
Does this seem less of a kludge?
[1,2,3,4] as $acceptable
| .potato as $k
| select( any($acceptable[]; . == $k) )
If your jq has IN/1, you could skip the $k:
[1,2,3,4] as $acceptable
| select(.potato | IN($acceptable[]))
This style makes it easy to pass $acceptable in as a command-line parameter, for example.
Temptation
It is easy to be tempted by the simplicity of a select-only solution such as:
[1,2,3,4] as $acceptable
| select($acceptable[] == .potato)
This would be fine under certain circumstances, e.g. if $acceptable is short and does not contain duplicates (assuming we want the semantics of any). But any and IN have a short-circuit semantics that may be desirable, e.g. for efficiency.

JSON key-globbing

On the jq manual page there are a few examples of output formatting, particularly some shortcuts for when you want to just echo exactly what was in the input JSON.
What if I want to echo exactly what was in the input, but only for keys that match a certain pattern?
For example, given input like so ...
[
{"Name":"Widgets","Size":10,"SymUS":"Widg","SymCN":"Zyin","SymJP":"Kono"},
{"Name":"Blodgets","Size":400,"SymUS":"Blodg","SymAU":"Blod","SymJP":"Kado"},
{"Name":"Fonzes","Size":11,"SymRU":"Fyet","SymBR":"Foao"}
]
Say I want to select all objects where the Name ends in "ets" and then display the Name and all attributes of the form Sym*. All I know about those attributes is that there will be one or more per JSON object, and the names have the format Sym followed by a two-letter ISO country code.
I would like to just do this:
jq '.[] | select(.Name | endswith("ets")) | {Name, Sym*}'
but that's not a thing.
Is this just not something jq is designed to handle in a single operation? Should I do a first pass through the file to collect all the possible keys and then list them all explicitly via a slurpfile?
The key to a simple solution to the problem is to_entries, as described in the online manual. With your example data, the following filter produces the output shown below, in accordance with what I understand to be the expectations:
.[]
| select(.Name | test("ets$"))
| {Name} + (to_entries | map(select(.key|test("^Sym"))) | from_entries)
You might want to refine the regex tests, and/or make other minor adjustments.
Output:
{
"Name": "Widgets",
"SymUS": "Widg",
"SymCN": "Zyin",
"SymJP": "Kono"
}
{
"Name": "Blodgets",
"SymUS": "Blodg",
"SymAU": "Blod",
"SymJP": "Kado"
}

jq - How do I print a parent value of an object when I am already deep into the object's children?

Say I have the following JSON, stored in my variable jsonVariable.
{
"id": 1,
"details": {
"username": "jamesbrown",
"name": "James Brown"
}
}
I parse this JSON with jq using the following:
echo $jsonVariable | jq '.details.name | select(.name == "James Brown")'
This would give me the output
James Brown
But what if I want to get the id of this person as well? Now, I'm aware this is a rough and simple example - the program I'm working with at the moment is 5 or 6 levels deep with many different JQ functions other than select. I need a way to select a parent's field when I am already 5 or 6 layers deep after carrying out various methods of filtering.
Can anyone help? Is there any way of 'going in reverse', back up to the parent? (Not sure if I'm making sense!)
For a more generic approach, save the value of the "parent" element at the detail level you want, then pipe it at the end of your filter:
jq '. as $parent | .details.name | select(. == "James Brown") | $parent'
Of course, for the trivial case you expose, you could omit this entirely:
jq 'select(.details.name == "James Brown")'
Also, consider that if your selecting filters return many matches for a single parent object, you will receive a copy of the parent object for each match. You may wish to make sure your select filters only return one element at the parent level by wrapping all matches below parent level into an array, or to deduplicate the final result with unique.
Give this a shot:
echo $jsonVariable | jq '{Name: .details.name, Id: .Id} | select(.name == "James Brown")'
Rather than querying up to the value you're testing for, query up to the root object that contains the value you're querying on and the values you wish to select.
You need the object that contains both the id and the name.
$ jq --arg name 'James Brown' 'select(.details.name == $name).id' input.json