Finding the location (line, column) of a field value in a JSON file - json

Consider the following JSON file example.json:
{
"key1": ["arr value 1", "arr value 2", "arr value 3"],
"key2": {
"key2_1": ["a1", "a2"],
"key2_2": {
"key2_2_1": 1.43123123,
"key2_2_2": 456.3123,
"key2_2_3": "string1"
}
}
}
The following jq command extracts a value from the above file:
jq ".key2.key2_2.key2_2_1" example.json
Output:
1.43123123
Is there an option in jq that, instead of printing the value itself, prints the location (line and column, start and end position) of the value within a (valid) JSON file, given an Object Identifier-Index (.key2.key2_2.key2_2_1 in the example)?
The output could be something like:
some_utility ".key2.key2_2.key2_2_1" example.json
Output:
(6,25) (6,35)

Given JSON data and a query, there is no
option in jq that, instead of printing the value itself, prints the location
of possible matches.
This is because JSON parsers providing an interface to developers usually focus on processing the logical structure of a JSON input, not the textual stream conveying it. You would have to instruct it to explicitly treat its input as raw text, while properly parsing it at the same time in order to extract the queried value. In the case of jq, the former can be achieved using the --raw-input (or -R) option, the latter then by parsing the read-in JSON-encoded string using fromjson.
The -R option alone would read the input linewise into an array of strings, which would have to be concatenated (e.g. using add) in order to provide the whole input at once to fromjson. The other way round, you could also provide the --slurp (or -s) option which (in combination with -R) already concatenates the input to a single string which then, after having parsed it with fromjson, would have to be split again into lines (e.g. using /"\n") in order to provide row numbers. I found the latter to be more convenient.
That said, this could give you a starting point (the --raw-output (or -r) option outputs raw text instead of JSON):
jq -Rrs '
"\(fromjson.key2.key2_2.key2_2_1)" as $query # save the query value as string
| ($query | length) as $length # save its length by counting its characters
| ./"\n" | to_entries[] # split into lines and provide 0-based line numbers
| {row: .key, col: .value | indices($query)[]} # find occurrences of the query
| "(\(.row),\(.col)) (\(.row),\(.col + $length))" # format the output
'
(5,24) (5,34)
Demo
Now, this works for the sample query, how about the general case? Your example queried a number (1.43123123) which is an easy target as it has the same textual representation when encoded as JSON. Therefore, a simple string search and length count did a fairly good job (not a perfect one because it would still find any occurrence of that character stream, not just "values"). Thus, for more precision, but especially with more complex JSON datatypes being queried, you would need to develop a more sophisticated searching approach, probably involving more JSON conversions, whitespace stripping and other normalizing shenanigans. So, unless your goal is to rebuild a full JSON parser within another one, you should narrow it down to the kind of queries you expect, and compose an appropriately tailored searching approach. This solution provides you with concepts to simultaneously process the input textually and structurally, and with a simple search and ouput integration.

Related

How do you conditionally change a string value to a number in JQ?

I am pulling a secret from SecretsManager in AWS and using the resulting JSON to build a parameters JSON file that can pass this on to the cloud formation engine. Unfortunately, SecretsManager stores all values as strings, so when I try to pass these values to my cloud formation template it will fail because it is passing a string instead of a number and some cloud formation parameters need to be numbers (e.g. not a string).
In the example below, I want to tell JQ that "HEALTH_CHECK_UNHEALTHY_THRESHOLD_COUNT" and "AUTOSCALING_MAX_CAPACITY" are numbers. So, I prefix the key with "NUMBER::".
This serves two purposes. First, it tells the person viewing this secret that it will be converted to a number, second, it will tell JQ to convert the string value of "2" to 2. This needs to scale so that I can have 1..n keys that need to be converted in the JSON.
Consider this JSON:
{
"NUMBER::AUTOSCALING_MAX_CAPACITY": "12",
"SERVICE_PLATFORM_VERSION": "1.3.0",
"HEALTH_CHECK_PROTOCOL": "HTTPS",
"NUMBER::HEALTH_CHECK_UNHEALTHY_THRESHOLD_COUNT": "2"
}
Here is what I'd like to do with JQ:
JQ will copy over the key/value pairs for the majority of elements in the JSON "as is". If there is no "NUMBER::" prefix, they are copied over "as is".
However, if a key is prefixed with "NUMBER::" I'd like the following to happen:
a. JQ will remove the "NUMBER::" prefix from the key name.
b. JQ will convert the value from a string to a number.
The end result is a JSON that looks like this:
{
"AUTOSCALING_MAX_CAPACITY": 12,
"SERVICE_PLATFORM_VERSION": "1.3.0",
"HEALTH_CHECK_PROTOCOL": "HTTPS",
"HEALTH_CHECK_UNHEALTHY_THRESHOLD_COUNT": 2
}
What I've tried
I have tried using Map to do this with limited success. In this example I am looking for a specific field mainly as a test. I don't want to have to call out specific keys by name, but rather just use any key that begins with "NUMBER::" to do the conversions.
NOTE: The SECRET_STRING variable in the examples below contains the source JSON.
echo $SECRET_STRING | jq 'to_entries | map(if .key == "NUMBER::AUTOSCALING_MAX_CAPACITY" then . + {"value":.value} else . end ) | from_entries'**
I've also tried to use "tonumber" across the entire JSON. JQ will examine all the values and see if it can convert them to numbers. The problem is it fails when it hits the "SERVICE_PLATFORM_VERSION" key as it detects "1.3.0" as a number and it tries for make that a number, which of course is bogus.
Example: echo $SECRET_STRING | jq -r '.[] | tonumber'
Recap
I'd like to use JQ to convert JSON string values to number by use a prefix of "NUMBER::" in the key name.
Note: This problem does not exist when attempting to pull entries from the Systems Manager Parameter Store because AWS allows you use "resolve" entries as strings or numbers. The same feature does not exist in SecretsManager. I'd also like to use the SecretsManager to provide a list of some 30 or more configuration items to set up my stack. With the Parameter store you have to set up each config item as a separate entry, which we be a maintenance nightmare.
Select each entry with a key starting with NUMBER:: and update it to remove that prefix and convert the value to a number.
with_entries(
select(.key | startswith("NUMBER::")) |= (
(.key |= ltrimstr("NUMBER::")) |
(.value |= tonumber)
)
)
Online demo

Create JSON with 2 large data sets?

I have 2 huge sets of numbers in columns 1 and 2 in an excel sheet. I want to pair my first column with my second column to create a JSON file like this link here - https://github.com/python-visualization/folium/blob/master/examples/data/data3.json,
something like
[{"0500000US33009": 51289.0, "0500000US38041": 46793.0, "0500000US38043": 39857.0}]
if I had just the numbers 0500000US33009, 0500000US38041, and 0500000US38043 in column 1 and 51289.0, 46793.0, and 39857.0 in column 2. How might I do this, and make sure that the resulting JSON has quotes around the "0500000US33009"?
Thanks in advance!
if I had just the numbers ...
The following assumes that the first two columns have been extracted into a CSV or TSV file, e.g.
0500000US33009,51289.0
0500000US38041,46793.0
0500000US38043,39857.0
With the input data in this format, it would of course be very easy to use your favorite text-processing tool to create a JSON dictionary.
Assuming a simple CSV format as above, the data could also be converted into a JSON dictionary using an invocation along the lines of:
jq -Rn 'reduce inputs as $in ({};
. + ($in|split(",")|{(.[0]): .[1] | tonumber}))'
Using jq 1.6 or earlier, this would produce:
{
"0500000US33009": 51289,
"0500000US38041": 46793,
"0500000US38043": 39857
}
The change in format of the numeric values is the result of a conversion to IEEE 754 64-bit numbers, and can be avoided by using a more recent version of jq. Using the current "master" version, the result would be:
{
"0500000US33009": 51289.0,
"0500000US38041": 46793.0,
"0500000US38043": 39857.0
}
So if you're stuck with jq 1.6 or earlier and require the explicit decimal point, you might want to consider omitting |tonumber in the above program, and add a post-processing step if and as required.
Some words of caution
The jq solution above assumes there are no collisions (one key having more than one value), or rather, that if there are any collisions, then the last key-value pair should prevail.
If any of the values in the second column cannot be represented with sufficient accuracy as IEEE 754 64-bit numbers, then a significantly different strategy may be warranted.
If the number of rows is indeed very large (e.g. in the billions), then the wisdom of constructing a single ginormous JSON dictionary might be worth reconsidering.

Find whether length of a string in json array exceeds certain limit

I have a file which contains many json arrays. I need to find if length of any value in any of the array exceeds a limit, say 1000. If it exceeds I have to trim the length of that particular value. Post that file will be fed to downstream application. What is the best possible solution to be implemented in shell scripting. Tried jq and sed but that doesn't seem to work. Maybe I haven't explored them completely. Any suggestion on this use case will be highly appreciated!
Unfortunately the originally posted question is rather vague on a number of points, so I'll first focus on determining whether an arbitrary JSON document has a string value (excluding key names) that exceeds a certain given size.
To find the maximum of a stream of numbers, we can write:
def max(stream): reduce stream as $s (null;
if $s > . then $s else . end);
Let us suppose the above def, together with the following line, is in a file named max.jq:
max( .. | strings | length) > $mx
Then we could find the answer by running a command such as:
jq --argjson mx 4 -f max.jq INPUT.json
A shorter but possibly less space-efficient answer
jq --argjson mx 4 '[..|strings|length]|max > $mx' INPUT.json
Variants
There are many possible variants, e.g. you might want to arrange things so that jq returns a suitable return code rather than emitting a boolean value.
Truncating long strings
To truncate strings longer than a given length, say $mx, you could use walk/1, like so:
walk(if type == "string" and length > $mx
then .[:$mx] else . end)

Convert CSV to Grouped JSON

I have several large CSV's which I would like to export to a particular JSON format but I'm not really sure how to convert it over. It's a list of usernames and urls.
b00nw33,harrypotter788.flv
b00nw33,harrypotter788.mov
b00nw33,levitation271.avi
b01spider,schimbvalutar109.avi
...
I want to export them to JSON grouped by the username like the following
{
"b00nw33": [
"harrypotter788.flv",
"harrypotter788.mov",
"levitation271.avi"
],
"b01spider": [
"schimbvalutar109.avi"
]
}
What is the JQ to do this? Thank you!
The key to a simple solution is the generic function aggregate_by:
# In this formulation, f must either always evaluate to a string or
# always to an integer, it being understood that negative integers
# might be problematic
def aggregate_by(s; f; g):
reduce s as $x (null; .[$x|f] += [$x|g]);
If the CSV can be accurately parsed by simply splitting on commas, then the desired transformation can be accomplished using the following jq filter:
aggregate_by(inputs | split(","); .[0]; .[1])
This assumes jq is invoked with the -R (raw) and -n options.
Output
With the given CSV input, the output would be:
{
"b00nw33": [
"harrypotter788.flv",
"harrypotter788.mov",
"levitation271.avi"
],
"b01spider": [
"schimbvalutar109.avi"
]
}
Handling non-trivial CSV
The above solution assumes that the CSV is as uncomplicated as the sample. If, on the contrary, the CSV cannot be accurately parsed by simply splitting at commas, a more general parser will be needed.
One approach would be to use the very robust and fast csv2json parser at https://github.com/fadado/CSV
Alternatively, you could use one of the many available "csv2tsv" parsers to generate TSV, which jq can handle directly (by splitting on tabs, i.e. split("\t") rather than split(",")).
In any case, once the CSV has been converted to JSON, the filter aggregate_by defined above can be used.
If you are interested in a jq parser for CSV, you might want to look at fromcsvfile (https://gist.github.com/pkoppstein/bbbbdf7489c8c515680beb1c75fa59f2); see also
the definitions for fromcsv being proposed at https://github.com/stedolan/jq/issues/1650#issuecomment-448050902

AWS CLI / jq - transforming JSON with tags, and showing information even for non-defined tags

I'm facing an issue when trying to process output of 'aws ec2 describe-instances' command with 'jq', and I really need some help.
I want to transform JSON output into CSV file with the list of all instances, with
columns 'Name,InstanceId,Tag-Client,Tag-CostCenter'.
I've been using jq's select with a command like:
aws ec2 describe-instances |
jq -r '.Reservations[].Instances[]
| (.Tags[]|select(.Key=="Name")|.Value) + "," + .InstanceId + ","
+ (.Tags[]|select(.Key=="Client")|.Value) + ","
+ (.Tags[]|select(.Key=="CostCenter")|.Value)'
However using selects in this way, only those entries containing all the tags are displayed, not showing those that contain one of the tags only.
I understand the behavior, which is similar to a grep, but I'm trying to figure out if it's possible to perform this operation using jq, so in the case that one tag is not defined, would just return string "" and not remove the whole line.
I've found a reference about using 'if' clauses in jq ([https://ilya-sher.org/2016/05/11/most-jq-you-will-ever-need/], but wondering in anyone has resolved such case without having to make this logic or splitting the command in different executions.
Whenever you are given an array of key/value pairs (the tags here) and you want to extract values by their key, it'll be easier to map them into an object so you can access them directly. Functions like from_entries will work well with this.
However, since you're also trying to retrieve values not within this tag array, you can approach it a little differently to save some steps. Using reduce or foreach, you can go through each of the tags and add it to an object that holds all the values you're interested in. Then you can map the values you want into an array then convert to a csv row.
So if your goal is to create rows of Tags[Name], InstaceId, Tags[Client], Tags[CostCenter] for each instance, you could do this:
# for each instance
.Reservations[].Instances[]
# map each instance to an object where we can easily extract the values
| reduce .Tags[] as $t (
{ InstanceId }; # we want the InstanceId from the instance
.[$t.Key] = $t.Value # add the values to the object
)
# map the desired values to an array
| [ .Name, .InstanceId, .Client, .CostCenter ]
# convert to csv
| #csv
And the good news is, if Name, Client, or CostCenter doesn't exist in the tag array, or even InstanceId, then they'll just be null which becomes empty when converted to csv.