extract key-value at an unknown level - json

jq -r .x <<< '{"x": "abc"}'
When I am sure which level a key-value I want to extract, I do something like the above.
What if I am not sure which level it is, like in {..., {"x": "abc"}, ...}? How can I extract the value in this case (suppose there is one such match, or I only care about the first match)?

How can I extract the value
Assuming you regard null as a value, and that you want at most one .x value,
I would venture to say that the most succinct expression (code-golf style) would be:
first(..|select(has("x")?).x)
Note that this will yield the empty stream if there is no .x value.
The following is slightly shorter but has different behavior if there is no "x" key:
[..|select(has("x")?).x][0]

Related

Finding the location (line, column) of a field value in a JSON file

Consider the following JSON file example.json:
{
"key1": ["arr value 1", "arr value 2", "arr value 3"],
"key2": {
"key2_1": ["a1", "a2"],
"key2_2": {
"key2_2_1": 1.43123123,
"key2_2_2": 456.3123,
"key2_2_3": "string1"
}
}
}
The following jq command extracts a value from the above file:
jq ".key2.key2_2.key2_2_1" example.json
Output:
1.43123123
Is there an option in jq that, instead of printing the value itself, prints the location (line and column, start and end position) of the value within a (valid) JSON file, given an Object Identifier-Index (.key2.key2_2.key2_2_1 in the example)?
The output could be something like:
some_utility ".key2.key2_2.key2_2_1" example.json
Output:
(6,25) (6,35)
Given JSON data and a query, there is no
option in jq that, instead of printing the value itself, prints the location
of possible matches.
This is because JSON parsers providing an interface to developers usually focus on processing the logical structure of a JSON input, not the textual stream conveying it. You would have to instruct it to explicitly treat its input as raw text, while properly parsing it at the same time in order to extract the queried value. In the case of jq, the former can be achieved using the --raw-input (or -R) option, the latter then by parsing the read-in JSON-encoded string using fromjson.
The -R option alone would read the input linewise into an array of strings, which would have to be concatenated (e.g. using add) in order to provide the whole input at once to fromjson. The other way round, you could also provide the --slurp (or -s) option which (in combination with -R) already concatenates the input to a single string which then, after having parsed it with fromjson, would have to be split again into lines (e.g. using /"\n") in order to provide row numbers. I found the latter to be more convenient.
That said, this could give you a starting point (the --raw-output (or -r) option outputs raw text instead of JSON):
jq -Rrs '
"\(fromjson.key2.key2_2.key2_2_1)" as $query # save the query value as string
| ($query | length) as $length # save its length by counting its characters
| ./"\n" | to_entries[] # split into lines and provide 0-based line numbers
| {row: .key, col: .value | indices($query)[]} # find occurrences of the query
| "(\(.row),\(.col)) (\(.row),\(.col + $length))" # format the output
'
(5,24) (5,34)
Demo
Now, this works for the sample query, how about the general case? Your example queried a number (1.43123123) which is an easy target as it has the same textual representation when encoded as JSON. Therefore, a simple string search and length count did a fairly good job (not a perfect one because it would still find any occurrence of that character stream, not just "values"). Thus, for more precision, but especially with more complex JSON datatypes being queried, you would need to develop a more sophisticated searching approach, probably involving more JSON conversions, whitespace stripping and other normalizing shenanigans. So, unless your goal is to rebuild a full JSON parser within another one, you should narrow it down to the kind of queries you expect, and compose an appropriately tailored searching approach. This solution provides you with concepts to simultaneously process the input textually and structurally, and with a simple search and ouput integration.

Converting object to array or keeping an array with jq

I have an input file which type may vary from array to object, one of the following inputs expected:
[{"version": "1.0"}]
{version: "1.0"}
How can I construct jq expression for the output to be always converted into array. I came up with the following:
jq 'if (select(has("version")?)) then [.] else . end'
once version key is matched, object is added inside array, but if not matched, that would mean it's already an array, nothing is printed, and I would it expect it to be printed as it is. Please suggest the right way to achieve it.
You can check the input's type directly:
jq 'if type == "object" then [.] else . end'
Demo
Or use the deconstruction alternative operator ?// (available since jq 1.6):
jq '. as [$v] ?// $v | [$v]'
Demo
Auxiliary to pmf's answer, why doesn't this work?
input.jsonl:
[{"version": "1.0"}]
{"version": "2.0"}
Command:
jq -c 'if (select(has("version")?)) then [.] else . end'
Actual output:
[{"version": "2.0"}]
Desired output:
[{"version": "1.0"}]
[{"version": "2.0"}]
First of all, select filters out values. For each value input to select, it evaluates the filter you give it on its input, and if that filter evaluates to a true value, the whole input value it emitted unchanged. Otherwise there is no output. Using select here is unhelpful, because you want a true or false value for the if condition. If the condition part of your if-then-else expression emits no output then the whole if-then-else expression emits no output.
A more thorough way to understand select(expr) is that every time expr emits a true value then the select(expr) emits its input value. A more thorough way to understand if cond then a else b end is that every time cond emits a true value the whole if-then-else emits a and every time cond emits a false value the whole if-then-else emits b.
Okay, so forget the select... why doesn't this work?
Command:
jq -c 'if (has("version")?) then [.] else . end'
Actual output:
[{"version": "2.0"}]
In this case, the ? is the error suppression operator and is equivalent to try has("version") which is itself equivalent to try has("version") catch empty. This means that when an error occurs the expression returns empty which means no output. An error does indeed occur when the input is a list instead of an object. When the condition part of the if-then-else expression emits no output, you guessed it, the whole expression emits no output.
You could make this work by doing this instead:
Command:
jq -c 'if (try has("version") catch false) then [.] else . end'
Actual output:
[{"version":"1.0"}]
[{"version":"2.0"}]
Of course that's a bit roundabout. You should follow pmf's answer. But this perhaps helps you understand why your attempt didn't go as you expected.
As a rule of thumb, try to make sure your select and if conditions always emit exactly one output for each input - that way you will not be surprised. Expressions that can emit zero outputs (e.g. expr? or select(...)) or multiple outputs (e.g. .[]) will make for a confusing time when used as conditions in select or if.
The arrays function selects an input if it is an array. Combined with // you can handle the case where it isn't:
$ jq -c 'arrays // [.]' versions.json
[{"version":"1.0"}]
[{"version":"2.0"}]
Where versions.json is:
[{"version":"1.0"}]
{"version":"2.0"}

Retrieving the first entity out of several ones

I am a rank beginner with jq, and I've been going through the tutorial, but I think there is a conceptual difference I don't understand. A common problem I encounter is that a large JSON file will contain many objects, each of which is quite big, and I'd like to view the first complete object, to see which fields exist, what types, how much nesting, etc.
In the tutorial, they do this:
# We can use jq to extract just the first commit.
$ curl 'https://api.github.com/repos/stedolan/jq/commits?per_page=5' | jq '.[0]'
Here is an example with one object - here, I'd like to return the whole array (just like my_array=['foo']; my_array[0] would return foo in Python).
wget https://hacker-news.firebaseio.com/v0/item/8863.json
I can access and pretty-print the whole thing with .
$ cat 8863.json | jq '.'
$
{
"by": "dhouston",
"descendants": 71,
"id": 8863,
"kids": [
9224,
...
8876
],
"score": 104,
"time": 1175714200,
"title": "My YC app: Dropbox - Throw away your USB drive",
"type": "story",
"url": "http://www.getdropbox.com/u/2/screencast.html"
}
But trying to get the first element fails:
$ cat 8863.json| jq '.[0]'
$ jq: error (at <stdin>:0): Cannot index object with number
I get the same error jq '.[0]' 8863.json, but strangely echo 8863.json | jq '.[0]' gives me parse error: Invalid numeric literal at line 2, column 0. What is the difference? Also, is this not the correct way to get the zeroth member of the JSON?
I've looked at other SO posts with this error message and at the manual, but I'm still confused. I think of the file as an array of JSON objects, and I'd like to get the first. But it looks like jq works with something called a "stream", and does operations on all of it (say, return one given field from every object).
Clarification:
Let's say I have 2 objects in my JSON:
{
"by": "pg",
"id": 160705,
"poll": 160704,
"score": 335,
"text": "Yes, ban them; I'm tired of seeing Valleywag stories on News.YC.",
"time": 1207886576,
"type": "pollopt"
}
{
"by": "dpapathanasiou",
"id": 16070,
"kids": [
16078
],
"parent": 16069,
"text": "Dividends don't mean that much: Microsoft in its dominant years (when they had 40%-plus margins and were raking in the cash) never paid a dividend (they did so only recently).",
"time": 1177355133,
"type": "comment"
}
How would I get the entire first object (lines 1-9) with jq?
Cannot index object with number
This error message says it all, you can't index objects with numbers. If you want to get the value of by field, you need to do
jq '.by' file
Wrt
echo 8863.json | jq '.[0]' gives me parse error: Invalid numeric literal at line 2, column 0.
It's normal since you didn't specify -R/--raw-input flag, and so jq sees the shell string 8863.json as a JSON string, and one cannot apply array indexing to JSON strings. (To get the first character as a string, you'd write .[0:1].)
If your input file consists of several separate entities, to get the first one:
jq -n 'input' file
or,
jq -n 'first(inputs)' file
To get nth (let's say 5th for example):
jq -n 'nth(5; inputs)' file
a large JSON file will contain many objects, each of which is quite big, and I'd like to view the first complete object, to see which fields exist, what types, how much nesting, etc.
As implied in #OguzIsmail's response, there are important differences between:
- a JSON file (i.e, a file containing exactly one JSON entity);
- a file containing a sequence (i.e., stream) of JSON entities;
- a file containing an array of JSON entities.
In the first two cases, you can write jq -n input to select the first entity, and in the case of an array of entities, jq .[0] will suffice.
(In JSON-speak, a "JSON object" is a kind of dictionary, and is not to be confused with JSON entities in general.)
If you have a bunch of JSON objects (whether as a stream or array or whatever), just looking at the first often doesn't really give an accurate picture of all them. For getting a bird's eye view of a bunch of objects, using a "schema inference engine" is often the way to go. For this purpose, you might like to consider my schema.jq schema inference engine. It's usually very simple to use but of course how you use it will depend on whether you have a stream or array of JSON entities. For basic details, see https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed; for related topics (e.g. verification), see the entry for JESS at https://github.com/stedolan/jq/wiki/Modules
Please note that schema.jq infers a structural schema that mirrors the entities under consideration. Such structural schemas have little in common with JSON Schema schemas, which you might also like to consider.

JSON - rewrite string to numbers without quotes

Hellou,
How can rename this:
"Id": "3",
to this:
"Id": 3,
I have a long file with string records.
I try use IntelliJ renamer with this formula "\d+" but $0 return completly string with quotes.
You need to take a look at how regex groups work.
The $0 will always represent the entire match. To get a subsection of it (the number in your case) you need to use parenthesis around the relevant portions to create a capturing group and then you can reference each group by a 1-based index.
So in your case, a pattern of "(\d+?)" on your sample string would return "3" for $0 (the entire match), and 3 for $1 (the first capturing group).

How to use the value of a variable as the name of an array in R

i am starting my Rscript with commandArgs(TRUE and variable <- args[1].
in variable is the name of a column of my mysql database. i select the column dynamically and query with rohDaten <- dbGetQuery(con, sql)
the result is an array. i want to do this:
rohDaten$XXX[rohDaten$XXX==NULL]<-NA where XXX is the value of variable
how can is set XXX to value of variable? i tried many things like variations of rohDaten$get(variable)
Instead of calling
rohDaten$XXX
try
rohDaten[variable]
This will translate to whatever your variable is, e.g.
rohDaten["columnname"]
This should work:
selected_col <- which(colnames(rohDaten) == variable)
rohDaten[,selected_col][rohDaten[, selected_col] == NULL] <- NA
There are a number of ways of subsetting a data.frame. The $ operator gets or sets a column as its underlying type but can only be used with literal column names, not column names in a variable. The [[ operator does the same as $ but takes a character vector (of length 1) as its argument. So these are all equivalent:
my_data$potatoes
my_data[["potatoes"]]
variable <- potatoes; my_data[[potatoes]]
The [ operator behaves differently depending on whether there are 1 or 2 arguments. With a single argument, it gets or sets a data.frame with the requested columns. This is important for repeated subsetting as you're doing:
my_data["potatoes"][my_data$id == 4]
This will select the column of my_data as a data.frame and then try to select columns from it again using a logical vector. This will fail unless there is only one row in my_data, and even then it won't be the desired result.
With 2 arguments, you can select rows, columns or both. Unless drop=FALSE is provided, the result will be a vector if only one column is requested.
my_data[my_data$id == 4, "potatoes"]
# only elements of my_data$potatoes where my_data$id is 4
my_data[, "potatoes"]
# entirely equivalent to `my_data$potatoes` or `my_data[["potatoes"]]`
For your original question, the neatest way of doing this is:
rohDaten[rohDaten[, variable]==NULL, variable] <- NA
However, this in itself raises another problem. An element of a vector cannot be NULL, and testing null would be done with is.null anyway. Can you add to your question the output of dput(rohDaten[, variable])?