Edit: I used the solution provided by #peak to do the following:
$ jq -r --argjson whitelist '["role1", "role2"]' '
select(has("roles") and any(.roles[]; . == "role1" or . == "role2"))
| (reduce ."roles"[] as $r ({}; .[$r]=true)) as $roles
| [.email, .username, .given_name, .family_name, ($roles[$whitelist[]]
| . != null)]
| #csv
' users.json
Added the select() to filter out users who haven't onboarded yet and don't have any roles, and to ensure the users included in the output have at least one of the target roles.
Scenario: user profiles as JSON docs, where each profile has a list object with their assigned roles. Example:
{
"username": "janedoe",
"roles": [
"role1",
"role4",
"role5"
]
}
The actual data file is an ndjson file, one user object as above per line.
I am only interested in specific roles, say role1, role3, and role4. I want to produce a CSV formatted as:
username,role1?,role3?,role4?
e.g.,
janedoe,true,false,true
The part I haven't figured out is how to output booleans or Y / N in response to the values in the list object. Is this something I can do in jq itself?
With your input, the invocation:
jq -r --argjson whitelist '["role1", "role3", "role4"]' '
(["username"] + $whitelist),
[.username, ($whitelist[] as $w | .roles | index([$w]) != null)]
| #csv
'
produces:
"username","role1","role3","role4"
"janedoe",true,false,true
Notes:
The second last line of the jq filter above could be shortened to:
[.username, (.roles | index($whitelist[]) != null)]
Presumably if there were more than one user, you'd only want
the header row once, in which case the above solution
would need to be tweaked.
Using IN/1
Because index/1 is not as efficient as it might be,
you might like to consider this alternative:
(["username"] + $whitelist),
(.roles as $roles | [.username, ($whitelist[] | IN($roles[]) )])
| #csv
Using a JSON dictionary
If the number of roles was very large, then it would probably be more
efficient to construct a JSON dictionary to avoid repeated linear lookups:
(reduce .roles[] as $r ({}; .[$r]=true)) as $roles
| (["username"] + $whitelist),
[.username, ($roles[$whitelist[]] != null)]
| #csv
With ndjson as input
For efficiency, and to ensure there's just one header, you could use inputs with the -n command-line option. Adding the extra fields mentioned in the revised Q, you might end up with:
jq -nr --argjson whitelist '["role1", "role2"]' '
["email", "username", "given_name", "family_name"] as $greenlist
| ($greenlist + $whitelist),
(inputs
| select(has("roles") and any(.roles[] == $whitelist[]; true))
| (reduce ."roles"[] as $r ({}; .[$r]=true)) as $roles
| [ .[$greenlist[]], ($roles[$whitelist[]] != null) ])
| #csv
' users.json
Related
I want to use jq (or anything else when it's the wrong tool) to concatenate a json object like this:
{
"https://github.com": {
"user-one": {
"repository-one": "version-one",
"repository-two": "version-two"
},
"user-two": {
"repository-three": "version-three",
"repository-four": "version-four"
}
},
"https://gitlab.com": {
"user-three": {
"repository-five": "version-five",
"repository-six": "version-six"
},
"user-four": {
"repository-seven": "version-seven",
"repository-eight": "version-eight"
}
}
}
recursively to a bash string array like this:
(
"https://github.com/user-one/repository-one/archive/refs/heads/version-one.tar.gz"
"https://github.com/user-one/repository-two/archive/refs/heads/version-two.tar.gz"
"https://github.com/user-two/repository-three/archive/refs/heads/version-three.tar.gz"
"https://github.com/user-two/repository-four/archive/refs/heads/version-four.tar.gz"
"https://gitlab.com/user-three/repository-five/-/archive/version-five/repository-five-version-five.tar.gz"
"https://gitlab.com/user-three/repository-six/-/archive/version-six/repository-six-version-six.tar.gz"
"https://gitlab.com/user-four/repository-seven/-/archive/version-seven/repository-seven-version-seven.tar.gz"
"https://gitlab.com/user-four/repository-eight/-/archive/version-eight/repository-eight-version-eight.tar.gz"
)
for subsequent use in a loop.
for i in "${arr[#]}"
do
echo "$i"
done
Have no idea how to do that.
As you can see, the values must be handled differently depending on the object name.
"https://github.com" + "/" + $user_name + "/" + $repository_name + "/archive/refs/heads/" + $version + ".tar.gz"
"https://gitlab.com" + "/" + $user_name + "/" + $repository_name + "/-/archive/" + $version + "/" + $repository_name + "-" + $version + ".tar.gz"
Could anyone help?
Easily done.
First, let's focus on the jq code alone:
to_entries[] # split items into keys and values
| .key as $site # store first key in $site
| .value # code below deals with the value
| to_entries[] # split that value into keys and values
| .key as $user # store the key in $user
| .value # code below deals with the value
| to_entries[] # split that value into keys and values
| .key as $repository_name # store the key in $repository_name
| .value as $version # store the value in $version
| if $site == "https://github.com" then
"\($site)/\($user)/\($repository_name)/archive/refs/heads/\($version).tar.gz"
else
"\($site)/\($user)/\($repository_name)/-/archive/\($version)/\($repository_name)-\($version).tar.gz"
end
That generates a list of lines. Reading lines into a bash array looks like readarray -t arrayname < ...datasource...
Thus, using a process substitution to redirect jq's stdout as if it were a file:
readarray -t uris < <(jq -r '
to_entries[]
| .key as $site
| .value
| to_entries[]
| .key as $user
| .value
| to_entries[]
| .key as $repository_name
| .value as $version
| if $site == "https://github.com" then
"\($site)/\($user)/\($repository_name)/archive/refs/heads/\($version).tar.gz"
else
"\($site)/\($user)/\($repository_name)/-/archive/\($version)/\($repository_name)-\($version).tar.gz"
end
' <config.json
)
The basic task of generating the strings can be done efficiently and generically (i.e., without any limits on the depths of the basenames) using the jq filter:
paths(strings) as $p | $p + [getpath($p)] | join("/")
There are several ways to populate a bash array accordingly, but if you merely wish to iterate through the values, you could use a bash while loop, like so:
< input.json jq -r '
paths(strings) as $p | $p + [getpath($p)] | join("/")' |
while read -r line ; do
echo "$line"
done
You might also wish to consider using jq's #sh or #uri filter. For a jq urlencode function, see e.g.
https://rosettacode.org/wiki/URL_encoding#jq
(If the strings contain newlines or tabs, then the above would need to be tweaked accordingly.)
I have the following JSON:
{
"transmitterId": "30451155eda2",
"rssiSignature": [
{
"receiverId": "001bc509408201d5",
"receiverIdType": 1,
"rssi": -52,
"numberOfDecodings": 5,
"rssiSum": -52
},
{
"receiverId": "001bc50940820228",
"receiverIdType": 1,
"rssi": -85,
"numberOfDecodings": 5,
"rssiSum": -85
}
],
"timestamp": 1574228579837
}
I want to convert it to CSV format, where each row corresponds to an entry in rssiSignature (I have added the header row for visualization purposes):
timestamp,transmitterId,receiverId,rssi
1574228579837,"30451155eda2","001bc509408201d5",-52
1574228579837,"30451155eda2","001bc50940820228",-85
My current attempt is the following, but I get a single CSV row:
$ jq -r '[.timestamp, .transmitterId, .rssiSignature[].receiverId, .rssiSignature[].rssi] | #csv' test.jsonl
1574228579837,"30451155eda2","001bc509408201d5","001bc50940820228",-52,-85
How can I use jq to generate different rows for each entry of the rssiSignature array?
In order to reuse a value of the upper level, like the timestamp, for every item of the rssiSignature array, you can define it as a variable. You can get your csv like this:
jq -r '.timestamp as $t | .transmitterId as $tid |
.rssiSignature[] | [ $t, $tid, .receiverId, .rssi] | #csv
' file.json
Output:
1574228579837,"30451155eda2","001bc509408201d5",-52
1574228579837,"30451155eda2","001bc50940820228",-85
Also here is an way to print headers for an output file in bash, independent of what commands we call, using commands grouping.
(
printf "timestamp,transmitterId,receiverId,rssi\n"
jq -r '.timestamp as $t | .transmitterId as $tid |
.rssiSignature[] | [ $t, $tid, .receiverId, .rssi] | #csv
' file.json
) > output.csv
Actually, the task can be accomplished without the use of any variables; one can also coax jq to include a header:
jq -r '
["timestamp","transmitterId","receiverId","rssi"],
[.timestamp, .transmitterId] + (.rssiSignature[] | [.receiverId,.rssi])
| #csv'
A single header with multiple files
One way to produce a single header with multiple input files would be to use inputs in conjunction with the -n command-line option. This happens also to be efficient:
jq -nr '
["timestamp","transmitterId","receiverId","rssi"],
(inputs |
[.timestamp, .transmitterId] + (.rssiSignature[] | [.receiverId,.rssi]))
| #csv'
I am trying to join() a relatively big array (20k elements) of objects with a character ('\n' in this particular case). I have a few operation upfront which solve in about 8 seconds (acceptable) but when I try to '| join("\n")' at the end the runtime jump to 3+ minutes.
Is there any reason for the join() to be that slow ? Is there another way of having the same output without join() ?
I am currently using jq-1.5 (latest stable)
Here is the JQ file
json2csv.jq
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | join("\n")
)] | join("\n")
;
json2csv
Considering:
$ jq 'length' test.json
23717
With the script is I want it (and put above)
$ time jq -rf json2csv.jq test.json > test.csv
real 3m46.721s
user 1m48.660s
sys 1m57.698s
With the same script, removing the join("\n")
$ time jq -rf json2csv.jq test.json > test.csv
real 0m8.564s
user 0m8.301s
sys 0m0.242s
(note: I remove the second join because else JQ cannot aggregate an array and a string, which make sense (but that's only on an array of 2 elements anyways, so the second join isn't the problem))
You don't need to use join at all. Rather than thinking of converting the whole file to a single string, think of it as converting each row to strings. The way jq outputs streams of results will give you the desired result in the end (assuming you take the raw output).
try something more like this.
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers
# output headers followed by rows of values as arrays
| (
$headers
),
(
.[] | [ .[$headers[]] | tostring | tonull ]
)
# convert the arrays to tab separated values strings
| #tsv
;
After thinking about it I remembered that jq automatically display carriage return ('\n') if you scan an array (.[]), which mean that in this particular case I can just do this:
def json2csv:
def tonull: if . == "null" then null else . end;
(.[0] | keys) as $headers |
[(
$headers | join("\t")
), (
[ .[] as $row | [ $headers[] as $h | $row[$h] | tostring | tonull ] | join("\t") ] | .[]
)] | .[]
;
json2csv
And this solved my problem
time jq -rf json2csv.jq test.json > test.csv
real 0m6.725s
user 0m6.454s
sys 0m0.245s
I'm leaving the question up as if I had wanted to use any other character than '\n' this wouldn't have solved the issue.
When producing output such as CSV or TSV, the idea is to stream the data as much as possible. The last thing you want to do is run join on an array containing all the data. If you did want to use a delimiter other than \n, you'd add it to each item in the stream, and then use the -j command-line option.
Also, I think your diagnosis is probably not quite right as joining an array with a large number of small strings is quite fast. Below are timings comparing joining an array with two strings and one with 100,000 strings. In case you're wondering, my machine is rather slow.
./join.sh 2
3
real 0.03
user 0.02
sys 0.00
1896448 maximum resident set size
$ ./join.sh 100000
588889
real 2.20
user 2.05
sys 0.13
21188608 maximum resident set size
$cat join.sh
#!/bin/bash
/usr/bin/time -lp jq -n --argjson n "$1" '[range(0;$n)|tostring]|join(".")|length'
The above runs used jq 1.6, but using jq 1.5 produces very similar results.
On the other hand, joining a large number (20,000) of very long strings (1K) is noticeably slow, so evidently the current jq implementation is not designed for such operations.
I need to convert JSON to CSV where JSON has arrays of variable length, for example:
JSON objects:
{"labels": ["label1"]}
{"labels": ["label2", "label3"]}
{"labels": ["label1", "label4", "label5"]}
Resulting CSV:
labels,labels,labels
"label1",,
"label2","label3",
"label1","label4","label5"
There are many other properties in the source JSON, this is just an exсerpt for the sake of simplicity.
Also, I need to say that the process has to work with JSON as a stream because source JSON could be very large (>1GB).
I wanted to use jq with two passes, the first pass would collect the maximum length of the 'labels' array, the second pass would create CSV as the number of the resulting columns is known by this time. But jq doesn't have a concept of global variables, so I don't know where I can store the running total.
I'd like to be able to do that on Windows via CLI.
Thank you in advance.
The question shows a stream of JSON objects, so the following solutions assume that the input file is already a sequence as shown. These solutions can also easily be adapted to cover the case where the input file contains a huge array of objects, e.g. as discussed in the epilog.
A two-invocation solution
Here's a two-pass solution using two invocations of jq. The presentation assumes a bash-like environment, in case you have wsl:
n=$(jq -n 'reduce (inputs|.labels|length) as $i (-1;
if $i > . then $i else . end)' stream.json)
jq -nr --argjson n $n '
def fill($n): . + [range(length;$n)|null];
[range(0;$n)|"labels"],
(inputs | .labels | fill($n))
| #csv' stream.json
Assuming the input is as described, this is guaranteed to produce valid CSV. Hopefully you can adapt the above to your shell as necessary -- maybe this link will help:
Assign output of a program to a variable using a MS batch file
Using input_filename and a single invocation of jq
Unfortunately, jq does not have a "rewind" facility, but
there is an alternative: read the file twice within a single invocation of jq. This is more cumbersome than the two-invocation solution above but avoids any difficulties associated with the latter.
cat sample.json | jq -nr '
def fill($n): . + [range(length;$n)|null];
def max($x): if . < $x then $x else . end;
foreach (inputs|.labels) as $in ( {n:0};
if input_filename == "<stdin>"
then .n |= max($in|length)
else .printed+=1
end;
if .printed == null then empty
else .n as $n
| (if .printed == 1 then [range(0;$n)|"labels"] else empty end),
($in | fill($n))
end)
| #csv' - sample.json
Another single-invocation solution
The following solution uses a special value (here null) to delineate the two streams:
(cat stream.json; echo null; cat stream.json) | jq -nr '
def fill($n): . + [range(length; $n) | null];
def max($x): if . < $x then $x else . end;
(label $loop | foreach inputs as $in (0;
if $in == null then . else max($in|.labels|length) end;
if $in == null then ., break $loop else empty end)) as $n
| [range(0;$n)|"labels"],
(inputs | .labels | fill($n))
| #csv '
Epilog
A file with a top-level JSON array that is too large to fit into memory can be converted into a stream of the array's items by invoking jq with the --stream option, e.g. as follows:
jq -cn --stream 'fromstream(1|truncate_stream(inputs))'
For such a large file, you will probably want to do this in two separate invocations, one to get the count, then another to actually output the csv. If you wanted to read the whole file into memory, you could do this in one, but we definitely don't want to do that, we'll want to stream it in where possible.
Things get a little ugly when it comes to storing the result of commands to a variable, writing to a file might be simpler. But I'd rather not use temp files if we don't have to.
REM assuming in a batch file
for /f "usebackq delims=" %%i in (`jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json`) do set cols=%%i
jq -rn --stream --argjson cols "%cols%" "[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|#csv" input.json
> jq -n --stream "reduce (inputs | .[0][1] + 1) as $l (0; if $l > . then $l else . end)" input.json
For the first invocation to get the count of columns, we're just taking advantage of the fact that the paths to the array values could be used to indicate the lengths of the arrays. We'll just want to take the max across all items.
> jq -rn --stream --argjson cols "%cols%" ^
"[range($cols)|\"labels\"],(fromstream(1|truncate_stream(inputs))|[.[],(range($cols-length)|null)])|#csv" input.json
Then to output the rest, we're just taking the labels array (assuming it's the only property on the objects) and padding them out with null up to the $cols count. Then output as csv.
If the labels are in a different, deeply nested path than what's in your example here, you'll need to select based on the appropriate paths.
set labelspath=foo.bar.labels
jq -rn --stream --argjson cols "%cols%" --arg labelspath "%labelspath%" ^
"($labelspath|split(\".\")|[.,length]) as [$path,$depth] | [range($cols)|\"labels\"],(fromstream($depth|truncate_stream(inputs|select(.[0][:$depth] == $path)))|[.[],(range($cols-length)|null)])|#csv" input.json
I have a stream of JSON arrays like this
[{"id":"AQ","Count":0}]
[{"id":"AR","Count":1},{"id":"AR","Count":3},{"id":"AR","Count":13},
{"id":"AR","Count":12},{"id":"AR","Count":5}]
[{"id":"AS","Count":0}]
I want to use jq to get a new json like this
{"id":"AQ","Count":0}
{"id":"AR","Count":34}
{"id":"AS","Count":0}
34=1+3+13+12+5 which are in the second array.
I don't know how to describe it in detail. But the basic idea is shown in my example.
I use bash and prefer to use jq to solve this problem. Thank you!
If you want an efficient but generic solution that does NOT assume each input array has the same ids, then the following helper function makes a solution easy:
# Input: a JSON object representing the subtotals
# Output: the object augmented with additional subtotals
def adder(stream; id; filter):
reduce stream as $s (.; .[$s|id] += ($s|filter));
Assuming your jq has inputs, then the most efficient approach is to use it (but remember to use the -n command-line option):
reduce inputs as $row ({}; adder($row[]; .id; .Count) )
This produces:
{"AQ":0,"AR":34,"AS":0}
From here, it's easy to get the answer you want, e.g. using to_entries[] | {(.key): .value}
If your jq does not have inputs and if you don't want to upgrade, then use the -s option (instead of -n) and replace inputs by .[]
Assuming the .id is the same in each array:
first + {Count: map(.Count) | add}
Or perhaps more intelligibly:
(map(.Count) | add) as $sum | first | .Count = $sum
Or more declaratively:
{ id: (first|.id), Count: (map(.Count) | add) }
It's a bit kludgey, but given your input:
jq -c '
reduce .[] as $item ({}; .[($item.id)] += ($item.Count))
| to_entries
| .[] | {"id": .key, "Count": .value}
'
Yields the output:
{"id":"AQ","Count":0}
{"id":"AR","Count":34}
{"id":"AS","Count":0}