How to skip first n objects in jq input - json

I have a VERY large stream of objects, which I am trying to import into MongoDB. I keep getting a broken pipe after about 10k objects, so I would like to be able to update my import script to skip the already imported objects and begin with the first one that was missed.
It seems to me that the tool for this would be jq. What I need is a way to skip (yield empty) all items before the nth, and then output the rest as-is.
I've tried using foreach to maintain an object counter, but I keep ending up with 1 as the value of the counter, for all objects in my small test sample (using a bash here document):
$ jq 'foreach . as $item (0; (.+1); [ . , if . < 2 then empty else $item end ])' <<"end"
> { "item": "first" }
> { "item": "second" }
> { "item": "third" }
> { "item": "fourth" }
> end
The output from this is:
[
1
]
[
1
]
[
1
]
[
1
]
Any suggestions would be most welcome.

def skip(n; stream):
foreach stream as $s (0; .+1; select(. > n) | $s);
Example:
skip(1000; inputs)
(When using inputs and/or input, don't forget you'll probably want to use the -n command-line option.)
Sledgehammer Approach
try (range(0; 1000) | input | empty), inputs
In this case, the try is necessary to avoid an error should there be fewer than the requested number of items.

Related

filter keys in JSON using jq

I am having a complex nested json
{
...
"key1": {
"key2" : [
{ ...
"base_score" :4.5
}
]
"key3": {
"key4": [
{ ...
"base_score" : 0.5
...
}
]
}
...
}
}
There maybe multiple "base_score" in the json("base_score" path is unknown) and the corresponding value will be a number, I have to check if at least one such value is greater than some known value 7.0, and if there is, I have to do "exit 1". I have to write this query in shell script.
Assuming the input is valid JSON in a file named input.json, then based on my understanding of the requirements, you could go with:
jq --argjson limit 7.0 '
any(.. | select(type=="object" and (.base_score|type=="number")) | .base_score; . > $limit)
| halt_error(if . then 1 else 0 end)
' input.json
You can modify the argument to halt_error to set the exit code as you wish.
Note that halt_error redirects its input to stderr, so you might want to append 2> /dev/null (or the equivalent expression appropriate for your shell) to the above invocation.
You can easily get a stream of base_score values at any level and use that with any:
any(..|.base_score?; . > 7)
The stream will contain null values for objects without the property, but null is not greater than any number, so that shouldn't be a stopper.
You could then compare the output or specify -e/--exit-status to be used with a condition directly:
jq -e 'any(..|.base_score?; . > 7)' complexnestedfile.json >/dev/null && exit 1

Using JQ to merge two JSON snippets from one file

I've got output from a script that outputs two structurally identical JSON snippets into one file:
{
"Objects": [
{
"Key": "somevalue",
"VersionId": "someversion"
}
],
"Quiet": false
}
{
"Objects": [
{
"Key": "someothervalue",
"VersionId": "someotherversion"
}
],
"Quiet": false
}
I would like to pass this output through JQ to have one Objects[] list, concatenating all of the objects within the two lists, and outputting the same overall structure. I can accomplish it with piping between two separate JQ commands:
jq '.Objects[]' inputfile | jq -s '{"Objects":., "Quiet":false}' -
But I'm wondering if there is a more elegant way to do so using only one invocation of JQ.
I'm currently using JQ version 1.5 but can update if needed.
You don't need to invoke JQ twice there. The second object can be fetched using the input keyword.
.Objects += input.Objects
Online demo
You can use reduce:
jq -s 'reduce .[] as $item ({ Quiet: false }; .Objects += $item.Objects)'
See it in action.
As #oguz-ismail suggested in a comment, the -s (slurp) flag can be removed by using inputs to get the rest of the entries after the first one:
jq 'reduce inputs as $item (.; .Objects += $item.Objects)'
See it in action.
Both versions work with any number of entries in the input (the second version requires at least one).

Replacing a value in a JSON structure with randomly-chosen strings from a list using jq

There are lots of similar questions but none for dynamically joining 2 files.
What I'm trying to do is to dynamically edit the following structure:
{
"features": [
{
"type": "Feature",
"properties": {
"name": "0",
"height": 0.7
}
},
{
"type": "Feature",
"properties": {
"name": "1",
"height": 0
}
}
]
}
I want to replace only the one field .features[].properties.name with a random value from a 1d-array inside another txt file. There are 8,000 features and around 100 names I've prepared.
This is what I've got now failing with errors:
#!/bin/bash
declare -a names=("name1" "name2" "name3")
jq '{
"features" : [
"type" : "Feature",
"properties" : {
"name" : `$names[seq 0 100]`,
"height" : .features[].properties.height
},
.features[].geometry
]
}' < areas.json
Is it even possible to do in a single command or I should use python or js for such tasks?
Your document (https://echarts.baidu.com/examples/data-gl/asset/data/buildings.json) is actually small enough that we don't need to do any crazy memory-conservation tricks to make it work; the following functions as-is:
# create sample data
[[ -e words.txt ]] || printf '%s\n' 'First Word' 'Second Word' 'Third Word' >words.txt
# actually run the replacements
jq -n --slurpfile buildings buildings.json '
# define a jq function that changes the current property name with the next input
def replaceName: (.properties.name |= input);
# now, for each document in buildings.json, replace each name it contains
$buildings[] | (.features |= map(replaceName))
' < <(shuf -r words.txt | jq -R .)
This works because shuf -r words.txt creates an unending stream of words randomly chosen from words.txt, and the jq -R . inside the process substitution quotes those as strings. (Because we only call input once per item in buildings.json, we don't try to keep running after that file's contents have been completely consumed).
For the tiny two-record document given in the question, the output looks like:
{
"features": [
{
"type": "Feature",
"properties": {
"name": "Third Word",
"height": 0.7
}
},
{
"type": "Feature",
"properties": {
"name": "Second Word",
"height": 0
}
}
]
}
...with the actual words varying each run; it's similarly been smoketested with the full externally-hosted file.
Here's a solution to the problem of choosing the names randomly with replacement, using the very simple PRNG written in jq
copied from https://rosettacode.org/wiki/Random_numbers#jq
Invocation:
jq --argjson names '["name1","name2","name3","name4"]' \
-f areas.jq areas.json
areas.jq
# The random numbers are in [0 -- 32767] inclusive.
# Input: an array of length at least 2 interpreted as [count, state, ...]
# Output: [count+1, newstate, r] where r is the next pseudo-random number.
def next_rand_Microsoft:
.[0] as $count | .[1] as $state
| ( (214013 * $state) + 2531011) % 2147483648 # mod 2^31
| [$count+1 , ., (. / 65536 | floor) ] ;
# generate a stream of random integers < $n
def randoms($n):
def r: next_rand_Microsoft
| (.[2] % $n), r;
[0,11] | r ;
. as $in
| ($names|length) as $count
| (.features|length) as $n
| [limit($n; randoms($count))] as $randoms
| reduce range(0; $n) as $i (.;
.features[$i].properties.name = $names[$randoms[$i]] )
Assuming your areas.json is valid JSON, then I believe the following would come close to accomplishing your intended edit:
names='["name1","name2","name3","name4"]'
jq --argjson names "$names" '.features[].properties.name = $names
' < areas.json
However, given your proposed solution, it's not clear to me what you mean by a "random value from a 1d-array". If you mean that the index should be randomly chosen (as by a PRNG), then I would suggest computing it using your favorite PRNG and passing in that random value as another argument to jq, as illustrated in the following section.
So the question becomes how to transform the text
['name1','name2','name3','name4']
into a valid JSON array. There are numerous ways this can be done, whether using jq or not, but I believe that is best left as a separate question or as an exercise, because the selection of the method will probably depend on specific details which are not mentioned in this Q. Personally, I'd use sed if possible; you might also consider using hjson, as also illustrated in the following section.
Illustration using hjson and awk
hjson -j <<< "['name1','name2','name3','name4']" > names.json.tmp
function randint {
awk -v n="$(jq length names.json.tmp)" '
function randint(n) {return int(n * rand())}
BEGIN {srand(); print randint(n)}'
}
jq --argfile names names.json.tmp --argjson n $(randint) '
.features[].properties.name = $names[$n]
' < areas.json
Addendum
Currently, jq does not have a builtin PRNG, but if you want to use jq and if you want a value from the "names" array to be chosen at random (with replacement?) for each occurrence of the .name field, then one option would be to pre-compute an array of the randomly selected names (an array of length features | length) using your favorite PRNG, and passing that array into jq:
jq --argjson randomnames "$randomnames" '
reduce range(0; .features[]|length) as $i (.;
.features[$i].properties.name = $randomnames[$i])
' < areas.json
Another option would be to use a PRNG written in jq, as illustrated elsewhere on this page.

Need to get all key value pairs from a JSON containing a specific character '/'

I have a specific json content for which I need to get all keys which contains the character / in their values.
JSON
{ "dig": "sha256:d2aae00e4bc6424d8a6ae7639d41cfff8c5aa56fc6f573e64552a62f35b6293e",
"name": "example",
"binding": {
"wf.example.input1": "/path/to/file1",
"wf.example.input2": "hello",
"wf.example.input3":
["/path/to/file3",
"/path/to/file4"],
"wf.example.input4": 44
}
}
I know I can get all the keys containing file path or array of file paths using query jq 'paths(type == "string" and contains("/"))'. This would give me an output like:
[ "binding", "wf.example.input1" ]
[ "binding", "wf.example.input3", 0]
[ "binding", "wf.example.input3", 1 ]
Now that i have all the elements that contains some file paths as their values, is there a way to fetch both key and value for the same and then store them as another JSON? For example, in JSON mentioned for this question, I need to get the output as another JSON containing all the matched paths. My output JSON should look something like below.
{ "binding":
{ "wf.example.input1": "/path/to/file1",
"wf.example.input3": [ "/path/to/file3", "/path/to/file4" ]
}
}
The following jq filter will produce the desired output if given input that is very similar to the example, but it is far from robust and glosses over some details that are unclear from the problem description. However, it should be easy enough to modify the filter in accordance with more precise specifications:
. as $in
| reduce paths(type == "string" and test("/")) as $path ({};
($in|getpath($path)) as $x
| if ($path[-1]|type) == "string"
then .[$path[-1]] = $x
else .[$path[-2]|tostring] += [$x]
end )
| {binding: .}
Output:
{
"binding": {
"wf.example.input1": "/path/to/file1",
"wf.example.input3": [
"/path/to/file3",
"/path/to/file4"
]
}
}

Process huge GEOJson file with jq

Given a GEOJson file as follows:-
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {
"FEATCODE": 15014
},
"geometry": {
"type": "Polygon",
"coordinates": [
.....
I want to end up with the following:-
{
"type": "FeatureCollection",
"features": [
{
"tippecanoe" : {"minzoom" : 13},
"type": "Feature",
"properties": {
"FEATCODE": 15014
},
"geometry": {
"type": "Polygon",
"coordinates": [
.....
ie. I have added the tippecanoe object to each feature in the array features
I can make this work with:-
jq '.features[].tippecanoe.minzoom = 13' <GEOJSON FILE> > <OUTPUT FILE>
Which is fine for small files. But processing a large file of 414Mb seems to take forever with the processor maxing out and nothing being written to the OUTPUT FILE
Reading further into jq it appears that the --stream command line parameter may help but I am completely confused as to how to use this for my purposes.
I would be grateful for an example command line that serves my purposes along with an explanation as to what --stream is doing.
A one-pass jq-only approach may require more RAM than is available. If that is the case, then a simple all-jq approach is shown below, together with a more economical approach based on using jq along with awk.
The two approaches are the same except for the reconstitution of the stream of objects into a single JSON document. This step can be accomplished very economically using awk.
In both cases, the large JSON input file with objects of the required form is assumed to be named input.json.
jq-only
jq -c '.features[]' input.json |
jq -c '.tippecanoe.minzoom = 13' |
jq -c -s '{type: "FeatureCollection", features: .}'
jq and awk
jq -c '.features[]' input.json |
jq -c '.tippecanoe.minzoom = 13' | awk '
BEGIN {print "{\"type\": \"FeatureCollection\", \"features\": ["; }
NR==1 { print; next }
{print ","; print}
END {print "] }";}'
Performance comparison
For comparison, an input file with 10,000,000 objects in .features[] was used. Its size is about 1GB.
u+s:
jq-only: 15m 15s
jq-awk: 7m 40s
jq one-pass using map: 6m 53s
An alternative solution could be for example:
jq '.features |= map_values(.tippecanoe.minzoom = 13)'
To test this, I created a sample JSON as
d = {'features': [{"type":"Feature", "properties":{"FEATCODE": 15014}} for i in range(0,N)]}
and inspected the execution time as a function of N. Interestingly, while the map_values approach seems to have linear complexity in N, .features[].tippecanoe.minzoom = 13 exhibits quadratic behavior (already for N=50000, the former method finishes in about 0.8 seconds, while the latter needs around 47 seconds)
Alternatively, one might just do it manually with, e.g., Python:
import json
import sys
data = {}
with open(sys.argv[1], 'r') as F:
data = json.load(F)
extra_item = {"minzoom" : 13}
for feature in data['features']:
feature["tippecanoe"] = extra_item
with open(sys.argv[2], 'w') as F:
F.write(json.dumps(data))
In this case, map rather than map_values is far faster (*):
.features |= map(.tippecanoe.minzoom = 13)
However, using this approach will still require enough RAM.
p.s. If you want to use jq to generate a large file for timing, consider:
def N: 1000000;
def data:
{"features": [range(0;N) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };
(*) Using map, 20s for 100MB, and approximately linear.
Here, based on the work of #nicowilliams at GitHub, is a solution that uses the streaming parser available with jq. The solution is very economical with memory, but is currently quite slow if the input is large.
The solution has two parts: a function for injecting the update into the stream produced using the --stream command-line option; and a function for converting the stream back to JSON in the original form.
Invocation:
jq -cnr --stream -f program.jq input.json
program.jq
# inject the given object into the stream produced from "inputs" with the --stream option
def inject(object):
[object|tostream] as $object
| 2
| truncate_stream(inputs)
| if (.[0]|length == 1) and length == 1
then $object[]
else .
end ;
# Input: the object to be added
# Output: text
def output:
. as $object
| ( "[",
foreach fromstream( inject($object) ) as $o
(0;
if .==0 then 1 else 2 end;
if .==1 then $o else ",", $o end),
"]" ) ;
{}
| .tippecanoe.minzoom = 13
| output
Generation of test data
def data(N):
{"features":
[range(0;2) | {"type":"Feature", "properties": {"FEATCODE": 15014}}] };
Example output
With N=2:
[
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
,
{"type":"Feature","properties":{"FEATCODE":15014},"tippecanoe":{"minzoom":13}}
]