Split a JSON file into separate files - json

I have a large JSON file that is an object of objects, which I would like to split into separate files name after object keys. Is it possible to achieve this using jq or any other off-the-shelf tools?
The original JSON is in the following format
{ "item1": {...}, "item2": {...}, ...}
Given this input I would like to produce files item1.json, item2.json etc.

This should give you a start:
for f in `cat input.json | jq -r 'keys[]'` ; do
cat input.json | jq ".$f" > $f.json
done
or when you insist on more bashy syntax like some seem to prefer:
for f in $(jq -r 'keys[]') ; do
jq ".[\"$f\"]" < input.json > "$f.json"
done < input.json

Here's a solution that requires only one call to jq:
jq -cr 'keys[] as $k | "\($k)\n\(.[$k])"' input.json |
while read -r key ; do
read -r item
printf "%s\n" "$item" > "/tmp/$key.json"
done
It might be faster to pipe the output of the jq command to awk, e.g.:
jq -cr 'keys[] as $k | "\($k)\t\(.[$k])"' input.json |
awk -F\\t '{ print $2 > "/tmp/" $1 ".json" }'
Of course, these approaches will need to be modified if the key names contain characters that cannot be used in filenames.

Is it possible to achieve this using jq or any other off-the-shelf tools?
It is. xidel can also do this very efficiently.
Let's assume 'input.json' :
{
"item1": {
"a": 1
},
"item2": {
"b": 2
},
"item3": {
"c": 3
}
}
Inefficient Bash method:
for f in $(xidel -s input.json -e '$json()'); do
xidel -s input.json -e '$json("'$f'")' > $f.json
done
For every object key another instance of xidel is called to parse the object. Especially when you have a very large JSON this is pretty slow.
Efficient file:write() method:
xidel -s input.json -e '
$json() ! file:write(
.||".json",
$json(.),
{"method":"json"}
)
'
One xidel call creates 'item{1,2,3}.json'. Their content is a compact/minified object, like {"a": 1} for 'item1.json'.
xidel -s input.json -e '
for $x in $json() return
file:write(
concat($x,".json"),
$json($x),
{
"method":"json",
"indent":true()
}
)
'
One xidel call creates 'item{1,2,3}.json'. Their content is a prettified object (because of {"indent":true()}), like...
{
"a": 1
}
...for 'item1.json'. Different query (for-loop), same result.
This method is multitudes faster!

Related

Create a json from given list of filenames in unix script

Hello I am trying to write unix script/command where I have to list out all filenames from given directory with filename format string-{number}.txt(eg: filename-1.txt,filename-2.txt) from which I have to form a json object. any pointers would be helpful.
[{
"filenumber": "1",
"name": "filename-1.txt"
},
{
"filenumber": "2",
"name": "filename-2.txt"
}
]
In the above json file-number should be read from {number} format of the each filename
A single call to jq should suffice :
shopt -s extglob
printf "%s\0" *-+([0-9]).txt | \
jq -sR 'split("\u0000") |
map({filenumber:capture(".*-(?<n>.*)\\.txt").n,
name:.})'
Very easy for the command-line tool xidel and its integrated EXPath File Module:
$ xidel -se '
array{
for $x in file:list(.,false(),"*.txt")
return {
"filenumber":extract($x,"(\d+)\.txt",1),
"name":$x
}
}
'
Intuitively, I'd say you can do this with jq. However, in practice I've rarely been able to achieve what I wanted with jq :-)
With some lunch break puzzling, I've come up with this beauty:
ls | jq -R '{filenumber:input_line_number, name:.}' | jq -s .
Instead of ls you could use any other command that produces a newline separated list of strings.
I have tried with multiple examples to achieve exact use case of mine and finally found this working fine exactly how I wanted Thanks
for file in $(ls *.txt); do file_version=$(echo $file | sed 's/\(^.*-\)\(.*\)\(.txt.*$\)/\2/'); jq -n --arg name "$file_version" --arg path "$file" '{name: $name, name: $path}'; done | jq -n '.urls |= [inputs]'

jq - iterate through dictionaries

My json knowledge is shaky, so pardon me if I use the wrong terminology.
I have input.txt which can be simplified down to this:
[
{
"foo1": "bar1",
"baz1": "fizz1"
},
{
"foo2": "bar2",
"baz2": "fizz2"
}
]
I want to iterate through each object via a loop, so I'm essentially hoping to tackle just the 1's first, then loop through the 2's, etc.
I thought it was something like:
jq 'keys[]' input.json | while read key ; do
echo "loop --$(jq "[$key]" input.json)"
done
but that's giving me
loop 0
loop 1
where I would expect to see (spacing here is optional, not sure how jq would parse it):
loop { "foo1": "bar1", "baz1": "fizz1" }
loop { "foo2": "bar2", "baz2": "fizz2" }
What am I missing?
No need to use bash, you can do this in jq itself:
jq -r 'keys[] as $k | "loop: \(.[$k])"' file.json
loop: {"foo1":"bar1","baz1":"fizz1"}
loop: {"foo2":"bar2","baz2":"fizz2"}
What about using the -c option:
$ jq -c '.[]' file | sed 's/^/loop /'
loop {"foo1":"bar1","baz1":"fizz1"}
loop {"foo2":"bar2","baz2":"fizz2"}
Assuming response is a variable containing your data :
echo "$response" | jq --raw-output '.[] | "loop " + tostring'
loop {"foo1":"bar1","baz1":"fizz1"}
loop {"foo2":"bar2","baz2":"fizz2"}
Hope it helps!

jq streaming - filter nested list and retain global structure

In a large json file, I want to remove some elements from a nested list, but keep the overall structure of the document.
My example input it this (but the real one is large enough to demand streaming).
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
{
"keep": "false",
"extra": "non-keeper"
}
]
}
The required output just has one element of the 'filter_this' block removed:
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this":
[
{"keep" : "true"},
{
"keep": "true",
"extra": "keeper"
} ,
]
}
The standard way to handle such cases appears to be using 'truncate_stream' to reconstitute streamed objects, before filtering those in the usual jq way. Specifically, the command:
jq -nc --stream 'fromstream(1|truncate_stream(inputs))'
gives access to a stream of objects:
{"keep_this":["this","list"]}
[{"keep":"true"},{"keep":"true","extra":"keeper"},
{"keep":"false","extra":"non-keeper"}]
at which point it is easy to filter for the required objects. However, this strips the results from the context of their parent object, which is not what I want.
Looking at the streaming structure:
[["keep_untouched","keep_this",0],"this"]
[["keep_untouched","keep_this",1],"list"]
[["keep_untouched","keep_this",1]]
[["keep_untouched","keep_this"]]
[["filter_this",0,"keep"],"true"]
[["filter_this",0,"keep"]]
[["filter_this",1,"keep"],"true"]
[["filter_this",1,"extra"],"keeper"]
[["filter_this",1,"extra"]]
[["filter_this",2,"keep"],"false"]
[["filter_this",2,"extra"],"non-keeper"]
[["filter_this",2,"extra"]]
[["filter_this",2]]
[["filter_this"]]
it seems I need to select all the 'filter_this' rows, truncate those rows only (using 'truncate_stream'), rebuild these rows as objects (using 'from_stream'), filter them, and turn the objects back into the stream data format (using 'tostream') to join the stream of 'keep untouched' rows, which are still in the streaming format. At that point it would be possible to re-build the whole json. If that is the right approach - which seems overly converluted to me - how do I do that? Or is there a better way?
If your input file consists of a single very large JSON entity that is too big for the regular jq parser to handle in your environment, then there is the distinct possibility that you won't have enough memory to reconstitute the JSON document.
With that caveat, the following may be worth a try. The key insight is that reconstruction can be accomplished using reduce.
The following uses a bunch of temporary files for the sake of clarity:
TMP=/tmp/$$
jq -c --stream 'select(length==2)' input.json > $TMP.streamed
jq -c 'select(.[0][0] != "filter_this")' $TMP.streamed > $TMP.1
jq -c 'select(.[0][0] == "filter_this")' $TMP.streamed |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
| .filter_this |= map(select(.keep=="true"))
| tostream
| select(length==2)' > $TMP.2
# Reconstruction
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' $TMP.1 $TMP.2
Output
{
"keep_untouched": {
"keep_this": [
"this",
"list"
]
},
"filter_this": [
{
"keep": "true"
},
{
"keep": "true",
"extra": "keeper"
}
]
}
Many thanks to #peak. I found his approach really useful, but unrealistic in terms of performance. Stealing some of #peak's ideas, though, I came up with the following:
Extract the 'parent' object:
jq -c --stream 'select(length==2)' input.json |
jq -c 'select(.[0][0] != "filter_this")' |
jq -n 'reduce inputs as [$p,$x] (null; setpath($p;$x))' > $TMP.parent
Extract the 'keepers' - though this means reading the file twice (:-<):
jq -nc --stream '[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' input.json > $TMP.keepers
Insert the filtered list into the parent object.
jq -nc -s 'inputs as $items
| $items[0] as $parent
| $parent
| .filter_this |= $items[1]
' $TMP.parent $TMP.keepers > result.json
Here is a simplified version of #PeteC's script. It requires one fewer invocations of jq.
In both cases, please note that the invocation of jq that uses "2|truncate_stream(_)" requires a more recent version of jq than 1.5.
TMP=/tmp/$$
INPUT=input.json
# Extract all but .filter_this
< $INPUT jq -c --stream 'select(length==2 and .[0][0] != "filter_this")' |
jq -nc 'reduce inputs as [$p,$x] (null; setpath($p;$x))
' > $TMP.parent
# Need jq > 1.5
# Extract the 'keepers'
< $INPUT jq -n -c --stream '
[fromstream(2|truncate_stream(inputs))
| select(type == "object" and .keep == "true")]
' $INPUT > $TMP.keepers
# Insert the filtered list into the parent object:
jq -s '. as $in | .[0] | (.filter_this |= $in[1])
' $TMP.parent $TMP.keepers > result.json

transform json to key/value file in bash

I have a JSON which looks like this:
{
"lorem": "ipsum",
"dolor": "sid",
"data": {
"key1": "value1",
"key2": "value2"
}
}
and I want an output which is ini like where I only need the content of 'data' (which is always flat, no branches). The output should look like this:
key1=value1
key2=value2
I can use jq (just don't get it running) but have to use a bash script for it. Can anyone help?
jq solution:
jq -r '.data | to_entries[] | "\(.key)=\(.value)"' input.json
The output:
key1=value1
key2=value2
This will work in BASH.
#!/bin/bash
all_keys=$( cat input.txt );
while read key
do
grep "$key" ./text.txt | awk -F':' '{ print $1"="$2}' | tr -d '[", ]'
done <<< "$all_keys"
Assuming you values are in text.txt and that you have your keys in input.txt.
Regards!
In python
This reads stdin and outputs the desired "key=value" one per line
#!/usr/bin/python
import json
import sys
data = sys.stdin.read()
json_structure=json.loads(data)
start_point=json_structure["data"]
for k in start_point.keys():
print("%s=%s" % (k, start_point[k]))
If I was using python to unmangle the json input however, I would probably rewrite the bash script in python

Filter only specific keys from an external file in jq

I have a JSON file with the following format:
[
{
"id": "00001",
"attr": {
"a": "foo",
"b": "bar",
...
}
},
{
"id": "00002",
"attr": {
...
},
...
},
...
]
and a text file with a list of ids, one per line. I'd like to use jq to filter only the records whose ids are mentioned in the text file. I.e. if the list contains "00001", only the first one should be printed.
Note, that I can't simply grep since each record may have an arbitrary number of attributes and sub-attributes.
There are basically two ways to proceed:
read the file of ids from STDIN
read the JSON from STDIN
Both are feasible, but here we illustrate (2) as it leads to a simple but efficient solution.
Suppose the JSON file is named in.json and the list of ids is in a file named ids.txt like so:
00001
00010
Notice that this file has no quotation marks. If it does, then the following can be significantly simplified as shown in the postscript.
The trick is to convert ids.txt into a JSON array. With the above assumption about quotation marks, this can be done by:
jq -R . ids.txt | jq -s .
Assuming a reasonable shell, a simple solution is now at hand:
jq --argjson ids "$(jq -R . ids.txt | jq -s .)" '
map( select( .id as $id | $ids | index($id) ))' in.json
Faster
Assuming your jq has any/2, then a simpler and more efficient solution can be obtaining by defining:
def isin($a): . as $in | any($a[]; $in == .);
The required jq filter is then just:
map( select( .id | isin($ids) ) )
If these two lines of jq are put into a file named select.jq, the required incantation is simply:
jq --argjson ids "$(jq -R . ids.txt | jq -s)" -f select.jq in.json
Postscript
If the index file consists of a stream of valid JSON texts (e.g., strings with quotation marks) and if your jq supports the --slurpfile option, the invocation can be further simplified to:
jq --slurpfile ids ids.txt -f select.jq in.json
Or if you want everything as a one-liner:
jq --slurpfile ids ids.txt 'map(select(.id as $id|any($ids[];$id==.)))' in.json