Trouble pretty-printing large json file on linux command line with jq - json

I am trying to pretty-print and scroll through sections of an extremely large (tens of gigabytes) human-unreadable json file with jq on the command line.
less bigFile.json | jq
works but just makes it fly by.
I tried to pipe it back into less like this:
less bigFile.json | jq | less
but it produced some kind of error.
How do you make this work?

The command should look like this:
jq -C . bigfile.json | less -r
If that exhausts all your memory you might want to try the -B option of less or even better, use jq to filter out the interesting parts.

Related

Splitting large JSON data using Unix command Split

Issue with Unix Split command for splitting large data: split -l 1000 file.json myfile. Want to split this file into multiple files of 1000 records each. But Im getting the output as single file - no change.
P.S. File is created converting Pandas Dataframe to JSON.
Edit: It turn outs that my JSON is formatted in a way that it contains only one row. wc -l file.json is returning 0
Here is the sample: file.json
[
{"id":683156,"overall_rating":5.0,"hotel_id":220216,"hotel_name":"Beacon Hill Hotel","title":"\u201cgreat hotel, great location\u201d","text":"The rooms here are not palatial","author_id":"C0F"},
{"id":692745,"overall_rating":5.0,"hotel_id":113317,"hotel_name":"Casablanca Hotel Times Square","title":"\u201cabsolutely delightful\u201d","text":"I travelled from Spain...","author_id":"8C1"}
]
Invoking jq once per partition plus once to determine the number of partitions would be extremely inefficient. The following solution suffices to achieve the partitioning deemed acceptable in your answer:
jq -c ".[]" file.json | split -l 1000
If, however, it is deemed necessary for each file to be pretty-printed, you could run jq -s . for each file, which would still be more efficient than running .[N:N+S] multiple times.
If each partition should itself be a single JSON array, then see Splitting / chunking JSON files with JQ in Bash or Fish shell?
After asking elsewhere, the file was, in fact a single line.
Reformatting with JQ (in compact form), would enable the split, though to process the file would at least need the first and last character to be deleted (or add '[' & ']' to the split files)
I'd recommend spliting the JSON array with jq (see manual).
cat file.json | jq length # get length of an array
cat file.json | jq -c '.[0:999]' # first 1000 items
cat file.json | jq -c '.[1000:1999]' # second 1000 items
...
Notice -c for compact result (not pretty printed).
For automation, you can code a simple bash script to split your file into chunks given the array length (jq length).

jq group_by does not play nice with .[]

I have a json file locally called pokemini.json. These are the contents of it;
{"name":"Bulbasaur","type":["Grass","Poison"],"total":318,"hp":45,"attack":49}
{"name":"Ivysaur","type":["Grass","Poison"],"total":405,"hp":60,"attack":62}
{"name":"Venusaur","type":["Grass","Poison"],"total":525,"hp":80,"attack":82}
{"name":"VenusaurMega Venusaur","type":["Grass","Poison"],"total":625,"hp":80,"attack":100}
{"name":"Charmander","type":["Fire"],"total":309,"hp":39,"attack":52}
{"name":"Charmeleon","type":["Fire"],"total":405,"hp":58,"attack":64}
{"name":"Charizard","type":["Fire","Flying"],"total":534,"hp":78,"attack":84}
{"name":"CharizardMega Charizard X","type":["Fire","Dragon"],"total":634,"hp":78,"attack":130}
{"name":"CharizardMega Charizard Y","type":["Fire","Flying"],"total":634,"hp":78,"attack":104}
{"name":"Squirtle","type":["Water"],"total":314,"hp":44,"attack":48}
There are a few types of pokemon in here and I want to do some aggregation with jq.
I could, per example, write this command;
> jq -s -c 'group_by(.type[0]) | .[]' pokemini.json
[{"name":"Charmander","type":["Fire"],"total":309,"hp":39,"attack":52},{"name":"Charmeleon","type":["Fire"],"total":405,"hp":58,"attack":64},{"name":"Charizard","type":["Fire","Flying"],"total":534,"hp":78,"attack":84},{"name":"CharizardMega Charizard X","type":["Fire","Dragon"],"total":634,"hp":78,"attack":130},{"name":"CharizardMega Charizard Y","type":["Fire","Flying"],"total":634,"hp":78,"attack":104}]
[{"name":"Bulbasaur","type":["Grass","Poison"],"total":318,"hp":45,"attack":49},{"name":"Ivysaur","type":["Grass","Poison"],"total":405,"hp":60,"attack":62},{"name":"Venusaur","type":["Grass","Poison"],"total":525,"hp":80,"attack":82},{"name":"VenusaurMega Venusaur","type":["Grass","Poison"],"total":625,"hp":80,"attack":100}]
[{"name":"Squirtle","type":["Water"],"total":314,"hp":44,"attack":48}]
I am aware that the -c flag is what is causing it to print line by line and that I need -s to handle the fact that my json file is more like jsonlines that actualy json. It should also be pointed that out there are only three types of pokemon detected because I can grouping over .type[0] (note that [0]).
I don't get why this does not work though;
> jq -s '.[] | group_by(.type[0])' pokemini.json
jq: error (at pokemini.json:10): Cannot index string with string "type"
group_by/1 expects its input to be an array. By calling .[] first, you are effectively undoing the work of the -s option.
By the way, an alternative to using -s is to use inputs with the -n command-line option, but in this case it makes little difference. When you don’t actually need to read all the entire stream of inputs at once, though, using inputs is in general more efficient.

Split JSON into multiple files

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json

Ignore Unparseable JSON with jq

I'm using jq to parse some of my logs, but some of the log lines can't be parsed for various reasons. Is there a way to have jq ignore those lines? I can't seem to find a solution. I tried to use the --seq argument that was recommended by some people, but --seq ignores all the lines in my file.
Assuming that each log entry is exactly one line, you can use the -R or --raw-input option to tell jq to leave the lines unparsed, after which you can prepend fromjson? | to your filter to make jq try to parse each line as JSON and throw away the ones that error.
I have log stream where some messages are in json format.
I want to pipe the json messages through jq, and just echo the rest.
The json messages are on a single line.
Solution: use grep and tee to split the lines in two streams, those starting with "^{" pipe through jq and the rest just echo to terminal.
kubectl logs -f web-svjkn | tee >(grep -v "^{") | grep "^{" | jq .
or
cat logs | tee >(grep -v "^{") | grep "^{" | jq .
Explanation:
tee generates 2nd stream, and grep -v prints non json info, 2nd grep only pipes what looks like json opening bracket to jq.
This is an old thread, but here's another solution fully in jq. This allows you to both process proper json lines and also print out non-json lines.
jq -R . as $line | try (fromjson | <further processing for proper json lines>) catch $line'
There are several Q&As on the FAQ page dealing with the topic of "invalid JSON", but see in particular the Q:
Is there a way to have jq keep going after it hits an error in the input file?
In particular, this shows how to use --seq.
However, from the the sparse details you've given (SO recommends a minimal example be given), it would seem it might be better simply to use inputs. The idea is to process one JSON entity at a time, using "try/catch", e.g.
def handle: inputs | [., "length is \(length)"] ;
def process: try handle catch ("Failed", process) ;
process
Don't forget to use the -n option when invoking jq.
See also Processing not-quite-valid JSON.
If JSON in curly braces {}:
grep -Pzo '\{(?>[^\{\}]|(?R))*\}' | jq 'objects'
If JSON in square brackets []:
grep -Pzo '\[(?>[^\[\]]|(?R))*\]' | jq 'arrays'
This works if there are no []{} in non-JSON lines.

Output UNIX environment as JSON

I'd like a unix one-liner that will output the current execution environment as a JSON structure like: { "env-var" : "env-value", ... etc ... }
This kinda works:
(echo "{"; printenv | sed 's/\"/\\\"/g' | sed -n 's|\(.*\)=\(.*\)|"\1"="\2"|p' | grep -v '^$' | paste -s -d"," -; echo "}")
but has some extra lines and I think won't work if the environment values or variables have '=' or newlines in them.
Would prefer pure bash/sh, but compact python / perl / ruby / etc one-liners would also be appreciated.
Using jq 1.5 (e.g. jq 1.5rc2 -- see http://stedolan.github.io/jq):
$ jq -n env
This works for me:
python -c 'import json, os;print(json.dumps(dict(os.environ)))'
It's pretty simple; the main complication is that os.environ is a dict-like object, but it is not actually a dict, so you have to convert it to a dict before you feed it to the json serializer.
Adding parentheses around the print statement lets it work in both Python 2 and 3, so it should work for the forseeable future on most *nix systems (especially since Python comes by default on any major distro).
#Alexander Trauzzi asked: "Wondering if anyone knows how to do this, but only passing a subset of the current environment's variables?"
I just found the way to do this:
jq -n 'env | {USER, HOME, PS1}'