Using jq to concatenate directory of JSON files - json

I have a directory of about 100 JSON files, each an array of 100 simple records, that I want to concatenate into one file for inclusion as static data in an app, so I don't have to make repeated API calls to retrieve small pieces. (I'm limited to downloading only 100 records at a time; that's why I have 100 short files.)
Here's a sample file, shortened to two records for display here:
[
{
"id": 11531,
"title": "category 1",
"count": 5
},
{
"id": 11532,
"title": "category 2",
"count": 5
}
]
My research led to a solution that works but only for two files with two records each:
jq -s '.[0] + .[1]' file1.json file2.json > output.json
This source also suggested this line would work to handle a directory (right now only two files in it):
jq -s 'reduce .[] as $item ({}; . * $item)' json_files/* > output.json
but I get an error:
jq: error (at json_files/categories-11-20.json:0): object ({}) and array ([{"id":1153...) cannot be multiplied
I thought maybe the problem was the *trying to multiply, so I tried + in that place, but I get a ... cannot be added. message.
Is there a way to do this through jq or is there a better tool?

The simplest and perfectly reasonable approach would be to use the -s command-line option and add along the following lines:
jq -s add json_files/*
Of course you may wish to specify the list of files differently. The order in which they are specified is also significant.
Notes:
This Q is really just a variant of Use jq to concatenate JSON arrays in multiple files
reduce can also be used, but you would need to start with null or [] rather than {}.
The operator '*' is (not surprisingly) quite different from '+'!

Related

Providing a very large argument to a jq command to filter on keys

I am trying to parse a very large file which consists of JSON objects like this:
{"id": "100000002", "title": "some_title", "year": 1988}
Now I also have a very big list of ID's that I want to extract from the file, if they are there.
Now I know that I can do this:
jq '[ .[map(.id)|indices("1", "2")[]] ]' 0.txt > p0.json
Which produces the result I want, namely fills p0.json with only the objects that have "id" 1 and "2". Now comes the problem: my list of id's is very long too (100k or so). So I have a Python programm that outputs the relevant id's. My line of thought was, to first assign that to a variable:
REL_IDS=`echo python3 rel_ids.py`
And then do:
jq --arg ids "$REL_IDS" '[ .[map(.id)|indices($ids)[]] ]' 0.txt > p0.json
I tried both with brackets [$ids] and without brackets, but no luck so far.
My question is, given a big amount of arguments for the filter, how would I proceed with putting them into my jq command?
Thanks a lot in advance!
Since the list of ids is long, the trick is NOT to use --arg. However, the details will depend on the details regarding the "long list of ids".
In general, though, you'd want to present the list of ids to jq as a file so that you could use --rawfile or --slurpfile or some such.
If for some reason you don't want to bother with an actual file, then provided your shell allows it, you could use these file-oriented options with process substitution: <( ... )
Example
Assuming ids.json contains a lising of the ids as JSON strings:
"1"
"2"
"3"
then one could write:
< objects.json jq -c -n --slurpfile ids ids.json '
inputs | . as $in | select( $ids | index($in.id))'
Notice the use of the -n command-line option.

Retrieving the first entity out of several ones

I am a rank beginner with jq, and I've been going through the tutorial, but I think there is a conceptual difference I don't understand. A common problem I encounter is that a large JSON file will contain many objects, each of which is quite big, and I'd like to view the first complete object, to see which fields exist, what types, how much nesting, etc.
In the tutorial, they do this:
# We can use jq to extract just the first commit.
$ curl 'https://api.github.com/repos/stedolan/jq/commits?per_page=5' | jq '.[0]'
Here is an example with one object - here, I'd like to return the whole array (just like my_array=['foo']; my_array[0] would return foo in Python).
wget https://hacker-news.firebaseio.com/v0/item/8863.json
I can access and pretty-print the whole thing with .
$ cat 8863.json | jq '.'
$
{
"by": "dhouston",
"descendants": 71,
"id": 8863,
"kids": [
9224,
...
8876
],
"score": 104,
"time": 1175714200,
"title": "My YC app: Dropbox - Throw away your USB drive",
"type": "story",
"url": "http://www.getdropbox.com/u/2/screencast.html"
}
But trying to get the first element fails:
$ cat 8863.json| jq '.[0]'
$ jq: error (at <stdin>:0): Cannot index object with number
I get the same error jq '.[0]' 8863.json, but strangely echo 8863.json | jq '.[0]' gives me parse error: Invalid numeric literal at line 2, column 0. What is the difference? Also, is this not the correct way to get the zeroth member of the JSON?
I've looked at other SO posts with this error message and at the manual, but I'm still confused. I think of the file as an array of JSON objects, and I'd like to get the first. But it looks like jq works with something called a "stream", and does operations on all of it (say, return one given field from every object).
Clarification:
Let's say I have 2 objects in my JSON:
{
"by": "pg",
"id": 160705,
"poll": 160704,
"score": 335,
"text": "Yes, ban them; I'm tired of seeing Valleywag stories on News.YC.",
"time": 1207886576,
"type": "pollopt"
}
{
"by": "dpapathanasiou",
"id": 16070,
"kids": [
16078
],
"parent": 16069,
"text": "Dividends don't mean that much: Microsoft in its dominant years (when they had 40%-plus margins and were raking in the cash) never paid a dividend (they did so only recently).",
"time": 1177355133,
"type": "comment"
}
How would I get the entire first object (lines 1-9) with jq?
Cannot index object with number
This error message says it all, you can't index objects with numbers. If you want to get the value of by field, you need to do
jq '.by' file
Wrt
echo 8863.json | jq '.[0]' gives me parse error: Invalid numeric literal at line 2, column 0.
It's normal since you didn't specify -R/--raw-input flag, and so jq sees the shell string 8863.json as a JSON string, and one cannot apply array indexing to JSON strings. (To get the first character as a string, you'd write .[0:1].)
If your input file consists of several separate entities, to get the first one:
jq -n 'input' file
or,
jq -n 'first(inputs)' file
To get nth (let's say 5th for example):
jq -n 'nth(5; inputs)' file
a large JSON file will contain many objects, each of which is quite big, and I'd like to view the first complete object, to see which fields exist, what types, how much nesting, etc.
As implied in #OguzIsmail's response, there are important differences between:
- a JSON file (i.e, a file containing exactly one JSON entity);
- a file containing a sequence (i.e., stream) of JSON entities;
- a file containing an array of JSON entities.
In the first two cases, you can write jq -n input to select the first entity, and in the case of an array of entities, jq .[0] will suffice.
(In JSON-speak, a "JSON object" is a kind of dictionary, and is not to be confused with JSON entities in general.)
If you have a bunch of JSON objects (whether as a stream or array or whatever), just looking at the first often doesn't really give an accurate picture of all them. For getting a bird's eye view of a bunch of objects, using a "schema inference engine" is often the way to go. For this purpose, you might like to consider my schema.jq schema inference engine. It's usually very simple to use but of course how you use it will depend on whether you have a stream or array of JSON entities. For basic details, see https://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed; for related topics (e.g. verification), see the entry for JESS at https://github.com/stedolan/jq/wiki/Modules
Please note that schema.jq infers a structural schema that mirrors the entities under consideration. Such structural schemas have little in common with JSON Schema schemas, which you might also like to consider.

Using jq to combine multiple JSON files

First off, I am not an expert with JSON files or with JQ. But here's my problem:
I am simply trying to download to card data (for the MtG card game) through an API, so I can use it in my own spreadsheets etc.
The card data from the API comes in pages, since there is so much of it, and I am trying to find a nice command line method in Windows to combine the files into one. That will make it nice and easy for me to use the information as external data in my workbooks.
The data from the API looks like this:
{
"object": "list",
"total_cards": 290,
"has_more": true,
"next_page": "https://api.scryfall.com/cards/search?format=json&include_extras=false&order=set&page=2&q=e%3Alea&unique=cards",
"data": [
{
"object": "card",
"id": "d5c83259-9b90-47c2-b48e-c7d78519e792",
"oracle_id": "c7a6a165-b709-46e0-ae42-6f69a17c0621",
"multiverse_ids": [
232
],
"name": "Animate Wall",
......
}
{
"object": "card",
......
}
]
}
Basically I need to take what's inside the "data" part from each file after the first, and merge it into the first file.
I have tried a few examples I found online using jq, but I can't get it to work. I think it might be because in this case the data is sort of under an extra level, since there is some basic information, then the "data" category is beneath it. I don't know.
Anyway, any help on how to get this going would be appreciated. I don't know much about this, but I can learn quickly so even any pointers would be great.
Thanks!
To merge the .data elements of all the responses into the first response, you could run:
jq 'reduce inputs.data as $s (.; .data += $s)' page1.json page2.json ...
Alternatives
You could use the following filter in conjunction with the -n command-line option:
reduce inputs as $s (input; .data += ($s.data))
Or if you simply want an object of the form {"data": [ ... ]} then (again assuming you invoke jq with the -n command-line option) the following jq filter would suffice:
{data: [inputs.data] | add}
Just to provide closure, #peak provided the solution. I am using it in conjunction with the method found here for using wildcards in batch files to address multiple files. The code looks like this now:
set expanded_list=
for /f "tokens=*" %%F in ('dir /b /a:-d "All Cards\!setname!_*.json"') do call set expanded_list=!expanded_list! "All Cards\%%F"
jq-win32 "reduce inputs.data as $s (.; .data += $s)" !expanded_list! > "All Cards\!setname!.json"
All the individual pages for each card set are named "setname"_"pagenumber".json
The code finds all the pages for each set and combines them into one variable which I can pass into jq.
Thanks again!

How to create 2 CSV files from 1 JSON using JQ

I have a lot of rather large JSON logs which need to be imported into several DB tables.
I can easily parse them and create 1 CSV for import.
But how can I parse the JSON and get 2 different CSV files as output?
Simple (nonsense) example:
testJQ.log
{"id":1234,"type":"A","group":"games"}
{"id":5678,"type":"B","group":"cars"}
using
cat testJQ.log|jq --raw-output '[.id,.type,.group]|#csv'>testJQ.csv
I get one file testJQ.csv
1234,"A","games
5678,"B","cars"
But I would like to get this
types.csv
1234,"A"
5678,"B"
groups.csv
1234,"games"
5678,"cars"
Can this be done without having to parse the JSON twice, first time creating the types.csv and second time the groups.csv like this?
cat testJQ.log|jq --raw-output '[.id,.type]|#csv'>types.csv
cat testJQ.log|jq --raw-output '[.id,.group]|#csv'>groups.csv
I suppose one way you could hack this up is to output the contents of one file to stdout and the others to stderr and redirect to separate files. Of course you're limited to two files though.
$ <testJQ.log jq -r '([.id,.type]|#csv),([.id,.group]|#csv|stderr|empty)' \
1>types.csv 2>groups.csv
stderr outputs to stderr but the value propagates to the output, so you'll want to follow that up with empty to swallow that up.
Personally I wouldn't recommend doing this, I would just write a python script (or other language) to parse this if you needed to output to multiple files.
You will either need to run jq twice, or to run jq in conjunction with another program to "split" the output of the call to jq. For example, you could use a pipeline of the form: jq -c ... | awk ...
The potential disadvantage of the pipeline approach is that if JSON is the final output, it will be JSONL; but obviously that doesn't apply here.
There are many ways to craft such a pipeline. For example, assuming there are no raw newlines in the CSV:
< testJQ.log jq -r '
"types", ([.id,.type] |#csv),
"groups", ([.id,.group]|#csv)' |
awk 'NR % 2 == 1 {out=$1; next} {print >> out".csv"}'
Or:
< testJQ.log jq -r '([.id,.type],[.id,.group])|#csv' |
awk '{ out = ((NR % 2) == 1) ? "types" : "groups"; print >> out".csv"}'
For other examples, see e.g.
Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
Splitting / chunking JSON files with JQ in Bash or Fish shell?
Split JSON into multiple files
Handling raw new-lines
Whether or not you split the CSV into multiple files, there is a potential issue with embedded raw newlines. One approach is to change "\n" in JSON strings to "\\n", e.g.
jq -r '([.id,.type],[.id,.group])
| map(if type == "string" then gsub("\n";"\\n") else . end)
| #csv'

Handling multiple "elements" in one JSON file with jq

I have a JSON-file which consists of multiple JSON-"elements", e.g.
{
"name": "Name 1",
"foo": "Bar"
}
{
"id": 123,
"bar": "Foo"
}
I'm only interested in the second element and I need to query by the 'index' of the element (i.e. I do not know what fields the element will contain).
How do I achieve this with jq?
There are several possible answers, depending on which version of jq you have, so here I'll focus on a generic and generally useful answer.
Use the -s ("slurp") option to get the second JSON entity, as in jq -s '.[1]'
In jq 1.4 and later, the jq filter .[] when used on objects preserves the order of the keys. (Using jq 1.3, you may be out of luck if you do not know anything about the key names.) For example, using jq 1.4 or later:
$ jq '.[]'
{"b":1, "a":2}
1
2