How to create 2 CSV files from 1 JSON using JQ - json

I have a lot of rather large JSON logs which need to be imported into several DB tables.
I can easily parse them and create 1 CSV for import.
But how can I parse the JSON and get 2 different CSV files as output?
Simple (nonsense) example:
testJQ.log
{"id":1234,"type":"A","group":"games"}
{"id":5678,"type":"B","group":"cars"}
using
cat testJQ.log|jq --raw-output '[.id,.type,.group]|#csv'>testJQ.csv
I get one file testJQ.csv
1234,"A","games
5678,"B","cars"
But I would like to get this
types.csv
1234,"A"
5678,"B"
groups.csv
1234,"games"
5678,"cars"
Can this be done without having to parse the JSON twice, first time creating the types.csv and second time the groups.csv like this?
cat testJQ.log|jq --raw-output '[.id,.type]|#csv'>types.csv
cat testJQ.log|jq --raw-output '[.id,.group]|#csv'>groups.csv

I suppose one way you could hack this up is to output the contents of one file to stdout and the others to stderr and redirect to separate files. Of course you're limited to two files though.
$ <testJQ.log jq -r '([.id,.type]|#csv),([.id,.group]|#csv|stderr|empty)' \
1>types.csv 2>groups.csv
stderr outputs to stderr but the value propagates to the output, so you'll want to follow that up with empty to swallow that up.
Personally I wouldn't recommend doing this, I would just write a python script (or other language) to parse this if you needed to output to multiple files.

You will either need to run jq twice, or to run jq in conjunction with another program to "split" the output of the call to jq. For example, you could use a pipeline of the form: jq -c ... | awk ...
The potential disadvantage of the pipeline approach is that if JSON is the final output, it will be JSONL; but obviously that doesn't apply here.
There are many ways to craft such a pipeline. For example, assuming there are no raw newlines in the CSV:
< testJQ.log jq -r '
"types", ([.id,.type] |#csv),
"groups", ([.id,.group]|#csv)' |
awk 'NR % 2 == 1 {out=$1; next} {print >> out".csv"}'
Or:
< testJQ.log jq -r '([.id,.type],[.id,.group])|#csv' |
awk '{ out = ((NR % 2) == 1) ? "types" : "groups"; print >> out".csv"}'
For other examples, see e.g.
Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
Splitting / chunking JSON files with JQ in Bash or Fish shell?
Split JSON into multiple files
Handling raw new-lines
Whether or not you split the CSV into multiple files, there is a potential issue with embedded raw newlines. One approach is to change "\n" in JSON strings to "\\n", e.g.
jq -r '([.id,.type],[.id,.group])
| map(if type == "string" then gsub("\n";"\\n") else . end)
| #csv'

Related

Splitting large JSON data using Unix command Split

Issue with Unix Split command for splitting large data: split -l 1000 file.json myfile. Want to split this file into multiple files of 1000 records each. But Im getting the output as single file - no change.
P.S. File is created converting Pandas Dataframe to JSON.
Edit: It turn outs that my JSON is formatted in a way that it contains only one row. wc -l file.json is returning 0
Here is the sample: file.json
[
{"id":683156,"overall_rating":5.0,"hotel_id":220216,"hotel_name":"Beacon Hill Hotel","title":"\u201cgreat hotel, great location\u201d","text":"The rooms here are not palatial","author_id":"C0F"},
{"id":692745,"overall_rating":5.0,"hotel_id":113317,"hotel_name":"Casablanca Hotel Times Square","title":"\u201cabsolutely delightful\u201d","text":"I travelled from Spain...","author_id":"8C1"}
]
Invoking jq once per partition plus once to determine the number of partitions would be extremely inefficient. The following solution suffices to achieve the partitioning deemed acceptable in your answer:
jq -c ".[]" file.json | split -l 1000
If, however, it is deemed necessary for each file to be pretty-printed, you could run jq -s . for each file, which would still be more efficient than running .[N:N+S] multiple times.
If each partition should itself be a single JSON array, then see Splitting / chunking JSON files with JQ in Bash or Fish shell?
After asking elsewhere, the file was, in fact a single line.
Reformatting with JQ (in compact form), would enable the split, though to process the file would at least need the first and last character to be deleted (or add '[' & ']' to the split files)
I'd recommend spliting the JSON array with jq (see manual).
cat file.json | jq length # get length of an array
cat file.json | jq -c '.[0:999]' # first 1000 items
cat file.json | jq -c '.[1000:1999]' # second 1000 items
...
Notice -c for compact result (not pretty printed).
For automation, you can code a simple bash script to split your file into chunks given the array length (jq length).

bashscript combine multiple json files into one json

I have a folder that contains subfolders of json files inside.
I need to write a bash script that combine all json files into one big valid json.
1) Tried first to use jq to combine first, all json files into each directory and later on I'll need to combine all into one big file again.
I didn't manage to make it work. I used this command:
jq -rs 'reduce .[] as $item ({}; . * $item)'
2) Other option is to create a json file at the beginning with "[" --> Process all files from all directories and for each file append the content --> append "]" at the end.
Can I achieve the same result with first way using jq only?
a very simple way is :
jq -s 'flatten' $target/*/*.json > $merged_json
a alternative ( in the case you need to use | ) :
cat $target/*/*.json | jq -s 'flatten' > $merged_json
or if too many files
find $target/* -name \*json cat {} | jq -s 'flatten' > $merged_json

Extract a JSON values from a txt file and write them seperated by comma

I have txt file with curl response with information on the thousands of files downloaded and the year in which they were downloaded.
I try unsuccessfully (sed+grep) to extract the filename and the year and write them to a separate file ("filname+year.txt") separated by a comma.
{"status_code":"200",
"status_message":"Results found.",
"results":[{"filename":"test189.pdf",
"year":"2012",
"URL":"https:\/\/www.orkistar.org\/random.php?q=iper.pdf&y=2012"
}
......
Any idea?
Use a JSON aware tool, e.g. jq:
jq -r '.results[] as $r | $r.filename + "," + $r.year' < file.json
jq has a filter for converting to CSV. Using it ensures various edge cases are handled appropriately, assuming the goal is to generate valid CSV:
jq -r '.results[] | [.filename, .year] | #csv' file.json
In any case, notice that there is no need to introduce any named variables.

Is there a method to obtain a diff for `JSON Lines` files?

Is there a method to obtain a diff for JSON Lines files? In case there's confusion, by "JSON Lines", I mean the format described here, which basically requires that every line is a valid JSON structure. Anyway, there's an answer here that discusses using jq in order to diff two different JSON files.
However, there, the question wanted the diff not to consider within-list ordering whereas I do care about that ordering. In addition, the answers contain jq scripts that just give a true or false response and do not give a full diff. Ideally, I'd like a full diff. There is a project call json-diff that does diff JSON files, but it only works for a single JSON entity, not with JSON lines.
To reiterate, is there a method or something like a jq script that can obtain a diff for JSON lines formatted files?
If I understand the question correctly, the following should do the job. I'll assume you have access to jq 1.5, which includes the filter walk/1 (if that is not the case, it's easy to supplement the file below with the definition, which can be found on the web, e.g. the src/builtin.jq file), and that you have a reasonably modern Mac or Linux-like shell.
(1) Create a file called (let's say) jq-diff.jq with these two lines:
def sortKeys: to_entries | sort | from_entries;
walk( if type == "object" then sortKeys else . end )
(2) Assuming the two files with JSON entities in them are FILE1 and FILE2, then run one of the following commands, depending on whether you want the JSON entities within each file to be sorted:
diff <(jq -cf jq-diff.jq FILE1 | sort) <(jq -cf jq-diff.jq FILE2 | sort)
# OR:
diff <(jq -cf jq-diff.jq FILE1) <(jq -cf jq-diff.jq FILE2)
Brief explanation:
The role of jq here is to sort the keys in the objects (without sorting the arrays) and to print them in a standard way, one per line (courtesy of the -c option).
You can use the -s flag to slurp your newline-separated JSON objects into a JSON array containing them, thus making them eligible for comparison with json-diff.

Read a json file with bash

I kwould like to read the json file from http://freifunk.in-kiel.de/alfred.json in bash and separate it into files named by hostname of each element in that json string.
How do I read json with bash?
How do I read json with bash?
You can use jq for that. First thing you have to do is extract the list of hostnames and save it to a bash array. Running a loop on that array you would then run again a query for each hostname to extract each element based on them and save the data through redirection with the filename based on them as well.
The easiest way to do this is with two instances of jq -- one listing hostnames, and another (inside the loop) extracting individual entries.
This is, alas, a bit inefficient (since it means rereading the file from the top for each record to extract).
while read -r hostname; do
[[ $hostname = */* ]] && continue # paranoia; see comments
jq --arg hostname "$hostname" \
'.[] | select(.hostname == $hostname)' <alfred.json >"out-${hostname}.json"
done < <(jq -r '.[] | .hostname' <alfred.json)
(The out- prefix prevents alfred.json from being overwritten if it includes an entry for a host named alfred).
You can use python one-liner in similar way like (I haven't checked):
curl -s http://freifunk.in-kiel.de/alfred.json | python -c '
import json, sys
tbl=json.load(sys.stdin)
for t in tbl:
with open(tbl[t]["hostname"], "wb") as fp:
json.dump(tbl[t], fp)
'