a potentially huge json-lines file with objects of known structure is to be converted to csv with headers.
example
{"name":"name_0","value_a":"value_a_0","value_b":"val_b_0"}
{"name":"name_1","value_a":"value_a_1","value_b":"val_b_1"}
{"name":"name_2","value_a":"value_a_2","value_b":"val_b_2"}
{"name":"name_3","value_a":"value_a_3","value_b":"val_b_3"}
{"name":"name_4","value_a":"value_a_4","value_b":"val_b_4"}
expected output
"name","value_a","value_b"
"name_0","value_a_0","val_b_0"
"name_1","value_a_1","val_b_1"
"name_2","value_a_2","val_b_2"
"name_3","value_a_3","val_b_3"
"name_4","value_a_4","val_b_4"
currently tried
(if (input_line_number == 1 ) then ([.|to_entries|.[].key]|#csv) else empty end),
(.|to_entries|[.[].value]|#csv )
However this relies on the order in the json
as an alternative I have substituted it with directly selecting the values in the order I want.
(if (input_line_number == 1 ) then ("\"name\",\"value_a\",\"value_b\"") else empty end), (.|[.name?,.value_a?,.value_b?]|#csv )
jqplay
any better solution? especially regarding the if, as it feels bulky.
I mainly don't want to use slurp because it will resort to load the whole file into memory
Don't overthink it; add a fixed header and use inputs together with -n/--null-input to format the actual content:
jq -n '["name", "value_a", "value_b"],
(inputs | [.name?, .value_a?, .value_b?])
| #csv' input.json
Output:
"name","value_a","value_b"
"name_0","value_a_0","val_b_0"
"name_1","value_a_1","val_b_1"
"name_2","value_a_2","val_b_2"
"name_3","value_a_3","val_b_3"
"name_4","value_a_4","val_b_4"
it's not jq, but I add it because I think it's interesting to know it.
Using Miller and run
mlr --j2c cat input.jsonl >output.csv
you get
name,value_a,value_b
name_0,value_a_0,val_b_0
name_1,value_a_1,val_b_1
name_2,value_a_2,val_b_2
name_3,value_a_3,val_b_3
name_4,value_a_4,val_b_4
Related
I have a JSON file and I am extracting data from it using jq. One simple use case is pulling out any JSON Object that contains an Id which is provided as an argument.
I use the following simple script to do so:
[.[] | select(.id == $ID)]
The script is stored in a separate file (by_id.jq) which I pass in using the -f argument.
The full command looks something like this:
cat ./my_json_file.json | jq -sf --arg ID "8df993c1-57d5-46b3-a8a3-d95066934e5b" ./by_id.jq
Is there a way by only using jq that a comma separated list of values could be passed as an argument to the jq script and iterate through the ids and check them against the value of .id in the the JSON file with the result being the objects that have that id?
For example if I wanted to pull out three objects by their ids I would want to structure the command in this way:
cat ./my_json_file.json | jq -sf --arg ID "8df993c1-57d5-46b3-a8a3-d95066934e5b,1d5441ca-5758-474d-a9fc-40d0f68aa538,23cc618a-8ad4-4141-bc1c-0251y0663963" ./by_id.jq
Sure. Though you'll need to parse (split) that list of ids to something that jq can work with, such as an array of ids. Then your problem becomes, given an array of keys, select objects that have any of these ids. Which you could use approaches found here.
$ jq --arg ID '8df993c1-57d5-46b3-a8a3-d95066934e5b,1d5441ca-5758-474d-a9fc-40d0f68aa538,23cc618a-8ad4-4141-bc1c-0251y0663963' '
select(.id | IN($ID|split(",")[]))
' ./my_json_file.json
I'm not sure what your input looks like but judging by your use of slurping then filtering the slurped input, it's a stream of objects. The slurping is not necessary here.
Here is an approach that focuses on efficiency.
Your Q indicates that in fact you have a stream of objects, so the first step towards efficiency is to avoid the -s option, and use -n with inputs instead.
The second step it to avoid splitting your comma-separated string of values more than once.
So your script might look like this:
INDEX($ids | splits(","); .) as $dict
| inputs
| select($dict[.id])
And the invocation would look like this:
jq -n --args a,b,c -f by_id.jq
This of course assumes that simply splitting the string of ids on "," will suffice. You might need to trim the values and take care of other potential anomalies.
For efficiency, it would be better to split $ID just once.
So if you have to use the -s option, you could use the following jq program:
INDEX($ID | splits(","); .) as $dict
| .[]
| select($dict[.id])
I have a file which contains many json arrays. I need to find if length of any value in any of the array exceeds a limit, say 1000. If it exceeds I have to trim the length of that particular value. Post that file will be fed to downstream application. What is the best possible solution to be implemented in shell scripting. Tried jq and sed but that doesn't seem to work. Maybe I haven't explored them completely. Any suggestion on this use case will be highly appreciated!
Unfortunately the originally posted question is rather vague on a number of points, so I'll first focus on determining whether an arbitrary JSON document has a string value (excluding key names) that exceeds a certain given size.
To find the maximum of a stream of numbers, we can write:
def max(stream): reduce stream as $s (null;
if $s > . then $s else . end);
Let us suppose the above def, together with the following line, is in a file named max.jq:
max( .. | strings | length) > $mx
Then we could find the answer by running a command such as:
jq --argjson mx 4 -f max.jq INPUT.json
A shorter but possibly less space-efficient answer
jq --argjson mx 4 '[..|strings|length]|max > $mx' INPUT.json
Variants
There are many possible variants, e.g. you might want to arrange things so that jq returns a suitable return code rather than emitting a boolean value.
Truncating long strings
To truncate strings longer than a given length, say $mx, you could use walk/1, like so:
walk(if type == "string" and length > $mx
then .[:$mx] else . end)
I have a json file like this:
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"234432","cust_name":"ghi"}
{"caller_id":"123321","cust_name":"abc"}
....
I tried:
jq -s 'unique_by(.field1)'
but this will remove all the duplicated items, I,m looking to keep just one of the duplicated items, to get the file like this:
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"234432","cust_name":"ghi"}
....
With field1, I doubt you are getting anything in the output, since there is no key/field with the given name. If you simply change your command to jq -s 'unique_by(.caller_id)' it will give you desired result containing unique & sorted objects based on caller_id key. It will ensure in result you have atleast & atmost one object for each caller_id.
NOTE: Same as what #Jeff Mercado has explained in the comments.
If the file consists of a sequence (stream) of JSON objects, then a very simple way to produce a stream of the distinct objects would be to use the invocation:
jq -s `unique[]`
A similar alternative would be:
jq -n `[inputs] | unique[]`
For large files, however, the above will probably be too inefficient, both with respect to RAM and run-time. Note that both unique and unique_by entail a sort.
A far better alternative would be to take advantage of the fact that the input is a stream, and to avoid the built-in unique and unique_by filters. This can be done with the assistance of the following filters, which are not yet built-in but likely to become so:
# emit a dictionary
def set(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);
# distinct entities in the stream s
def distinct(s): set(s)[];
We now have only to add:
distinct(inputs)
to achieve the objective, provided jq is invoked with the -n command-line option.
This approach will also preserve the original ordering.
If the input is an array ...
If the input is an array, then using distinct as defined above still has the advantage of not requiring a sort. For arrays that are too large to fit comfortably in memory, it would be advisable to use jq's streaming parser to create a stream.
One possibility would be to proceed in two steps (jq --stream .... | jq -n ...), but it might be better to do everything in one step (jq -cn --stream ...), using the following "main" program:
distinct(fromstream(inputs
| (.[0] |= .[1:] )
| select(. != [[]])))
I have a lot of rather large JSON logs which need to be imported into several DB tables.
I can easily parse them and create 1 CSV for import.
But how can I parse the JSON and get 2 different CSV files as output?
Simple (nonsense) example:
testJQ.log
{"id":1234,"type":"A","group":"games"}
{"id":5678,"type":"B","group":"cars"}
using
cat testJQ.log|jq --raw-output '[.id,.type,.group]|#csv'>testJQ.csv
I get one file testJQ.csv
1234,"A","games
5678,"B","cars"
But I would like to get this
types.csv
1234,"A"
5678,"B"
groups.csv
1234,"games"
5678,"cars"
Can this be done without having to parse the JSON twice, first time creating the types.csv and second time the groups.csv like this?
cat testJQ.log|jq --raw-output '[.id,.type]|#csv'>types.csv
cat testJQ.log|jq --raw-output '[.id,.group]|#csv'>groups.csv
I suppose one way you could hack this up is to output the contents of one file to stdout and the others to stderr and redirect to separate files. Of course you're limited to two files though.
$ <testJQ.log jq -r '([.id,.type]|#csv),([.id,.group]|#csv|stderr|empty)' \
1>types.csv 2>groups.csv
stderr outputs to stderr but the value propagates to the output, so you'll want to follow that up with empty to swallow that up.
Personally I wouldn't recommend doing this, I would just write a python script (or other language) to parse this if you needed to output to multiple files.
You will either need to run jq twice, or to run jq in conjunction with another program to "split" the output of the call to jq. For example, you could use a pipeline of the form: jq -c ... | awk ...
The potential disadvantage of the pipeline approach is that if JSON is the final output, it will be JSONL; but obviously that doesn't apply here.
There are many ways to craft such a pipeline. For example, assuming there are no raw newlines in the CSV:
< testJQ.log jq -r '
"types", ([.id,.type] |#csv),
"groups", ([.id,.group]|#csv)' |
awk 'NR % 2 == 1 {out=$1; next} {print >> out".csv"}'
Or:
< testJQ.log jq -r '([.id,.type],[.id,.group])|#csv' |
awk '{ out = ((NR % 2) == 1) ? "types" : "groups"; print >> out".csv"}'
For other examples, see e.g.
Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
Splitting / chunking JSON files with JQ in Bash or Fish shell?
Split JSON into multiple files
Handling raw new-lines
Whether or not you split the CSV into multiple files, there is a potential issue with embedded raw newlines. One approach is to change "\n" in JSON strings to "\\n", e.g.
jq -r '([.id,.type],[.id,.group])
| map(if type == "string" then gsub("\n";"\\n") else . end)
| #csv'
Is there a method to obtain a diff for JSON Lines files? In case there's confusion, by "JSON Lines", I mean the format described here, which basically requires that every line is a valid JSON structure. Anyway, there's an answer here that discusses using jq in order to diff two different JSON files.
However, there, the question wanted the diff not to consider within-list ordering whereas I do care about that ordering. In addition, the answers contain jq scripts that just give a true or false response and do not give a full diff. Ideally, I'd like a full diff. There is a project call json-diff that does diff JSON files, but it only works for a single JSON entity, not with JSON lines.
To reiterate, is there a method or something like a jq script that can obtain a diff for JSON lines formatted files?
If I understand the question correctly, the following should do the job. I'll assume you have access to jq 1.5, which includes the filter walk/1 (if that is not the case, it's easy to supplement the file below with the definition, which can be found on the web, e.g. the src/builtin.jq file), and that you have a reasonably modern Mac or Linux-like shell.
(1) Create a file called (let's say) jq-diff.jq with these two lines:
def sortKeys: to_entries | sort | from_entries;
walk( if type == "object" then sortKeys else . end )
(2) Assuming the two files with JSON entities in them are FILE1 and FILE2, then run one of the following commands, depending on whether you want the JSON entities within each file to be sorted:
diff <(jq -cf jq-diff.jq FILE1 | sort) <(jq -cf jq-diff.jq FILE2 | sort)
# OR:
diff <(jq -cf jq-diff.jq FILE1) <(jq -cf jq-diff.jq FILE2)
Brief explanation:
The role of jq here is to sort the keys in the objects (without sorting the arrays) and to print them in a standard way, one per line (courtesy of the -c option).
You can use the -s flag to slurp your newline-separated JSON objects into a JSON array containing them, thus making them eligible for comparison with json-diff.