I have a huge JSON file which contain records like this:
{"callsign":"abc","kruxSegmentIds":{"0":"q2d9nn1qv","1":"rle4kfgsf"},"liveFlag":"Y"}}
I need to replace the keys inside the nested JSON key "kruxSegmentIds" in such a way that 0 becomes "zero" and 1 as "one" like below:
{"callsign":"abc","kruxSegmentIds":{"zero":"q2d9nn1qv","one":"rle4kfgsf"},"liveFlag":"Y"}}
Is this possible using sed? I don't want to write a script as the file size is huge and it may not fit into memory.
Any help/support is greatly appreciated.
From the problem description (and from the fact that the proposed awk solution has been accepted), it seems clear that although the file itself is large, each JSON document is relatively small, or at least small enough to fit in memory. If that is indeed the case, then a straightforward solution using jq would have similar performance characteristics to a sed or awk solution, but without the potential complications. Here therefore is such a solution:
jq '.kruxSegmentIds |= with_entries(.key |= if .=="0" then "zero" elif .=="1" then "one" else . end)'
If jq empty hugefile fails because of the file's size, then jq might still be useful because of its streaming parser, which is designed precisely for such cases.
Variations
In the comments, the OP posted another example, so it might be useful to define a filter for performing the key-to-key transformation:
def twiddle:
with_entries(.key |= if .=="0" then "zero" elif .=="1" then "one" else . end);
With this, the solution to the original problem is:
.kruxSegmentIds |= twiddle
and the solution to the variant is:
(.users.L3AVIcqaDpZxLf6ispK.kruxSegmentIds) |= twiddle
Generalizing even further, if the task is to perform the transformation on all objects, wherever they occur, the solution is:
walk(if type == "object" then twiddle else . end)
If your jq does not have walk pre-defined, then you can snarf its def from https://raw.githubusercontent.com/stedolan/jq/master/src/builtin.jq
Related
a potentially huge json-lines file with objects of known structure is to be converted to csv with headers.
example
{"name":"name_0","value_a":"value_a_0","value_b":"val_b_0"}
{"name":"name_1","value_a":"value_a_1","value_b":"val_b_1"}
{"name":"name_2","value_a":"value_a_2","value_b":"val_b_2"}
{"name":"name_3","value_a":"value_a_3","value_b":"val_b_3"}
{"name":"name_4","value_a":"value_a_4","value_b":"val_b_4"}
expected output
"name","value_a","value_b"
"name_0","value_a_0","val_b_0"
"name_1","value_a_1","val_b_1"
"name_2","value_a_2","val_b_2"
"name_3","value_a_3","val_b_3"
"name_4","value_a_4","val_b_4"
currently tried
(if (input_line_number == 1 ) then ([.|to_entries|.[].key]|#csv) else empty end),
(.|to_entries|[.[].value]|#csv )
However this relies on the order in the json
as an alternative I have substituted it with directly selecting the values in the order I want.
(if (input_line_number == 1 ) then ("\"name\",\"value_a\",\"value_b\"") else empty end), (.|[.name?,.value_a?,.value_b?]|#csv )
jqplay
any better solution? especially regarding the if, as it feels bulky.
I mainly don't want to use slurp because it will resort to load the whole file into memory
Don't overthink it; add a fixed header and use inputs together with -n/--null-input to format the actual content:
jq -n '["name", "value_a", "value_b"],
(inputs | [.name?, .value_a?, .value_b?])
| #csv' input.json
Output:
"name","value_a","value_b"
"name_0","value_a_0","val_b_0"
"name_1","value_a_1","val_b_1"
"name_2","value_a_2","val_b_2"
"name_3","value_a_3","val_b_3"
"name_4","value_a_4","val_b_4"
it's not jq, but I add it because I think it's interesting to know it.
Using Miller and run
mlr --j2c cat input.jsonl >output.csv
you get
name,value_a,value_b
name_0,value_a_0,val_b_0
name_1,value_a_1,val_b_1
name_2,value_a_2,val_b_2
name_3,value_a_3,val_b_3
name_4,value_a_4,val_b_4
I have the following structure:
{"ID":"XX","guid":1}
{"ID":"YY","guid":2}
...
I have tried running:
jq 'sort_by(.guid)' conn.json
I however get an error:
Cannot index string with string "guid"
Please can you advise how I'd sort the file by guid and/or find the record where guid is the largest?
UPDATE
What I am actually looking for is the record where the GUID is the largest in the dataset. Thought sorting it would help me but it's proving to be very slow
Thanks
sort_by assumes its input is iterable, and expands it by applying .[] before sorting its members. You're providing a stream of objects to it, and each object expands to a stream of non-indexable values ("XX", 1 etc.) in this case, thus .guid fails.
Slurp them to make it work, e.g:
jq -s 'sort_by(.guid)[]' conn.json
To extract the object with the largest GUID, you wouldn't sort the slurped input manually; for such tasks, jq has max_by, e.g:
jq -s 'max_by(.guid)' conn.json
and reduce, which is a more convenient construct for large inputs and eliminates the need for slurping.
jq 'reduce inputs as $in (input; if $in.guid > .guid then $in else . end)' conn.json
I have a file which contains many json arrays. I need to find if length of any value in any of the array exceeds a limit, say 1000. If it exceeds I have to trim the length of that particular value. Post that file will be fed to downstream application. What is the best possible solution to be implemented in shell scripting. Tried jq and sed but that doesn't seem to work. Maybe I haven't explored them completely. Any suggestion on this use case will be highly appreciated!
Unfortunately the originally posted question is rather vague on a number of points, so I'll first focus on determining whether an arbitrary JSON document has a string value (excluding key names) that exceeds a certain given size.
To find the maximum of a stream of numbers, we can write:
def max(stream): reduce stream as $s (null;
if $s > . then $s else . end);
Let us suppose the above def, together with the following line, is in a file named max.jq:
max( .. | strings | length) > $mx
Then we could find the answer by running a command such as:
jq --argjson mx 4 -f max.jq INPUT.json
A shorter but possibly less space-efficient answer
jq --argjson mx 4 '[..|strings|length]|max > $mx' INPUT.json
Variants
There are many possible variants, e.g. you might want to arrange things so that jq returns a suitable return code rather than emitting a boolean value.
Truncating long strings
To truncate strings longer than a given length, say $mx, you could use walk/1, like so:
walk(if type == "string" and length > $mx
then .[:$mx] else . end)
I want to aggregate the json present on each line of file based on the date and account. There might be multiple records with same date and account, we have to aggregate count based on date and account_no.
sample file:
{"date":"2019-04-01","count":0,"account_no":"1591"}
{"date":"2019-04-01","count":1,"account_no":"1592"}
Please suggest some solution.
Number of jsons in file are almost 2.5cr
jq using inputs is a good way to go.
First, here's a generic stream-oriented sigma_by function:
# In this formulation, f must either always evaluate to a string or
# always to an integer, it being understood that negative integers
# might be problematic
def sigma_by(s; f; g):
reduce s as $x (null; .[$x|f] += ($x|g));
Then a solution could be achieved by:
sigma_by(inputs; "\(.date):\(.account_no)"; .count)
provided the -n command-line option is used.
Output
With the sample input, the output would be:
{
"2019-04-01:1591": 0,
"2019-04-01:1592": 1
}
Variations
Needless to say, there are many possible variations. In particular, a variant of sigma_by that uses a dictionary of dictionaries might be warranted, e.g. to save space, and to avoid potential parsing issues for recovering the two "aggregate by" strings:
def sigma_by(s; a; b; g):
reduce s as $x (null; .[$x|a][$x|b] += ($x|g));
sigma_by(inputs; .date; .account_no; .count)
Note that jq's builtin "group_by" has a significant potential disadvantage for large arrays: it uses a sorting algorithm.
I have a json file like this:
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"234432","cust_name":"ghi"}
{"caller_id":"123321","cust_name":"abc"}
....
I tried:
jq -s 'unique_by(.field1)'
but this will remove all the duplicated items, I,m looking to keep just one of the duplicated items, to get the file like this:
{"caller_id":"123321","cust_name":"abc"}
{"caller_id":"123443","cust_name":"def"}
{"caller_id":"234432","cust_name":"ghi"}
....
With field1, I doubt you are getting anything in the output, since there is no key/field with the given name. If you simply change your command to jq -s 'unique_by(.caller_id)' it will give you desired result containing unique & sorted objects based on caller_id key. It will ensure in result you have atleast & atmost one object for each caller_id.
NOTE: Same as what #Jeff Mercado has explained in the comments.
If the file consists of a sequence (stream) of JSON objects, then a very simple way to produce a stream of the distinct objects would be to use the invocation:
jq -s `unique[]`
A similar alternative would be:
jq -n `[inputs] | unique[]`
For large files, however, the above will probably be too inefficient, both with respect to RAM and run-time. Note that both unique and unique_by entail a sort.
A far better alternative would be to take advantage of the fact that the input is a stream, and to avoid the built-in unique and unique_by filters. This can be done with the assistance of the following filters, which are not yet built-in but likely to become so:
# emit a dictionary
def set(s): reduce s as $x ({}; .[$x | (type[0:1] + tostring)] = $x);
# distinct entities in the stream s
def distinct(s): set(s)[];
We now have only to add:
distinct(inputs)
to achieve the objective, provided jq is invoked with the -n command-line option.
This approach will also preserve the original ordering.
If the input is an array ...
If the input is an array, then using distinct as defined above still has the advantage of not requiring a sort. For arrays that are too large to fit comfortably in memory, it would be advisable to use jq's streaming parser to create a stream.
One possibility would be to proceed in two steps (jq --stream .... | jq -n ...), but it might be better to do everything in one step (jq -cn --stream ...), using the following "main" program:
distinct(fromstream(inputs
| (.[0] |= .[1:] )
| select(. != [[]])))