How to use `select` within a jq --stream command? - json

I have a very large json document (~100 GB) that I am trying to use jq to parse out specific objects that meet a given criteria. Because it is so large, I won't be able to read it into memory, and will need to utilize the --stream option.
I understand how to run a select to extract what I need when I'm not streaming, but could use some assistance in figuring out how to configure my command correctly.
Here's a sample of my document named example.json.
{
"reporting_entity_name" : "INSURANCE COMPANY",
"reporting_entity_type" : "INSURER",
"last_updated_on" : "2022-12-01",
"version" : "1.0.0",
"in_network" : [ {
"negotiation_arrangement" : "ffs",
"name" : "ER VISIT",
"billing_code_type" : "CPT",
"billing_code_type_version" : "2022",
"billing_code" : "99285",
"description" : "HIGHEST LEVEL ER VISIT",
"negotiated_rates" : [ {
"provider_groups" : [ {
"npi" : [ 111111111, 222222222],
"tin" : {
"type" : "ein",
"value" : "99-9999999"
}
} ],
"negotiated_prices" : [ {
"negotiated_type" : "negotiated",
"negotiated_rate" : 550.50,
"expiration_date" : "9999-12-31",
"service_code" : [ "23" ],
"billing_class" : "institutional"
} ]
} ]
}
]
}
I am trying to grab the in_network object where billing_code is equal to 99285.
If I was able to do this without streaming, here's how I would approach it:
jq '.in_network[] | select(.billing_code == "99285")' example.json
Expected output:
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}
Any help on how I could configure this with the --stream option would be greatly appreciated!

If the objects from the .in_network array alone do fit into your memory, truncate at the array items (two levels deep):
jq --stream -n '
fromstream(2|truncate_stream(inputs | select(.[0][0] == "in_network")))
| select(.billing_code == "99285")
' example.json
{
"negotiation_arrangement": "ffs",
"name": "ER VISIT",
"billing_code_type": "CPT",
"billing_code_type_version": "2022",
"billing_code": "99285",
"description": "HIGHEST LEVEL ER VISIT",
"negotiated_rates": [
{
"provider_groups": [
{
"npi": [
111111111,
222222222
],
"tin": {
"type": "ein",
"value": "99-9999999"
}
}
],
"negotiated_prices": [
{
"negotiated_type": "negotiated",
"negotiated_rate": 550.5,
"expiration_date": "9999-12-31",
"service_code": [
"23"
],
"billing_class": "institutional"
}
]
}
]
}

You will find jq —-stream excruciatingly slow even for 10GB. Since jq is intended to complement other shell tools, I would recommend using jstream (https://github.com/bcicen/jstream), or my own jm or jm.py (https://github.com/pkoppstein/jm), to ”splat” the array, and pipe the result to jq.
E.g. to achieve the same effect as your jq filter:
jm —-pointer /in_network example.json |
jq 'select(.billing_code == "99285")'

Related

Fill arrays in the first input with elements from the second based on common field

I have two files and I would need to merge the elements of the second file into an object array in the first file based on searching the reference field.
The first file:
[
{
"reference": 25422,
"order_number": "10_1",
"details" : []
},
{
"reference": 25423,
"order_number": "10_2",
"details" : []
}
]
The second file:
[
{
"record_id" : 1,
"reference": 25422,
"row_description": "descr_1_0"
},
{
"record_id" : 2,
"reference": 25422,
"row_description": "descr_1_1"
},
{
"record_id" : 3,
"reference": 25423,
"row_description": "descr_2_0"
}
]
I would like to get:
[
{
"reference": 25422,
"order_number": "10_1",
"details" : [
{
"record_id" : 1,
"reference": 25422,
"row_description": "descr_1_0"
},
{
"record_id" : 2,
"reference": 25422,
"row_description": "descr_1_1"
}
]
},
{
"reference": 25423,
"order_number": "10_2",
"details" :[
{
"record_id" : 3,
"reference": 25423,
"row_description": "descr_2_0"
}
]
}
]
Below is my code in es_func.jq file launched by this command:
jq -n --argfile f1 es_file1.json --argfile f2 es_file2.json -f es_func.jq
INDEX($f2[] ; .reference) as $details
| $f1
| map( ($details[.reference|tostring]| .row_description) as $vn
| if $vn then .details = [{"row_description" : $vn}] else . end)
I get the result only for the last record in 25422 reference with "row description": "descr_1_1" and not have "row_description": "descr_1_0"
[
{
"reference": 25422,
"order_number": "10_1",
"details": [
{
"row_description": "descr_1_1"
}
]
},
{
"reference": 25423,
"order_number": "10_2",
"details": [
{
"row_description": "descr_2_0"
}
]
}
]
I think I'm close to the solution but something is still missing. Thank you
This would be way easier if you used reduce instead.
jq 'reduce inputs[] as $rec (INDEX(.reference);
.[$rec.reference | tostring].details += [$rec]
) | map(.)' es_file1.json es_file2.json
Online demo
Here's a straightforward, reduce-free solution:
jq '
group_by(.reference)
| INDEX(.[]; .[0]|.reference|tostring) as $dict
| input
| map_values(. + {details: $dict[.reference|tostring]})
' 2.json 1.json

Reconstructing JSON with jq

I have a JSON like this (sample.json):
{
"sheet1": [
{
"hostname": "sv001",
"role": "web",
"ip1": "172.17.0.3"
},
{
"hostname": "sv002",
"role": "web",
"ip1": "172.17.0.4"
},
{
"hostname": "sv003",
"role": "db",
"ip1": "172.17.0.5",
"ip2": "172.18.0.5"
}
],
"sheet2": [
{
"hostname": "sv004",
"role": "web",
"ip1": "172.17.0.6"
},
{
"hostname": "sv005",
"role": "db",
"ip1": "172.17.0.7"
},
{
"hostname": "vsv006",
"role": "db",
"ip1": "172.17.0.8"
}
],
"sheet3": []
}
I want to extract data like this:
sheet1
jq '(something command)' sample.json
{
"web": {
"hosts": [
"172.17.0.3",
"172.17.0.4"
]
},
"db": {
"hosts": [
"172.17.0.5"
]
}
}
Is it possible to perform the reconstruction with jq map?
(I will reuse the result for ansible inventory.)
Here's a short, straight-forward and efficient solution -- efficient in part because it avoids group_by by courtesy of the following generic helper function:
def add_by(f;g): reduce .[] as $x ({}; .[$x|f] += [$x|g]);
.sheet1
| add_by(.role; .ip1)
| map_values( {hosts: .} )
Output
This produces the required output:
{
"web": {
"hosts": [
"172.17.0.3",
"172.17.0.4"
]
},
"db": {
"hosts": [
"172.17.0.5"
]
}
}
If the goal is to regroup the ips by their roles within each sheet you could do this:
map_values(
reduce group_by(.role)[] as $g ({};
.[$g[0].role].hosts = [$g[] | del(.hostname, .role)[]]
)
)
Which produces something like this:
{
"sheet1": {
"db": {
"hosts": [
"172.17.0.5",
"172.18.0.5"
]
},
"web": {
"hosts": [
"172.17.0.3",
"172.17.0.4"
]
}
},
"sheet2": {
"db": {
"hosts": [
"172.17.0.7",
"172.17.0.8"
]
},
"web": {
"hosts": [
"172.17.0.6"
]
}
},
"sheet3": {}
}
https://jqplay.org/s/3VpRc5l4_m
If you want to flatten all to a single object keeping only unique ips, you can keep everything mostly the same, you'll just need to flatten the inputs prior to grouping and remove the map_values/1 call.
$ jq -n '
reduce ([inputs[][]] | group_by(.role)[]) as $g ({};
.[$g[0].role].hosts = ([$g[] | del(.hostname, .role)[]] | unique)
)
'
{
"db": {
"hosts": [
"172.17.0.5",
"172.17.0.7",
"172.17.0.8",
"172.18.0.5"
]
},
"web": {
"hosts": [
"172.17.0.3",
"172.17.0.4",
"172.17.0.6"
]
}
}
https://jqplay.org/s/ZGj1wC8hU3

Selecting entries of a sub array

I have json that looks like:
{
"base": "abc",
"members" : [
{"fn": "maurice", "ln": "hickey"},
{"fn": "john", "ln": "smith"},
{"fn": "robin", "ln": "smith"},
...
],
"date": "2018-08-26"
}
I am trying to write a jq filter to give me the same schema output but with only a subset of the members array e.g. all the "smith"s
{
"base": "abc",
"members" : [
{"fn": "john", "ln": "smith"},
{"fn": "robin", "ln": "smith"}
],
"date": "2018-08-26"
}
Any pointers would be appreciated.
This gets the desired output.
jq '.members |= map(select(.ln == "smith"))'
It updates .members, selecting only objects with .ln == smith

jq: group and key by property

I have a list of objects that look like this:
[
{
"ip": "1.1.1.1",
"component": "name1"
},
{
"ip": "1.1.1.2",
"component": "name1"
},
{
"ip": "1.1.1.3",
"component": "name2"
},
{
"ip": "1.1.1.4",
"component": "name2"
}
]
Now I'd like to group and key that by the component and assign a list of ips to each of the components:
{
"name1": [
"1.1.1.1",
"1.1.1.2"
]
},{
"name2": [
"1.1.1.3",
"1.1.1.4"
]
}
I figured it out myself. I first group by .component and then just create new lists of ips that are indexed by the component of the first object of each group:
jq ' group_by(.component)[] | {(.[0].component): [.[] | .ip]}'
The accepted answer doesn't produce valid json, but:
{
"name1": [
"1.1.1.1",
"1.1.1.2"
]
}
{
"name2": [
"1.1.1.3",
"1.1.1.4"
]
}
name1 as well as name2 are valid json objects, but the output as a whole isn't.
The following jq statement results in the desired output as specified in the question:
group_by(.component) | map({ key: (.[0].component), value: [.[] | .ip] }) | from_entries
Output:
{
"name1": [
"1.1.1.1",
"1.1.1.2"
],
"name2": [
"1.1.1.3",
"1.1.1.4"
]
}
Suggestions for simpler approaches are welcome.
If human readability is preferred over valid json, I'd suggest something like ...
jq -r 'group_by(.component)[] | "IPs for " + .[0].component + ": " + (map(.ip) | tostring)'
... which results in ...
IPs for name1: ["1.1.1.1","1.1.1.2"]
IPs for name2: ["1.1.1.3","1.1.1.4"]
As a further example of #replay's technique, after many failures using other methods, I finally built a filter that condenses this Wazuh report (excerpted for brevity):
{
"took" : 228,
"timed_out" : false,
"hits" : {
"total" : {
"value" : 2806,
"relation" : "eq"
},
"hits" : [
{
"_source" : {
"agent" : {
"name" : "100360xx"
},
"data" : {
"vulnerability" : {
"severity" : "High",
"package" : {
"condition" : "less than 78.0",
"name" : "Mozilla Firefox 68.11.0 ESR (x64 en-US)"
}
}
}
}
},
{
"_source" : {
"agent" : {
"name" : "100360xx"
},
"data" : {
"vulnerability" : {
"severity" : "High",
"package" : {
"condition" : "less than 78.0",
"name" : "Mozilla Firefox 68.11.0 ESR (x64 en-US)"
}
}
}
}
},
...
Here is the jq filter I use to provide an array of objects, each consisting of an agent name followed by an array of names of the agent's vulnerable packages:
jq ' .hits.hits |= unique_by(._source.agent.name, ._source.data.vulnerability.package.name) | .hits.hits | group_by(._source.agent.name)[] | { (.[0]._source.agent.name): [.[]._source.data.vulnerability.package | .name ]}'
Here is an excerpt of the output produced by the filter:
{
"100360xx": [
"Mozilla Firefox 68.11.0 ESR (x64 en-US)",
"VLC media player",
"Windows 10"
]
}
{
"WIN-KD5C4xxx": [
"Windows Server 2019"
]
}
{
"fridxxx": [
"java-1.8.0-openjdk",
"kernel",
"kernel-headers",
"kernel-tools",
"kernel-tools-libs",
"python-perf"
]
}
{
"mcd-xxx-xxx": [
"dbus",
"fribidi",
"gnupg2",
"graphite2",
...

Read JSON File for Records using Linq

Following is my JSON file . I have to get Fields mentioned for each page and for each Type in comma separated string. Please help in how to proceed using Linq
Example : If I want "Type = customFields" defined for "page1" , have to get output in comma separated ProjectID,EmployeeID,EmployeeName,hasExpiration etc
{
"Fields": {
"Pages": {
"Page": {
"-Name": "page1",
"Type": [
{
"-TypeID": "CUSTOMIZEDFIELDS",
"Field": [
"ProjectID",
"EmployeeID",
"EmployeeName",
"HasExpiration",
"EndDate",
"OTStrategy",
"Division",
"AddTimesheets",
"SubmitTimesheets",
"ManagerTimesheetApprovalRequired",
"OTAllowed",
"AddExpenses",
"SubmitExpenses",
"ManagerExpenseApprovalRequired",
"SendApprovalEmails"
]
},
{
"-TypeID": "CFDATASET",
"Field": [
"ProjectID",
"EmployeeID",
"EmployeeName",
"HasExpiration",
"EndDate",
"OTStrategy",
"Division",
"AddTimesheets",
"SubmitTimesheets",
"ManagerTimesheetApprovalRequired",
"OTAllowed",
"AddExpenses",
"SubmitExpenses",
"ManagerExpenseApprovalRequired",
"SendApprovalEmails"
]
},
{
"-TypeID": "CFDATASETCAPTION",
"Field": [
"ProjectID",
"EmployeeID",
"EmployeeName",
"HasExpiration",
"EndDate",
"OTStrategy",
"Division",
"AddTimesheets",
"SubmitTimesheets",
"ManagerTimesheetApprovalRequired",
"OTAllowed",
"AddExpenses",
"SubmitExpenses",
"ManagerExpenseApprovalRequired",
"SendApprovalEmails"
]
}
]
}
}
}
}