Split large JSON file by using jq and awk - json

I have a large file called
Metadata_01.json
It consistst of blocks that following this structure:
[
{
"Participant_id": "P04_00001",
"no_of_people": "Multiple",
"apparent_gender": "F",
"geographic_location": "AUS",
"ethnicity": "Caucasian",
"capture_device_used": "iOS 14",
"camera_orientation": "Portrait",
"camera_position": "Side View",
"indoor_outdoor_env": "Indoors",
"lighting_condition": "Bright",
"Occluded": 1,
"category": "Two Person",
"camera_movement": "Still",
"action": "No action",
"indoor_outdoor_in_moving_car_or_train": "Indoor",
"daytime_nighttime": "Nighttime"
},
{
"Participant_id": "P04_00002",
"no_of_people": "Single",
"apparent_gender": "M",
"geographic_location": "AUS",
"ethnicity": "Caucasian",
"capture_device_used": "iOS 14",
"camera_orientation": "Portrait",
"camera_position": "Frontal View",
"indoor_outdoor_env": "Outdoors",
"lighting_condition": "Bright",
"Occluded": "None",
"category": "Animals",
"camera_movement": "Still",
"action": "Small action",
"indoor_outdoor_in_moving_car_or_train": "Outdoor",
"daytime_nighttime": "Daytime"
},
And so on... thousands of them.
I am using the following command:
jq -cr '.[]' Metadata_01.json | awk '{print > (NR ".json")}'
And it's kinda doing the expected work.
From large file that is structured like this
I am getting tons of files that named like this
And structure like this (in one line)
Instead of those results I need each json file to be named after the "Participant_id" (e.g. P04_00002.json)
And I want to preserve the json structure to look like that for each file
{
"Participant_id": "P04_00002",
"no_of_people": "Single",
"apparent_gender": "M",
"geographic_location": "AUS",
"ethnicity": "Caucasian",
"capture_device_used": "iOS 14",
"camera_orientation": "Portrait",
"camera_position": "Frontal View",
"indoor_outdoor_env": "Outdoors",
"lighting_condition": "Bright",
"Occluded": "None",
"category": "Animals",
"camera_movement": "Still",
"action": "Small action",
"indoor_outdoor_in_moving_car_or_train": "Outdoor",
"daytime_nighttime": "Daytime"
}
What adjustments should I make to the command above to achieve this?
Or maybe there's an easier way to do this? Thank you!

What adjustments should I make ...?
I'd go with:
jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk '
NR%2==1 {fn="id." $0 ".json"; next} {print >> fn; close(fn); }
'
and then run something like jq . "$FILE" | sponge "$FILE" to pretty-print each file.
Alternatively, if you can navigate your way around any issues that might arise when escaping quotation marks, you could get awk to call jq:
jq -cr '.[] | (.Participant_id, .)' Metadata_01.json | awk -v q=$'\'' '
NR%2==1 {fn = "id." $0 ".json"; next}
{ system( ("jq . <<< " q $0 q " >> \"" fn "\"") );
close(fn);
}
'
"Big Data"
Of course if the input file is too large or too slow for jq empty, then you will want to consider alternatives, e.g. jq's --stream option, jstream, or my own jm. For example if you want the JSON to be pretty-printed in each file:
while read -r json
do
fn=$(jq -r .Participant_id <<< "$json")
<<< "$json" jq . > "id.$fn.json"
done < <(jm Metadata_01.json)

Would recommend using PowerShell since working with objects tends to be easier overall. Fortunately, PowerShell has a ConvertFrom-Json cmdlet you can use to convert the returned text into a PS object letting you reference the properties via dot notation (.Participant_id). Then, you'd just have to convert each iteration back to JSON format and export it. Here I use New-Item to create the file with the output but piping to Out-File would work as well.
$json = Get-Content -Path '.\Metadata_01.json' -Raw | ConvertFrom-Json
foreach ($json_object in $json)
{
New-Item -Path ".\Desktop\" -Name "$($json_object.Participant_id).json" -Value (ConvertTo-Json -InputObject $json_object) -ItemType 'File' -Force
}
The issue I can see you probably running into is not enough memory, due to the size of that file since you'll be saving to a variable first in this example. There are ways around it but this is for demonstration purposes.

Related

How to loop over nested JSON arrays, get values and then modify them based off index in shell using JQ

this is the JSON I am working with -
{
"content": {
"macOS": {
"releases": [
{
"version": "3.21",
"updateItems": [
{
"id": 1,
"title": "Automatic detection for inactivity is here",
"message": "We've finally added a feature long requested - Orby now detects when you've been inactive on your computer. You can modify the maximum allowable inactive time through settings, or just turn it off if you don't need it",
"image": "https://static.image.png"
},
{
"id": 2,
"title": "In case you missed it... We have an iOS app 📱 🙌",
"message": "It's far from perfect, but it's come a long way since we first pushed out version 1.0. We don't that many users on it so far, but I'm hoping that it's useful. Please send any feedback and feature requests my way",
"image": "https://static.image.png"
}
]
}
]
},
"iOS": {
"releases": [
{
"version": "1.31",
"updateItems": [
{
"image": "https://static.image.png",
"id": 1,
"link": "orbit://com.orbit:/settings",
"message": "Strict mode offers a fantastic new way to keep your focus and get more done. To enable it, go to settings and toggle it on. Now when you want to run a timer, put your device face down on a surface. The timer will stop if you pick it up.",
"title": "Strict mode is here 🙌 ⚠️ ⚠️ ⚠️"
}
]
}
]
}
}
}
I want to translate all the title values and message values (I use shell translate).
In my attempt, I have looped through the releases, gotten the indices, then looped through the updateItems and gotten their indices, then translated the title and message based off both these indices, and then I've attempted to assign these new values to replace the existing title and message.
I've noticed that this results in all the title values being the same, and all the message values being the same.
I'm clearly using jq the wrong way, but am unsure how to correct.
Please help.
curl ${URL} | jq >englishContent.json
LANGUAGES=(
# "en"
# "de"
# "fr"
# "es"
"it"
# "ja"
# "ko"
# "nl"
# "pt-BR"
# "ru"
# "zh-Hans"
)
for language in $LANGUAGES; do
# Create new json payload from the downloaded english content json
result=$(jq '{ "language": "'$language'", "response": . }' englishContent.json)
# Get the total number of release items
macOS_releases_length=$(jq -r ".response.content.macOS.releases | length" <(echo "$result"))
# Iterate over releases items and then use index to substitute values into nested arrays
macOS_releases_length=$(expr "$macOS_releases_length" - 1)
for macOS_release_index in $(seq 0 $macOS_releases_length); do
update_items_length=$(jq ".response.content.macOS.releases[$macOS_release_index].updateItems | length" <(echo "$result"))
update_items_length=$(expr "$update_items_length" - 1)
for update_item_index in $(seq 0 $update_items_length); do
title=$(jq ".response.content.macOS.releases[$macOS_release_index].updateItems[$update_item_index].title" <(echo "$result"))
translated_title=$(trans -brief -no-warn :$language $title | xargs)
message=$(jq ".response.content.macOS.releases[$macOS_release_index].updateItems[$update_item_index].message" <(echo "$result"))
translated_message=$(trans -brief -no-warn :$language $message | xargs)
result=$(jq --arg release_index "$(echo "$macOS_release_index" | jq 'tonumber')" --arg item_index "$("$update_item_index" | jq 'tonumber')" --arg new_title $translated_title '.response.content.macOS.releases['$release_index'].updateItems['$item_index'].title |= $new_title' <(echo "$result"))
result=$(jq --arg release_index "$(echo "$macOS_release_index" | jq 'tonumber')" --arg item_index "$("$update_item_index" | jq 'tonumber')" --arg new_message $translated_message '.response.content.macOS.releases['$release_index'].updateItems['$item_index'].message |= $new_message' <(echo "$result"))
done
done
echo $result
done

How to use data from a JQ key to name a new JSON file

I have been trying to modify the accepted code provided by #peak in this thread: Split a JSON file into separate files. I'm very grateful for this answer, as it saved me many hours.
Both of the solutions provided in that thread produce exactly the results I expect and want within the resulting split files. However, the output files are named "$key.json". I would like the file name to be the data contained in the first field of the output file.
Each output file looks something like this:
{
"name": "Bob Smith",
"description": "(some descriptive text)",
"image": "(link to an image file)",
...
}
I have spent several hours trying to figure out how to get the output file names to be "Bob Smith.json", "Jane Doe.json" etc., instead of "0.json", "1.json", etc. I have tried many different ways of modifying the output parameters printf "%s\n" "$item" > "/tmp/$key.json" and '{ print $2 > "/tmp/" $1 ".json" }' without any success. I am completely new to JQ, so I suspect that the solution may be very simple. But, without spending many more hours learning JQ, I don't think I will be able to find it on my own.
For your convenience, here are the solutions from the previous thread:
jq -cr 'keys[] as $k | "\($k)\n\(.[$k])"' input.json |
while read -r key ; do
read -r item
printf "%s\n" "$item" > "/tmp/$key.json"
done
and
jq -cr 'keys[] as $k | "\($k)\t\(.[$k])"' input.json |
awk -F\\t '{ print $2 > "/tmp/" $1 ".json" }'
Can someone who is proficient in JQ please give me a hint? Thank you.
Blindly using .name as the basis of the filename might not be a great idea,
so please adapt the following to your needs.
Assuming the input has the form as in the previous question, i.e.
{ "item1": { "name": "Bob Smith", ...}, ...}
you could use the following pipeline:
jq -cr '.[] | "\(.name)\t\(.)"' input.json |
awk -F\\t '{ print $2 >> "/tmp/" $1 ".json" }'

How to select specific data in a large json file and save the result with same structure

Very large Json file (3Gb) like this:
{
"listPoint": [{
"Paime": "RE6845",
"rmOi": "SNO-55",
"State": "OPEN",
"dateOpneing": "2017-12-22",
"adress": {
"ZIPCODE": "33410",
"codeRoc": "33105"
}
},
{
"Paime": "RE6243",
"rmOi": "SNO-65",
"State": "OPEN",
"dateOpneing": "2014-11-12",
"adress": {
"ZIPCODE": "453410",
"codeRoc": "35105"
}
}
]
}
I'm Trying to filter it to another file with same structure, the condition is that all ZIPCODES must belong to a specific list:
['33410', '42000', '75015'....]
the result should be like this (the output file must have the same structure as the input):
{
"listPoint": [{
"Paime": "RE6845",
"rmOi": "SNO-55",
"State": "OPEN",
"dateOpneing": "2017-12-22",
"adress": {
"ZIPCODE": "33410",
"codeRoc": "35105"
},
{
"Paime": "RE6243",
"rmOi": "SNO-65",
"State": "OPEN",
"dateOpneing": "2014-11-12",
"ZIPCODE": "75015",
"codeRoc": "55115"
}
.....
]
}
Ive Tried by this but it streams all the file:
./jq-win64.exe -n --stream 'fromstream(0|truncate_stream(inputs))' test1.json
I dont know how to do this please can you help
(1) If your computer has enough RAM, then the simplest solution would be along the lines of:
< very-large-file.json jq '
.listPoint |= map(select(.ZIPCODE|startswith("33")))'
(2) Otherwise, you could use jq's streaming parser in a two-step solution such as:
< very-large-file.json jq -n --stream '
fromstream(2|truncate_stream(inputs))
| select(.ZIPCODE|startswith("33"))' |
jq -s '{listPoint: .}'
In the first step, a stream of the relevant JSON objects is produced; in the second step, these are re-assembled into the desired structure.
(3) If even the winnowed array is too large to fit into memory, then you could do worse than:
echo '{listPoint:'
< very-large-file.json jq -n --stream '
fromstream(2|truncate_stream(inputs))
| select(.ZIPCODE|startswith("33"))' |
jq -ncr 'input, (inputs | ",", .)'
echo '}'
(4) If you want to select zipcodes based on a whitelist, change the selection criterion e.g. to
.ZIPCODE|IN($whitelist[])
——-
Note: if your shell is not sufficiently bash-like, you might find it better to use the following invocation format:
jq OPTIONS -f program.jq INPUT.json

Merge multiple JSON files and include filename of each file in the resulting object

I have hundreds of files being named as [guid].json where structure of them all looks similar to this:
{
"Active": true,
"CaseType": "CaseType",
"CustomerGroup": ["Core", "Extended"]
}
First I need to append a new key-value pair to all files with "CaseId": "[filename]" and then merge them all into one big array and save it as a new json manifest file.
I would like one file with the following structure from a jq command:
[
{
"Active": true,
"CaseType": "CaseType",
"CustomerGroup": ["Core", "Extended"],
"CaseId": "43d47f66-5a0a-4b86-88d6-1f1f893098d2"
},
{
"Active": true,
"CaseType": "CaseType",
"CustomerGroup": ["Core", "Extended"],
"CaseId": "e3x47f66-5a0a-4b86-88d6-1f1f893098d2"
}
]
You're looking for input_filename.
jq -n '[ inputs | .CaseId = input_filename ]' *.json
You can use reduce adding on one input object at a time. Use input_filename to get the UTF-8 encoded filename and form the record with CaseId
jq -n 'reduce inputs as $d (null; . + [ $d + { CaseId: input_filename } ] )' *.json

How to get newline on every iteration in jq

I have the following file
[
{
"id": 1,
"name": "Arthur",
"age": "21"
},
{
"id": 2,
"name": "Richard",
"age": "32"
}
]
To display login and id together, I am using the following command
$ jq '.[] | .name' test
"Arthur"
"Richard"
But when I put it in a shell script and try to assign it to a variable then the whole output is displayed on a single line like below
#!/bin/bash
names=$(jq '.[] | .name' test)
echo $names
$ ./script.sh
"Arthur" "Richard"
I want to break at every iteration similar to how it works on the command line.
Couple of issues in the information you have provided. The jq filter .[] | .login, .id will not produce the output as you claimed on jq-1.5. For your original JSON
{
"login":"dmaxfield",
"id":7449977
}
{
"login":"stackfield",
"id":2342323
}
It will produce four lines of output as,
jq -r '.login, .id' < json
dmaxfield
7449977
stackfield
2342323
If you are interested in storing them side by side, you need to do variable interpolation as
jq -r '"\(.login), \(.id)"' < json
dmaxfield, 7449977
stackfield, 2342323
And if you feel your output stored in a variable is not working. It is probably because of lack of double-quotes when you tried to print the variable in the shell.
jqOutput=$(jq -r '"\(.login), \(.id)"' < json)
printf "%s\n" "$jqOutput"
dmaxfield, 7449977
stackfield, 2342323
This way the embedded new lines in the command output are not swallowed by the shell.
For you updated JSON (totally new one compared to old one), all you need to do is
jqOutput=$(jq -r '.[] | .name' < json)
printf "%s\n" "$jqOutput"
Arthur
Richard
In case the .login or .id contains embedded spaces or other characters that might cause problems, a more robust approach is to ensure each JSON value is on a separate line. Consider, for example:
jq -c .login,.id input.json | while read login ; do read id; echo login="$login" and id="$id" ; done
login="dmaxfield" and id=7449977
login="stackfield" and id=2342323