Parsing JSON from shell script using JSON.sh - json

I'm working on parsing JSON data using JSON.sh. And I wanted to read data from json file (test.json) whose content will be something like,
{
"/home/ukrishnan/projects/test.yml": {
"LOG_DRIVER": "syslog",
"IMAGE": "mysql:5.6"
},
"/home/ukrishnan/projects/mysql/app.xml": {
"ENV_ACCOUNT_BRIDGE_ENDPOINT": "/u01/src/test/sample.txt"
}
}
And I try to parse this JSON using JSON.sh by using,
test_parser=`sh ./lib/JSON.sh < test/test.json`
echo $test_parser
It prints,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog" ["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6" ["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"} ["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt" ["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"} [] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
Whereas, the same command (sh ./lib/JSON.sh < test/test.json), if I run through terminal, it is printing with line breaks,
["/home/ukrishnan/projects/test.yml","LOG_DRIVER"] "syslog"
["/home/ukrishnan/projects/test.yml","IMAGE"] "mysql:5.6"
["/home/ukrishnan/projects/test.yml"] {"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"}
["/home/ukrishnan/projects/mysql/app.xml","ENV_ACCOUNT_BRIDGE_ENDPOINT"] "/u01/src/test/sample.txt"
["/home/ukrishnan/projects/mysql/app.xml"] {"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}
[] {"/home/ukrishnan/projects/test.yml":{"LOG_DRIVER":"syslog","IMAGE":"mysql:5.6"},"/home/ukrishnan/projects/mysql/app.xml":{"ENV_ACCOUNT_BRIDGE_ENDPOINT":"/u01/src/test/sample.txt"}}
I wanted to read this and assign to bash variables like,
file_name='/home/ukrishnan/projects/test.yml'
key='LOG_DRIVER'
value='syslog'
As I'm almost completely new to shell script and grep or awk, I don't have much idea of how to achieve this. Any help on this would be greatly appreciated.

I wrote a JSON serializer / deserializer for gawk, if you're interested. Save that script and modify it, replacing everything above # === FUNCTIONS === with the following:
#!/usr/bin/gawk -f
# capture JSON string from beginning to end into a scalar variable
{ json = json ORS $0 }
END {
# objectify JSON string to the multilevel array "obj"
deserialize(json, obj)
for (filename in obj) {
print "file_name=" quote(filename)
for (key in obj[filename]) {
# print key="value"
print key "=" quote(obj[filename][key])
}
}
}
Do chmod 755 json.awk and execute it. Output will resemble this:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
ENV_ACCOUNT_BRIDGE_ENDPOINT="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
LOG_DRIVER="syslog"
IMAGE="mysql:5.6"
Hopefully the logic is reasonably easy to follow. If you prefer to output filename=, key=, and value= on every loop iteration, modify the nested for loops accordingly:
for (filename in obj) {
for (key in obj[filename]) {
print "file_name=" quote(filename)
print "key=" quote(key)
print "value=" quote(obj[filename][key])
}
}
That change will result in the following output:
$ ./json.awk test5.json
file_name="/home/ukrishnan/projects/mysql/app.xml"
key="ENV_ACCOUNT_BRIDGE_ENDPOINT"
value="/u01/src/test/sample.txt"
file_name="/home/ukrishnan/projects/test.yml"
key="LOG_DRIVER"
value="syslog"
file_name="/home/ukrishnan/projects/test.yml"
key="IMAGE"
value="mysql:5.6"
Anyway, with that output, you can do something silly in BASH like this to populate and act upon the variables:
#!/bin/bash
./test.awk test5.json | while read -r line; do {
eval $line
[ "${line/=*/}" = "value" ] && {
echo "bash: file_name=$file_name"
echo "bash: key=$key"
echo "bash: value=$value"
echo "------"
}
}; done
It'd probably be more graceful just to do all processing within gawk from start to finish and not mess with the polyglot handoff, though.
Getting back to json.awk, if you prefer to keep json.awk modular for easy reuse in future projects, you could remove everything above # === FUNCTIONS ===, create a separate main.awk containing the code block at the top of this answer, and #include "json.awk" as a helper library pretty much anywhere outside of END {...} (just below the shbang, for example).

JSON.sh (from http://json.org) offers a nice bash friendly means of flattening out a JSON file. Which you've already provided how it looks in your question. So, the flatten form is the format:
[node] tab value
You have to think in UNIX script in extracting the information you want, you'll note the lines you're interested in actually follow this pattern:
["filename","key"] tab ["value"]
In regex notation, we replace:
filename with (.*)
key with (.*)
tab with \t
value with (.*)
We can retrieve the first, second and third matching groups with \1, \2, \3 respectively.
When used in sed we also note that these symbols []() need to be escaped with a backslash \, resulting in the following script:
./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d'
/home/ukrishnan/projects/test.yml,LOG_DRIVER,syslog
/home/ukrishnan/projects/test.yml,IMAGE,mysql:5.6
/home/ukrishnan/projects/mysql/app.xml,ENV_ACCOUNT_BRIDGE_ENDPOINT,/u01/src/test/sample.txt
Now we put the lines in a loop and for each line, we can extract out filename,key,value:
for line in $(./lib/JSON.sh < test/test.json | sed 's/\["\(.*\)","\(.*\)\"]\t"\(.*\)"/\1,\2,\3/;t;d')
do
IFS="," read -ra arr <<< $line
filename=${arr[0]}
key=${arr[1]}
value=${arr[2]}
cat <<EOF
filename : $filename
key : $key
value : $value
EOF
done
Which outputs:
filename : /home/ukrishnan/projects/test.yml
key : LOG_DRIVER
value : syslog
filename : /home/ukrishnan/projects/test.yml
key : IMAGE
value : mysql:5.6
filename : /home/ukrishnan/projects/mysql/app.xml
key : ENV_ACCOUNT_BRIDGE_ENDPOINT
value : /u01/src/test/sample.txt

Related

How can I prettyprint JSON on the command line, but allow invalid JSON objects to pass though?

I'm currently tailing some logs in bash that are half JSON, half text like below:
{"response":{"message":"asdfasdf"}}
{"log":{"example":"asdfasdf"}}
here is some text
{"another":{"example":"asdfasdf"}}
more text
Each line is either a full valid JSON object or some text that would fail a JSON parser.
I've looked at jq and underscore-cli to see if they have options to return the invalid object in the case of failure, but I'm not seeing any.
I've also tried to use a || operator to cat the piped input, but I'm losing the value somehow. Maybe I should read up on pipes more? Example: getLogs -t | (underscore print || cat)
I think I could write a script that stores the input. Format it, and return the output if successful. If it fails returned the stored value. I feel like there should be a simpler way though. Any thoughts?
You can use this node library
install with
$ npm install -g js-beautify
Here is what I did:
$ js-beautify -r test.js
beautified test.js
I tested it with an incomplete json file and it worked
jq can check for invalid json
#!/bin/bash
while read p; do
if jq -e . >/dev/null 2>&1 <<<"$p"; then
echo $p | jq
else
echo 'Skipping invalid json'
fi
done < /tmp/tst.txt
{
"response": {
"message": "asdfasdf"
}
}
{
"log": {
"example": "asdfasdf"
}
}
Skipping invalid json
{
"another": {
"example": "asdfasdf"
}
}
Skipping invalid json

How to delete the last character of prior line with sed

I'm trying to delete a line with a the last character of the prior line with sed:
I have a json file :
{
"name":"John",
"age":"16",
"country":"Spain"
}
I would like to delete country of all entries, to do that I have to delete the comma for the json syntax of the prior line.
I'm using this pattern :
sed '/country/d' test.json
sed -n '/resolved//.$//{x;d;};1h;1!{x;p;};${x;p;}' test.json
Editor's note:
The OP later clarified the following additional requirements, which invalidated some of the existing answers:
- multiple occurrences of country properties should be removed
- across all levels of the object hierarchy
- whitespace variations should be tolerated
Using a proper JSON parser such as jq is generally the best choice (see below), but if installing a utility is not an option, try this GNU sed command:
$ sed -zr 's/,\s*"country":[^\n]+//g' test.json
{
"name":"John",
"age":"16"
}
-z splits the input into records by NULs, which, in this case means that the whole file is read at once, which enables cross-line substitutions.
-r enables extended regular expressions for a more modern syntax with more features.
s/,\n"country":\s*//g replaces all occurrences of a comma followed by a (possibly empty) run of whitespace (including possibly a newline) and then "country" through the end of that line with the empty string, i.e., effectively removes the matched strings.
Note that this assumes that no other property or closing } follows such a country property on the same line.
To demonstrate a more robust solution based on jq.
Bertrand Martel's helpful answer contains a jq solution, which, however, does not address the requirement (added later) of replacing country attributes anywhere in the input object hierarchy.
In a not-yet-released version of jq higher than v1.5.2, a builtin walk/1 function will be available, which enables the following simple solution:
# Walk all nodes and remove a "country" property from any object.
jq 'walk(if type == "object" then del (.country) else . end)' test.json
In v1.5.2 and below, you can define a simplified variant of walk yourself:
jq '
# Define recursive function walk_objects/1 that walks all objects in the
# hierarchy.
def walk_objects(f): . as $in |
if type == "object" then
reduce keys[] as $key
( {}; . + { ($key): ($in[$key] | walk_objects(f)) } ) | f
elif type == "array" then map( walk_objects(f) )
else . end;
# Walk all objects and remove a "country" property, if present.
walk_objects(del(.country))
' test.json
As pointed out before you should really consider using a JSON parser to parse JSON.
When that is said you can slurp the whole file, remove newlines and then replace
accordantly:
$ sed ':a;N;$!ba;s/\n//g;s/,"country"[^}]*//' test.json
{"name":"John","age":"16"}
Breakdown:
:a; # Define label 'a'
N; # Append next line to pattern space
$!ba; # Goto 'a' unless it's the last line
s/\n//g; # Replace all newlines with nothing
s/,"country"[^}]*// # Replace ',"country...' with nothing
This might work for you (GNU sed):
sed 'N;s/,\s*\n\s*"country".*//;P;D' file
Read two lines into the pattern space and remove substitution string.
N.B. Allows for spaces either side of the line.
You can use a JSON parser like jq to parse json file. The following will return the document without the country field and write the new document in result.json :
jq 'del(.country)' file.json > result.json

Changing values in a JSON data file from shell

I have created a JSON file which in this case contains:
{"ipaddr":"10.1.1.2","hostname":"host2","role":"http","status":"active"},
{"ipaddr":"10.1.1.3","hostname":"host3","role":"sql","status":"active"},
{"ipaddr":"10.1.1.4","hostname":"host4","role":"quad","status":"active"},
On other side I have a variable with values for example:
arr="10.1.1.2 10.1.1.3"
which comes from a subsequent check of the server status for example. For those values I want to change the status field to "inactive". In other words to grep the host and change its "status" value.
Expected output:
{"ipaddr":"10.1.1.2","hostname":"host2","role":"http","status":"inactive"},
{"ipaddr":"10.1.1.3","hostname":"host3","role":"sql","status":"inactive"},
{"ipaddr":"10.1.1.4","hostname":"host4","role":"quad","status":"active"},
$ arr="10.1.1.2 10.1.1.3"
$ awk -v arr="$arr" -F, 'BEGIN { gsub(/\./,"\\.",arr); gsub(/ /,"|",arr) }
$1 ~ "\"(" arr ")\"" { sub(/active/,"in&") } 1' file
{"ipaddr":"10.1.1.2","hostname":"host2","role":"http","status":"inactive"},
{"ipaddr":"10.1.1.3","hostname":"host3","role":"sql","status":"inactive"},
{"ipaddr":"10.1.1.4","hostname":"host4","role":"quad","status":"active"},
Here is a quick perl "wrap-around one-liner": that uses the JSON module and slurps with the -0 switch:
perl -MJSON -n0E '$j = decode_json($_);
for (#{$j->{hosts}}){$_->{status}=inactive if $_->{ipaddr}=~/2|3/} ;
say to_json( $j->{hosts}, {pretty=>1} )' status_data.json
might be nicer or might violate PBP recommendations for map:
perl -MJSON -n0E '$j = decode_json($_);
map { $_->{status}=inactive if $_->{ipaddr}=~/2|3/ } #{ $j->{hosts} } ;
say to_json( $j->{hosts} )' status_data.json
A shell script that resets status using jq would also be possible. Here's a quick way to parse and output changes to JSON using jq:
cat status_data.json| jq -r '.hosts |.[] |
select(.ipaddr == "10.1.1.2"//.ipaddr == "10.1.1.3" )' |jq '.status = "inactive"'
EDIT In an earlier comment I was uncertain whether the OP was more interested in an application than a quick search and replace (something about the phrases "On other side..." and "check on the server status"). Here is a (still simple) perl approach in script form:
use v5.16; #strict, warnings, say
use JSON ;
use IO::All;
my $status_data < io 'status_data.json';
my $network = JSON->new->utf8->decode($status_data) ;
my #changed_hosts= qw/10.1.1.2 10.1.1.3/;
sub status_report {
foreach my $host ( #{ $network->{hosts} }) {
say "$host->{hostname} is $host->{status}";
}
}
sub change_status {
foreach my $host ( #{ $network->{hosts} }){
foreach (#changed_hosts) {
$host->{status} = "inactive" if $host->{ipaddr} eq $_ ;
}
}
status_report;
}
defined $ENV{CHANGE_HAPPENED} ? change_status : status_report ;
The script reads the JSON file status_data.json (using IO::All which is great fun) then decodes it with JSON into a hash. It is hard to tell if this us a complete a solution because if you are "monitoring" host status then we should check the JSON data file periodically and compare it to our hash and then run the main body of the script one when changes have occurred.
To simulate changes occurring you can define/undefine CHANGE_HAPPENED in your environment with export CHANGE_HAPPENED=1 (or setenv if in in tcsh) and unset CHANGE_HAPPENED and the script will then either update the messages and the hash or "report". For this to be complete the data in our hash should be updated to match the the data file either periodically or when an event occurs. The status_report() subroutine could be changed so that it builds arrays of #inactive_hosts and #active_hosts when update_status() told it to do so: if ( something_happened() ) { update_status() }, etc.
Hope that helps.
status_data.json
{
"hosts":[
{"ipaddr":"10.1.1.2","hostname":"host2","role":"http","status":"active"},
{"ipaddr":"10.1.1.3","hostname":"host3","role":"sql","status":"active"},
{"ipaddr":"10.1.1.4","hostname":"host4","role":"quad","status":"active"}
]
}
output:
~/ % perl network_status_json.pl
host2 is active
host3 is active
host4 is active
~/ % export CHANGE_HAPPENED=1
~/ % perl network_status_json.pl
host2 is inactive
host3 is inactive
host4 is active
Version 1:
Using a simple regex based transformation. This can be done in several ways. From the initial question, the list of ipaddr is in variable in arr. Example using a Bash env variable:
$ export var="... ..."
It would be a possible solution to provide this information by command line parameters.
#!/usr/bin/perl
my %inact; # ipaddr to inactivate
my $arr=$ENV{arr} ; # from external var (export arr=...)
## $arr=shift; # from command line arg
for( split(/\s+/, $arr)){ $inact{$_}=1 }
while(<>){ # one "json" line at the time
if(/"ipaddr":"(.*?)"/ and $inact{$1}){
s/"active"/"inactive"/}
print $_;
}
Version 2:
Using Json parser we can do more complex transformations; as the input is not real JSON we will process one line of "almost json" at the time:
use JSON;
use strict;
my ($line, %inact);
my $arr=$ENV{arr} ;
for( split(/\s+/, $arr)){ $inact{$_}=1 }
while(<>){ # one "json" line at the time
if(/^\{.*\},/){
s/,\n//;
$line = from_json( $_);
if($inact{$line->{ipaddr}}){
$line->{status} = "inactive" ;}
print to_json($line), ",\n"; }
else { print $_;}
}
#!/bin/ksh
# your "array" of IP
arr="10.1.1.2 10.1.1.3"
# create and prepare temporary file for sed action
SedAction=/tmp/Action.sed
# --- for/do generating SedAction --------
echo "#sed action" > ${SedAction}
#take each IP from the arr variable one by one
for IP in ${arr}
do
# prepare for a psearch pattern use
IP_RE="$( echo "${IP}" | sed 's/\./\\./g' )"
# generate sed action in temporary file.
# final action will be like:
# s/\("ipaddr":"10\.1\.1\.2".*\)"active"}/\1"inactive"}/;t
# escape(double) \ for in_file espace, escape(simple) " for this line interpretation
echo "s/\\\(\"ipaddr\":\"${IP_RE}\".*\\\)\"active\"}/\\\1\"inactive\"}/;t" >> ${SedAction}
done
# --- sed generating sed action ---------------
echo "${arr}" \
| tr " " "\n" \
| sed 's/\./\\./g
s#.*#s/\\("ipaddr":"&".*\\)"active"}/\\1"inactive"}/;t#
' \
> ${SedAction}
# core of the process (use -i for inline editing or "double" redirection for non GNU sed)
sed -f ${SedAction} YourFile
# clean temporary file
rm ${SedAction}
Self commented, tested in ksh/AIX.
2 way to generate the SedAction depending of action you want to do also (if any). You only need one to work, i prefer the second
This is very simple indeed in Perl, using the JSON module.
use strict;
use warnings;
use JSON qw/ from_json to_json /;
my $json = JSON->new;
my $data = from_json(do { local $/; <DATA> });
my $arr = "10.1.1.2 10.1.1.3";
my %arr = map { $_ => 1 } split ' ', $arr;
for my $item (#$data) {
$item->{status} = 'inactive' if $arr{$item->{ipaddr}};
}
print to_json($data, { pretty => 1 }), "\n";
__DATA__
[
{"ipaddr":"10.1.1.2","hostname":"host2","role":"http","status":"active"},
{"ipaddr":"10.1.1.3","hostname":"host3","role":"sql","status":"active"},
{"ipaddr":"10.1.1.4","hostname":"host4","role":"quad","status":"active"}
]
output
[
{
"role" : "http",
"hostname" : "host2",
"status" : "inactive",
"ipaddr" : "10.1.1.2"
},
{
"hostname" : "host3",
"role" : "sql",
"ipaddr" : "10.1.1.3",
"status" : "inactive"
},
{
"ipaddr" : "10.1.1.4",
"status" : "active",
"hostname" : "host4",
"role" : "quad"
}
]

Read JSON data in a shell script [duplicate]

This question already has answers here:
Parsing JSON with Unix tools
(45 answers)
Closed 6 years ago.
In shell I have a requirement wherein I have to read the JSON response which is in the following format:
{ "Messages": [ { "Body": "172.16.1.42|/home/480/1234/5-12-2013/1234.toSort", "ReceiptHandle": "uUk89DYFzt1VAHtMW2iz0VSiDcGHY+H6WtTgcTSgBiFbpFUg5lythf+wQdWluzCoBziie8BiS2GFQVoRjQQfOx3R5jUASxDz7SmoCI5bNPJkWqU8ola+OYBIYNuCP1fYweKl1BOFUF+o2g7xLSIEkrdvLDAhYvHzfPb4QNgOSuN1JGG1GcZehvW3Q/9jq3vjYVIFz3Ho7blCUuWYhGFrpsBn5HWoRYE5VF5Bxc/zO6dPT0n4wRAd3hUEqF3WWeTMlWyTJp1KoMyX7Z8IXH4hKURGjdBQ0PwlSDF2cBYkBUA=", "MD5OfBody": "53e90dc3fa8afa3452c671080569642e", "MessageId": "e93e9238-f9f8-4bf4-bf5b-9a0cae8a0ebc" } ] }
Here I am only concerned with the "Body" property value. I made some unsuccessful attempts like:
jsawk -a 'return this.Body'
or
awk -v k="Body" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}
But that did not suffice. Can anyone help me with this?
There is jq for parsing json on the command line:
jq '.Body'
Visit this for jq: https://stedolan.github.io/jq/
tl;dr
$ cat /tmp/so.json | underscore select '.Messages .Body'
["172.16.1.42|/home/480/1234/5-12-2013/1234.toSort"]
Javascript CLI tools
You can use Javascript CLI tools like
underscore-cli:
json:select(): CSS-like selectors for JSON.
Example
Select all name children of a addons:
underscore select ".addons > .name"
The underscore-cli provide others real world examples as well as the json:select() doc.
Similarly using Bash regexp. Shall be able to snatch any key/value pair.
key="Body"
re="\"($key)\": \"([^\"]*)\""
while read -r l; do
if [[ $l =~ $re ]]; then
name="${BASH_REMATCH[1]}"
value="${BASH_REMATCH[2]}"
echo "$name=$value"
else
echo "No match"
fi
done
Regular expression can be tuned to match multiple spaces/tabs or newline(s). Wouldn't work if value has embedded ". This is an illustration. Better to use some "industrial" parser :)
Here is a crude way to do it: Transform JSON into bash variables to eval them.
This only works for:
JSON which does not contain nested arrays, and
JSON from trustworthy sources (else it may confuse your shell script, perhaps it may even be able to harm your system, You have been warned)
Well, yes, it uses PERL to do this job, thanks to CPAN, but is small enough for inclusion directly into a script and hence is quick and easy to debug:
json2bash() {
perl -MJSON -0777 -n -E 'sub J {
my ($p,$v) = #_; my $r = ref $v;
if ($r eq "HASH") { J("${p}_$_", $v->{$_}) for keys %$v; }
elsif ($r eq "ARRAY") { $n = 0; J("$p"."[".$n++."]", $_) foreach #$v; }
else { $v =~ '"s/'/'\\\\''/g"'; $p =~ s/^([^[]*)\[([0-9]*)\](.+)$/$1$3\[$2\]/;
$p =~ tr/-/_/; $p =~ tr/A-Za-z0-9_[]//cd; say "$p='\''$v'\'';"; }
}; J("json", decode_json($_));'
}
use it like eval "$(json2bash <<<'{"a":["b","c"]}')"
Not heavily tested, though. Updates, warnings and more examples see my GIST.
Update
(Unfortunately, following is a link-only-solution, as the C code is far
too long to duplicate here.)
For all those, who do not like the above solution,
there now is a C program json2sh
which (hopefully safely) converts JSON into shell variables.
In contrast to the perl snippet, it is able to process any JSON,
as long as it is well formed.
Caveats:
json2sh was not tested much.
json2sh may create variables, which start with the shellshock pattern () {
I wrote json2sh to be able to post-process .bson with Shell:
bson2json()
{
printf '[';
{ bsondump "$1"; echo "\"END$?\""; } | sed '/^{/s/$/,/';
echo ']';
};
bsons2json()
{
printf '{';
c='';
for a;
do
printf '%s"%q":' "$c" "$a";
c=',';
bson2json "$a";
done;
echo '}';
};
bsons2json */*.bson | json2sh | ..
Explained:
bson2json dumps a .bson file such, that the records become a JSON array
If everything works OK, an END0-Marker is applied, else you will see something like END1.
The END-Marker is needed, else empty .bson files would not show up.
bsons2json dumps a bunch of .bson files as an object, where the output of bson2json is indexed by the filename.
This then is postprocessed by json2sh, such that you can use grep/source/eval/etc. what you need, to bring the values into the shell.
This way you can quickly process the contents of a MongoDB dump on shell level, without need to import it into MongoDB first.

JSON to fixed width file

I have to extract data from JSON file depending on a specific key. The data then has to be filtered (based on the key value) and separated into different fixed width flat files. I have to develop a solution using shell scripting.
Since the data is just key:value pair I can extract them by processing each line in the JSON file, checking the type and writing the values to the corresponding fixed-width file.
My problem is that the input JSON file is approximately 5GB in size. My method is very basic and would like to know if there is a better way to achieve this using shell scripting ?
Sample JSON file would look like as below:
{"Type":"Mail","id":"101","Subject":"How are you ?","Attachment":"true"}
{"Type":"Chat","id":"12ABD","Mode:Online"}
The above is a sample of the kind of data I need to process.
Give this a try:
#!/usr/bin/awk
{
line = ""
gsub("[{}\x22]", "", $0)
f=split($0, a, "[:,]")
for (i=1;i<=f;i++)
if (a[i] == "Type")
file = a[++i]
else
line = line sprintf("%-15s",a[i])
print line > file ".fixed.out"
}
I made assumptions based on the sample data provided. There is a lot based on those assumptions that may need to be changed if the data varies much from what you've shown. In particular, this script will not work properly if the data values or field names contain colons, commas, quotes or braces. If this is a problem, it's one of the primary reasons that a proper JSON parser should be used. If it were my assignment, I'd push back hard on this point to get permission to use the proper tools.
This outputs lines that have type "Mail" to a file named "Mail.fixed.out" and type "Chat" to "Chat.fixed.out", etc.
The "Type" field name and field value ("Mail", etc.) are not output as part of the contents. This can be changed.
Otherwise, both the field names and values are output. This can be changed.
The field widths are all fixed at 15 characters, padded with spaces, with no delimiters. The field width can be changed, etc.
Let me know how close this comes to what you're looking for and I can make some adjustments.
perl script
#!/usr/bin/perl -w
use strict;
use warnings;
no strict 'refs'; # for FileCache
use FileCache; # avoid exceeding system's maximum number of file descriptors
use JSON;
my $type;
my $json = JSON->new->utf8(1); #NOTE: expect utf-8 strings
while(my $line = <>) { # for each input line
# extract type
eval { $type = $json->decode($line)->{Type} };
$type = 'json_decode_error' if $#;
$type ||= 'missing_type';
# print to the appropriate file
my $fh = cacheout '>>', "$type.out";
print $fh $line; #NOTE: use cache if there are too many hdd seeks
}
corresponding shell script
#!/bin/bash
#NOTE: bash is used to create non-ascii filenames correctly
__extract_type()
{
perl -MJSON -e 'print from_json(shift)->{Type}' "$1"
}
__process_input()
{
local IFS=$'\n'
while read line; do # for each input line
# extract type
local type="$(__extract_type "$line" 2>/dev/null ||
echo json_decode_error)"
[ -z "$type" ] && local type=missing_type
# print to the appropriate file
echo "$line" >> "$type.out"
done
}
__process_input
Example:
$ ./script-name < input_file
$ ls -1 *.out
json_decode_error.out
Mail.out