I have an event-log file generated by a third-party tool that I cannot change. So, this log file is a huge JSON array where odds elements contain metadata and the pairs contain the body message associated with the meta-data. I want to be able to split the file depending on the metadata, agglomerating the information by subject in different files.
I am working on this project on windows and I am trying it using a batch file and JQ.
Basically the array looks like this:
[
{ "type": "abc123"},
{"name":"first component of type abc123"},
{ "type": "abc123"},
{"name":"second component of type abc123"},
{ "type": "def124"},
{"name":"first component of type def124"},
{ "type": "xyz999"},
{"name":"first component of type xyz999"},
{ "type": "abc123"},
{"name":"third component of type abc123"},
{ "type": "def124"},
{"name":"second component of type def124"},
{ "type": "abc123"},
{"name":"fifth component of type abc123"},
{ "type": "abc123"},
{"name":"sixth component of type abc123"},
{ "type": "def124"},
{"name":"third component of type def124"},
{ "type": "def124"},
{"name":"fourth component of type def124"},
{ "type": "abc123"},
{"name":"seventh component of type abc123"},
{ "type": "xyz999"},
{"name":"second component of type xyz999"}
...
]
I know that I only have 3 types, so this is what I am trying to archive is create a file for each of them. something like:
First file
{
"componentLog": {
"type": "abc123",
"information": [
"first component of type abc123",
"second component of type abc123",
"third component of type abc123",
...
]
}
}
Second file
{
"componentLog": {
"type": "def124",
"information": [
"first component of type def124",
"second component of type def124",
"third component of type def124",
...
]
}
}
Third file
{
"componentLog": {
"type": "xyz999",
"information": [
"first component of type xyz999",
"second component of type xyz999",
"third component of type xyz999",
...
]
}
}
I know that I can separate the metadata with this
jq.exe ".[] | select(.type==\"product\")" file.json
And then I try to math the index.But index just returns the index of the first item that contains the select statement... So I don't know how to solve this...
The following bash script is a bit messy because it assumes none of the files (input or output) will fit into memory.
If you don't already have access to bash, sed and awk in your computing environment, you might want to consider installing wsl, mingw, or some such, or you could adapt the script as appropriate, e.g. using gawk for Windows, or Ruby for Windows.
The other main assumption not already embedded in the original question is that it's OK to remove the log-type*.tmp files and
overwrite log-TYPE.json for the various values of "type".
Be sure to set input to the appropriate input file name.
# The input file name:
input=file.json
/bin/rm log-type*.tmp
# Use jq to produce a stream of .type and .name values
# as per the jq FAQ
jq -cn --stream '
fromstream(1|truncate_stream(inputs))
| if .type then .type else .name end' "$input" |
awk '
NR%2 {fn=$1; sub("^\"","",fn); sub("\"$","", fn); next;}
{ print > "log-type." fn ".tmp"}
'
for f in log-type.*.tmp ; do
echo formatting $f ...
g=$(sed -e 's/log-type.//' -e 's/.tmp$//' <<< "$f")
echo g="$g"
awk -v type="\"$g\"" '
BEGIN { print "{\"componentLog\": { \"type\": " type " ,";
print "\"information\": ["; }
NR==1 { print; next }
{print ",", $0}
END {print "]}}"; }' "$f" > "log-$g.json"
done
I have a preference for shell commands to get things done. I have a very, very big file -- about 2.8 GB and the content is that of JSON. Everything is on one line, and I was told there are at least 1.5 million records in there.
I must prepare the file for consumption. Each record must be on its own line. Sample:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},{"RecordId":"2",...},{"RecordId":"3",...},{"RecordId":"4",...},{"RecordId":"5",...} }}
Or, use the following...
{"Accounts":{"Customer":[{"AccountHolderId":"9c585258-c94c-442b-a2f0-1ebbcc274795","Title":"Mrs","Forename":"Tina","Surname":"Wright","DateofBirth":"1988-01-01","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"1","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"2","Superseded":"Yes" },{"Contact_Info":"acne.pimple#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"3","Superseded":"No" },{"Contact_Info":"swati.singh#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"4","Superseded":"Yes" }, {"Contact_Info":"christian.bale#hollywood.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"5","Superseded":"NO" },{"Contact_Info":"15482475584","TypeId":"Mobile_Phone","PrimaryFlag":"No","Index":"6","Superseded":"No" }],"Address":[{"AddressPtr":"5","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB100KP","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"6","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB10V6T","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"6884133655531279","Field_B":"887.07","Field_C":"A Loan Product",...,"FieldY_":"2015-09-18","Field_Z":"24275627"}]},{"AccountHolderId":"92a5788f-cd8f-423d-ae5f-4eb0ceb457fd","_Title":"Dr","_Forename":"Christopher","_Surname":"Carroll","_DateofBirth":"1977-02-02","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"7","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"8","Superseded":"Yes" },{"Contact_Info":"acne.pimple#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"9","Superseded":"No" },{"Contact_Info":"swati.singh#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"10","Superseded":"Yes" }],"Address":[{"AddressPtr":"11","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB11TXF","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"12","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB11O8W","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"4121879819185553","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_X":"2015-09-18","Field_Z":"25679434"}]},{"AccountHolderId":"4aa10284-d9aa-4dc0-9652-70f01d22b19e","_Title":"Dr","_Forename":"Cheryl","_Surname":"Ortiz","_DateofBirth":"1977-03-03","Contact":[{"Contact_Info":"9168777943","TypeId":"Mobile Number","PrimaryFlag":"No","Index":"13","Superseded":"No" },{"Contact_Info":"9503588153","TypeId":"Home Telephone","PrimaryFlag":"Yes","Index":"14","Superseded":"Yes" },{"Contact_Info":"acne.pimple#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"No","Index":"15","Superseded":"No" },{"Contact_Info":"swati.singh#microchimerism.com","TypeId":"Email Address","PrimaryFlag":"Yes","Index":"16","Superseded":"Yes" }],"Address":[{"AddressPtr":"17","Line1":"Flat No.14","Line2":"Surya Estate","Line3":"Baner","Line4":"Pune ","Line5":"new","Addres_City":"pune","Country":"India","PostCode":"AB12SQR","PrimaryFlag":"No","Superseded":"No"},{"AddressPtr":"18","Line1":"A-602","Line2":"Viva Vadegiri","Line3":"Virar","Line4":"new","Line5":"banglow","Addres_City":"Mumbai","Country":"India","PostCode":"AB12BAQ","PrimaryFlag":"Yes","Superseded":"Yes"}],"Account":[{"Field_A":"3288214945919484","Field_B":"887.07","Field_C":"A Loan Product",...,"Field_Y":"2015-09-18","Field_Z":"66264768"}]}]}}
Final outcome should be:
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Attempted commands:
sed -e 's/,{"RecordId"/}]},\n{"RecordId"/g' sample.dat
awk '{gsub(",{\"RecordId\"",",\n{\"RecordId\"",$0); print $0}' sample.dat
The attempted commands works perfectly fine for small files. But it does not work for the 2.8 GB file that I must manipulate. Sed quits midway after 10 mins without reason and nothing was done. Awk errored with a Segmentation Fault (core dump) reason after many hours in. I tried perl's search and replace and got an error saying "Out of memory".
Any help/ ideas would be great!
Additional info on my machine:
More than 105 GB disk space available.
8 GB memory
4 cores CPU
Running Ubuntu 14.04
Since you've tagged your question with sed, awk AND perl, I gather that what you really need is a recommendation for a tool. While that's kind of off-topic, I believe that jq is something you could use for this. It will be better than sed or awk because it actually understands JSON. Everything shown here with jq could also be done in perl with a bit of programming.
Assuming content like the following (based on your sample):
{"RomanCharacters":{"Alphabet": [ {"RecordId":"1","data":"data"},{"RecordId":"2","data":"data"},{"RecordId":"3","data":"data"},{"RecordId":"4","data":"data"},{"RecordId":"5","data":"data"} ] }}
You can easily reformat this to "prettify" it:
$ jq '.' < data.json
{
"RomanCharacters": {
"Alphabet": [
{
"RecordId": "1",
"data": "data"
},
{
"RecordId": "2",
"data": "data"
},
{
"RecordId": "3",
"data": "data"
},
{
"RecordId": "4",
"data": "data"
},
{
"RecordId": "5",
"data": "data"
}
]
}
}
And we can dig in to the data to retrieve only the records you're interested in (regardless of what they're wrapped in):
$ jq '.[][][]' < data.json
{
"RecordId": "1",
"data": "data"
}
{
"RecordId": "2",
"data": "data"
}
{
"RecordId": "3",
"data": "data"
}
{
"RecordId": "4",
"data": "data"
}
{
"RecordId": "5",
"data": "data"
}
This is much more readable, both by humans and by tools like awk which process content line-by-line. If you want to join your lines for processing per your question, the awk becomes much more simple:
$ jq '.[][][]' < data.json | awk '{printf("%s ",$0)} /}/{printf("\n")}'
{ "RecordId": "1", "data": "data" }
{ "RecordId": "2", "data": "data" }
{ "RecordId": "3", "data": "data" }
{ "RecordId": "4", "data": "data" }
{ "RecordId": "5", "data": "data" }
Or, as #peak suggested in comments, eliminate the awk portion of thie entirely by using jq's -c (compact output) option:
$ jq -c '.[][][]' < data.json
{"RecordId":"1","data":"data"}
{"RecordId":"2","data":"data"}
{"RecordId":"3","data":"data"}
{"RecordId":"4","data":"data"}
{"RecordId":"5","data":"data"}
Regarding perl: Try setting the input line separator $/ to }, like this:
#!/usr/bin/perl
$/= "},";
while (<>){
print "$_\n";
}'
or, as a one-liner:
$ perl -e '$/="},";while(<>){print "$_\n"}' sample.dat
Try using } as the record separator, e.g. in Perl:
perl -l -0175 -ne 'print $_, $/' < input
You might need to glue back lines containing only }.
This avoids the memory problem by not looking at the data as a single record, but may go too far the other way with respect to performance (processing a single character at a time). Also note that it requires gawk for the built-in RT variable (value of the current record separator):
$ cat j.awk
BEGIN { RS="[[:print:]]" }
RT == "{" { bal++}
RT == "}" { bal-- }
{ printf "%s", RT }
RT == "," && bal == 2 { print "" }
END { print "" }
$ gawk -f j.awk j.txt
{"RomanCharacters":{"Alphabet":[{"RecordId":"1",...]},
{"RecordId":"2",...},
{"RecordId":"3",...},
{"RecordId":"4",...},
{"RecordId":"5",...} }}
Using the sample data provided here (the one that begins with {Accounts:{Customer... ), the solution to this problem is one that reads in the file and as it is reading it is counting the number of delimiters defined in $/. For every count of 10,000 delimiters, it will write out to a new file. And for each delimiter found, it gives it a new line. Here is how the script looks like:
#!/usr/bin/perl
$base="/home/dat789/incoming";
#$_="sample.dat";
$/= "}]},"; # delimiter to find and insert new line after
$n = 0;
$match="";
$filecount=0;
$recsPerFile=10000; # set number of records in a file
print "Processing " . $_ ."\n";
while (<>){
if ($n < $recsPerFile) {
$match=$match.$_."\n";
$n++;
print "."; #This is so that we'd know it has done something
}
else {
my $newfile="partfile".$recsPerFile."-".$filecount . ".dat";
open ( OUTPUT,'>', $newfile );
print OUTPUT $match;
$match="";
$filecount++;
$n=0;
print "Wrote file " . $newfile . "\n";
}
}
print "Finished\n\n";
I've used this script against the big 2.8 GB file where it's content is an unformatted one-liner JSON. The resulting output files would be missing the correct JSON headers and footers but this can be easily fixed.
Thank you so much guys for contributing!