How to split text file into multiple files and extract filename from line prefix? - json

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.

Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file

awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

Related

How to split line delimited JSON into many files using linux shell script

I have a huge newline delimited JSON file input.json which like this:
{ "name":"a.txt", "content":"...", "other_keys":"..."}
{ "name":"b.txt", "content":"...", "something_else":"..."}
{ "name":"c.txt", "content":"...", "etc":"..."}
...
How can I split it into multiple text files, where file names are taken from "name" and file content is taken from "content"? Other keys can be ignored. Currently toying with jq tool without luck.
The key to an efficient, jq-based solution is to pipe the output of jq (invoked with the -c option) to a program such as awk to perform the actual writing of the output files.
jq -c '.name, .content' input.json |
awk 'fn {print > fn; close(fn); fn=""; next;}
{fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'
Warnings
Blindly relying on the JSON input for the file names has some risks,
e.g.
what if the same "name" is specified more than once?
if a file already exists, the above program will simply append to it.
Also, somewhere along the line, the validity of .name as a filename should be checked.
Related answers on SO
This question has been asked and answered on SO in slightly different forms before,
see e.g. Split a JSON file into separate files
jq doesn't have the output capabilities to create the desired files after grouping the objects; you'll need to use another language with a JSON library. An example using Python:
import json
import fileinput
for line in fileinput.input(): # Read from standard input or filename arguments
d = json.loads(line)
with open(d['name'], "a") as f:
print(d['content'], file=f)
This has the drawback of repeatedly opening and closing each file multiple times, but it's simple. A more complex, but more efficient, example would use an exit stack context manager.
import json
import fileinput
import contextlib
with contextlib.ExitStack() as es:
files = {}
for line in fileinput.input():
d = json.loads(line)
file_name = d['name']
if file_name not in files:
files[file_name] = es.enter_context(open(file_name, "w"))
print(d['content'], file=files[file_name])
Put briefly, files are opened and cached as they are discovered. Once the loop completes (or in the event of an exception), the exit stack ensures all files previously opened are properly closed.
If there's a chance that there will be too many files to have open simultaneously, you'll have to use the simple-but-inefficient code, though you could implement something even more complex that just keeps a small, fixed number of files open at any given time, reopening them in append mode as necessary. Implementing that is beyond the scope of this answer, though.
The following jq-based solution ensures that the output in the JSON files is pretty-printed,
but ignores any input object with .content equal to the JSON string: "IGNORE ME":
jq 'if .content == "IGNORE ME"
then "Skipping IGNORE ME" | stderr | empty
else .name, .content, "IGNORE ME" end' input.json |
awk '/^"IGNORE ME"$/ {close(fn); fn=""; next}
fn {print >> fn; next}
{fn=$0; sub(/^"/,"",fn); sub(/"$/,"",fn);}'

Split large file size json into multiple files [duplicate]

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json

Dynamically create and update json using bash

In my hypothetical folder /hd/log/, I have 2 dozens Folder and each folder has log files in this format foldername.2017.07.09.log. I have a crontab that gzips the last log file every night, so there is a new log file with new log name every day.
I am trying to create a dynamic json file whose out put looks like this:
[
{
"Foldername": "foldername",
"lastmodifiedfile": "/hd/log/foldername/foldername.2017.07.09.log"
},
{
"Foldername": "foldername2",
"lastmodifiedfile": "/hd/log/foldername2/foldername2.2017.07.09.log"
}
]
The bash script should be able to dynamically create array for each subfolder name (in case more folder are added or names are changed) and also give direct link to the last modified file.
I already php program to parse json file, but no sane way to crease this json file dynamically.
Any help or pointers is appreciated.
printf "%s" "["
for var in $(find /hd/log -type d)
do
path=$("ls -1t $var" | head -1)
echo $var"/"$path | awk -F\/ '{ printf "%s","\n\t{\n\t\t\"Foldername\":\""$(NF-1)"\",\n\t\tlastmodifiedfile\":\""$0"\"\n\t},"}'
done
printf "%s" "]"
Here we find all directories in /hd/log in a loop taking each directory in turn and then using ls -1t | head -1 to get the last modified file in the directory. The path and file is then parsed through awk to get the desired output. We first set the delimiter for awk as / with the -F flag. Then we then print the json syntax as required using the last but one / delimited piece of data for the directory (NF -1 - number field -1) and the complete line for the last modified file ($0).

Split JSON into multiple files

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json

awk, header: igore when reading, include when writing

i know how to ignore the column header when reading a file. i can do something like this:
awk 'FNR > 1 { #process here }' table > out_table
But if i do this, everything else other than the COLUMN HEADER is written into the output file. But i want the output file to have COLUMN HEADER also.
Of course i can do something like this, after i execute the first statement:
awk 'BEGIN {print "Column Headers\t"} {print}' Out_table > out_table_with_header
But this becomes a 2 step process. So is there a way to do this in a SINGE STEP itself?
In short, is there a way for me to ignore Column Header while reading file, perform operation on the data, then include Column Header when writing it to output file, in a single step (or a block of steps that takes very less response time?)
Not sure if I got your correctly, you can simply:
awk 'NR==1{print}; NR>1 { # process }' file
which can be simplified to:
awk 'NR==1; NR>1 { # process }' file
That works for a single input file.
If you want to process more than one file, all having the same column headers at line 1 use this:
awk 'FNR==1 && !h {print; h=1}; FNR>1 { # process }' file1 file2 ...
I'm using the variable h to check whether the headers have been printed already or not.