Split large file size json into multiple files [duplicate] - json

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?

Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_

In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json

Related

Splitting large JSON data using Unix command Split

Issue with Unix Split command for splitting large data: split -l 1000 file.json myfile. Want to split this file into multiple files of 1000 records each. But Im getting the output as single file - no change.
P.S. File is created converting Pandas Dataframe to JSON.
Edit: It turn outs that my JSON is formatted in a way that it contains only one row. wc -l file.json is returning 0
Here is the sample: file.json
[
{"id":683156,"overall_rating":5.0,"hotel_id":220216,"hotel_name":"Beacon Hill Hotel","title":"\u201cgreat hotel, great location\u201d","text":"The rooms here are not palatial","author_id":"C0F"},
{"id":692745,"overall_rating":5.0,"hotel_id":113317,"hotel_name":"Casablanca Hotel Times Square","title":"\u201cabsolutely delightful\u201d","text":"I travelled from Spain...","author_id":"8C1"}
]
Invoking jq once per partition plus once to determine the number of partitions would be extremely inefficient. The following solution suffices to achieve the partitioning deemed acceptable in your answer:
jq -c ".[]" file.json | split -l 1000
If, however, it is deemed necessary for each file to be pretty-printed, you could run jq -s . for each file, which would still be more efficient than running .[N:N+S] multiple times.
If each partition should itself be a single JSON array, then see Splitting / chunking JSON files with JQ in Bash or Fish shell?
After asking elsewhere, the file was, in fact a single line.
Reformatting with JQ (in compact form), would enable the split, though to process the file would at least need the first and last character to be deleted (or add '[' & ']' to the split files)
I'd recommend spliting the JSON array with jq (see manual).
cat file.json | jq length # get length of an array
cat file.json | jq -c '.[0:999]' # first 1000 items
cat file.json | jq -c '.[1000:1999]' # second 1000 items
...
Notice -c for compact result (not pretty printed).
For automation, you can code a simple bash script to split your file into chunks given the array length (jq length).

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

How to create 2 CSV files from 1 JSON using JQ

I have a lot of rather large JSON logs which need to be imported into several DB tables.
I can easily parse them and create 1 CSV for import.
But how can I parse the JSON and get 2 different CSV files as output?
Simple (nonsense) example:
testJQ.log
{"id":1234,"type":"A","group":"games"}
{"id":5678,"type":"B","group":"cars"}
using
cat testJQ.log|jq --raw-output '[.id,.type,.group]|#csv'>testJQ.csv
I get one file testJQ.csv
1234,"A","games
5678,"B","cars"
But I would like to get this
types.csv
1234,"A"
5678,"B"
groups.csv
1234,"games"
5678,"cars"
Can this be done without having to parse the JSON twice, first time creating the types.csv and second time the groups.csv like this?
cat testJQ.log|jq --raw-output '[.id,.type]|#csv'>types.csv
cat testJQ.log|jq --raw-output '[.id,.group]|#csv'>groups.csv
I suppose one way you could hack this up is to output the contents of one file to stdout and the others to stderr and redirect to separate files. Of course you're limited to two files though.
$ <testJQ.log jq -r '([.id,.type]|#csv),([.id,.group]|#csv|stderr|empty)' \
1>types.csv 2>groups.csv
stderr outputs to stderr but the value propagates to the output, so you'll want to follow that up with empty to swallow that up.
Personally I wouldn't recommend doing this, I would just write a python script (or other language) to parse this if you needed to output to multiple files.
You will either need to run jq twice, or to run jq in conjunction with another program to "split" the output of the call to jq. For example, you could use a pipeline of the form: jq -c ... | awk ...
The potential disadvantage of the pipeline approach is that if JSON is the final output, it will be JSONL; but obviously that doesn't apply here.
There are many ways to craft such a pipeline. For example, assuming there are no raw newlines in the CSV:
< testJQ.log jq -r '
"types", ([.id,.type] |#csv),
"groups", ([.id,.group]|#csv)' |
awk 'NR % 2 == 1 {out=$1; next} {print >> out".csv"}'
Or:
< testJQ.log jq -r '([.id,.type],[.id,.group])|#csv' |
awk '{ out = ((NR % 2) == 1) ? "types" : "groups"; print >> out".csv"}'
For other examples, see e.g.
Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?
Splitting / chunking JSON files with JQ in Bash or Fish shell?
Split JSON into multiple files
Handling raw new-lines
Whether or not you split the CSV into multiple files, there is a potential issue with embedded raw newlines. One approach is to change "\n" in JSON strings to "\\n", e.g.
jq -r '([.id,.type],[.id,.group])
| map(if type == "string" then gsub("\n";"\\n") else . end)
| #csv'

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.
Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file
awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

Split JSON into multiple files

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json