json2csv output multiple jsons to one csv - json

I am using json2csv to convert multiple json files structured like
{
"address": "0xe9f6191596bca549e20431978ee09d3f8db959a9",
"copyright": "None",
"created_at": "None"
...
}
The problem is that I need to put multiple json files into one csv file.
In my code I iterate through a hash file, call a curl with that hash and output the data to a json. Then I use json2csv to convert each json to csv.
mkdir -p curl_outs
{ cat hashes.hash; echo; } | while read h; do
echo "Downloading $h"
curl -L https://main.net955305.contentfabric.io/s/main/q/$h/meta/public/nft > curl_outs/$h.json;
node index.js $h;
json2csv -i curl_outs/$h.json -o main.csv;
done
I use -o to output the json into csv, however it just overwrites the previous json data. So I end up with only one row.
I have used >>, and this does append to the csv file.
json2csv -i "curl_outs/${h}.json" >> main.csv
But for some reason it appends the data's keys to the end of the csv file
I've also tried
cat csv_outs/*.csv > main.csv
However I get the same output.
How do I append multiple json files to one main csv file?

It's not entirely clear from the image and your description what's wrong with >>, but it looks like maybe the CSV file doesn't have a trailing line break, so appending the next file (>>) starts writing directly at the end of the last row and column (cell) of the previous file's data.
I deal with CSVs almost daily and love the GoCSV tool. Its stack subcommand will do just what the name implies: stack multiple CSVs, one on top of the other.
In your case, you could download each JSON and convert it to an individual (intermediate) CSV. Then, at the end, stack all the intermediate CSVs, then delete all the intermediate CSVs.
mkdir -p curl_outs
{ cat hashes.hash; echo; } | while read h; do
echo "Downloading $h"
curl -L https://main.net955305.contentfabric.io/s/main/q/$h/meta/public/nft > curl_outs/$h.json;
node index.js $h;
json2csv -i curl_outs/$h.json -o curl_outs/$h.csv;
done
gocsv stack curl_outs/*.csv > main.csv;
# I suggested deleting the intermediate CSVs
# rm curl_outs/*.csv
# ...
I changed the last line of your loop to json2csv -i curl_outs/$h.json -o curl_outs/$h.csv; to create those intermediate CSVs I mentioned before. Now, gocsv's stack subcommand can take a list of those intermediate CSVs and give you main.csv.

Related

Search in large csv files

The problem
I have thousands of csv files in a folder. Every file has 128,000 entries with four columns in each line.
From time to time (two times a day) I need to compare a list (10,000 entries) with all csv files. If one of the entries is identical with the third or fourth column of one of the csv files I need to write the whole csv row to an extra file.
Possible solutions
Grep
#!/bin/bash
getArray() {
array=()
while IFS= read -r line
do
array+=("$line")
done < "$1"
}
getArray "entries.log"
for e in "${array[#]}"
do
echo "$e"
/bin/grep $e ./csv/* >> found
done
This seems to work, but it lasts forever. After almost 48 hours the script checked only 48 entries of about 10,000.
MySQL
The next try was to import all csv files to a mysql database. But there I had problems with my table at around 50,000,000 entries.
So I wrote a script which created a new table after 49,000,000 entries and so I was able to import all csv files.
I tried to create an index on the second column but it always failed (timeout). To create the index before the import process wasn't possible, too. It slowed down the import to days instead of only a few hours.
The select statement was horrible, but it worked. Much faster than the "grep" solution but still to slow.
My question
What else can I try to search within the csv files?
To speed things up I copied all csv files to an ssd. But I hope there are other ways.
This is unlikely to offer you meaningful benefits, but some improvements to your script
use the built-in mapfile to slurp a file into an array:
mapfile -t array < entries.log
use grep with a file of patterns and appropriate flags.
I assume you want to match items in entries.log as fixed strings, not as regex patterns.
I also assume you want to match whole words.
grep -Fwf entries.log ./csv/*
This means you don't have to grep the 1000's of csv files 1000's of times (once for each item in entries.log). Actually this alone should give you a real meaningful performance improvement.
This also removes the need to read entries.log into an array at all.
In awk assuming all the csv files change, otherwise it would be wise to keep track of the already checked files. But first some test material:
$ mkdir test # the csvs go here
$ cat > test/file1 # has a match in 3rd
not not this not
$ cat > test/file2 # no match
not not not not
$ cat > test/file3 # has a match in 4th
not not not that
$ cat > list # these we look for
this
that
Then the script:
$ awk 'NR==FNR{a[$1];next} ($3 in a) || ($4 in a){print >> "out"}' list test/*
$ cat out
not not this not
not not not that
Explained:
$ awk ' # awk
NR==FNR { # process the list file
a[$1] # hash list entries to a
next # next list item
}
($3 in a) || ($4 in a) { # if 3rd or 4th field entry in hash
print >> "out" # append whole record to file "out"
}' list test/* # first list then the rest of the files
The script hashes all the list entries to a and reads thru the csv files looking for 3rd and 4th field entries in the hash outputing when there is a match.
If you test it, let me know how long it ran.
You can build a patterns file and then use xargs and grep -Ef to search for all patterns in batches of csv files, rather than one pattern at a time as in your current solution:
# prepare patterns file
while read -r line; do
printf '%s\n' "^[^,]+,[^,]+,$line,[^,]+$" # find value in third column
printf '%s\n' "^[^,]+,[^,]+,[^,]+,$line$" # find value in fourth column
done < entries.log > patterns.dat
find /path/to/csv -type f -name '*.csv' -print0 | xargs -0 grep -hEf patterns.dat > found.dat
find ... - emits a NUL-delimited list of all csv files found
xargs -0 ... - passes the file list to grep, in batches

Split large file size json into multiple files [duplicate]

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json

How to split text file into multiple files and extract filename from line prefix?

I have a simple log file with content like:
1504007980.039:{"key":"valueA"}
1504007990.359:{"key":"valueB", "key2": "valueC"}
...
That I'd like to output to multiple files that each have as content the JSON part that comes after the timestamp. So I would get as a result the files:
1504007980039.json
1504007990359.json
...
This is similar to How to split one text file into multiple *.txt files? but the name of the file should be extracted from each line (and remove an extra dot), and not generated via an index
Preferably I'd want a one-liner that can be executed in bash.
Since you aren't using GNU awk you need to close output files as you go to avoid the "too many open files" error. To avoid that and issues around specific values in your JSON and issues related to undefined behavior during output redirection, this is what you need:
awk '{
fname = $0
sub(/\./,"",fname)
sub(/:.*/,".json",fname)
sub(/[^:]+:/,"")
print >> fname
close(fname)
}' file
You can of course squeeze it onto 1 line if you see some benefit to that:
awk '{f=$0;sub(/\./,"",f);sub(/:.*/,".json",f);sub(/[^:]+:/,"");print>>f;close(f)}' file
awk solution:
awk '{ idx=index($0,":"); fn=substr($0,1,idx-1)".json"; sub(/\./,"",fn);
print substr($0,idx+1) > fn; close(fn) }' input.log
idx=index($0,":") - capturing index of the 1st :
fn=substr($0,1,idx-1)".json" - preparing filename
Viewing results (for 2 sample lines from the question):
for f in *.json; do echo "$f"; cat "$f"; echo; done
The output (filename -> content):
1504007980039.json
{"key":"valueA"}
1504007990359.json
{"key":"valueB"}

Dynamically create and update json using bash

In my hypothetical folder /hd/log/, I have 2 dozens Folder and each folder has log files in this format foldername.2017.07.09.log. I have a crontab that gzips the last log file every night, so there is a new log file with new log name every day.
I am trying to create a dynamic json file whose out put looks like this:
[
{
"Foldername": "foldername",
"lastmodifiedfile": "/hd/log/foldername/foldername.2017.07.09.log"
},
{
"Foldername": "foldername2",
"lastmodifiedfile": "/hd/log/foldername2/foldername2.2017.07.09.log"
}
]
The bash script should be able to dynamically create array for each subfolder name (in case more folder are added or names are changed) and also give direct link to the last modified file.
I already php program to parse json file, but no sane way to crease this json file dynamically.
Any help or pointers is appreciated.
printf "%s" "["
for var in $(find /hd/log -type d)
do
path=$("ls -1t $var" | head -1)
echo $var"/"$path | awk -F\/ '{ printf "%s","\n\t{\n\t\t\"Foldername\":\""$(NF-1)"\",\n\t\tlastmodifiedfile\":\""$0"\"\n\t},"}'
done
printf "%s" "]"
Here we find all directories in /hd/log in a loop taking each directory in turn and then using ls -1t | head -1 to get the last modified file in the directory. The path and file is then parsed through awk to get the desired output. We first set the delimiter for awk as / with the -F flag. Then we then print the json syntax as required using the last but one / delimited piece of data for the directory (NF -1 - number field -1) and the complete line for the last modified file ($0).

Split JSON into multiple files

I have json file exported from mongodb which looks like:
{"_id":"99919","city":"THORNE BAY"}
{"_id":"99921","city":"CRAIG"}
{"_id":"99922","city":"HYDABURG"}
{"_id":"99923","city":"HYDER"}
there are about 30000 lines, I want to split each line into it's own .json file. (I'm trying to transfer my data onto couchbase cluster)
I tried doing this:
cat cities.json | jq -c -M '.' | \
while read line; do echo $line > .chunks/cities_$(date +%s%N).json; done
but I found that it seems to drop loads of line and the output of running this command only gave me 50 odd files when I was expecting 30000 odd!!
Is there a logical way to make this not drop any data using anything that would suite?
Assuming you don't care about the exact filenames, if you want to split input into multiple files, just use split.
jq -c . < cities.json | split -l 1 --additional-suffix=.json - .chunks/cities_
In general to split any text file into separate files per-line using any awk on any UNIX system is simply:
awk '{close(f); f=".chunks/cities_"NR".json"; print > f}' cities.json