explore multiple GB of JSON data - json

I’ve got a live firebase app with a database that’s about 5GB in size. The firebase dashboard refuses to show me the contents of my database and just fails to load every time, presumably because the thing is too big. I’ve been digging around for some time now in search of some tool that makes it possible for me to come up with an ERD of my data. Help?
Atom crashes, vim takes forever and doesnt load anything, jq simply spits out a formatted version of my data, i’ve tried a couple of java tools to generate JSON schemas, but they crash after a while.. most python programs to do the same don’t even start properly.
How would you explore 5GB of json data?

Most of the file editors have line pagination, so your file should load.
Unless it's a one single line file.
In that case, you can use sed or jq to reformat the file in order to have more than one line.
After that operation you should be able to open it.
In case you need to extract data, you could use cat file.json | grep "what you need to extract".
That should work even on a one single line 5gb file.

Related

split a huge single line json into separate files

I've got a huge (>4GB) JSON file where all the information is written in one single line. Unfortunately, my scripts can't work with such a huge file and I wasn't able to split it into multiple lines. Every script or program I've used was crashing because of limited memory.
Can you help me splitting a json file into several files => for example with a bash command?

Efficiently Aggregate Many CSVs in Spark

Pardon my simple question but I'm relatively new to Spark/Hadoop.
I'm trying to load a bunch of small CSV files into Apache Spark. They're currently stored in S3, but I can download them locally if that simplifies things. My goal is to do this as efficiently as possible. It seems like it would be a shame to have some single-threaded master downloading and parsing a bunch of CSV files while my dozens of Spark workers sit idly. I'm hoping there's an idiomatic way to distribute this work.
The CSV files are arranged in a directory structure that looks like:
2014/01-01/fileabcd.csv
2014/01-01/filedefg.csv
...
I have two years of data, with directories for each day, and a few hundred CSVs inside of each. All of those CSVs should have an identical schema, but it's of course possible that one CSV is awry and I'd hate for the whole job to crash if there are a couple problematic files. Those files can be skipped as long as I'm notified in a log somewhere that that happened.
It seems that every Spark project I have in mind is in this same form and I don't know how to solve it. (e.g. trying to read in a bunch of tab-delimited weather data, or reading in a bunch of log files to look at those.)
What I've Tried
I've tried both SparkR and the Scala libraries. I don't really care which language I need to use; I'm more interested in the correct idioms/tools to use.
Pure Scala
My original thought was to enumerate and parallelize the list of all year/mm-dd combinations so that I could have my Spark workers all processing each day independently (download and parse all CSV files, then stack them on top of eachother (unionAll()) to reduce them). Unfortunately, downloading and parsing the CSV files using the spark-csv library can only be done in the "parent"/master job, and not from each child as Spark doesn't allow job nesting. So that won't work as long as I want to use the Spark libraries to do the importing/parsing.
Mixed-Language
You can, of course, use the language's native CSV parsing to read in each file then "upload" them to Spark. In R, this is a combination of some package to get the file out of S3 followed by a read.csv, and finishing off with a createDataFrame() to get the data into Spark. Unfortunately, this is really slow and also seems backwards to the way I want Spark to work. If all my data is piping through R before it can get into Spark, why bother with Spark?
Hive/Sqoop/Phoenix/Pig/Flume/Flume Ng/s3distcp
I've started looking into these tailored tools and quickly got overwhelmed. My understanding is that many/all of these tools could be used to get my CSV files from S3 into HDFS.
Of course it would be faster to read my CSV files in from HDFS than S3, so that solves some portion of the problem. But I still have tens of thousands of CSVs that I need to parse and am unaware of a distributed way to do that in Spark.
So right now (Spark 1.4) SparkR has support for json or parquet file structures. Csv files can be parsed, but then the spark context needs to be started with an extra jar (which needs to be downloaded and placed in the appropriate folder, never done this myself but my collegues have).
sc <- sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext <- sparkRSQL.init(sc)
There is more information in the docs. I expect that a newer spark release would have more support for this.
If you don't do this you'll need to either resort to a different file structure or use python to convert all your files from .csv into .parquet. Here is a snippet from a recent python talk that does this.
data = sc.textFile(s3_paths, 1200).cache()
def caster(x):
return Row(colname1 = x[0], colname2 = x[1])
df_rdd = data\
.map(lambda x: x.split(','))\
.map(caster)
ddf = sqlContext.inferSchema(df_rdd).cache()
ddf.write.save('s3n://<bucket>/<filename>.parquet')
Also, how big is your dataset? You may not even need spark for analysis. Note that also as of right now;
SparkR has only DataFrame support.
no distributed machine learning yet.
for visualisation you will need to convert a distributed dataframe back into a normal one if you want to use libraries like ggplot2.
if your dataset is no larger than a few gigabytes, then the extra bother of learning spark might not be worthwhile yet
it's modest now, but you can expect more from the future
I've run into this problem before (but w/ reading a large qty of Parquet files) and my recommendation would be to avoid dataframes and to use RDDs.
The general idiom used was:
Read in a list of the files w/ each file being a line (In the driver). The expected output here is a list of strings
Parallelize the list of strings and map over them with a customer csv reader. with the return being a list of case classes.
You can also use flatMap if at the end of the day you want a data structure like List[weather_data] that could be rewritten to parquet or a database.

Explore database contents from .sql file

I inherited the maintenance of a small web forum. Near as I can tell, it is powered by a MySQL database on the backend (the frontend is all PHP).
I need to extract some of the data (which also involves searching for the data I need to extract), but I don't want to touch the production database. I exported a database backup, which produced a several-hundred-megabyte .sql file.
What's the best way to mine these data? I can see several options:
grep through the .sql script in text mode, trying to extract the relevant data
Load it up in sqlite3 (I tried doing this, but it barfed on some of the statements in the script and didn't produce any tables. I have no database experience whatsoever though, so I haven't ruled it out as a dead end just yet).
Install MySQL on my home box, create a database, and execute the .sql script to recreate the data. Then just attach some database explorer tool.
Find some (Linux) app which can understand the .sql file natively (seems unlikely after a bit of Googling).
Any pointers to which of these options (or one I haven't thought of yet) would be the most productive?
I would say any option might work but for data mining, you definitely want to load it up in a new database so you can start query-ing the data and building reports on the data. I would load it up on your Home box. No need to have it remote.

Conversion of GRIB and NetCDF to my database

I have downloaded "High Resolution Initial Conditions" climate forecast data for one day, it was in extension .tar.gz so I extracted it in my local directory and I get the files like in the attached image. I think, that the files without extension are GRIB data (because first word in them is "GRIB"). So I want to get data from the big files (GRIB and NetCDF formats containing climate data like temerature & pressure in grid) to my database, but they are binary. Can you recommend me some easy way for getting data from these files? I can't get any information about handling their datasets on their website.
Converting these files to .csv would be nice, but I can't find a program to convert the GRIB files.
Using python and some available modules it is simple...
The Enthought Python Distribution includes several packages, including netCDF4, to deal with NetCDF files!
I've never worked with GRIB files, but google tells that another python package exists, pygrib2.
Or you can use PyNio, a Python package that allows to read and write netCDF3 and netCDF4 classic format, and to read GRIB1 and GRIB2 files.
I don't know the ammount of data you have, but usually it is crazy to convert it to *.csv! Python is easy to learn, and suitable to work with this kind of data (with matplotlib package you can even plot it). Or, if you really need it in a *.csv, you can select with python a smaller domain, for example, or the needed variables...
For conversion into text, look into http://www.cpc.ncep.noaa.gov/products/wesley/wgrib.html or http://www.cpc.ncep.noaa.gov/products/wesley/wgrib2/
Both are C programs from one of the big names in GRIB.
I'm currently dealing with a similar issue.
In my case I'm trying to rely on the GrADS software, which can "easily" transform GRIB data into other formats.
If your dataset is not huge, then you can export it to csv using this tutorial.
My dataset is 80gb in GRIB binary files, so I'm very restricted in what software I can use to handle it (no R unless I find a computer with more than 80gb of RAM).

Bulk loading MongoDB from JSON file with a number of objects

I want to do a bulk load into MongoDB. I have about 200GB of files containing JSON objects which I want to load, the problem is I cannot use the mongoimport tool as the objects contain objects (i.e. I'd need to use the --jsonArray aaram) which is limited to 4MB.
There is the Bulk Load API in CouchDB where I can just write a script and use cURL to send a POST request to insert the documents, no size limits...
Is there anything like this in MongoDB? I know there is Sleepy but I am wondering if this can cope with a JSON nest array insert..?
Thanks!
Ok, basically appears there is no real good answer unless I write my own tool in something like Java or Ruby to pass the objects in (meh effort)... But that's a real pain so instead I decided to simply split the files down to 4MB chunks... Just wrote a simple shell script using split (note that I had to split the files multiple times because of the limitations). I used the split command with -l (line numbers) so each file had x number of lines in it. In my case each Json object was about 4kb so I just guessed line sizes.
For anyone wanting to do this remember that split can only make 676 files (26*26) so you need to make sure each file has enough lines in it to avoid missing half the files. Any way put all this in a good old bash script and used mongo import and let it run overnight. Easiest solution IMO and no need to cut and mash files and parse JSON in Ruby/Java or w.e. else.
The scripts are a bit custom, but if anyone wants them just leave a comment and ill post.
Without knowing anything about the structure of your data I would say that if you can't use mongoimport you're out of luck. There is no other standard utility that can be tweaked to interpret arbitrary JSON data.
When your data isn't a 1:1 fit to what the import utilities expect, it's almost always easiest to write a one-off import script in a language like Ruby or Python to do it. Batch inserts will speed up the import considerably, but don't do too large batches or else you will get errors (the max size of an insert in 1.8+ is 16Mb). In the Ruby driver a batch insert can be done by simply passing an array of hashes to the insert method, instead of a single hash.
If you add an example of your data to the question I might be able to help you further.